Skip to main content

Multimodal Tool Correctness

The multimodal tool correctness metric is an agentic LLM metric that assesses your multimodal LLM agent's function/tool calling ability. It is calculated by comparing whether every tool that is expected to be used was indeed called.

info

The MultimodalToolCorrectnessMetric allows you to define the strictness of correctness. By default, it considers matching tool names to be correct, but you can also require input parameters and output to match.

Required Arguments

To use the MultimodalToolCorrectnessMetric, you'll have to provide the following arguments when creating an MLLMTestCase:

  • input
  • actual_output
  • tools_called
  • expected_tools

Example

from deepeval.metrics import MultimodalToolCorrectnessMetric
from deepeval.test_case import MLLMTestCase, ToolCall

test_case = MLLMTestCase(
input="What's in this image?",
actual_output="The image shows a pair of running shoes.",
# Replace this with the tools that was actually used by your LLM agent
tools_called=[ToolCall(name="ImageAnalysis"), ToolCall(name="ToolQuery")],
expected_tools=[ToolCall(name="ImageAnalysis")],
)

metric = MultimodalToolCorrectnessMetric()
metric.measure(test_case)
print(metric.score)
print(metric.reason)

There are seven optional parameters when creating a MultimodalToolCorrectnessMetric:

  • [Optional] threshold: a float representing the minimum passing threshold, defaulted to 0.5.
  • [Optional] evaluation_params: A list of ToolCallParams indicating the strictness of the correctness criteria, available options are ToolCallParams.INPUT_PARAMETERS and ToolCallParams.OUTPUT. For example, supplying a list containing ToolCallParams.INPUT_PARAMETERS but excluding ToolCallParams.OUTPUT, will deem a tool correct if the tool name and input parameters match, even if the output does not. Defaults to an empty list.
  • [Optional] include_reason: a boolean which when set to True, will include a reason for its evaluation score. Defaulted to True.
  • [Optional] strict_mode: a boolean which when set to True, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted to False.
  • [Optional] verbose_mode: a boolean which when set to True, prints the intermediate steps used to calculate said metric to the console, as outlined in the How Is It Calculated section. Defaulted to False.
  • [Optional] should_consider_ordering: a boolean which when set to True, will consider the ordering in which the tools were called in. For example, if expected_tools=[ToolCall(name="ImageAnalysis"), ToolCall(name="ToolQuery"), ToolCall(name="ImageAnalysis")] and tools_called=[ToolCall(name="ImageAnalysis"), ToolCall(name="ImageAnalysis"), ToolCall(name="ToolQuery")], the metric will consider the tool calling to be incorrect. Only available for ToolCallParams.TOOL and defaulted to False.
  • [Optional] should_exact_match: a boolean which when set to True, will require the tools_called and expected_tools to be exactly the same. Available for ToolCallParams.TOOL and ToolCallParams.INPUT_PARAMETERS and defaulted to False.
info

Since should_exact_match is a stricter criteria than should_consider_ordering, setting should_consider_ordering will have no effect when should_exact_match is set to True.

How Is It Calculated?

note

The MultimodalToolCorrectnessMetric, unlike all other deepeval metrics, is not calculated using any models or LLMs, and instead via exact matching between the expected_tools and tools_called parameters.

The multimodal tool correctness metric score is calculated according to the following equation:

Tool Correctness=Number of Correctly Used Tools (or Correct Input Parameters/Outputs)Total Number of Expected Tools\text{Tool Correctness} = \frac{\text{Number of Correctly Used Tools (or Correct Input Parameters/Outputs)}}{\text{Total Number of Expected Tools}}

This metric assesses the accuracy of your agent's tool usage by comparing the tools_called by your multimodal LLM agent to the list of expected_tools. A score of 1 indicates that every tool utilized by your LLM agent was called correctly according to the list of expected_tools, should_consider_ordering, and should_exact_match, while a score of 0 signifies that none of the tools_called were called correctly.

info

If exact_match is not specified and ToolCall.INPUT_PARAMETERS is included in evaluation_params, correctness may be a percentage score based on the proportion of correct input parameters (assuming the name and output are correct, if applicable).