Evaluators
In this guide, you will create custom evaluators to grade your LLM system. An evaluator can apply any logic you want, returning a numeric score
associated with a key
.
Most evaluators are applied on a run
level, scoring each prediction individually.
Some summary_evaluators
can be applied on a experiment
level, letting you score and aggregate metrics across multiple runs.
Evaluators take in a Run
and an Example
:
Run Object
The key Run
object fields are as follows:
outputs
:Dict[str, Any]
- The outputs of your pipelineinputs
:Dict[str, Any]
- The inputs to your pipeline. These are the same as the inputs in the example.child_runs
:List[Run]
- If your pipeline has nested steps, these can be accessed and used in your evaluator
Example Object
The key Example
object fields are as follows:
outputs
:Dict[str, Any]
- The reference labels or other context found in your datasetinputs
:Dict[str, Any]
- The inputs to your pipelinemetadata
:Dict[str, Any]
- Any other metadata you have stored in that example within the dataset
Run Evaluators
Run evaluators are applied to each prediction of your pipeline. The common automated evaluator types are:
- Simple Heuristics: Checking for regex matches, presence/absence of certain words or code, etc.
- AI-assisted: Instruct an "LLM-as-judge" to grade the output of a run based on the prediction and reference answer (or retrieved context).
We will demonstrate some simple ones below.
Example 1: Exact Match Evaluator
Let's start with a simple evaluator that checks if the model's output exactly matches the reference answer.
from langsmith.schemas import Example, Run
def exact_match(run: Run, example: Example) -> dict:
reference = example.outputs["answer"]
prediction = run.outputs["output"]
score = prediction.lower() == reference.lower()
return {"key": "exact_match", "score": score}
Let's break this down:
- The evaluator function accepts a
Run
andExample
and returns a dictionary with the evaluation key and score. The run contains the full trace of your pipeline, and the example contains the inputs and outputs for this data point. If your dataset contains labels, they are found in theexample.outputs
dictionary, which is kept separate to keep your model from cheating. - In our dataset, the outputs have an "answer" key that contains the reference answer. Your pipeline generates predictions as a dictionary with an "output" key.
- It compares the prediction and reference (case-insensitive) and returns a dictionary with the evaluation key and score.
You can use this evaluator directly in the evaluate
function:
from langsmith.evaluation import evaluate
evaluate(
<your prediction function>,
data="<dataset_name>",
evaluators=[exact_match],
)
Example 2: Parametrizing your evaluator
You may want to parametrize your evaluator as a class. This is useful when you need to pass additional parameters to the evaluator.
from langsmith.evaluation import evaluate
from langsmith.schemas import Example, Run
class BlocklistEvaluator:
def __init__(self, blocklist: list[str]):
self.blocklist = blocklist
def __call__(
self, run: Run, example: Example | None = None
) -> dict:
model_outputs = run.outputs["output"]
score = not any([word in model_outputs for word in self.blocklist])
return {"key": "blocklist", "score": score}
evaluate(
<your prediction function>,
data="<dataset_name>",
evaluators=[BlocklistEvaluator(blocklist=["bad", "words"])],
)
Example 3: Evaluating nested traces
While most evaluations are applied to the inputs and outputs of your system, you can also evaluate all of the subcomponents that are traced within your pipeline.
This is possible by stepping through the run
object and collecting the outputs of each component.
As a simple example, let's assume you want to evaluate the expected tools that are invoked in a pipeline.
from langsmith.evaluation import evaluate
from langsmith.schemas import Example, Run
def evaluate_trajectory(run: Run, example: Example) -> dict:
# collect the tools on level 1 of the trace tree
steps = [child.name for child in run.child_runs if child.run_type == "tool"]
expected_steps = example.outputs["expected_tools"]
score = len(set(steps) & set(expected_steps)) / len(set(steps) | set(expected_steps))
return {"key": "tools", "score": score}
This lets you grade the performance of intermediate steps in your pipeline.
Note: the example above assumes tools are properly typed in the trace tree.
Example 3: Structured Output
With function calling, it has become easier than ever to generate feedback metrics using LLMs as a judge simply by specifying a Pydantic schema for the output.
Below is an example (in this case using OpenAI's tool calling functionality) to evaluate RAG app faithfulness.
iimport json
from typing import List
import openai
from langsmith.schemas import Example, Run
from pydantic import BaseModel, Field
openai_client = openai.Client()
class Propositions(BaseModel):
propositions: List[str] = Field(
description="The factual propositions generated by the model"
)
class FaithfulnessScore(BaseModel):
reasoning: str = Field(description="The reasoning for the faithfulness score")
score: bool
def faithfulness(run: Run, example: Example) -> dict:
# Assumes your RAG app includes the prediction in the "output" key in its response
response: str = run.outputs["output"]
# Assumes your RAG app includes the retrieved docs as a "context" key in the outputs
# If not, you can fetch from the child_runs of the run object
retrieved_docs: list = run.outputs["context"]
formatted_docs = "\n".join([str(doc) for doc in retrieved_docs])
extracted = openai_client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[
{
"role": "user",
"content": "Extract all factual propositions from the following text:\n\n"
f"```\n{response}\n```",
},
],
tools=[
{
"type": "function",
"function": {
"name": "Propositions",
"description": "Use to record each factual assertion.",
"parameters": Propositions.model_json_schema(),
},
}
],
tool_choice={"type": "function", "function": {"name": "Propositions"}},
)
propositions = [
prop
for tc in extracted.choices[0].message.tool_calls
for prop in json.loads(tc.function.arguments)["propositions"]
]
scores, reasoning = [], []
tools = [
{
"type": "function",
"function": {
"name": "FaithfulnessScore",
"description": "Use to score how faithful the propositions are to the docs.",
"parameters": FaithfulnessScore.model_json_schema(),
},
}
]
for proposition in propositions:
faithfulness_completion = openai_client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[
{
"role": "user",
"content": "Grade whether the proposition can be logically concluded"
f" from the docs:\n\nProposition: {proposition}\nDocs:\n"
f"```\n{formatted_docs}\n```",
},
],
tools=tools,
tool_choice={"type": "function", "function": {"name": "FaithfulnessScore"}},
)
faithfulness_args = json.loads(
faithfulness_completion.choices[0].message.tool_calls[0].function.arguments
)
scores.append(faithfulness_args["score"])
reasoning.append(faithfulness_args["reasoning"])
average_score = sum(scores) / len(scores) if scores else None
comment = "\n".join(reasoning)
return {"key": "faithfulness", "score": average_score, "comment": comment}
Example 4: Returning Multiple Scores
A single evaluator can return multiple scores. An example of when this might be useful is if you are using tool calling for an LLM-as-judge to extract multiple metrics in a single API call.
import json
import openai
from langsmith.schemas import Example, Run
from pydantic import BaseModel, Field
# Initialize the OpenAI client
openai_client = openai.Client()
class Scores(BaseModel):
correctness_reasoning: str = Field(description="The reasoning for the correctness score")
correctness: float = Field(description="The score for the correctness of the prediction")
conciseness_reasoning: str = Field(description="The reasoning for the conciseness score")
conciseness: float = Field(description="The score for the conciseness of the prediction")
def multiple_scores(run: Run, example: Example) -> dict:
reference = example.outputs["answer"]
prediction = run.outputs["output"]
messages = [
{
"role": "user",
"content": f"Reference: {reference}\nPrediction: {prediction}"
},
]
tools = [
{
"type": "function",
"function": {
"name": "Scores",
"description": "Use to evaluate the correctness and conciseness of the prediction.",
"parameters": Scores.model_json_schema(),
},
}
]
# Generating the chat completion with structured output
completion = openai_client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=messages,
tools=tools,
tool_choice={"type": "function", "function": {"name": "Scores"}},
)
# Extracting structured scores from the completion
scores_args = json.loads(completion.choices[0].message.tool_calls[0].function.arguments)
return {
"results": [
# Provide the key, score and other relevant information for each metric
{"key": "correctness", "score": scores_args["correctness"], "comment": scores_args["correctness_reasoning"]},
{"key": "conciseness", "score": scores_args["conciseness"], "comment": scores_args["conciseness_reasoning"]}
]
}
Example 5: Perplexity Evaluator
The flexibility of the functional interface means you can easly apply evaluators from any other libraries. For instance, you may want to use statistical measures such as perplexity
to grade your run output. Below is an example using the evaluate package by HuggingFace, which contains numerous commonly used metrics. Start by installing the evaluate
package by running pip install evaluate
.
from evaluate import load
from langsmith.schemas import Example, Run
class PerplexityEvaluator:
def __init__(self, prediction_key: Optional[str] = None, model_id: str = "gpt-2"):
self.prediction_key = prediction_key
self.model_id = model_id
self.metric_fn = load("perplexity", module_type="metric")
def evaluate_run(
self, run: Run, example: Example
) -> dict:
if run.outputs is None:
raise ValueError("Run outputs cannot be None")
prediction = run.outputs[self.prediction_key]
results = self.metric_fn.compute(
predictions=[prediction], model_id=self.model_id
)
ppl = results["perplexities"][0]
return {"key": "Perplexity", "score": ppl}
Let's break down what the PerplexityEvaluator
is doing:
- Initialize: In the constructor, we're setting up a few properties that will be needed later on.
prediction_key
: The key to find the model's prediction in the outputs of a run.model_id
: The ID of the language model you want to use to compute the metric. In our example, we are using 'gpt-2'.metric_fn
: The evaluation metric function, loaded from the HuggingFaceevaluate
package.
- Evaluate: This method takes a run (and optionally an example) and returns an evaluation dictionary.
- If the run outputs are
None
, the evaluator raises an error. - Otherwise, the outputs are passed to the
metric_fn
to compute the perplexity. The perplexity score is then returned as part of the evaluation dictionary. Once you've defined your evaluators, you can use them to evaluate your model:
- If the run outputs are
from langsmith.evaluation import evaluate
evaluate(
<LLM or chain or agent>,
data="<dataset_name>",
evaluators=[BlocklistEvaluator(blocklist=["bad", "words"]), PerplexityEvaluator(), is_empty],
)
Summary Evaluators
Some metrics can only be defined on the entire experiment level as opposed to the individual runs of the experiment. For example, you may want to compute the f1 score of a classifier across all runs in an experiment kicked off from a dataset. These are called summary_evaluators
.
from typing import List
from langsmith.schemas import Example, Run
from langsmith.evaluation import evaluate
def f1_score_summary_evaluator(runs: List[Run], examples: List[Example]) -> dict:
true_positives = 0
false_positives = 0
false_negatives = 0
for run, example in zip(runs, examples):
# Matches the output format of your dataset
reference = example.outputs["answer"]
# Matches the output dict in `predict` function below
prediction = run.outputs["prediction"]
if reference and prediction == reference:
true_positives += 1
elif prediction and not reference:
false_positives += 1
elif not prediction and reference:
false_negatives += 1
if true_positives == 0:
return {"key": "f1_score", "score": 0.0}
precision = true_positives / (true_positives + false_positives)
recall = true_positives / (true_positives + false_negatives)
f1_score = 2 * (precision * recall) / (precision + recall)
return {"key": "f1_score", "score": f1_score}
def predict(inputs: dict):
return {"prediction": True}
evaluate(
predict, # Your classifier
data="<dataset_name>",
summary_evaluators=[f1_score_summary_evaluator],
)
Recap
Congratulations! You created a custom evaluation chain you can apply to any traced run so you can surface more relevant information in your application. LangChain's evaluation chains speed up the development process for application-specific, semantically robust evaluations. You can also extend existing components from the library so you can focus on building your product. All your evals come with:
- Automatic tracing integrations to help you debug, compare, and improve your code
- Easy sharing and mixing of components and results
- Out-of-the-box support for sync and async evaluations for faster runs