Skip to main content

How to Use Off-the-Shelf Evaluators

LangChain's evaluation module provides evaluators you can use as-is for common evaluation scenarios.

It's easy to use these by passing them to the evaluators argument of the evaluate() function.

Copy the code snippets below to get started. You can also configure them for your applications using the arguments mentioned in the "Parameters" sections.
If you don't see an implementation that suits your needs, you can learn how to create your own Custom Run Evaluators in the linked guide, or contribute an string evaluator to the LangChain repository.

note

Most of these evaluators are useful but imperfect! We recommend against blind trust of any single automated metric and to always incorporate them as a part of a holistic testing and evaluation strategy.
Many of the LLM-based evaluators return a binary score for a given data point, so measuring differences in prompt or model performance are most reliable in aggregate over a larger dataset.

Overview

The following table enumerates the off-the-shelf evaluators available in LangSmith, along with their output keys and a simple code sample.

Evaluator nameOutput KeySimple Code Example
QAcorrectnessLangChainStringEvaluator("qa")
Contextual Q&Acontextual accuracyLangChainStringEvaluator("context_qa")
Chain of Thought Q&Acot contextual accuracyLangChainStringEvaluator("cot_qa")
CriteriaDepends on criteria keyLangChainStringEvaluator("criteria", config={ "criteria": <criterion> })

criterion may be one of the default implemented criteria: conciseness, relevance, correctness, coherence, harmfulness, maliciousness, helpfulness, controversiality, misogyny, and criminality.

Or, you may define your own criteria in a custom dict as follows:
{ "criterion_key": "criterion description" }
Labeled CriteriaDepends on criteria keyLangChainStringEvaluator("labeled_criteria", config={ "criteria": <criterion> })

criterion may be one of the default implemented criteria: conciseness, relevance, correctness, coherence, harmfulness, maliciousness, helpfulness, controversiality, misogyny, and criminality.

Or, you may define your own criteria in a custom dict as follows:
{ "criterion_key": "criterion description" }
ScoreDepends on criteria keyLangChainStringEvaluator("score_string", config={ "criteria": <criterion>, "normalize_by": 10 })

criterion may be one of the default implemented criteria: conciseness, relevance, correctness, coherence, harmfulness, maliciousness, helpfulness, controversiality, misogyny, and criminality.

Or, you may define your own criteria in a custom dict as follows:
{ "criterion_key": "criterion description" }. Scores are out of 10, so normalize_by will cast this to a score from 0 to 1.
Labeled ScoreDepends on criteria keyLangChainStringEvaluator("labeled_score_string", config={ "criteria": <criterion>, "normalize_by": 10 })

criterion may be one of the default implemented criteria: conciseness, relevance, correctness, coherence, harmfulness, maliciousness, helpfulness, controversiality, misogyny, and criminality.

Or, you may define your own criteria in a custom dict as follows:
{ "criterion_key": "criterion description" }. Scores are out of 10, so normalize_by will cast this to a score from 0 to 1.
Embedding distanceembedding_cosine_distanceLangChainStringEvaluator("embedding_distance")
String Distancestring_distanceLangChainStringEvaluator("string_distance", config={"distance": "damerau_levenshtein" })

distance defines the string difference metric to be applied, such as levenshtein or jaro_winkler.
Exact Matchexact_matchLangChainStringEvaluator("exact_match")
Regex Matchregex_matchLangChainStringEvaluator("regex_match")
Json Validityjson_validityLangChainStringEvaluator("json_validity")
Json Equalityjson_equalityLangChainStringEvaluator("json_equality")
Json Edit Distancejson_edit_distanceLangChainStringEvaluator("json_edit_distance")
Json Schemajson_schemaLangChainStringEvaluator("json_schema")

Correctness: QA evaluation

QA evalutors help to measure the correctness of a response to a user query or question. If you have a dataset with reference labels or reference context docs, these are the evaluators for you! Three QA evaluators you can load are: "qa", "context_qa", "cot_qa". Based on our meta-evals, we recommend using "cot_qa" or a similar prompt for best results.

  • The "qa" evaluator (reference) instructs an llm to directly grade a response as "correct" or "incorrect" based on the reference answer.
  • The "context_qa" evaluator (reference) instructs the LLM chain to use reference "context" (provided throught the example outputs) in determining correctness. This is useful if you have a larger corpus of grounding docs but don't have ground truth answers to a query.
  • The "cot_qa" evaluator (reference) is similar to the "context_qa" evaluator, except it instructs the LLMChain to use chain of thought "reasoning" before determining a final verdict. This tends to lead to responses that better correlate with human labels, for a slightly higher token and runtime cost.
from langsmith import Client
from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa")
context_qa_evaluator = LangChainStringEvaluator("context_qa")
cot_qa_evaluator = LangChainStringEvaluator("cot_qa")

client = Client()
evaluate(
<your pipeline>,
data="<dataset_name>",
evaluators=[qa_evaluator, context_qa_evaluator, cot_qa_evaluator],
metadata={"revision_id": "the version of your pipeline you are testing"},
)
You can customize the evaluator by specifying the LLM used to power its LLM chain or even by customizing the prompt itself. Below is an example using an Anthropic model to run the evaluator, and a custom prompt for the base QA evaluator. Check out the reference docs for more information on the expected prompt format.
from langchain.chat_models import ChatAnthropic
from langchain_core.prompts.prompt import PromptTemplate
from langsmith.evaluation import LangChainStringEvaluator

_PROMPT_TEMPLATE = """You are an expert professor specialized in grading students' answers to questions.
You are grading the following question:
{input}
Here is the real answer:
{reference}
You are grading the following predicted answer:
{prediction}
Respond with CORRECT or INCORRECT:
Grade:
"""

PROMPT = PromptTemplate(
input_variables=["input", "reference", "prediction"], template=_PROMPT_TEMPLATE
)
eval_llm = ChatAnthropic(temperature=0.0)

qa_evaluator = LangChainStringEvaluator("qa", config={"llm": eval_llm, "prompt": PROMPT})
context_qa_evaluator = LangChainStringEvaluator("context_qa", config={"llm": eval_llm})
cot_qa_evaluator = LangChainStringEvaluator("cot_qa", config={"llm": eval_llm})
evaluate(
<your pipeline>,
data="<dataset_name>",
evaluators=[qa_evaluator, context_qa_evaluator, cot_qa_evaluator],
)

Criteria Evaluators (No Labels)

If you don't have ground truth reference labels, you can evaluate your run against a custom set of criteria using the "criteria" or "score" evaluators. These are helpful when there are high level semantic aspects of your model's output you'd like to monitor that aren't captured by other explicit checks or rules.

  • The "criteria" evaluator (reference) instructs an LLM to assess if a prediction satisfies the given criteria, outputting a binary score
  • The "score_string" evaluator (reference) has the LLM score the prediction on a numeric scale (default 1-10) based on how well it satisfies the criteria
from langsmith import Client  
from langsmith.evaluation import LangChainStringEvaluator, evaluate

criteria_evaluator = LangChainStringEvaluator(
"criteria",
config={
"criteria": {
"creativity": "Is this submission creative, imaginative, or novel?",
"conciseness": "Is this response concise and to the point?",
}
}
)
score_evaluator = LangChainStringEvaluator(
"score_string",
config={
"criteria": {
"accuracy": "How accurate is this prediction on a scale of 1-10?"
},
# If you want the score to be saved on a scale from 0 to 1
"normalize_by": 10,
}
)

client = Client()
evaluate(
<your pipeline>,
data="<dataset_name>",
evaluators=[
criteria_evaluator,
score_evaluator
],
metadata={"revision_id": "the version of your pipeline you are testing"},
)
Supported Criteria

Default criteria are implemented for the following aspects: conciseness, relevance, correctness, coherence, harmfulness, maliciousness, helpfulness, controversiality, misogyny, and criminality.
To specify custom criteria, write a mapping of a criterion name to its description, such as:

    criterion = {"creativity": "Is this submission creative, imaginative, or novel?"}
criteria_evaluator = LangChainStringEvaluator(
"labeled_criteria",
config={"criteria": criterion}
)
Interpreting the Score

Evaluation scores don't have an inherent "direction" (i.e., higher is not necessarily better). The direction of the score depends on the criteria being evaluated. For example, a score of 1 for "helpfulness" means that the prediction was deemed to be helpful by the model. However, a score of 1 for "maliciousness" means that the prediction contains malicious content, which, of course, is "bad".

Criteria Evaluators (With Labels)

If you have ground truth reference labels, you can evaluate your run against custom criteria while also providing that reference information to the LLM using the "labeled_criteria" or "labeled_score_string" evaluators.

  • The "labeled_criteria" evaluator (reference) instructs an LLM to assess if a prediction satisfies the criteria, taking into account the reference label
  • The "labeled_score_string" evaluator (reference) has the LLM score the prediction on a numeric scale based on how well it satisfies the criteria compared to the reference
from langsmith import Client
from langsmith.evaluation import LangChainStringEvaluator, evaluate

labeled_criteria_evaluator = LangChainStringEvaluator(
"labeled_criteria",
config={
"criteria": {
"helpfulness": (
"Is this submission helpful to the user,"
" taking into account the correct reference answer?"
)
}
},
prepare_data=lambda run, example: {
"prediction": run.outputs["output"],
"reference": example.outputs["answer"],
"input": example.inputs["question"],
}
)
labeled_score_evaluator = LangChainStringEvaluator(
"labeled_score_string",
config={
"criteria": {
"accuracy": "How accurate is this prediction compared to the reference on a scale of 1-10?"
},
"normalize_by": 10,
},
prepare_data=lambda run, example: {
"prediction": run.outputs["output"],
"reference": example.outputs["answer"],
"input": example.inputs["question"],
}
)

client = Client()
evaluate(
<your pipeline>,
data="<dataset_name>",
evaluators=[
labeled_criteria_evaluator,
labeled_score_evaluator
],
metadata={"revision_id": "the version of your pipeline you are testing"},
)

JSON Evaluators

Evaluating extraction and function calling applications often comes down to validating that the LLM's string output can be parsed correctly and comparing it to a reference object. The JSON evaluators provide functionality to check your model's output consistency:

  • The "json_validity" (reference) evaluator checks if a prediction is valid JSON
  • The "json_equality" (reference) evaluator checks if a JSON prediction exactly matches a JSON reference, after normalization
  • The "json_edit_distance" (reference) evaluator computes the normalized edit distance between a JSON prediction and reference
  • The "json_schema" (reference) evaluator checks if a JSON prediction satisfies a provided JSON schema
from langsmith.evaluation import LangChainStringEvaluator, evaluate

json_validity_evaluator = LangChainStringEvaluator("json_validity")
json_equality_evaluator = LangChainStringEvaluator("json_equality")
json_edit_distance_evaluator = LangChainStringEvaluator("json_edit_distance")
json_schema_evaluator = LangChainStringEvaluator(
"json_schema",
config={
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer", "minimum": 0}
},
"required": ["name"]
}
}
)

evaluate(
<your pipeline>,
data="<dataset_name>",
evaluators=[
json_validity_evaluator,
json_equality_evaluator,
json_edit_distance_evaluator,
json_schema_evaluator
],
)

String or Embedding Distance

To measure the similarity between a predicted string and a reference, you can use string distance metrics:

  • The "string_distance" (reference) evaluator computes a normalized string edit distance between the prediction and reference
  • The "embedding_distance" (reference) evaluator computes the distance between the text embeddings of the prediction and reference
  • The "exact_match" (reference) evaluator checks for an exact string match between prediction and reference
from langsmith.evaluation import LangChainStringEvaluator, evaluate

string_distance_evaluator = LangChainStringEvaluator(
"string_distance",
config={"distance": "levenshtein", "normalize_score": True}
)
embedding_distance_evaluator = LangChainStringEvaluator(
"embedding_distance",
config={
# Defaults to OpenAI, but you can customize which embedding provider to use:
# "embeddings": HuggingFaceEmbeddings(model="distilbert-base-uncased"),
# Can also choose "euclidean", "chebyshev", "hamming", and "manhattan"
"distance_metric": "cosine",
}
)
exact_match_evaluator = LangChainStringEvaluator(
"exact_match",
config={"ignore_case": True, "ignore_punctuation": True}
)

evaluate(
<your pipeline>,
data="<dataset_name>",
evaluators=[
string_distance_evaluator,
embedding_distance_evaluator,
exact_match_evaluator
],
)

Regex Match

To evaluate predictions against a reference regular expression pattern, you can use the "regex_match" (reference) evaluator. The pattern is provided as a string in the example outputs of the dataset. The evaluator checks if the prediction matches the pattern.

from langsmith.evaluation import LangChainStringEvaluator, evaluate

regex_evaluator = LangChainStringEvaluator(
"regex_match",
config={
# Optionally control which flags to use in the regex match
"flags": re.IGNORECASE
}
)

evaluate(
<your pipeline>,
data="<dataset_name>",
evaluators=[regex_evaluator],
)

Don't see what you're looking for?

These implementations are just a starting point. We want to work with you to build better off-the-shelf evaluation tools for everyone. We'd love feedback and contributions! Send us feedback at support@langchain.dev, check out the Evaluators in LangChain or submit PRs or issues directly to better address your needs.


Was this page helpful?