Skip to main content

Migrating from run_on_dataset to evaluate

In python, we've introduced a cleaner evaluate() function to replace the run_on_dataset function. While we are not deprecating the run_on_dataset function, the new function lets you get started and without needing to install langchain in your local environment.

This guide will walk you through the process of migrating your existing code from using run_on_dataset to leveraging the benefits of evaluate.

Key Differences

1. llm_or_chain_factory -> first positional argument

The "thing you are evaluating" (pipeline, target, model, chain, agent, etc.) is always the first positional argument and always has the following signature:

def predict(inputs: dict) -> dict:
"""Call your model or pipeline with the given inputs and return the predictions."""
# Example:
# result = client.chat.completions.create(...)
# response = result.choices[0].message.content
return {"output": ...}

No need to specify the confusing "llm_or_chain_factory". If you need to create a new version of your object for each data point, initialize it within the predict() function. If you want to evaluate a LangChain object (runnable, etc.), you can directly call evaluate(chain.invoke, data: ...,...).

2. dataset_name -> data

The data field accepts a broader range of inputs, including the dataset name, id, or an iterator over examples. This lets you easily evaluate over a subset of the data to quickly debug.

If you were previously specifying a dataset_version, you can directly pass the target version like so:

dataset_version = "lates" # your tagged version

results = evaluate(
...,
data=client.list_examples(dataset_name="my_dataset", as_of=dataset_version),
...
)

3. RunEvalConfig -> List[RunEvaluator]

The config has been deprecated (removing the LangChain dependency). Instead, directly provide a list of evaluators to the evaluators argument.

a. Custom evaluators are simple functions that take a `Run` and an `Example` and return a dictionary with the evaluation results. For example:
def exact_match(run: Run, example: Example) -> dict:
"""Calculate the exact match score of the run."""
expected = example.outputs["answer"]
predicted = run.outputs["output"]
return {"score": expected.lower() == predicted.lower(), "key": "exact_match"}

evaluate(
...,
evaluators=[exact_match],

)

Anything that subclasses RunEvaluator still works as they did before, we just will automatically promote your compatible functions to RunEvaluator instances.

b. `LangChain` evaluators can be incorporated using the `LangChainStringEvaluator` wrapper.

For example, if you were previously using the "Criteria" evaluator, this evaluation:

eval_config = RunEvalConfig(
evaluators=[RunEvalConfig.Criteria(
criteria={"usefulness": "The prediction is useful if..."},
llm=my_eval_llm,
)]
)

client.run_on_dataset(..., eval_config=eval_config)

becomes:

from langsmith.evaluation import LangChainStringEvaluator

evaluators=[
LangChainStringEvaluator(
"labeled_criteria",
config={
"criteria": {
"usefulness": "The prediction is useful if...",
},
"llm": my_eval_llm,
},
),
]

c. For evaluating multi-key datasets using off-the-shelf LangChain evaluators, replace any input_key, reference_key, prediction_key with a custom prepare_data function.

If your dataset has a single key for the inputs and reference answer, and if your target pipeline returns a response in a single key, the evaluators can automatically use these responses directly without any additional configuration.

For multi-key datasets, you must explain which values to use for the model prediction, (and optionally for the expected answer and/or inputs). This is done by providing a prepare_data function that converts a run and example to a dictionary of {"input": ..., "prediction": ..., "reference": ...}.

def prepare_data(run: Run, example: Example) -> dict:
# Run is the trace of your pipeline
# Example is a dataset record
return {
"prediction": run.outputs["output"],
"input": example.inputs["input"],
"reference": example.outputs["answer"],
}

qa_evaluator = LangChainStringEvaluator(
"qa",
prepare_data=prepare_data,
config={"llm": my_qa_llm},
)

4. batch_evaluators -> summary_evaluators.

These let you compute custom metrics over the whole dataset. For example, precision:

def precision(runs: List[Run], examples: List[Example]) -> dict:
"""Calculate the precision of the runs."""
expected = [example.outputs["answer"] for example in examples]
predicted = [run.outputs["output"] for run in runs]
tp = sum([p == e for p, e in zip(predictions, expected) if p == "yes"])
fp = sum([p == "yes" and e == "no" for p, e in zip(predictions, expected)])
return {"score": tp / (tp + fp), "key": "precision"}

5. project_metadata -> metadata.

6. project_name -> experiment_prefix.

evaluate() always appends an experiment uuid to the prefix to ensure uniqueness, so you don't have to run into those confusing "project already exists" errors.

7. concurrency_level -> max_concurrency.

Migration Steps

1. Update your imports:

from langsmith.evaluation import evaluate

2. Change your run_on_dataset call to evaluate:

results = evaluate(
...,
data=...,
evaluators=[...],
summary_evaluators=[...],
metadata=...,
experiment_prefix=...,
max_concurrency=...,
)

3. If you were using a factory function, replace it with a direct invocation:

def predict(inputs: dict):
my_pipeline = ...
return my_pipeline.invoke(inputs)

####4. If you were using LangChain evaluators, wrap them with LangChainStringEvaluator:

from langsmith.evaluation import LangChainStringEvaluator

evaluators=[
LangChainStringEvaluator("embedding_distance"),
LangChainStringEvaluator(
"labeled_criteria",
config={"criteria": {"usefulness": "The prediction is useful if..."}},
prepare_data=prepare_criteria_data
),
]

5. Update any references to project_metadata, project_name, dataset_version, and concurrency_level to use the new argument names.

Support

If you encounter any issues during the migration process or have further questions, please don't hesitate to reach out to our support team at support@langchain.dev. We're here to help ensure a smooth transition!

Happy evaluating!


Was this page helpful?