Add Metrics to Existing Tests
At times, you may want to apply an evaluator post-hoc. This is useful if you have a new evaluator (or version of an evaluator) and want to add the metrics without re-running your model.
You can do this like so:
from langsmith.beta import compute_test_metrics
def my_evaluator(run, example):
score = "foo" in run.outputs['output']
return {"key": "is_foo", "score": score}
# The name of the test you have already run.
# This is DISTINCT from the dataset name
test_project = "test-abc123"
compute_test_metrics(test_project, evaluators=[my_evaluator])
Within the compute_test_metrics
function, we list the runs in the test and apply the provided evaluators to each one.
Below, we will share a quick example.
Prerequisites
Install the requisite packages, and generate the initial test results. In reality, you will already have a dataset + test results.
This utility function expects langsmith>=0.1.31
.
# %pip install -U langsmith langchain
import os
import uuid
os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
# Update if you are self-hosted
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
from langsmith import Client
client = Client()
dataset_name = "My Example Dataset " + uuid.uuid4().hex[:6]
ds = client.create_dataset(dataset_name=dataset_name)
client.create_examples(
inputs=[{"input": i} for i in range(10)],
outputs=[{"output": i * (3 % (i + 1))} for i in range(10)],
dataset_id=ds.id,
)
def my_chain(example_input: dict):
# The input to the llm_or_chain_factory is
# the example.inputs
return {"output": example_input["input"] * 3}
results = client.run_on_dataset(
dataset_name=dataset_name, llm_or_chain_factory=my_chain
)
test_name = results["project_name"]
View the evaluation results for project 'puzzled-cloud-96' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/cbdb128b-a725-4662-a515-dfe0009cb15c/compare?selectedSessions=28f2c88e-3091-4fcc-bac7-c1dbd8a6a43b
View all tests for Dataset My Example Dataset 512ee7 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/cbdb128b-a725-4662-a515-dfe0009cb15c
[------------------------------------------------->] 10/10
Add Evaluation Metrics
Now that we have existing test results, we can apply new evaluators to this project using the compute_test_metrics
utility function.
from langsmith.beta._evals import compute_test_metrics
from langsmith.schemas import Example, Run
def exact_match(run: Run, example: Example):
# "output" is the key we assigned in the create_examples step above
expected = example.outputs["output"]
predicted = run.outputs["output"]
return {"key": "exact_match", "score": predicted == expected}
# The name of the test you have already run.
# This is DISTINCT from the dataset name
compute_test_metrics(test_name, evaluators=[exact_match])
/var/folders/gf/6rnp_mbx5914kx7qmmh7xzmw0000gn/T/ipykernel_80329/988510393.py:14: UserWarning: Function compute_test_metrics is in beta.
compute_test_metrics(test_name, evaluators=[exact_match])
Now you can check out the test results in the above link.
Conclusion
Congrats! You've run evals on an existing test. This makes it easy to backfill evaluation results on old test results.