Testing & Evaluation Recipes
Retrieval Augmented Generation (RAG)
- Q&A System Correctness: evaluate your retrieval-augmented Q&A pipeline end-to-end on a dataset. Iterate, improve, and keep testing.
- Evaluating Q&A Systems with Dynamic Data: use evaluators that dereference a labels to handle data that changes over time.
- RAG Evaluation using Fixed Sources: evaluate the response component of a RAG (retrieval-augmented generation) pipeline by providing retrieved documents in the dataset
- RAG evaluation with RAGAS: evaluate RAG pipelines using the RAGAS framework. Covers metrics for both the generator AND retriever in both labeled and reference-free contexts (answer correctness, faithfulness, context relevancy, recall and precision).
Chat Bots
- Chat Bot Evals using Simulated Users: evaluate your chat bot using a simulated user. The user is given a task, and you score your assistant based on how well it helps without being breaking its instructions.
- Single-turn evals: Evaluate chatbots within multi-turn conversations by treating each data point as an individual dialogue turn. This guide shows how to set up a multi-turn conversation dataset and evaluate a simple chat bot on it.
Extraction
- Evaluating an Extraction Chain: measure the similarity between the extracted structured content and structured labels using LangChain's json evaluators.
- Exact Match: deterministic comparison of your system output against a reference label.
Agents
- Evaluating an Agent's intermediate steps: compare the sequence of actions taken by an agent to an expected trajectory to grade effective tool use.
- Tool Selection: Evaluate the precision of selected tools. Include an automated prompt writer to improve the tool descriptions based on failure cases.
Multimodel
- Evaluating Multimodal Models: benchmark a multimodal image classification chain
Fundamentals
- Adding Metrics to Existing Tests: Apply new evaluators to existing test results without re-running your model, using the
compute_test_metrics
utility function. This lets you evaluate "post-hoc" and backfill metrics as you define new evaluators. - Production Candidate Testing: benchmark new versions of your production app using real inputs. Convert production runs to a test dataset, then compare your new system's performance against the baseline.
- Naming Test Projects: manually name your tests with
run_on_dataset(..., project_name='my-project-name')
- Exporting Tests to CSV: Use the
get_test_results
beta utility to easily export your test results to a CSV file. This allows you to analyze and report on the performance metrics, errors, runtime, inputs, outputs, and other details of your tests outside of the Langsmith platform. - How to download feedback and examples from a test project: goes beyond the utility described above to query and export the predictions, evaluation results, and other information to programmatically add to your reports. .