Bootstrap Few-shot Prompting with LangSmith
Prompt engineering is a pain. You can use examples to optimize the prompt for you with the help of tools like LangSmith. Instead of guessing which examples will be the most impactful, you can use tried-and-true evaluation practices to curate and compile the right examples for your pipeline. The main steps are:
- Create a dataset
- Pick a metric to improve
- Create an initial system
- Decide the update logic (few-shot examples vs. instruction teaching vs. other methods, how to format the examples, etc.)
- Train!
Below is an example bootstrapping a gpt-3.5-turbo model on an entailment task using few-shot examples. This example inspired by Christopher Potts' example on the SCONE dataset.
The task is natural language inference, where the LLM is required to predict whether the a statement can be logically concluded from a premise / grounding statement.
%pip install -U langsmith langchain langchain_openai pandas
import os
# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY"
os.environ["OPENAI_API_KEY"] = "YOUR API KEY"
# We can do the same thing with a SQLite cache
from langchain.cache import SQLiteCache
from langchain_core.globals import set_llm_cache
set_llm_cache(SQLiteCache(database_path=".langchain.db"))
from langsmith import Client
client = Client()
public_datasets = [
"https://smith.langchain.com/public/1d065de2-56c1-496e-bc66-bdce308e6537/d", # train
"https://smith.langchain.com/public/3205fa05-bd78-4eaf-924f-96df0f577b1f/d", # train2
"https://smith.langchain.com/public/fdf16166-1edd-418f-b777-3af82034931d/d", # dev
"https://smith.langchain.com/public/aee61506-3c60-4ca8-95c4-0314c9719ca8/d", # dev2
"https://smith.langchain.com/public/8d40d210-f8e6-4def-a206-78c5080c5d53/d", # test
]
for ds in public_datasets:
client.clone_public_dataset(ds)
train_name = "scone-train2"
dev_name = "scone-dev2"
test_name = "scone-test-one-scoped"
full_test_name = "scone-test"
example = next(client.list_examples(dataset_name=train_name))
print("inputs", example.inputs)
print("outputs", example.outputs)
inputs {'context': 'A man who does not walk confidently dropping produce.', 'question': 'Can we logically conclude for sure that a man who does not walk confidently dropping kale?'}
outputs {'answer': 'No', 'category': 'one_not_scoped'}
Reviewing the values above, these examples can be tricky!
Evaluator
Since we have ground-truth clasification labels, we can use an exact-match criterion as our evaluator.
import sys
from langsmith.evaluation import run_evaluator
@run_evaluator
def exact_match(run, example):
# Evaluate the exact match correctness of the NLI result
try:
predicted = run.outputs["is_entailed"]
expected = example.outputs["answer"]
score = expected.lower() == predicted.lower()
except Exception as e:
try:
expected = example.outputs["answer"]
expected_bool = {"no": False, "yes": True}.get(expected.strip().lower())
score = run.outputs["output"].is_entailed == expected_bool
except Exception as e2:
score = 0
return {
"key": "exact_match",
"score": int(score),
}
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
# And we will create a placeholder in the template to add few-shot examples
prompt = PromptTemplate.from_template(
"""You are given some context (a premise) and a question (a hypothesis). You must indicate with Yes/No answer whether we can logically conclude the hypothesis from the premise.
---
Follow the following format.
Context: ${{context}}
Question: ${{question}}
Reasoning: Let's think step by step in order to ${{produce the answer}}. We ...
Answer: Yes or No
---{examples}
Context: {context}
Question: {question}
Reasoning: Let's think step by step in order to"""
).partial(examples="")
def parse(pred: str):
fnd = "\nAnswer:"
idx = pred.find(fnd)
answer = pred[idx + len(fnd) :].strip()
return {"is_entailed": answer, "reasoning": pred[:idx].strip()}
chain = prompt | ChatOpenAI(model="gpt-3.5-turbo") | StrOutputParser() | parse
prediction = chain.invoke(example.inputs)
prediction
{'is_entailed': 'No',
'reasoning': 'produce the answer. We know that the man does not walk confidently and drops produce. However, dropping produce does not necessarily mean he drops kale specifically. He could be dropping any type of produce.'}
Initial Evaluation
from langchain.smith import RunEvalConfig
eval_config = RunEvalConfig(
custom_evaluators=[exact_match],
)
res = client.run_on_dataset(
dataset_name="scone-test2", # dev_name,
llm_or_chain_factory=chain,
evaluation=eval_config,
project_metadata={"optimizer": None},
)
View the evaluation results for project 'passionate-copy-48' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/f1b328a2-b4e8-473c-808f-e042d38f6ebd/compare?selectedSessions=bb3d33aa-53a1-4d63-8b79-3758df4b1fb7
View all tests for Dataset scone-test2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/f1b328a2-b4e8-473c-808f-e042d38f6ebd
[------------------------------------------------->] 200/200
Got about ~55% on it. Definitely room for improvement.
✨ Optimize ✨
This just means to "use data to update the system". At present, LangChain runnables don't natively support a "backwards" method (a la pytorch), but you can pretty easily define updates/mutations for key important components you'd want to update, (such as prompts or LLMs).
For instance, component-wise, you could apply:
- Few shot prompting: add an additional string input or MessagesPlaceholder in the prompt template
- Updating the instructions: update the prompt template directly (likely the system prompt)
- LLM: do a backwards pass.
We will focus on few-shot prompting to limit the search space. We will then apply a genetic/evolutionary algorithm to compare performance of different few-shot examples and pick the ones that provide the most "lift" of the provided metric.
We'll first create a constructor for our chain that accepts the few-shot examples, letting us re-create the chain with each updated state.
# We will define how we want our few-shot examples to be formatted
import random
from typing import List, Optional
from langchain_core.runnables import RunnableLambda
def format_example(example: dict):
inputs = example["input"]
outputs = example["output"]
return f"""
Context: {inputs['context']}
Question: {inputs['question']}
Reasoning: {outputs['reasoning']}
Answer: {outputs['is_entailed']}
"""
def format_few_shot(input_: dict, examples: Optional[List[dict]] = None):
if examples:
# TODO: make this configurable / bound to the prompt template
input_["examples"] = (
"--".join(format_example(e) for i, e in enumerate(examples)) + "--"
)
return input_
def create_chain(examples: Optional[List] = None, llm=None):
llm = llm or ChatOpenAI(model="gpt-3.5-turbo")
chain = (
RunnableLambda(format_few_shot).bind(examples=examples)
| prompt
| llm
| StrOutputParser()
| parse
).with_config(tags=["to_train"])
return chain
Training
Next, we'll define the training utilities.
from langchain_core.tracers.context import collect_runs
def step(
construct_chain,
train_examples,
eval_config,
examples=None,
bootstrap_k: int = 8,
):
collected = examples.copy() if examples else []
random.shuffle(train_examples)
train_examples = train_examples.copy()
# TODO: Batching to speed it up
while train_examples:
if len(collected) >= bootstrap_k:
break
train_batch = [
train_examples.pop() for _ in range(bootstrap_k - len(collected))
]
chain = construct_chain([e for e in collected if e["id"] != example.id])
with collect_runs() as cb:
chain.batch([e.inputs for e in train_batch])
evaluator = eval_config.custom_evaluators[0]
for run, example in zip(cb.traced_runs, train_batch):
metric = evaluator.evaluate_run(run, example)
score = metric.score
# Check if success
if score:
collected.append(
{
"input": example.inputs,
"output": run.outputs,
"id": example.id,
}
)
return collected
def eval(eval_dataset, chain, eval_config, step_n) -> float:
"""Compute the metrics on the validation dataset."""
dev_results = client.run_on_dataset(
dataset_name=eval_dataset,
llm_or_chain_factory=chain,
evaluation=eval_config,
verbose=True,
concurrency_level=8,
project_metadata={
"step": step_n,
},
)
df = dev_results.to_dataframe()
feedback_key = [c for c in df.columns if c.startswith("feedback.")][0]
# Assume single metric rn ha
return df[feedback_key].mean()
def train(
chain_constructor,
train_dataset,
eval_dataset,
eval_config,
steps: int = 5,
k: int = 8,
bootstrap_k: int = 8,
):
"""Run the full training loop"""
best_score = eval(eval_dataset, chain_constructor(), eval_config, 0)
best_step = 0
scores = [(best_score, [])]
train_examples = list(client.list_examples(dataset_name=train_dataset))
for step_number in range(steps):
collected = step(
chain_constructor, train_examples, eval_config, bootstrap_k=bootstrap_k
)
if len(collected) < k:
# TODO: probably want some diversity of labels here lol
to_sample = min(k - len(collected), len(train_examples))
collected += random.sample(train_examples, to_sample)
selected_examples = collected
updated_chain = chain_constructor(examples=selected_examples)
updated_score = eval(eval_dataset, updated_chain, eval_config, step_number + 1)
scores.append((updated_score, selected_examples))
if updated_score > best_score:
print(
f"New best score {updated_score} > {best_score}. Updating selected examples."
)
best_score = updated_score
best_step = step_number + 1
else:
print("Underperformed. Continuing")
print("Best overall score: ", best_score)
print("Best step: ", best_step)
return sorted(scores, key=lambda x: x[0], reverse=True)
Train
Now we can finally run the training loop!
import functools
# We will train with gpt-4-turbo
llm = ChatOpenAI(model="gpt-4-turbo-preview")
all_scores = train(
functools.partial(create_chain, llm=llm),
train_name,
dev_name,
eval_config,
steps=10,
)
View the evaluation results for project 'bold-show-44' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=0478dc12-5f1a-4d1b-84d6-95699f05bf77
View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[------------------------------------------------->] 50/50
<h3>Experiment Results:</h3>
feedback.exact_match | error | execution_time | run_id | count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50.00000 | 0 | 50.000000 | 50 | ||||||||||||
NaN | 0 | NaN | 50 | ||||||||||||
NaN | NaN | NaN | e45cdb67-3ae6-48b6-9db1-6fe09e39e6a3 | ||||||||||||
NaN | NaN | NaN | 1 | ||||||||||||
0.86000 | NaN | 0.021456 | NaN | ||||||||||||
0.35051 | NaN | 0.011425 | NaN | ||||||||||||
0.00000 | NaN | 0.007727 | NaN | ||||||||||||
1.00000 | NaN | 0.013763 | NaN | ||||||||||||
1.00000 | NaN | 0.019525 | NaN | ||||||||||||
1.00000 | NaN | 0.023224 | NaN | ||||||||||||
1.00000 | NaN | 0.059278 | NaN |
View the evaluation results for project 'giving-record-97' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=c181b376-6214-4130-8d6e-87ee7c0cfd5f
View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[------------------------------------------------->] 50/50
<h3>Experiment Results:</h3>
feedback.exact_match | error | execution_time | run_id | count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50.00000 | 0 | 50.000000 | 50 | ||||||||||||
NaN | 0 | NaN | 50 | ||||||||||||
NaN | NaN | NaN | ef1483cc-1040-4ebb-a0b0-f770bc9411c5 | ||||||||||||
NaN | NaN | NaN | 1 | ||||||||||||
0.86000 | NaN | 9.071231 | NaN | ||||||||||||
0.35051 | NaN | 4.016930 | NaN | ||||||||||||
0.00000 | NaN | 4.513033 | NaN | ||||||||||||
1.00000 | NaN | 6.605231 | NaN | ||||||||||||
1.00000 | NaN | 7.932223 | NaN | ||||||||||||
1.00000 | NaN | 10.160974 | NaN | ||||||||||||
1.00000 | NaN | 24.512853 | NaN |
Underperformed. Continuing
View the evaluation results for project 'proper-man-52' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=13f9f137-b12b-41c8-bc51-fc65aed67594
View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[-----------------------> ] 24/50
Error Type: BadRequestError, Message: Error code: 400 - {'error': {'message': 'You requested a model that is not compatible with this engine. Please contact us through our help center at help.openai.com for further questions.', 'type': 'invalid_request_error', 'param': 'model', 'code': None}}
[------------------------------------------------->] 50/50
<h3>Experiment Results:</h3>
feedback.exact_match | error | execution_time | run_id | count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
49.000000 | 1 | 50.000000 | 50 | ||||||||||||
NaN | 1 | NaN | 50 | ||||||||||||
NaN | Error code: 400 - {'error': {'message': 'You r... | NaN | c3388800-20aa-4c72-8e1c-f96632355fcf | ||||||||||||
NaN | 1 | NaN | 1 | ||||||||||||
0.836735 | NaN | 10.026921 | NaN | ||||||||||||
0.373438 | NaN | 4.115617 | NaN | ||||||||||||
0.000000 | NaN | 0.559937 | NaN | ||||||||||||
1.000000 | NaN | 7.325939 | NaN | ||||||||||||
1.000000 | NaN | 9.343092 | NaN | ||||||||||||
1.000000 | NaN | 11.909372 | NaN | ||||||||||||
1.000000 | NaN | 24.057484 | NaN |
Underperformed. Continuing
View the evaluation results for project 'proper-quiet-36' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=c6f18469-7df3-41d5-bd70-10ee4a076182
View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[----------------------------> ] 29/50
Error Type: BadRequestError, Message: Error code: 400 - {'error': {'message': 'You requested a model that is not compatible with this engine. Please contact us through our help center at help.openai.com for further questions.', 'type': 'invalid_request_error', 'param': 'model', 'code': None}}
[------------------------------------------------->] 50/50
<h3>Experiment Results:</h3>
feedback.exact_match | error | execution_time | run_id | count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
49.000000 | 1 | 50.000000 | 50 | ||||||||||||
NaN | 1 | NaN | 50 | ||||||||||||
NaN | Error code: 400 - {'error': {'message': 'You r... | NaN | ac830a9d-4169-49b6-a843-0f4afe138865 | ||||||||||||
NaN | 1 | NaN | 1 | ||||||||||||
0.897959 | NaN | 7.242384 | NaN | ||||||||||||
0.305839 | NaN | 2.108956 | NaN | ||||||||||||
0.000000 | NaN | 0.525809 | NaN | ||||||||||||
1.000000 | NaN | 6.170674 | NaN | ||||||||||||
1.000000 | NaN | 6.969927 | NaN | ||||||||||||
1.000000 | NaN | 8.018508 | NaN | ||||||||||||
1.000000 | NaN | 12.737470 | NaN |
New best score 0.8979591836734694 > 0.86. Updating selected examples.
View the evaluation results for project 'advanced-competition-88' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=31ece295-31c4-4c3c-b9f0-a1df3dd09adb
View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[------------------------------------------------->] 50/50
<h3>Experiment Results:</h3>
feedback.exact_match | error | execution_time | run_id | count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50.00000 | 0 | 50.000000 | 50 | ||||||||||||
NaN | 0 | NaN | 50 | ||||||||||||
NaN | NaN | NaN | e2d59128-29e4-4562-bc11-93bb60738953 | ||||||||||||
NaN | NaN | NaN | 1 | ||||||||||||
0.86000 | NaN | 8.488865 | NaN | ||||||||||||
0.35051 | NaN | 4.301064 | NaN | ||||||||||||
0.00000 | NaN | 3.736222 | NaN | ||||||||||||
1.00000 | NaN | 6.037187 | NaN | ||||||||||||
1.00000 | NaN | 6.998608 | NaN | ||||||||||||
1.00000 | NaN | 9.773248 | NaN | ||||||||||||
1.00000 | NaN | 26.641730 | NaN |
Underperformed. Continuing
View the evaluation results for project 'drab-print-47' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=70686baf-1859-4bcf-91b3-82c41843cd86
View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[------------------------------------------------->] 50/50
<h3>Experiment Results:</h3>
feedback.exact_match | error | execution_time | run_id | count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50.000000 | 0 | 50.000000 | 50 | ||||||||||||
NaN | 0 | NaN | 50 | ||||||||||||
NaN | NaN | NaN | 1bd0827b-b405-4bdc-8eb0-ed3105d94e4d | ||||||||||||
NaN | NaN | NaN | 1 | ||||||||||||
0.900000 | NaN | 10.443896 | NaN | ||||||||||||
0.303046 | NaN | 13.421476 | NaN | ||||||||||||
0.000000 | NaN | 4.744148 | NaN | ||||||||||||
1.000000 | NaN | 6.975307 | NaN | ||||||||||||
1.000000 | NaN | 8.340018 | NaN | ||||||||||||
1.000000 | NaN | 9.440450 | NaN | ||||||||||||
1.000000 | NaN | 101.049986 | NaN |
New best score 0.9 > 0.8979591836734694. Updating selected examples.
View the evaluation results for project 'impressionable-writer-19' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=1f31eff6-8ab8-4b16-baa5-6f3669f4dead
View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[------------------------------------------------->] 50/50
<h3>Experiment Results:</h3>
feedback.exact_match | error | execution_time | run_id | count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50.000000 | 0 | 50.000000 | 50 | ||||||||||||
NaN | 0 | NaN | 50 | ||||||||||||
NaN | NaN | NaN | 041fd757-fb44-4a79-8dcf-d0ab006622f1 | ||||||||||||
NaN | NaN | NaN | 1 | ||||||||||||
0.880000 | NaN | 7.219473 | NaN | ||||||||||||
0.328261 | NaN | 2.151543 | NaN | ||||||||||||
0.000000 | NaN | 3.604611 | NaN | ||||||||||||
1.000000 | NaN | 5.412153 | NaN | ||||||||||||
1.000000 | NaN | 7.344393 | NaN | ||||||||||||
1.000000 | NaN | 8.157682 | NaN | ||||||||||||
1.000000 | NaN | 13.777614 | NaN |
Underperformed. Continuing
View the evaluation results for project 'drab-map-24' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=aa3fb10d-f9a7-47ac-a90d-c385085339fc
View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[------------------------------------------------->] 50/50
<h3>Experiment Results:</h3>
feedback.exact_match | error | execution_time | run_id | count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50.000000 | 0 | 50.000000 | 50 | ||||||||||||
NaN | 0 | NaN | 50 | ||||||||||||
NaN | NaN | NaN | e8f88ef2-8d1e-4323-ac51-0c7ba1c6b0fd | ||||||||||||
NaN | NaN | NaN | 1 | ||||||||||||
0.880000 | NaN | 7.352010 | NaN | ||||||||||||
0.328261 | NaN | 2.876893 | NaN | ||||||||||||
0.000000 | NaN | 3.442488 | NaN | ||||||||||||
1.000000 | NaN | 5.508052 | NaN | ||||||||||||
1.000000 | NaN | 6.563693 | NaN | ||||||||||||
1.000000 | NaN | 8.169192 | NaN | ||||||||||||
1.000000 | NaN | 17.694664 | NaN |
Underperformed. Continuing
View the evaluation results for project 'best-step-66' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=1d7c26de-3ae1-470e-8c51-9b2873a442c9
View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[------------------------------------------------->] 50/50
<h3>Experiment Results:</h3>
feedback.exact_match | error | execution_time | run_id | count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50.000000 | 0 | 50.000000 | 50 | ||||||||||||
NaN | 0 | NaN | 50 | ||||||||||||
NaN | NaN | NaN | 31e30bda-a245-4f68-8596-03183b8ffcc3 | ||||||||||||
NaN | NaN | NaN | 1 | ||||||||||||
0.920000 | NaN | 8.322146 | NaN | ||||||||||||
0.274048 | NaN | 2.587044 | NaN | ||||||||||||
0.000000 | NaN | 5.140714 | NaN | ||||||||||||
1.000000 | NaN | 6.780764 | NaN | ||||||||||||
1.000000 | NaN | 7.700001 | NaN | ||||||||||||
1.000000 | NaN | 9.086863 | NaN | ||||||||||||
1.000000 | NaN | 19.068444 | NaN |
New best score 0.92 > 0.9. Updating selected examples.
View the evaluation results for project 'brief-color-26' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=4b090fa5-87cf-4bab-8f90-d86d91102240
View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[------------------------------------------------->] 50/50
<h3>Experiment Results:</h3>
feedback.exact_match | error | execution_time | run_id | count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50.00000 | 0 | 50.000000 | 50 | ||||||||||||
NaN | 0 | NaN | 50 | ||||||||||||
NaN | NaN | NaN | bd2fe2a3-cb39-4287-9c79-ba214bcdae40 | ||||||||||||
NaN | NaN | NaN | 1 | ||||||||||||
0.86000 | NaN | 9.189128 | NaN | ||||||||||||
0.35051 | NaN | 5.716492 | NaN | ||||||||||||
0.00000 | NaN | 4.791341 | NaN | ||||||||||||
1.00000 | NaN | 6.648413 | NaN | ||||||||||||
1.00000 | NaN | 7.485603 | NaN | ||||||||||||
1.00000 | NaN | 9.478416 | NaN | ||||||||||||
1.00000 | NaN | 41.826824 | NaN |
Underperformed. Continuing
View the evaluation results for project 'worthwhile-rabbit-93' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa/compare?selectedSessions=c8676b03-e009-4a3b-aa50-1f16a4476dbf
View all tests for Dataset scone-dev2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/0360d4ae-63e0-4e58-b4a7-97f8ef466aaa
[------------------------------------------------->] 50/50
<h3>Experiment Results:</h3>
feedback.exact_match | error | execution_time | run_id | count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
50.000000 | 0 | 50.000000 | 50 | ||||||||||||
NaN | 0 | NaN | 50 | ||||||||||||
NaN | NaN | NaN | 83776c8b-5772-4521-8b30-17b1cc5defca | ||||||||||||
NaN | NaN | NaN | 1 | ||||||||||||
0.880000 | NaN | 8.748563 | NaN | ||||||||||||
0.328261 | NaN | 4.640876 | NaN | ||||||||||||
0.000000 | NaN | 5.161556 | NaN | ||||||||||||
1.000000 | NaN | 7.018997 | NaN | ||||||||||||
1.000000 | NaN | 7.690480 | NaN | ||||||||||||
1.000000 | NaN | 9.327333 | NaN | ||||||||||||
1.000000 | NaN | 37.731715 | NaN |
Underperformed. Continuing
Best overall score: 0.92
Best step: 8
Compare on held-out set
It's easy to overfit a single benchmark if you explicitly choose your pipeline based on metrics on that benchmark.
Let's compare models on an unseen test set to see whether the selected examples are reliably better.
best_score, best_examples = all_scores[0]
original_model = create_chain()
# This time we will apply gpt-3.5-turbo, but use the few-shot examples + reasoning trajectories
# from gpt-4 to help induce better performance
best_performing_model = create_chain(best_examples)
for model_name, model in [
("optimized", best_performing_model),
# ("original", original_model),
]:
client.run_on_dataset(
dataset_name=test_name,
llm_or_chain_factory=model,
evaluation=eval_config,
verbose=True,
project_metadata={
"model": model_name,
},
)
View the evaluation results for project 'shiny-ship-82' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/f1b328a2-b4e8-473c-808f-e042d38f6ebd/compare?selectedSessions=368a8216-6462-4d19-8261-9709fe301b19
View all tests for Dataset scone-test2 at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/f1b328a2-b4e8-473c-808f-e042d38f6ebd
[------------------------------------------------->] 200/200
<h3>Experiment Results:</h3>
feedback.exact_match | error | execution_time | run_id | count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
200.000000 | 0 | 200.000000 | 200 | ||||||||||||
NaN | 0 | NaN | 200 | ||||||||||||
NaN | NaN | NaN | 2ab8873e-b142-4f3f-a970-0ca693ce12c2 | ||||||||||||
NaN | NaN | NaN | 1 | ||||||||||||
0.870000 | NaN | 1.772289 | NaN | ||||||||||||
0.337147 | NaN | 0.341076 | NaN | ||||||||||||
0.000000 | NaN | 1.205090 | NaN | ||||||||||||
1.000000 | NaN | 1.547561 | NaN | ||||||||||||
1.000000 | NaN | 1.718797 | NaN | ||||||||||||
1.000000 | NaN | 1.897174 | NaN | ||||||||||||
1.000000 | NaN | 3.934606 | NaN |
Using the GPT-4 generated examples, we were able to boost the performance from ~0.54 to ~0.87: not bad!