Evaluations
Run models against your data
Introducing Evaluations, a powerful feature designed to enable you to effortlessly test and compare a selection of AI models against your datasets.
Whether you're fine-tuning models or evaluating performance metrics, Oxen evaluations simplifies the process, allowing you to quickly and easily run prompts through an entire dataset.
Once you're happy with the results, output the resulting dataset to a new file, another branch, or directly as a new commit.
Synthetic Question Generation
37ca1f44-1dcf-4dcb-846d-365e6fc4e270
5 row sample completed
Bessie
Bessie
2 weeks ago
Prompt: Rephrase the following question but keep the intent the same. Only respond with the new question. {query}
12 iterations 300 tokens$ 0.0003
text → textlambda Lambda Labs/ Hermes 3 405B
Source:
Answer Extraction from Qwen Results
1d089fc6-e0af-400d-9328-2aee62229752
100 rows completed
Bessie
Bessie
2 weeks ago
Prompt: Extract the conclusion and just respond with the text within the <CONCLUSION></CONCLUSION> tag. For example if the tag says <CONCLUSION>1%</CONCLUSION> just respond with "1%". {prediction}
3 iterations 27430 tokens$ 0.0042
text → textopenaiOpenAI/GPT-4o mini
Source:
qwen-72B-results
Target:
qwen-72B-results
Qwen 72B Vision CoT
eca8d328-3a7d-4d48-ade0-df10a3dffaf2
100 rows completed
Bessie
Bessie
2 weeks ago
Prompt: {imgname} Here is image and a question that I want you to answer. I need you to strictly follow the format with four specific sections: <SUMMARY></SUMMARY> <CAPTION></CAPTION> <REASONING></REASONING> <CONCLUSION></CONCLUSION>. It is crucial that you adhere to this structure exactly as outlined and that the final answer in the <CONCLUSION></CONCLUSION> matches the standard correct answer precisely. To explain further: SUMMARY: briefly explain what steps you'll take to solve the problem. CAPTION: describe the contents of the image in as much detail as possible, specifically focusing on details relevant to the question. REASONING: outline a step-by-step thought process you would use to solve the problem based on the image. CONCLUSION: give the final answer in a direct format, and it must match the correct answer exactly. If it's a multiple choice question, the conclusion should only include the option without repeating what the option is. Here's how the xml response format should look: <SUMMARY> Summarize how you will approach the problem and explain the steps you will take to reach the answer. </SUMMARY> <CAPTION> Provide a detailed description of the image, particularly emphasizing the aspects related to the question. </CAPTION> <REASONING> Provide a chain-of-thought, logical explanation of the problem. This should outline step-by-step reasoning. </REASONING> <CONCLUSION> State the final answer in a clear and direct format. It must match the correct answer exactly. </CONCLUSION> (Do not forget the <CONCLUSION></CONCLUSION>!) Please apply this format meticulously putting each section in xml tags like above. Analyze the given image and answer the related question, ensuring that the answer matches the standard one perfectly. <QUESTION> {query} </QUESTION>
2 iterations 122053 tokens$ 0.1098
image → textfireworksFireworks AI/Qwen2 VL 72B Instruct
Source:
Target:
qwen-72B-results
d7c091da-c2ad-4993-bd87-dabe52f286ce
d7c091da-c2ad-4993-bd87-dabe52f286ce
100 rows completed
Bessie
Bessie
2 weeks ago
Prompt: {imgname} Answer the following question very concisely. Respond with one word if possible {query}
3 iterations 4609 tokens$ 0.0008
image → textgroqGroq/Llama 3.2 11B Vision (Preview)
Source:
Target:
Judge 11B Direct Responses
154b8be9-7ee8-4b11-9424-2a7efb2c7d13
100 rows completed
Bessie
Bessie
2 weeks ago
Prompt: Are the two responses equivalent? Ignore punctuation and irrelevant characters and differences in verb tense. Reply with true or false. One word all lowercase. Response 1: {label} Response 2: {prediction}
4 iterations 5285 tokens$ 0.0008
text → textopenaiOpenAI/GPT-4o mini
Source:
llama-3.2-11B-direct-answers
Target:
llama-3.2-11B-direct-answers
Answer questions directly with Llama 3.2 11B
466c5926-6157-4ff6-a8e7-f0bbd5bd8fb3
100 rows completed
Bessie
Bessie
2 weeks ago
Prompt: {imgname} Answer the following question succinctly with a single word if possible. Question: {query}
3 iterations 4627 tokens$ 0.0008
image → textgroqGroq/Llama 3.2 11B Vision (Preview)
Source:
Target:
llama-3.2-11B-direct-answers
Judge 11B Responses
d61b6b29-ff7b-4332-9192-6376fd661469
100 rows completed
Bessie
Bessie
2 weeks ago
Prompt: Are the two responses equivalent? Ignore punctuation and irrelevant characters and differences in verb tense. Reply with true or false. One word all lowercase. Response 1: {label} Response 2: {conclusion}
5 iterations 5178 tokens$ 0.0137
text → textopenaiOpenAI/GPT-4o
Source:
llama-3.2-11B-cot-separate-steps
Target:
llama-3.2-11B-cot-separate-steps
Llama 90B Conclusions
e47608b7-a6f2-4c03-b32f-0a6cb7b85a48
1 / 100 rowserror
Bessie
Bessie
2 weeks ago
Prompt: {imgname} I have an image and a question that I want you to answer. Take the following summary, caption, and reasoning to come up with a final conclusion. Give the final answer in a direct format, and it must be concise match the correct answer exactly. Do not ramble, just give the final answer, no other words. If it is a numeric value just answer with the number. If it's a multiple choice question, the conclusion should only include the option without repeating what the option is. Question: {query} Summary: {summary} Caption: {caption} Reasoning: {reasoning} Conclusion:
1 iteration 1984 tokens$ 0.0000
image → textgroqGroq/Llama 3.2 11B Vision (Preview)
Source:
llama-90B-CoT-separate-steps
Target:
Llama 11B Conclusion
428ae1bf-1c58-42b0-bd5f-3a90a5aa7637
100 rows completed
Bessie
Bessie
2 weeks ago
Prompt: {imgname} I have an image and a question that I want you to answer. Take the following summary, caption, and reasoning to come up with a final conclusion. Give the final answer in a direct format, and it must be concise match the correct answer exactly. Do not ramble, just give the final answer, no other words. If it is a numeric value just answer with the number. If it's a multiple choice question, the conclusion should only include the option without repeating what the option is. Question: {query} Summary: {summary} Caption: {caption} Reasoning: {reasoning} Conclusion:
3 iterations 83122 tokens$ 0.0150
image → textgroqGroq/Llama 3.2 11B Vision (Preview)
Source:
llama-3.2-11B-cot-separate-steps
Target:
llama-3.2-11B-cot-separate-steps
Llama 11B Reasoning
c1957ee0-d55b-4345-ad8a-138a2302a003
100 rows completed
Bessie
Bessie
2 weeks ago
Prompt: {imgname} I have an image and a question that I want you to answer. Outline a step-by-step thought process you would use to solve the problem based on the image. Question: {query} Reasoning:
1 iteration 30674 tokens$ 0.0055
image → textgroqGroq/Llama 3.2 11B Vision (Preview)
Source:
llama-3.2-11B-cot-separate-steps
Target:
llama-3.2-11B-cot-separate-steps
Llama 11B Captions
8d745c64-e9d2-4a59-911c-235c5b712760
100 rows completed
Bessie
Bessie
2 weeks ago
Prompt: {imgname} I have an image and a question that I want you to answer. Caption the image in detail. Describe the contents of the image, specifically focusing on details relevant to the question. Question: {query} Caption:
2 iterations 32421 tokens$ 0.0058
image → textgroqGroq/Llama 3.2 11B Vision (Preview)
Source:
llama-3.2-11B-cot-separate-steps
Target:
llama-3.2-11B-cot-separate-steps
Llama 11B Summary
bcc8c95f-63de-402e-8193-dd3b18ff4a9e
100 rows completed
Bessie
Bessie
2 weeks ago
Prompt: {imgname} I have an image and a question that I want you to answer. Summarize everything everything you would need to do to answer the question. Question: {query} Summary:
2 iterations 24567 tokens$ 0.0044
image → textgroqGroq/Llama 3.2 11B Vision (Preview)
Source:
Target:
llama-3.2-11B-cot-separate-steps
Llama 90B Reasoning
315ba916-3d6b-4f45-9ac6-931166df3682
100 rows completed
Bessie
Bessie
2 weeks ago
Prompt: {imgname} I have an image and a question that I want you to answer. Outline a step-by-step thought process you would use to solve the problem based on the image. Question: {query} Reasoning:
1 iteration 29807 tokens$ 0.0268
image → textgroqGroq/Llama 3.2 90B Vision (Preview)
Source:
llama-90B-CoT-separate-steps
Target:
llama-90B-CoT-separate-steps
Llama 90B Caption
3541fabd-5121-43cc-9059-dc04e061196e
100 rows completed
Bessie
Bessie
2 weeks ago
Prompt: {imgname} I have an image and a question that I want you to answer. Caption the image in detail. Describe the contents of the image, specifically focusing on details relevant to the question. Question: {query}
1 iteration 23480 tokens$ 0.0211
image → textgroqGroq/Llama 3.2 90B Vision (Preview)
Source:
llama-90B-CoT-separate-steps
Target:
llama-90B-CoT-separate-steps
Llama 90B
9bf56263-d835-4247-a853-046d32cc67b9
100 rows completed
Bessie
Bessie
2 weeks ago
Prompt: {imgname} I have an image and a question that I want you to answer. Summarize everything everything you would need to do to answer the question. Describe how you will approach the problem step by step and create a plan. Question: {query}
2 iterations 31030 tokens$ 0.0279
image → textgroqGroq/Llama 3.2 90B Vision (Preview)
Source:
Target:
llama-90B/summaries
Extract Conclusions Llama 3.2 90B
f76f2a5e-7573-4c2b-b8f5-491a9089ee38
100 rows completed
Bessie
Bessie
2 weeks ago
Prompt: Extract the conclusion from the text, respond with only the text after the <CONCLUSION> tag {prediction}
2 iterations 19045 tokens$ 0.0034
text → textopenaiOpenAI/GPT-4o mini
Source:
Llama-3.2-90B-CoT-100ex
Target:
Llama-3.2-90B-CoT-100ex
Llama 3.2 90B CoT Reasoning
70d53820-a048-47f0-9a8c-7e165956ca2e
100 rows completed
Bessie
Bessie
2 weeks ago
Prompt: {imgname} Here is image and a question that I want you to answer. I need you to strictly follow the format with four specific sections: <SUMMARY></SUMMARY> <CAPTION></CAPTION> <REASONING></REASONING> <CONCLUSION></CONCLUSION>. It is crucial that you adhere to this structure exactly as outlined and that the final answer in the <CONCLUSION></CONCLUSION> matches the standard correct answer precisely. To explain further: SUMMARY: briefly explain what steps you'll take to solve the problem. CAPTION: describe the contents of the image in as much detail as possible, specifically focusing on details relevant to the question. REASONING: outline a step-by-step thought process you would use to solve the problem based on the image. CONCLUSION: give the final answer in a direct format, and it must match the correct answer exactly. If it's a multiple choice question, the conclusion should only include the option without repeating what the option is. Here's how the xml response format should look: <SUMMARY> Summarize how you will approach the problem and explain the steps you will take to reach the answer. </SUMMARY> <CAPTION> Provide a detailed description of the image, particularly emphasizing the aspects related to the question. </CAPTION> <REASONING> Provide a chain-of-thought, logical explanation of the problem. This should outline step-by-step reasoning. </REASONING> <CONCLUSION> State the final answer in a clear and direct format. It must match the correct answer exactly. </CONCLUSION> (Do not forget the <CONCLUSION></CONCLUSION>!) Please apply this format meticulously putting each section in xml tags like above. Analyze the given image and answer the related question, ensuring that the answer matches the standard one perfectly. <QUESTION> {query} </QUESTION>
1 iteration 52502 tokens$ 0.0473
image → textgroqGroq/Llama 3.2 90B Vision (Preview)
Source:
Target:
Llama-3.2-90B-CoT-100ex
Extract Conclusions Llama 11B
6869326f-c1ff-45f4-be0b-b5a1d485683e
100 rows completed
Bessie
Bessie
2 weeks ago
Prompt: Extract the conclusion from the text, respond with only the text after the <CONCLUSION> tag {prediction}
2 iterations 28659 tokens$ 0.0046
text → textopenaiOpenAI/GPT-4o mini
Source:
Llama-3.2-11B-CoT-100ex
Target:
Llama-3.2-11B-CoT-100ex
Extract conclusions GPT-4o
6719bc92-cb04-4ef2-bbd9-96f66bf55ca0
100 rows completed
Bessie
Bessie
2 weeks ago
Prompt: Extract the conclusion from the text, respond with only the text after the <CONCLUSION> tag {prediction}
3 iterations 25332 tokens$ 0.0039
text → textopenaiOpenAI/GPT-4o mini
Source:
gpt-4o-cot
Target:
gpt-4o-cot
GPT-4o Chain of Thought
bb568442-5903-4f00-a112-dc46fd34a8cd
100 rows completed
Bessie
Bessie
2 weeks ago
Prompt: {imgname} Here is image and a question that I want you to answer. I need you to strictly follow the format with four specific sections: <SUMMARY></SUMMARY> <CAPTION></CAPTION> <REASONING></REASONING> <CONCLUSION></CONCLUSION>. It is crucial that you adhere to this structure exactly as outlined and that the final answer in the <CONCLUSION></CONCLUSION> matches the standard correct answer precisely. To explain further: SUMMARY: briefly explain what steps you'll take to solve the problem. CAPTION: describe the contents of the image in as much detail as possible, specifically focusing on details relevant to the question. REASONING: outline a step-by-step thought process you would use to solve the problem based on the image. CONCLUSION: give the final answer in a direct format, and it must match the correct answer exactly. If it's a multiple choice question, the conclusion should only include the option without repeating what the option is. Here's how the xml response format should look: <SUMMARY> Summarize how you will approach the problem and explain the steps you will take to reach the answer. </SUMMARY> <CAPTION> Provide a detailed description of the image, particularly emphasizing the aspects related to the question. </CAPTION> <REASONING> Provide a chain-of-thought, logical explanation of the problem. This should outline step-by-step reasoning. </REASONING> <CONCLUSION> State the final answer in a clear and direct format. It must match the correct answer exactly. </CONCLUSION> (Do not forget the <CONCLUSION></CONCLUSION>!) Please apply this format meticulously putting each section in xml tags like above. Analyze the given image and answer the related question, ensuring that the answer matches the standard one perfectly. <QUESTION> {query} </QUESTION>
1 iteration 132771 tokens$ 0.4999
image → textopenaiOpenAI/GPT-4o
Source:
Target:
gpt-4o-cot
Llama 3.2 11B CoT Reasoning
85d938b9-6668-4581-9fd4-7595bcd0304a
100 rows completed
Bessie
Bessie
2 weeks ago
Prompt: {imgname} Here is image and a question that I want you to answer. I need you to strictly follow the format with four specific sections: <SUMMARY></SUMMARY> <CAPTION></CAPTION> <REASONING></REASONING> <CONCLUSION></CONCLUSION>. It is crucial that you adhere to this structure exactly as outlined and that the final answer in the <CONCLUSION></CONCLUSION> matches the standard correct answer precisely. To explain further: SUMMARY: briefly explain what steps you'll take to solve the problem. CAPTION: describe the contents of the image in as much detail as possible, specifically focusing on details relevant to the question. REASONING: outline a step-by-step thought process you would use to solve the problem based on the image. CONCLUSION: give the final answer in a direct format, and it must match the correct answer exactly. If it's a multiple choice question, the conclusion should only include the option without repeating what the option is. Here's how the xml response format should look: <SUMMARY> Summarize how you will approach the problem and explain the steps you will take to reach the answer. </SUMMARY> <CAPTION> Provide a detailed description of the image, particularly emphasizing the aspects related to the question. </CAPTION> <REASONING> Provide a chain-of-thought, logical explanation of the problem. This should outline step-by-step reasoning. </REASONING> <CONCLUSION> State the final answer in a clear and direct format. It must match the correct answer exactly. </CONCLUSION> (Do not forget the <CONCLUSION></CONCLUSION>!) Please apply this format meticulously putting each section in xml tags like above. Analyze the given image and answer the related question, ensuring that the answer matches the standard one perfectly. <QUESTION> {query} </QUESTION>
1 iteration 69199 tokens$ 0.0111
image → textgroqGroq/Llama 3.2 11B Vision (Preview)
Source:
Target:
Llama-3.2-11B-CoT-100ex