Evaluations
Run models against your data
Introducing Evaluations, a powerful feature designed to enable you to effortlessly test and compare a selection of AI models against your datasets.
Whether you're fine-tuning models or evaluating performance metrics, Oxen evaluations simplifies the process, allowing you to quickly and easily run prompts through an entire dataset.
Once you're happy with the results, output the resulting dataset to a new file, another branch, or directly as a new commit.
e8049bcd-343d-428f-b8f6-3345578dc606
e8049bcd-343d-428f-b8f6-3345578dc606
56 rows completed
lilian
2 weeks ago
Prompt: Based on the premise,determine whether to agree with the hypothesis. Respond with only 0, 1 or 2. 2 for Support, 0 for Oppose, or 1 for Neutral. premise: {premise} hypothesis: {hypothesis}
3 iterations 7950 tokens$ 0.0012
texttextopenaiOpenAI/GPT-4o mini
Qwen 72B on Boolq Validation
4c00221f-59ec-4771-a0cc-ad3e990516e5
3270 rows completed
lilian
2 weeks ago
Prompt: According to the context, answer the question with only 1 or 0. 1 for yes and 0 for no context: {passage} question: {question}
5 iterations 662171 tokens$ 0.5960
texttextfireworksFireworks AI/Qwen2.5 72B Instruct
Source:
conflict-main-2a556f1d-96da-402a-8bbe-15e1058c8d21
Target:
conflict-main-2a556f1d-96da-402a-8bbe-15e1058c8d21
llama 70B on Commitment Bank
bd87f366-457c-4190-8b4a-13f6abd77948
56 rows completed
lilian
2 weeks ago
Prompt: Based on the premise,determine whether to agree with the hypothesis. Respond with only 0, 1 or 2. 2 for Support, 0 for Oppose, or 1 for Neutral. premise: {premise} hypothesis: {hypothesis}
2 iterations 8646 tokens$ 0.0078
texttextfireworksFireworks AI/Llama v3.1 70B Instruct
llama 70B on Boolq
dc29989f-2d95-4996-9589-f14abacc86db
3270 rows completed
lilian
2 weeks ago
Prompt: According to the context, answer the question with only 1 or 0. 1 for yes and 0 for no context: {passage} question: {question}
2 iterations 600464 tokens$ 0.5404
texttextfireworksFireworks AI/Llama v3.1 70B Instruct
Source:
conflict-main-2a556f1d-96da-402a-8bbe-15e1058c8d21
Target:
conflict-main-2a556f1d-96da-402a-8bbe-15e1058c8d21
Qwen on Commitment Bank Q
b939a272-43f2-46ac-9e26-3b3cef2f456a
56 rows completed
lilian
2 weeks ago
Prompt: Based on the premise,determine whether to agree with the hypothesis. Respond with only 0, 1 or 2. 2 for Support, 0 for Oppose, or 1 for Neutral. premise: {premise} hypothesis: {hypothesis}
4 iterations 9435 tokens$ 0.0085
texttextfireworksFireworks AI/Qwen2.5 72B Instruct
Qwen 72B on Boolq Validation
a5631f40-d8b1-4343-99e1-32aa2ca08ad7
2700 / 3270 rowserror
lilian
2 weeks ago
Prompt: According to the context, answer the question with only 1 or 0. 1 for yes and 0 for no context: {passage} question: {question}
2 iterations 546534 tokens$ 0.4900
texttextfireworksFireworks AI/Qwen2.5 72B Instruct
Qwen 72B on Boolq Validation
3f7688c6-01b2-4c3d-a50a-476c550ca736
5 row sample completed
lilian
2 weeks ago
Prompt: According to the context, answer the question with only 1 or 0. 1 for yes and 0 for no context: {passage} question: {question}
3 iterations 1189 tokens$ 0.0011
texttextfireworksFireworks AI/Qwen2.5 72B Instruct
228b59d1-f298-4849-a01a-93511511a5c4
228b59d1-f298-4849-a01a-93511511a5c4
5 row sample completed
lilian
2 weeks ago
Prompt: According to the context, answer the question concisely. Try to answer in one or a few words. context: {paragraph} question: {question}
1 iteration 2149 tokens$ 0.0019
texttextfireworksFireworks AI/Qwen2.5 72B Instruct
bbad6e19-ad2b-4c4b-a6bc-22f0e9f33ca3
bbad6e19-ad2b-4c4b-a6bc-22f0e9f33ca3
5 row sample completed
lilian
2 weeks ago
Prompt: According to the context, answer the question concisely. Try to answer in one or a few words. context: {paragraph} question: {question}
2 iterations 1280 tokens$ 0.0012
texttextfireworksFireworks AI/Qwen2.5 72B Instruct
test gpt4-o
7638571d-fffc-43fc-8135-437bb7b9b1c5
5 row sample completed
lilian
1 month ago
Prompt: Use the passage as context and pick the correct answer from given choices to put into the placeholder in the query. Do not output anthing besides the answer. passage: {passage} choices: {entities} query: {query}
1 iteration 1578 tokens$ 0.0002
texttextopenaiOpenAI/GPT-4o mini
Test llama3.1 8B on super_glue
c09749b7-c681-475e-8643-00fe5fed7d5c
5 row sample completed
lilian
1 month ago
Prompt: Use the passage as context and pick the correct answer from given choices to put into the placeholder in the query. Do not output anthing besides the answer. passage: {passage} choices: {entities} query: {query}
1 iteration 1632 tokens$ 0.0003
texttextfireworksFireworks AI/Llama v3.1 8B Instruct
Test llama3.1 8B
18f4ac2d-9cab-407b-9bfb-5ac9996eaebc
5 row sample completed
lilian
1 month ago
Prompt: Use the passage as context and pick the correct answer from given choices to put into the placeholder in the query. Do not output anything besides the answer. passage: {passage} choices: {entities} query: {query}
1 iteration 1570 tokens$ 0.0002
texttextopenaiOpenAI/GPT-4o mini
Test llama3.1 8B on super_glue
cb56be48-8656-48c1-be12-727cb2ae1b48
20 row sample completed
lilian
1 month ago
Prompt: Use the passage as context and pick the correct answer from given choices to put into the placeholder in the query. Do not output anthing besides the answer. passage: {passage} query: {query} choices: {entities}
3 iterations 6614 tokens$ 0.0013
texttextfireworksFireworks AI/Llama v3.1 8B Instruct