Evaluations
Run models against your data
Introducing Evaluations, a powerful feature designed to enable you to effortlessly test and compare a selection of AI models against your datasets.
Whether you're fine-tuning models or evaluating performance metrics, Oxen evaluations simplifies the process, allowing you to quickly and easily run prompts through an entire dataset.
Once you're happy with the results, output the resulting dataset to a new file, another branch, or directly as a new commit.
GPT-4o As A Judge: Gemini Pro v Llama 405B
486adfc0-a842-446d-beac-df2cdd87024c
1001 rows completed
Bessie
Bessie
1 month ago
Prompt: Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. Output your final verdict by strictly following this format: “[[A]]” if assistant A is better, “[[B]]” if assistant B is better. Answer with just the verdict string and nothing else. [User Question] {prompt} [The Start of Assistant A’s Answer] {specific_thoughts_llama_405b_response} [The End of Assistant A’s Answer] [The Start of Assistant B’s Answer] {specific_thoughts_gemini_pro_response} [The End of Assistant B’s Answer]
1 iteration 1027794 tokens$ 2.59
texttextopenaiOpenAI/GPT-4o
Target:
judgements_llama_405b_v_gemini_pro
OpenAI GPT-4o Judgement of Llama 405B Responses
f6578af2-65da-4d41-a1ea-b55e6d19ce43
1001 rows completed
Bessie
Bessie
1 month ago
Prompt: Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. Output your final verdict by strictly following this format: “[[A]]” if assistant A is better, “[[B]]” if assistant B is better. Answer with just the verdict string and nothing else. [User Question] {prompt} [The Start of Assistant A’s Answer] {specific_thoughts_llama_405b_response} [The End of Assistant A’s Answer] [The Start of Assistant B’s Answer] {generic_thoughts_llama_405B_response} [The End of Assistant B’s Answer]
2 iterations 977245 tokens$ 2.47
texttextopenaiOpenAI/GPT-4o
Source:
Target:
Generic Thoughts Llama 405B
ae7f23c0-0e82-4133-b779-6bd4f75c6782
1000 rows completed
Bessie
Bessie
1 month ago
Prompt: Respond to the following user query in a comprehensive and detailed way. You can write down your thought process before responding. Write your thoughts after "Here is my thought process:" and write your response after "Here is my response:". User query: {prompt}
1 iteration 746390 tokens$ 2.24
texttextfireworksFireworks AI/Llama v3.1 405B Instruct
Target:
generic_thoughts_llama_405B
Specific Thoughts Gemini Pro
8a13d411-e496-40f7-a681-335f77048252
1000 rows completed
Bessie
Bessie
1 month ago
Prompt: Respond to the following user query in a comprehensive and detailed way. But first write down your internal thoughts. This must include your draft response and its evaluation. After this, write your final response after "<R>". User query: {prompt}
1 iteration 888626 tokens$ 3.70
texttextgoogleGoogle/Gemini 1.5 Pro
Target:
specific_thoughts_gemini_pro
Specific Thoughts Llama 405B
e95d048d-c536-40bb-a41c-8ab9852d4b87
1000 rows completed
Bessie
Bessie
1 month ago
Prompt: Respond to the following user query in a comprehensive and detailed way. But first write down your internal thoughts. This must include your draft response and its evaluation. After this, write your final response after "<R>". User query: {prompt}
1 iteration 879230 tokens$ 2.64
texttextfireworksFireworks AI/Llama v3.1 405B Instruct
Source:
Target:
specific_thoughts_llama_405b
9ab24c3a-849f-4b45-a81b-ce1f6ebaa72f
9ab24c3a-849f-4b45-a81b-ce1f6ebaa72f
5 row sample completed
Bessie
Bessie
2 months ago
Prompt: Classify the text into one of the following categories. Repond with only the category, one word, all lowercase. If it does not fall into a category, respond with "none". sports finance tech entertainment {prompt}
2 iterations 899 tokens$ 0.0002
texttexttogetheraiTogether.ai/Meta Llama 3.1 8B Instruct Turbo
c3632963-40a9-45e5-8553-5d885f31d403
c3632963-40a9-45e5-8553-5d885f31d403
5 row sample completed
Bessie
Bessie
2 months ago
Prompt: Classify the text into one of the following categories. Repond with only the category, one word, all lowercase. If it does not fall into a category, resond with "none". sports finance tech entertainment {prompt}
2 iterations 918 tokens$ 0.0002
texttexttogetheraiTogether.ai/Meta Llama 3.1 8B Instruct Turbo
Categorize all the prompts
eadf486d-82b5-4e15-9606-1dd25357ff23
5 row sample completed
Bessie
Bessie
2 months ago
Prompt: Classify the text into entertainment, sports or finance. Limit it to one word. {prompt}
2 iterations 475 tokens$ 0.0001
texttextopenaiOpenAI/GPT-4o mini
a534f90f-33cc-4d81-bf5e-ffdd0696fe63
a534f90f-33cc-4d81-bf5e-ffdd0696fe63
5 row sample completed
Bessie
Bessie
2 months ago
Prompt: Classify the text into one of 3 categories, you decide the categories {prompt}
1 iteration 807 tokens$ 0.0003
texttextopenaiOpenAI/GPT-4o mini
Judge The Responses
60088e3f-0acc-45af-927c-9fb4d31c73bb
1000 rows completed
Bessie
Bessie
2 months ago
Prompt: Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. Output your final verdict by strictly following this format: “[[A]]” if assistant A is better, “[[B]]” if assistant B is better. Answer with just the verdict string and nothing else. [User Question] {prompt} [The Start of Assistant A’s Answer] {specific_thoughts_response} [The End of Assistant A’s Answer] [The Start of Assistant B’s Answer] {generic_thoughts_response} [The End of Assistant B’s Answer]
3 iterations 991422 tokens$ 0.8923
texttextfireworksFireworks AI/Llama v3.1 70B Instruct
Source:
combined_thoughts
Target:
6373dbc5-d642-42b0-8650-d479684831c8
6373dbc5-d642-42b0-8650-d479684831c8
5 row sample completed
Bessie
Bessie
2 months ago
Prompt: Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. Output your final verdict by strictly following this format: “[[A]]” if assistant A is better, “[[B]]” if assistant B is better. Answer with just the verdict string and nothing else. [User Question] {prompt} [The Start of Assistant A’s Answer] {specific_thoughts_response} [The End of Assistant A’s Answer] [The Start of Assistant B’s Answer] {generic_thoughts_response} [The End of Assistant B’s Answer]
3 iterations 2802 tokens$ 0.0025
texttextfireworksFireworks AI/Llama v3.1 70B Instruct
Source:
combined_thoughts
Specific Thought Prompt - Llama 3.1 70B
a72cad52-a06c-4413-95e1-6993a6450fad
1000 rows completed
Bessie
Bessie
2 months ago
Prompt: Respond to the following user query in a comprehensive and detailed way. But first write down your internal thoughts. This must include your draft response and its evaluation. After this, write your final response after "<R>". User query: {prompt}
2 iterations 977771 tokens$ 0.8800
texttextfireworksFireworks AI/Llama v3.1 70B Instruct
Source:
Target:
specific_thoughts_70B
Specific Thought Prompt - Llama 3.1 8B
a0d1316d-c238-49f1-9c98-fa19d0e285e1
1000 rows completed
Bessie
Bessie
2 months ago
Prompt: Respond to the following user query in a comprehensive and detailed way. But first write down your internal thoughts. This must include your draft response and its evaluation. After this, write your final response after "<R>". User query: {prompt}
1 iteration 916141 tokens$ 0.1832
texttextfireworksFireworks AI/Llama v3.1 8B Instruct
Source:
Target:
specific_thoughts
Generate generic thoughts w/ Llama 3.1 8B
ff6d6793-8617-4530-a8c4-80261a88973e
1000 rows completed
Bessie
Bessie
2 months ago
Prompt: Respond to the following user query in a comprehensive and detailed way. You can write down your thought process before responding. Write your thoughts after "Here is my thought process:" and write your response after "Here is my response:". User query: {prompt}
2 iterations 773706 tokens$ 0.1547
texttextfireworksFireworks AI/Llama v3.1 8B Instruct
Source:
Classify the instructions with Llama 3.1 8B
437e4c31-f7f0-4b4a-9ef2-02e56be79482
1000 rows completed
Bessie
Bessie
2 months ago
Prompt: Below is an instruction that I would like you to analyze: <instruction> {prompt} </instruction> Categorize the instruction above into one of the following categories: General Knowledge Math and Calculations Programming and Coding Reasoning and Problem-Solving Creative Writing Content Writing Art and Design Language and Translation Research and Analysis Conversational Dialogue Data Analysis and Visualization Business and Finance Education and Learning Science and Technology Health and Wellness Personal Development Entertainment and Humor Travel and Leisure Marketing and Sales Game Development Miscellaneous Be sure to provide the exact category name without any additional text.
2 iterations 312294 tokens$ 0.0625
texttextfireworksFireworks AI/Llama v3.1 8B Instruct
Target: