Introducing Evaluations, a powerful feature designed to enable you to effortlessly test and compare a selection of AI models against your datasets.
Whether you're fine-tuning models or evaluating performance metrics, Oxen evaluations simplifies the process, allowing you to quickly and easily run prompts through an entire dataset.
Once you're happy with the results, output the resulting dataset to a new file, another branch, or directly as a new commit.
GPT-4o As A Judge: Gemini Pro v Llama 405B
486adfc0-a842-446d-beac-df2cdd87024c 1001 rows 00:07:49completed
Bessie
6 days ago
Prompt: Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision.
Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible.
Output your final verdict by strictly following this format: “[[A]]” if assistant A is better, “[[B]]” if assistant B is better. Answer with just the verdict string and nothing else.
[User Question]
{prompt}
[The Start of Assistant A’s Answer]
{specific_thoughts_llama_405b_response}
[The End of Assistant A’s Answer]
[The Start of Assistant B’s Answer]
{specific_thoughts_gemini_pro_response}
[The End of Assistant B’s Answer]
textOpenAI/GPT-4o
Source:
combined_thoughts
Target:
judgements_llama_405b_v_gemini_pro
OpenAI GPT-4o Judgement of Llama 405B Responses
f6578af2-65da-4d41-a1ea-b55e6d19ce43 1001 rows 00:07:38completed
Bessie
6 days ago
Prompt: Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision.
Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible.
Output your final verdict by strictly following this format: “[[A]]” if assistant A is better, “[[B]]” if assistant B is better. Answer with just the verdict string and nothing else.
[User Question]
{prompt}
[The Start of Assistant A’s Answer]
{specific_thoughts_llama_405b_response}
[The End of Assistant A’s Answer]
[The Start of Assistant B’s Answer]
{generic_thoughts_llama_405B_response}
[The End of Assistant B’s Answer]
textOpenAI/GPT-4o
Source:
combined_thoughts
Target:
llama_405B_judgements
Generic Thoughts Llama 405B
ae7f23c0-0e82-4133-b779-6bd4f75c6782 1000 rows 02:08:59completed
Bessie
6 days ago
Prompt: Respond to the following user query in a comprehensive and detailed way. You can write down your thought process before responding. Write your thoughts after "Here is my thought process:" and write your response after "Here is my response:".
User query: {prompt}
textFireworks AI/Llama v3.1 405B Instruct
Source:
Target:
generic_thoughts_llama_405B
Specific Thoughts Gemini Pro
8a13d411-e496-40f7-a681-335f77048252 1000 rows 04:26:48completed
Bessie
6 days ago
Prompt: Respond to the following user query in a comprehensive and detailed way. But first write down your internal thoughts. This must include your draft response and its evaluation. After this, write your final response after "<R>".
User query: {prompt}
textGoogle/Gemini 1.5 Pro
Source:
Target:
specific_thoughts_gemini_pro
Specific Thoughts Llama 405B
e95d048d-c536-40bb-a41c-8ab9852d4b87 1000 rows 02:45:28completed
Bessie
6 days ago
Prompt: Respond to the following user query in a comprehensive and detailed way. But first write down your internal thoughts. This must include your draft response and its evaluation. After this, write your final response after "<R>".
User query: {prompt}
textFireworks AI/Llama v3.1 405B Instruct
Source:
categorizations
Target:
specific_thoughts_llama_405b
9ab24c3a-849f-4b45-a81b-ce1f6ebaa72f
9ab24c3a-849f-4b45-a81b-ce1f6ebaa72f 5 row sample 00:00:03completed
Bessie
2 weeks ago
Prompt: Classify the text into one of the following categories. Repond with only the category, one word, all lowercase. If it does not fall into a category, respond with "none".
sports
finance
tech
entertainment
{prompt}
textTogether.ai/Meta Llama 3.1 8B Instruct Turbo
Source:
c3632963-40a9-45e5-8553-5d885f31d403
c3632963-40a9-45e5-8553-5d885f31d403 5 row sample 00:00:04completed
Bessie
2 weeks ago
Prompt: Classify the text into one of the following categories. Repond with only the category, one word, all lowercase. If it does not fall into a category, resond with "none".
sports
finance
tech
entertainment
{prompt}
textTogether.ai/Meta Llama 3.1 8B Instruct Turbo
Source:
Categorize all the prompts
eadf486d-82b5-4e15-9606-1dd25357ff23 5 row sample 00:00:02completed
Bessie
2 weeks ago
Prompt: Classify the text into entertainment, sports or finance. Limit it to one word.
{prompt}
textOpenAI/GPT-4o mini
Source:
a534f90f-33cc-4d81-bf5e-ffdd0696fe63
a534f90f-33cc-4d81-bf5e-ffdd0696fe63 5 row sample 00:00:08completed
Bessie
3 weeks ago
Prompt: Classify the text into one of 3 categories, you decide the categories
{prompt}
textOpenAI/GPT-4o mini
Source:
Judge The Responses
60088e3f-0acc-45af-927c-9fb4d31c73bb 1000 rows 00:10:57completed
Bessie
3 weeks ago
Prompt: Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision.
Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible.
Output your final verdict by strictly following this format: “[[A]]” if assistant A is better, “[[B]]” if assistant B is better. Answer with just the verdict string and nothing else.
[User Question]
{prompt}
[The Start of Assistant A’s Answer]
{specific_thoughts_response}
[The End of Assistant A’s Answer]
[The Start of Assistant B’s Answer]
{generic_thoughts_response}
[The End of Assistant B’s Answer]
textFireworks AI/Llama v3.1 70B Instruct
Source:
combined_thoughts
Target:
judgements
6373dbc5-d642-42b0-8650-d479684831c8
6373dbc5-d642-42b0-8650-d479684831c8 5 row sample 00:00:01completed
Bessie
3 weeks ago
Prompt: Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision.
Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible.
Output your final verdict by strictly following this format: “[[A]]” if assistant A is better, “[[B]]” if assistant B is better. Answer with just the verdict string and nothing else.
[User Question]
{prompt}
[The Start of Assistant A’s Answer]
{specific_thoughts_response}
[The End of Assistant A’s Answer]
[The Start of Assistant B’s Answer]
{generic_thoughts_response}
[The End of Assistant B’s Answer]
textFireworks AI/Llama v3.1 70B Instruct
Source:
combined_thoughts
Specific Thought Prompt - Llama 3.1 70B
a72cad52-a06c-4413-95e1-6993a6450fad 1000 rows 01:46:03completed
Bessie
3 weeks ago
Prompt: Respond to the following user query in a comprehensive and detailed way. But first write down your internal thoughts. This must include your draft response and its evaluation. After this, write your final response after "<R>".
User query: {prompt}
textFireworks AI/Llama v3.1 70B Instruct
Source:
categorizations
Target:
specific_thoughts_70B
Specific Thought Prompt - Llama 3.1 8B
a0d1316d-c238-49f1-9c98-fa19d0e285e1 1000 rows 01:00:21completed
Bessie
3 weeks ago
Prompt: Respond to the following user query in a comprehensive and detailed way. But first write down your internal thoughts. This must include your draft response and its evaluation. After this, write your final response after "<R>".
User query: {prompt}
textFireworks AI/Llama v3.1 8B Instruct
Source:
categorizations
Target:
specific_thoughts
Generate generic thoughts w/ Llama 3.1 8B
ff6d6793-8617-4530-a8c4-80261a88973e 1000 rows 00:48:22completed
Bessie
3 weeks ago
Prompt: Respond to the following user query in a comprehensive and detailed way. You can write down your thought process before responding. Write your thoughts after "Here is my thought process:" and write your response after "Here is my response:".
User query: {prompt}
textFireworks AI/Llama v3.1 8B Instruct
Source:
categorizations
Target:
thoughts
Classify the instructions with Llama 3.1 8B
437e4c31-f7f0-4b4a-9ef2-02e56be79482 1000 rows 00:06:14completed
Bessie
3 weeks ago
Prompt: Below is an instruction that I would like you to analyze:
<instruction>
{prompt}
</instruction>
Categorize the instruction above into one of the following categories:
General Knowledge
Math and Calculations
Programming and Coding
Reasoning and Problem-Solving
Creative Writing
Content Writing
Art and Design
Language and Translation
Research and Analysis
Conversational Dialogue
Data Analysis and Visualization
Business and Finance
Education and Learning
Science and Technology
Health and Wellness
Personal Development
Entertainment and Humor
Travel and Leisure
Marketing and Sales
Game Development
Miscellaneous
Be sure to provide the exact category name without any additional text.
textFireworks AI/Llama v3.1 8B Instruct
Source:
Target:
categorizations