Evaluations
Run models against your data
Introducing Evaluations, a powerful feature designed to enable you to effortlessly test and compare a selection of AI models against your datasets.
Whether you're fine-tuning models or evaluating performance metrics, Oxen evaluations simplifies the process, allowing you to quickly and easily run prompts through an entire dataset.
Once you're happy with the results, output the resulting dataset to a new file, another branch, or directly as a new commit.
486adfc0-a842-446d-beac-df2cdd87024c
OpenAIOpenAI/GPT 4otexttext
Bessie
ox
5 months ago
Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. 

Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. 

Output your final verdict by strictly following this format: “[[A]]” if assistant A is better, “[[B]]” if assistant B is better. Answer with just the verdict string and nothing else.

[User Question]
{prompt}

[The Start of Assistant A’s Answer]
{specific_thoughts_llama_405b_response}
[The End of Assistant A’s Answer]

[The Start of Assistant B’s Answer]
{specific_thoughts_gemini_pro_response}
[The End of Assistant B’s Answer]
completed 1001 rows1027794 tokens$ 2.59 1 iteration
f6578af2-65da-4d41-a1ea-b55e6d19ce43
OpenAIOpenAI/GPT 4otexttext
Bessie
ox
5 months ago
Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. 

Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. 

Output your final verdict by strictly following this format: “[[A]]” if assistant A is better, “[[B]]” if assistant B is better. Answer with just the verdict string and nothing else.

[User Question]
{prompt}

[The Start of Assistant A’s Answer]
{specific_thoughts_llama_405b_response}
[The End of Assistant A’s Answer]

[The Start of Assistant B’s Answer]
{generic_thoughts_llama_405B_response}
[The End of Assistant B’s Answer]
completed 1001 rows977245 tokens$ 2.47 2 iterations
ae7f23c0-0e82-4133-b779-6bd4f75c6782
MetaMeta/Llama 3.1 405B Instructtexttext
Bessie
ox
5 months ago
Respond to the following user query in a comprehensive and detailed way. You can write down your thought process before responding. Write your thoughts after "Here is my thought process:" and write your response after "Here is my response:".
User query: {prompt}
completed 1000 rows746390 tokens$ 2.24 1 iteration
8a13d411-e496-40f7-a681-335f77048252
GoogleGoogle/Gemini 1.5 Protexttext
Bessie
ox
5 months ago
Respond to the following user query in a comprehensive and detailed way. But first write down your internal thoughts. This must include your draft response and its evaluation. After this, write your final response after "<R>".

User query: {prompt}
specific_thoughts_gemini_pro
completed 1000 rows888626 tokens$ 3.70 1 iteration
e95d048d-c536-40bb-a41c-8ab9852d4b87
MetaMeta/Llama 3.1 405B Instructtexttext
Bessie
ox
5 months ago
Respond to the following user query in a comprehensive and detailed way. But first write down your internal thoughts. This must include your draft response and its evaluation. After this, write your final response after "<R>".

User query: {prompt}
specific_thoughts_llama_405b
completed 1000 rows879230 tokens$ 2.64 1 iteration
9ab24c3a-849f-4b45-a81b-ce1f6ebaa72f
MetaMeta/Llama 3.1 8B Instruct Turbotexttext
Bessie
ox
5 months ago
Classify the text into one of the following categories. Repond with only the category, one word, all lowercase. If it does not fall into a category, respond with "none".

sports
finance
tech
entertainment

{prompt}
completed 5 row sample899 tokens$ 0.0002 2 iterations
c3632963-40a9-45e5-8553-5d885f31d403
MetaMeta/Llama 3.1 8B Instruct Turbotexttext
Bessie
ox
5 months ago
Classify the text into one of the following categories. Repond with only the category, one word, all lowercase. If it does not fall into a category, resond with "none".

sports
finance
tech
entertainment

{prompt}
completed 5 row sample918 tokens$ 0.0002 2 iterations
eadf486d-82b5-4e15-9606-1dd25357ff23
OpenAIOpenAI/GPT 4o minitexttext
Bessie
ox
5 months ago
Classify the text into entertainment, sports or finance. Limit it to one word.

{prompt}
completed 5 row sample475 tokens$ 0.0001 2 iterations
a534f90f-33cc-4d81-bf5e-ffdd0696fe63
OpenAIOpenAI/GPT 4o minitexttext
Bessie
ox
6 months ago
Classify the text into one of 3 categories, you decide the categories

{prompt}
completed 5 row sample807 tokens$ 0.0003 1 iteration
60088e3f-0acc-45af-927c-9fb4d31c73bb
MetaMeta/Llama 3.1 70B Instructtexttext
Bessie
ox
6 months ago
Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. 

Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. 

Output your final verdict by strictly following this format: “[[A]]” if assistant A is better, “[[B]]” if assistant B is better. Answer with just the verdict string and nothing else.

[User Question]
{prompt}

[The Start of Assistant A’s Answer]
{specific_thoughts_response}
[The End of Assistant A’s Answer]

[The Start of Assistant B’s Answer]
{generic_thoughts_response}
[The End of Assistant B’s Answer]
completed 1000 rows991422 tokens$ 0.8923 3 iterations
6373dbc5-d642-42b0-8650-d479684831c8
MetaMeta/Llama 3.1 70B Instructtexttext
Bessie
ox
6 months ago
Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. 

Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. 

Output your final verdict by strictly following this format: “[[A]]” if assistant A is better, “[[B]]” if assistant B is better. Answer with just the verdict string and nothing else.

[User Question]
{prompt}

[The Start of Assistant A’s Answer]
{specific_thoughts_response}
[The End of Assistant A’s Answer]

[The Start of Assistant B’s Answer]
{generic_thoughts_response}
[The End of Assistant B’s Answer]
combined_thoughts
completed 5 row sample2802 tokens$ 0.0025 3 iterations
a72cad52-a06c-4413-95e1-6993a6450fad
MetaMeta/Llama 3.1 70B Instructtexttext
Bessie
ox
6 months ago
Respond to the following user query in a comprehensive and detailed way. But first write down your internal thoughts. This must include your draft response and its evaluation. After this, write your final response after "<R>".

User query: {prompt}
specific_thoughts_70B
completed 1000 rows977771 tokens$ 0.8800 2 iterations
a0d1316d-c238-49f1-9c98-fa19d0e285e1
MetaMeta/Llama 3.1 8B Instructtexttext
Bessie
ox
6 months ago
Respond to the following user query in a comprehensive and detailed way. But first write down your internal thoughts. This must include your draft response and its evaluation. After this, write your final response after "<R>".
User query: {prompt}
specific_thoughts
completed 1000 rows916141 tokens$ 0.1832 1 iteration
ff6d6793-8617-4530-a8c4-80261a88973e
MetaMeta/Llama 3.1 8B Instructtexttext
Bessie
ox
6 months ago
Respond to the following user query in a comprehensive and detailed way. You can write down your thought process before responding. Write your thoughts after "Here is my thought process:" and write your response after "Here is my response:".
User query: {prompt}
completed 1000 rows773706 tokens$ 0.1547 2 iterations
437e4c31-f7f0-4b4a-9ef2-02e56be79482
MetaMeta/Llama 3.1 8B Instructtexttext
Bessie
ox
6 months ago
Below is an instruction that I would like you to analyze:

<instruction>
{prompt}
</instruction>

Categorize the instruction above into one of the following categories: 
General Knowledge
Math and Calculations
Programming and Coding
Reasoning and Problem-Solving
Creative Writing
Content Writing
Art and Design
Language and Translation
Research and Analysis
Conversational Dialogue
Data Analysis and Visualization
Business and Finance
Education and Learning
Science and Technology
Health and Wellness
Personal Development
Entertainment and Humor
Travel and Leisure
Marketing and Sales
Game Development
Miscellaneous

Be sure to provide the exact category name without any additional text.
completed 1000 rows312294 tokens$ 0.0625 2 iterations