Evaluations
Run models against your data
Introducing Evaluations, a powerful feature designed to enable you to effortlessly test and compare a selection of AI models against your datasets.
Whether you're fine-tuning models or evaluating performance metrics, Oxen evaluations simplifies the process, allowing you to quickly and easily run prompts through an entire dataset.
Once you're happy with the results, output the resulting dataset to a new file, another branch, or directly as a new commit.
Demo
d62d7bdf-76ff-4773-8c0d-eaa8fcea53ff
5 row sample completed
Bart Dubbeldam
Bart Dubbeldam
2 days ago
Prompt: Which language is the prompt written in: {prompt}
1 iteration 1921 tokens$ 0.0003
texttextOpenAIOpenAI/GPT 4o mini
Dissect Prompt
6a386cf0-ef1f-44f2-81c8-5fb3dce77a66
30 / 23110 rows error
Bart Dubbeldam
Bart Dubbeldam
2 weeks ago
Prompt: You are an expert in NLP and prompt analysis. Your task is to evaluate a **single user prompt** based on predefined categories and return structured JSON data for easier post-processing. --- Select up to 3 topics that are most relevant to the prompt from the following list: ["Healthcare", "Finance", "Education", "Technology", "Science", "Politics", "Environment", "Ethics", "Entertainment", "History", "Philosophy", "Psychology", "Sports", "Legal", "Business", "Travel", "Food", "Art", "Literature", "Personal Development", "Programming"] The first topic should be the most dominant in the prompt. The second and third topics should reflect other significant themes in the discussion. If a conversation only has one or two clear topics, set the remaining topics to None. 2. Language Style "Formal" "Informal" "Mixed" 3. Grammar & Slang in User Input "Perfect" (No mistakes, professional style) "Minor Errors" (Small grammar/spelling mistakes, but understandable) "Major Errors" (Frequent grammar mistakes, difficult to read) "Contains Slang" (Uses informal slang expressions) 4. Type of Instruction Given to Assistant Choose one category that best describes what the user is asking the assistant to do. Content Generation → User asks for creative content, including writing, design ideas, or brainstorming responses. Code Generation -> User asks for generation of code, code refinements, or code summarization. Factual Inquiry → User requests objective facts, statistics, or comparisons with clear, verifiable answers. Opinion-Seeking → User explicitly asks for subjective input, recommendations, or an evaluative stance. Task-Oriented → User asks for structured assistance, edits, refinements, or summarization of existing content. Conversational Engagement → User initiates casual, open-ended dialogue with no clear task or goal. Output Format Return structured JSON output in this format: { "topic": ["Art", "Healthcare", None], "language_style": "Formal", "grammar_slang": "Perfect", "instruction_type": "Content Generation" } Instructions Analyze the prompt Select the 3 most relevant topics, ordered by prominence in the conversation. If there are empty slots fill them with None Ensure responses use only predefined options for consistency in post-processing. Do not add explanations—only return JSON. Now, analyze the following prompt: {prompt}
3 iterations 22852 tokens$ 0.0000
texttextMetaMeta/Llama 3.2 3B Instruct Turbo
PromptAnalysis
e12115fe-f1f4-47a0-ae64-5ac14fc6d4ec
30 / 23110 rows error
Bart Dubbeldam
Bart Dubbeldam
2 weeks ago
Prompt: You are an expert in NLP and prompt analysis. Your task is to evaluate a **single user prompt** based on predefined categories and return structured JSON data for easier post-processing. --- Select up to 3 topics that are most relevant to the prompt from the following list: ["Healthcare", "Finance", "Education", "Technology", "Science", "Politics", "Environment", "Ethics", "Entertainment", "History", "Philosophy", "Psychology", "Sports", "Legal", "Business", "Travel", "Food", "Art", "Literature", "Personal Development", "Programming"] The first topic should be the most dominant in the prompt. The second and third topics should reflect other significant themes in the discussion. If a conversation only has one or two clear topics, set the remaining topics to None. 2. Language Style "Formal" "Informal" "Mixed" 3. Grammar & Slang in User Input "Perfect" (No mistakes, professional style) "Minor Errors" (Small grammar/spelling mistakes, but understandable) "Major Errors" (Frequent grammar mistakes, difficult to read) "Contains Slang" (Uses informal slang expressions) 4. Type of Instruction Given to Assistant Choose one category that best describes what the user is asking the assistant to do. Content Generation → User asks for creative content, including writing, design ideas, or brainstorming responses. Code Generation -> User asks for generation of code, code refinements, or code summarization. Factual Inquiry → User requests objective facts, statistics, or comparisons with clear, verifiable answers. Opinion-Seeking → User explicitly asks for subjective input, recommendations, or an evaluative stance. Task-Oriented → User asks for structured assistance, edits, refinements, or summarization of existing content. Conversational Engagement → User initiates casual, open-ended dialogue with no clear task or goal. Output Format Return structured JSON output in this format: { "topic": ["Art", "Healthcare", None], "language_style": "Formal", "grammar_slang": "Perfect", "instruction_type": "Content Generation" } Instructions Analyze the prompt Select the 3 most relevant topics, ordered by prominence in the conversation. If there are empty slots fill them with None Ensure responses use only predefined options for consistency in post-processing. Do not add explanations—only return JSON. Now, analyze the following prompt: {prompt}
2 iterations 21625 tokens$ 0.0000
texttextOpenAIOpenAI/GPT 4o mini
Categories, Sentiment, and language
bfb7ff77-8b1c-4e72-a241-e4bbf4efdee8
30 / 23110 rows error
Bart Dubbeldam
Bart Dubbeldam
2 weeks ago
Prompt: You are an expert in NLP and prompt analysis. Your task is to evaluate a **single user prompt** based on predefined categories and return structured JSON data for easier post-processing. --- Select up to 3 topics that are most relevant to the prompt from the following list: ["Healthcare", "Finance", "Education", "Technology", "Science", "Politics", "Environment", "Ethics", "Entertainment", "History", "Philosophy", "Psychology", "Sports", "Legal", "Business", "Travel", "Food", "Art", "Literature", "Personal Development", "Programming"] The first topic should be the most dominant in the prompt. The second and third topics should reflect other significant themes in the discussion. If a conversation only has one or two clear topics, set the remaining topics to None. 2. Language Style "Formal" "Informal" "Mixed" 3. Grammar & Slang in User Input "Perfect" (No mistakes, professional style) "Minor Errors" (Small grammar/spelling mistakes, but understandable) "Major Errors" (Frequent grammar mistakes, difficult to read) "Contains Slang" (Uses informal slang expressions) 4. Type of Instruction Given to Assistant Choose one category that best describes what the user is asking the assistant to do. Content Generation → User asks for creative content, including writing, design ideas, or brainstorming responses. Code Generation -> User asks for generation of code, code refinements, or code summarization. Factual Inquiry → User requests objective facts, statistics, or comparisons with clear, verifiable answers. Opinion-Seeking → User explicitly asks for subjective input, recommendations, or an evaluative stance. Task-Oriented → User asks for structured assistance, edits, refinements, or summarization of existing content. Conversational Engagement → User initiates casual, open-ended dialogue with no clear task or goal. Output Format Return structured JSON output in this format: { "topic": ["Art", "Healthcare", None], "language_style": "Formal", "grammar_slang": "Perfect", "instruction_type": "Content Generation" } Instructions Analyze the prompt Select the 3 most relevant topics, ordered by prominence in the conversation. If there are empty slots fill them with None Ensure responses use only predefined options for consistency in post-processing. Do not add explanations—only return JSON. Now, analyze the following prompt: {prompt}
43 iterations 21635 tokens$ 0.0000
texttextOpenAIOpenAI/GPT 4o mini