Evaluations
Run models against your data
Introducing Evaluations, a powerful feature designed to enable you to effortlessly test and compare a selection of AI models against your datasets.
Whether you're fine-tuning models or evaluating performance metrics, Oxen evaluations simplifies the process, allowing you to quickly and easily run prompts through an entire dataset.
Once you're happy with the results, output the resulting dataset to a new file, another branch, or directly as a new commit.
Demo
d62d7bdf-76ff-4773-8c0d-eaa8fcea53ff 5 row sample completed

Bart Dubbeldam
2 days ago
Prompt: Which language is the prompt written in:
{prompt}
text → text
OpenAI/GPT 4o mini
Source:
Dissect Prompt
6a386cf0-ef1f-44f2-81c8-5fb3dce77a66 30 / 23110 rows error

Bart Dubbeldam
2 weeks ago
Prompt: You are an expert in NLP and prompt analysis. Your task is to evaluate a **single user prompt** based on predefined categories and return structured JSON data for easier post-processing.
---
Select up to 3 topics that are most relevant to the prompt from the following list:
["Healthcare", "Finance", "Education", "Technology", "Science", "Politics", "Environment", "Ethics", "Entertainment", "History", "Philosophy", "Psychology", "Sports", "Legal", "Business", "Travel", "Food", "Art", "Literature", "Personal Development", "Programming"]
The first topic should be the most dominant in the prompt.
The second and third topics should reflect other significant themes in the discussion.
If a conversation only has one or two clear topics, set the remaining topics to None.
2. Language Style
"Formal"
"Informal"
"Mixed"
3. Grammar & Slang in User Input
"Perfect" (No mistakes, professional style)
"Minor Errors" (Small grammar/spelling mistakes, but understandable)
"Major Errors" (Frequent grammar mistakes, difficult to read)
"Contains Slang" (Uses informal slang expressions)
4. Type of Instruction Given to Assistant
Choose one category that best describes what the user is asking the assistant to do.
Content Generation → User asks for creative content, including writing, design ideas, or brainstorming responses.
Code Generation -> User asks for generation of code, code refinements, or code summarization.
Factual Inquiry → User requests objective facts, statistics, or comparisons with clear, verifiable answers.
Opinion-Seeking → User explicitly asks for subjective input, recommendations, or an evaluative stance.
Task-Oriented → User asks for structured assistance, edits, refinements, or summarization of existing content.
Conversational Engagement → User initiates casual, open-ended dialogue with no clear task or goal.
Output Format
Return structured JSON output in this format:
{
"topic": ["Art", "Healthcare", None],
"language_style": "Formal",
"grammar_slang": "Perfect",
"instruction_type": "Content Generation"
}
Instructions
Analyze the prompt
Select the 3 most relevant topics, ordered by prominence in the conversation. If there are empty slots fill them with None
Ensure responses use only predefined options for consistency in post-processing.
Do not add explanations—only return JSON.
Now, analyze the following prompt:
{prompt}
text → text
Meta/Llama 3.2 3B Instruct Turbo
Source:
prompt_analysis
Target:
PromptAnalysis
e12115fe-f1f4-47a0-ae64-5ac14fc6d4ec 30 / 23110 rows error

Bart Dubbeldam
2 weeks ago
Prompt: You are an expert in NLP and prompt analysis. Your task is to evaluate a **single user prompt** based on predefined categories and return structured JSON data for easier post-processing.
---
Select up to 3 topics that are most relevant to the prompt from the following list:
["Healthcare", "Finance", "Education", "Technology", "Science", "Politics", "Environment", "Ethics", "Entertainment", "History", "Philosophy", "Psychology", "Sports", "Legal", "Business", "Travel", "Food", "Art", "Literature", "Personal Development", "Programming"]
The first topic should be the most dominant in the prompt.
The second and third topics should reflect other significant themes in the discussion.
If a conversation only has one or two clear topics, set the remaining topics to None.
2. Language Style
"Formal"
"Informal"
"Mixed"
3. Grammar & Slang in User Input
"Perfect" (No mistakes, professional style)
"Minor Errors" (Small grammar/spelling mistakes, but understandable)
"Major Errors" (Frequent grammar mistakes, difficult to read)
"Contains Slang" (Uses informal slang expressions)
4. Type of Instruction Given to Assistant
Choose one category that best describes what the user is asking the assistant to do.
Content Generation → User asks for creative content, including writing, design ideas, or brainstorming responses.
Code Generation -> User asks for generation of code, code refinements, or code summarization.
Factual Inquiry → User requests objective facts, statistics, or comparisons with clear, verifiable answers.
Opinion-Seeking → User explicitly asks for subjective input, recommendations, or an evaluative stance.
Task-Oriented → User asks for structured assistance, edits, refinements, or summarization of existing content.
Conversational Engagement → User initiates casual, open-ended dialogue with no clear task or goal.
Output Format
Return structured JSON output in this format:
{
"topic": ["Art", "Healthcare", None],
"language_style": "Formal",
"grammar_slang": "Perfect",
"instruction_type": "Content Generation"
}
Instructions
Analyze the prompt
Select the 3 most relevant topics, ordered by prominence in the conversation. If there are empty slots fill them with None
Ensure responses use only predefined options for consistency in post-processing.
Do not add explanations—only return JSON.
Now, analyze the following prompt:
{prompt}
text → text
OpenAI/GPT 4o mini
Source:
Target:
Categories, Sentiment, and language
bfb7ff77-8b1c-4e72-a241-e4bbf4efdee8 30 / 23110 rows error

Bart Dubbeldam
2 weeks ago
Prompt: You are an expert in NLP and prompt analysis. Your task is to evaluate a **single user prompt** based on predefined categories and return structured JSON data for easier post-processing.
---
Select up to 3 topics that are most relevant to the prompt from the following list:
["Healthcare", "Finance", "Education", "Technology", "Science", "Politics", "Environment", "Ethics", "Entertainment", "History", "Philosophy", "Psychology", "Sports", "Legal", "Business", "Travel", "Food", "Art", "Literature", "Personal Development", "Programming"]
The first topic should be the most dominant in the prompt.
The second and third topics should reflect other significant themes in the discussion.
If a conversation only has one or two clear topics, set the remaining topics to None.
2. Language Style
"Formal"
"Informal"
"Mixed"
3. Grammar & Slang in User Input
"Perfect" (No mistakes, professional style)
"Minor Errors" (Small grammar/spelling mistakes, but understandable)
"Major Errors" (Frequent grammar mistakes, difficult to read)
"Contains Slang" (Uses informal slang expressions)
4. Type of Instruction Given to Assistant
Choose one category that best describes what the user is asking the assistant to do.
Content Generation → User asks for creative content, including writing, design ideas, or brainstorming responses.
Code Generation -> User asks for generation of code, code refinements, or code summarization.
Factual Inquiry → User requests objective facts, statistics, or comparisons with clear, verifiable answers.
Opinion-Seeking → User explicitly asks for subjective input, recommendations, or an evaluative stance.
Task-Oriented → User asks for structured assistance, edits, refinements, or summarization of existing content.
Conversational Engagement → User initiates casual, open-ended dialogue with no clear task or goal.
Output Format
Return structured JSON output in this format:
{
"topic": ["Art", "Healthcare", None],
"language_style": "Formal",
"grammar_slang": "Perfect",
"instruction_type": "Content Generation"
}
Instructions
Analyze the prompt
Select the 3 most relevant topics, ordered by prominence in the conversation. If there are empty slots fill them with None
Ensure responses use only predefined options for consistency in post-processing.
Do not add explanations—only return JSON.
Now, analyze the following prompt:
{prompt}
text → text
OpenAI/GPT 4o mini
Source:
Target: