Evaluations
Run models against your data
Introducing Evaluations, a powerful feature designed to enable you to effortlessly test and compare a selection of AI models against your datasets.
Whether you're fine-tuning models or evaluating performance metrics, Oxen evaluations simplifies the process, allowing you to quickly and easily run prompts through an entire dataset.
Once you're happy with the results, output the resulting dataset to a new file, another branch, or directly as a new commit.
d62d7bdf-76ff-4773-8c0d-eaa8fcea53ff

holodorum
4 weeks agoWhich language is the prompt written in: {prompt}
6a386cf0-ef1f-44f2-81c8-5fb3dce77a66

holodorum
1 month agoYou are an expert in NLP and prompt analysis. Your task is to evaluate a **single user prompt** based on predefined categories and return structured JSON data for easier post-processing. --- Select up to 3 topics that are most relevant to the prompt from the following list: ["Healthcare", "Finance", "Education", "Technology", "Science", "Politics", "Environment", "Ethics", "Entertainment", "History", "Philosophy", "Psychology", "Sports", "Legal", "Business", "Travel", "Food", "Art", "Literature", "Personal Development", "Programming"] The first topic should be the most dominant in the prompt. The second and third topics should reflect other significant themes in the discussion. If a conversation only has one or two clear topics, set the remaining topics to None. 2. Language Style "Formal" "Informal" "Mixed" 3. Grammar & Slang in User Input "Perfect" (No mistakes, professional style) "Minor Errors" (Small grammar/spelling mistakes, but understandable) "Major Errors" (Frequent grammar mistakes, difficult to read) "Contains Slang" (Uses informal slang expressions) 4. Type of Instruction Given to Assistant Choose one category that best describes what the user is asking the assistant to do. Content Generation → User asks for creative content, including writing, design ideas, or brainstorming responses. Code Generation -> User asks for generation of code, code refinements, or code summarization. Factual Inquiry → User requests objective facts, statistics, or comparisons with clear, verifiable answers. Opinion-Seeking → User explicitly asks for subjective input, recommendations, or an evaluative stance. Task-Oriented → User asks for structured assistance, edits, refinements, or summarization of existing content. Conversational Engagement → User initiates casual, open-ended dialogue with no clear task or goal. Output Format Return structured JSON output in this format: { "topic": ["Art", "Healthcare", None], "language_style": "Formal", "grammar_slang": "Perfect", "instruction_type": "Content Generation" } Instructions Analyze the prompt Select the 3 most relevant topics, ordered by prominence in the conversation. If there are empty slots fill them with None Ensure responses use only predefined options for consistency in post-processing. Do not add explanations—only return JSON. Now, analyze the following prompt: {prompt}
prompt_analysis
error An exception occurred indexing, getting dataframe and running evaluation: %Req.TransportError{reason: :closed} 30 / 23110 rows22852 tokens$ 0.0000 3 iterations
e12115fe-f1f4-47a0-ae64-5ac14fc6d4ec

holodorum
1 month agoYou are an expert in NLP and prompt analysis. Your task is to evaluate a **single user prompt** based on predefined categories and return structured JSON data for easier post-processing. --- Select up to 3 topics that are most relevant to the prompt from the following list: ["Healthcare", "Finance", "Education", "Technology", "Science", "Politics", "Environment", "Ethics", "Entertainment", "History", "Philosophy", "Psychology", "Sports", "Legal", "Business", "Travel", "Food", "Art", "Literature", "Personal Development", "Programming"] The first topic should be the most dominant in the prompt. The second and third topics should reflect other significant themes in the discussion. If a conversation only has one or two clear topics, set the remaining topics to None. 2. Language Style "Formal" "Informal" "Mixed" 3. Grammar & Slang in User Input "Perfect" (No mistakes, professional style) "Minor Errors" (Small grammar/spelling mistakes, but understandable) "Major Errors" (Frequent grammar mistakes, difficult to read) "Contains Slang" (Uses informal slang expressions) 4. Type of Instruction Given to Assistant Choose one category that best describes what the user is asking the assistant to do. Content Generation → User asks for creative content, including writing, design ideas, or brainstorming responses. Code Generation -> User asks for generation of code, code refinements, or code summarization. Factual Inquiry → User requests objective facts, statistics, or comparisons with clear, verifiable answers. Opinion-Seeking → User explicitly asks for subjective input, recommendations, or an evaluative stance. Task-Oriented → User asks for structured assistance, edits, refinements, or summarization of existing content. Conversational Engagement → User initiates casual, open-ended dialogue with no clear task or goal. Output Format Return structured JSON output in this format: { "topic": ["Art", "Healthcare", None], "language_style": "Formal", "grammar_slang": "Perfect", "instruction_type": "Content Generation" } Instructions Analyze the prompt Select the 3 most relevant topics, ordered by prominence in the conversation. If there are empty slots fill them with None Ensure responses use only predefined options for consistency in post-processing. Do not add explanations—only return JSON. Now, analyze the following prompt: {prompt}
error An exception occurred indexing, getting dataframe and running evaluation: %Req.TransportError{reason: :closed} 30 / 23110 rows21625 tokens$ 0.0000 2 iterations
bfb7ff77-8b1c-4e72-a241-e4bbf4efdee8

holodorum
1 month agoYou are an expert in NLP and prompt analysis. Your task is to evaluate a **single user prompt** based on predefined categories and return structured JSON data for easier post-processing. --- Select up to 3 topics that are most relevant to the prompt from the following list: ["Healthcare", "Finance", "Education", "Technology", "Science", "Politics", "Environment", "Ethics", "Entertainment", "History", "Philosophy", "Psychology", "Sports", "Legal", "Business", "Travel", "Food", "Art", "Literature", "Personal Development", "Programming"] The first topic should be the most dominant in the prompt. The second and third topics should reflect other significant themes in the discussion. If a conversation only has one or two clear topics, set the remaining topics to None. 2. Language Style "Formal" "Informal" "Mixed" 3. Grammar & Slang in User Input "Perfect" (No mistakes, professional style) "Minor Errors" (Small grammar/spelling mistakes, but understandable) "Major Errors" (Frequent grammar mistakes, difficult to read) "Contains Slang" (Uses informal slang expressions) 4. Type of Instruction Given to Assistant Choose one category that best describes what the user is asking the assistant to do. Content Generation → User asks for creative content, including writing, design ideas, or brainstorming responses. Code Generation -> User asks for generation of code, code refinements, or code summarization. Factual Inquiry → User requests objective facts, statistics, or comparisons with clear, verifiable answers. Opinion-Seeking → User explicitly asks for subjective input, recommendations, or an evaluative stance. Task-Oriented → User asks for structured assistance, edits, refinements, or summarization of existing content. Conversational Engagement → User initiates casual, open-ended dialogue with no clear task or goal. Output Format Return structured JSON output in this format: { "topic": ["Art", "Healthcare", None], "language_style": "Formal", "grammar_slang": "Perfect", "instruction_type": "Content Generation" } Instructions Analyze the prompt Select the 3 most relevant topics, ordered by prominence in the conversation. If there are empty slots fill them with None Ensure responses use only predefined options for consistency in post-processing. Do not add explanations—only return JSON. Now, analyze the following prompt: {prompt}
error An exception occurred indexing, getting dataframe and running evaluation: %Req.TransportError{reason: :closed} 30 / 23110 rows21635 tokens$ 0.0000 43 iterations