Evaluations
Run models against your data
Introducing Evaluations, a powerful feature designed to enable you to effortlessly test and compare a selection of AI models against your datasets.
Whether you're fine-tuning models or evaluating performance metrics, Oxen evaluations simplifies the process, allowing you to quickly and easily run prompts through an entire dataset.
Once you're happy with the results, output the resulting dataset to a new file, another branch, or directly as a new commit.
486adfc0-a842-446d-beac-df2cdd87024c

ox
5 months agoPlease act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. Output your final verdict by strictly following this format: “[[A]]” if assistant A is better, “[[B]]” if assistant B is better. Answer with just the verdict string and nothing else. [User Question] {prompt} [The Start of Assistant A’s Answer] {specific_thoughts_llama_405b_response} [The End of Assistant A’s Answer] [The Start of Assistant B’s Answer] {specific_thoughts_gemini_pro_response} [The End of Assistant B’s Answer]
combined_thoughts
judgements_llama_405b_v_gemini_pro
f6578af2-65da-4d41-a1ea-b55e6d19ce43

ox
5 months agoPlease act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. Output your final verdict by strictly following this format: “[[A]]” if assistant A is better, “[[B]]” if assistant B is better. Answer with just the verdict string and nothing else. [User Question] {prompt} [The Start of Assistant A’s Answer] {specific_thoughts_llama_405b_response} [The End of Assistant A’s Answer] [The Start of Assistant B’s Answer] {generic_thoughts_llama_405B_response} [The End of Assistant B’s Answer]
combined_thoughts
llama_405B_judgements
ae7f23c0-0e82-4133-b779-6bd4f75c6782

ox
5 months agoRespond to the following user query in a comprehensive and detailed way. You can write down your thought process before responding. Write your thoughts after "Here is my thought process:" and write your response after "Here is my response:". User query: {prompt}
generic_thoughts_llama_405B
8a13d411-e496-40f7-a681-335f77048252

ox
5 months agoRespond to the following user query in a comprehensive and detailed way. But first write down your internal thoughts. This must include your draft response and its evaluation. After this, write your final response after "<R>". User query: {prompt}
specific_thoughts_gemini_pro
e95d048d-c536-40bb-a41c-8ab9852d4b87

ox
5 months agoRespond to the following user query in a comprehensive and detailed way. But first write down your internal thoughts. This must include your draft response and its evaluation. After this, write your final response after "<R>". User query: {prompt}
categorizations
specific_thoughts_llama_405b
9ab24c3a-849f-4b45-a81b-ce1f6ebaa72f

ox
5 months agoClassify the text into one of the following categories. Repond with only the category, one word, all lowercase. If it does not fall into a category, respond with "none". sports finance tech entertainment {prompt}
c3632963-40a9-45e5-8553-5d885f31d403

ox
5 months agoClassify the text into one of the following categories. Repond with only the category, one word, all lowercase. If it does not fall into a category, resond with "none". sports finance tech entertainment {prompt}
eadf486d-82b5-4e15-9606-1dd25357ff23

ox
5 months agoClassify the text into entertainment, sports or finance. Limit it to one word. {prompt}
a534f90f-33cc-4d81-bf5e-ffdd0696fe63

ox
6 months agoClassify the text into one of 3 categories, you decide the categories {prompt}
60088e3f-0acc-45af-927c-9fb4d31c73bb

ox
6 months agoPlease act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. Output your final verdict by strictly following this format: “[[A]]” if assistant A is better, “[[B]]” if assistant B is better. Answer with just the verdict string and nothing else. [User Question] {prompt} [The Start of Assistant A’s Answer] {specific_thoughts_response} [The End of Assistant A’s Answer] [The Start of Assistant B’s Answer] {generic_thoughts_response} [The End of Assistant B’s Answer]
combined_thoughts
judgements
6373dbc5-d642-42b0-8650-d479684831c8

ox
6 months agoPlease act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. Output your final verdict by strictly following this format: “[[A]]” if assistant A is better, “[[B]]” if assistant B is better. Answer with just the verdict string and nothing else. [User Question] {prompt} [The Start of Assistant A’s Answer] {specific_thoughts_response} [The End of Assistant A’s Answer] [The Start of Assistant B’s Answer] {generic_thoughts_response} [The End of Assistant B’s Answer]
combined_thoughts
a72cad52-a06c-4413-95e1-6993a6450fad

ox
6 months agoRespond to the following user query in a comprehensive and detailed way. But first write down your internal thoughts. This must include your draft response and its evaluation. After this, write your final response after "<R>". User query: {prompt}
categorizations
specific_thoughts_70B
a0d1316d-c238-49f1-9c98-fa19d0e285e1

ox
6 months agoRespond to the following user query in a comprehensive and detailed way. But first write down your internal thoughts. This must include your draft response and its evaluation. After this, write your final response after "<R>". User query: {prompt}
categorizations
specific_thoughts
ff6d6793-8617-4530-a8c4-80261a88973e

ox
6 months agoRespond to the following user query in a comprehensive and detailed way. You can write down your thought process before responding. Write your thoughts after "Here is my thought process:" and write your response after "Here is my response:". User query: {prompt}
categorizations
thoughts
437e4c31-f7f0-4b4a-9ef2-02e56be79482

ox
6 months agoBelow is an instruction that I would like you to analyze: <instruction> {prompt} </instruction> Categorize the instruction above into one of the following categories: General Knowledge Math and Calculations Programming and Coding Reasoning and Problem-Solving Creative Writing Content Writing Art and Design Language and Translation Research and Analysis Conversational Dialogue Data Analysis and Visualization Business and Finance Education and Learning Science and Technology Health and Wellness Personal Development Entertainment and Humor Travel and Leisure Marketing and Sales Game Development Miscellaneous Be sure to provide the exact category name without any additional text.
categorizations