Repository evaluations - ox/Thinking-LLMs

Evaluations

Run models against your data

Introducing Evaluations, a powerful feature designed to enable you to effortlessly test and compare a selection of AI models against your datasets.

Whether you're fine-tuning models or evaluating performance metrics, Oxen evaluations simplifies the process, allowing you to quickly and easily run prompts through an entire dataset.

Once you're happy with the results, output the resulting dataset to a new file, another branch, or directly as a new commit.

GPT-4o As A Judge: Gemini Pro v Llama 405B

486adfc0-a842-446d-beac-df2cdd87024c

OpenAI/GPT 4otext → text

5 months ago

Prompt

Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. 

Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. 

Output your final verdict by strictly following this format: “[[A]]” if assistant A is better, “[[B]]” if assistant B is better. Answer with just the verdict string and nothing else.

[User Question]
{prompt}

[The Start of Assistant A’s Answer]
{specific_thoughts_llama_405b_response}
[The End of Assistant A’s Answer]

[The Start of Assistant B’s Answer]
{specific_thoughts_gemini_pro_response}
[The End of Assistant B’s Answer]

combined_thoughts

combined_responses_llama_405b_gemini_pro.jsonl

judgements_llama_405b_v_gemini_pro

combined_responses_llama_405b_gemini_pro.jsonl

completed 1001 rows1027794 tokens$ 2.59 1 iteration

OpenAI GPT-4o Judgement of Llama 405B Responses

f6578af2-65da-4d41-a1ea-b55e6d19ce43

OpenAI/GPT 4otext → text

5 months ago

Prompt

Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. 

Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. 

Output your final verdict by strictly following this format: “[[A]]” if assistant A is better, “[[B]]” if assistant B is better. Answer with just the verdict string and nothing else.

[User Question]
{prompt}

[The Start of Assistant A’s Answer]
{specific_thoughts_llama_405b_response}
[The End of Assistant A’s Answer]

[The Start of Assistant B’s Answer]
{generic_thoughts_llama_405B_response}
[The End of Assistant B’s Answer]

combined_thoughts

combined_responses_llama_405b.jsonl

llama_405B_judgements

combined_responses_llama_405b.jsonl

completed 1001 rows977245 tokens$ 2.47 2 iterations

Generic Thoughts Llama 405B

ae7f23c0-0e82-4133-b779-6bd4f75c6782

Meta/Llama 3.1 405B Instructtext → text

5 months ago

Prompt

Respond to the following user query in a comprehensive and detailed way. You can write down your thought process before responding. Write your thoughts after "Here is my thought process:" and write your response after "Here is my response:".
User query: {prompt}

main

ultrafeedback_1000.jsonl

generic_thoughts_llama_405B

ultrafeedback_1000.jsonl

completed 1000 rows746390 tokens$ 2.24 1 iteration

Specific Thoughts Gemini Pro

8a13d411-e496-40f7-a681-335f77048252

Google/Gemini 1.5 Protext → text

5 months ago

Prompt

Respond to the following user query in a comprehensive and detailed way. But first write down your internal thoughts. This must include your draft response and its evaluation. After this, write your final response after "<R>".

User query: {prompt}

main

ultrafeedback_1000.jsonl

specific_thoughts_gemini_pro

ultrafeedback_1000.jsonl

completed 1000 rows888626 tokens$ 3.70 1 iteration

Specific Thoughts Llama 405B

e95d048d-c536-40bb-a41c-8ab9852d4b87

Meta/Llama 3.1 405B Instructtext → text

5 months ago

Prompt

Respond to the following user query in a comprehensive and detailed way. But first write down your internal thoughts. This must include your draft response and its evaluation. After this, write your final response after "<R>".

User query: {prompt}

categorizations

ultrafeedback_1000.jsonl

specific_thoughts_llama_405b

ultrafeedback_1000.jsonl

completed 1000 rows879230 tokens$ 2.64 1 iteration

9ab24c3a-849f-4b45-a81b-ce1f6ebaa72f

Meta/Llama 3.1 8B Instruct Turbotext → text

5 months ago

Prompt

Classify the text into one of the following categories. Repond with only the category, one word, all lowercase. If it does not fall into a category, respond with "none".

sports
finance
tech
entertainment

{prompt}

main

ultrafeedback_1000.jsonl

completed 5 row sample899 tokens$ 0.0002 2 iterations

c3632963-40a9-45e5-8553-5d885f31d403

Meta/Llama 3.1 8B Instruct Turbotext → text

5 months ago

Prompt

Classify the text into one of the following categories. Repond with only the category, one word, all lowercase. If it does not fall into a category, resond with "none".

sports
finance
tech
entertainment

{prompt}

main

ultrafeedback_1000.jsonl

completed 5 row sample918 tokens$ 0.0002 2 iterations

Categorize all the prompts

eadf486d-82b5-4e15-9606-1dd25357ff23

OpenAI/GPT 4o minitext → text

5 months ago

Prompt

Classify the text into entertainment, sports or finance. Limit it to one word.

{prompt}

main

ultrafeedback_1000.jsonl

completed 5 row sample475 tokens$ 0.0001 2 iterations

a534f90f-33cc-4d81-bf5e-ffdd0696fe63

OpenAI/GPT 4o minitext → text

6 months ago

Prompt

Classify the text into one of 3 categories, you decide the categories

{prompt}

main

ultrafeedback_1000.jsonl

completed 5 row sample807 tokens$ 0.0003 1 iteration

Judge The Responses

60088e3f-0acc-45af-927c-9fb4d31c73bb

Meta/Llama 3.1 70B Instructtext → text

6 months ago

Prompt

Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. 

Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. 

Output your final verdict by strictly following this format: “[[A]]” if assistant A is better, “[[B]]” if assistant B is better. Answer with just the verdict string and nothing else.

[User Question]
{prompt}

[The Start of Assistant A’s Answer]
{specific_thoughts_response}
[The End of Assistant A’s Answer]

[The Start of Assistant B’s Answer]
{generic_thoughts_response}
[The End of Assistant B’s Answer]

combined_thoughts

combined_responses.jsonl

judgements

combined_responses.jsonl

completed 1000 rows991422 tokens$ 0.8923 3 iterations

6373dbc5-d642-42b0-8650-d479684831c8

Meta/Llama 3.1 70B Instructtext → text

6 months ago

Prompt

Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. 

Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. 

Output your final verdict by strictly following this format: “[[A]]” if assistant A is better, “[[B]]” if assistant B is better. Answer with just the verdict string and nothing else.

[User Question]
{prompt}

[The Start of Assistant A’s Answer]
{specific_thoughts_response}
[The End of Assistant A’s Answer]

[The Start of Assistant B’s Answer]
{generic_thoughts_response}
[The End of Assistant B’s Answer]

combined_thoughts

combined_responses.jsonl

completed 5 row sample2802 tokens$ 0.0025 3 iterations

Specific Thought Prompt - Llama 3.1 70B

a72cad52-a06c-4413-95e1-6993a6450fad

Meta/Llama 3.1 70B Instructtext → text

6 months ago

Prompt

Respond to the following user query in a comprehensive and detailed way. But first write down your internal thoughts. This must include your draft response and its evaluation. After this, write your final response after "<R>".

User query: {prompt}

categorizations

ultrafeedback_1000.jsonl

specific_thoughts_70B

ultrafeedback_1000.jsonl

completed 1000 rows977771 tokens$ 0.8800 2 iterations

Specific Thought Prompt - Llama 3.1 8B

a0d1316d-c238-49f1-9c98-fa19d0e285e1

Meta/Llama 3.1 8B Instructtext → text

6 months ago

Prompt

Respond to the following user query in a comprehensive and detailed way. But first write down your internal thoughts. This must include your draft response and its evaluation. After this, write your final response after "<R>".
User query: {prompt}

categorizations

ultrafeedback_1000.jsonl

specific_thoughts

ultrafeedback_1000.jsonl

completed 1000 rows916141 tokens$ 0.1832 1 iteration

Generate generic thoughts w/ Llama 3.1 8B

ff6d6793-8617-4530-a8c4-80261a88973e

Meta/Llama 3.1 8B Instructtext → text

6 months ago

Prompt

Respond to the following user query in a comprehensive and detailed way. You can write down your thought process before responding. Write your thoughts after "Here is my thought process:" and write your response after "Here is my response:".
User query: {prompt}

categorizations

ultrafeedback_1000.jsonl

thoughts

ultrafeedback_1000.jsonl

completed 1000 rows773706 tokens$ 0.1547 2 iterations

Classify the instructions with Llama 3.1 8B

437e4c31-f7f0-4b4a-9ef2-02e56be79482

Meta/Llama 3.1 8B Instructtext → text

6 months ago

Prompt

Below is an instruction that I would like you to analyze:

<instruction>
{prompt}
</instruction>

Categorize the instruction above into one of the following categories: 
General Knowledge
Math and Calculations
Programming and Coding
Reasoning and Problem-Solving
Creative Writing
Content Writing
Art and Design
Language and Translation
Research and Analysis
Conversational Dialogue
Data Analysis and Visualization
Business and Finance
Education and Learning
Science and Technology
Health and Wellness
Personal Development
Entertainment and Humor
Travel and Leisure
Marketing and Sales
Game Development
Miscellaneous

Be sure to provide the exact category name without any additional text.

main

ultrafeedback_1000.jsonl

categorizations

ultrafeedback_1000.jsonl

completed 1000 rows312294 tokens$ 0.0625 2 iterations