Repository evaluations - LilianZhou/super_glue

Evaluations

Run models against your data

Introducing Evaluations, a powerful feature designed to enable you to effortlessly test and compare a selection of AI models against your datasets.

Whether you're fine-tuning models or evaluating performance metrics, Oxen evaluations simplifies the process, allowing you to quickly and easily run prompts through an entire dataset.

Once you're happy with the results, output the resulting dataset to a new file, another branch, or directly as a new commit.

e8049bcd-343d-428f-b8f6-3345578dc606

OpenAI/GPT 4o minitext → text

LilianZhou

4 months ago

Prompt

Based on the premise，determine whether to agree with the hypothesis. Respond with only 0, 1 or 2. 2 for Support, 0 for Oppose, or 1 for Neutral.
premise:
{premise}
hypothesis:
{hypothesis}

main

cb/super_glue_cb_validation.parquet

main

cb/super_glue_cb_validation.parquet

completed 56 rows7950 tokens$ 0.0012 3 iterations

Qwen 72B on Boolq Validation

4c00221f-59ec-4771-a0cc-ad3e990516e5

Qwen/Qwen2.5 72B Instructtext → text

LilianZhou

5 months ago

Prompt

According to the context, answer the question with only 1 or 0. 1 for yes and 0 for no
context:
{passage}
question:
{question}

conflict-main-2a556f1d-96da-402a-8bbe-15e1058c8d21

boolq/super_glue_boolq_validation.parquet

conflict-main-2a556f1d-96da-402a-8bbe-15e1058c8d21

boolq/super_glue_boolq_validation.parquet

completed 3270 rows662171 tokens$ 0.5960 5 iterations

llama 70B on Commitment Bank

bd87f366-457c-4190-8b4a-13f6abd77948

Meta/Llama 3.1 70B Instructtext → text

LilianZhou

5 months ago

Prompt

Based on the premise，determine whether to agree with the hypothesis. Respond with only 0, 1 or 2. 2 for Support, 0 for Oppose, or 1 for Neutral.
premise:
{premise}
hypothesis:
{hypothesis}

main

cb/super_glue_cb_validation.parquet

3b8c2f56de4e359bb5dd6288bc97fa57

cb/super_glue_cb_validation.parquet

completed 56 rows8646 tokens$ 0.0078 2 iterations

llama 70B on Boolq

dc29989f-2d95-4996-9589-f14abacc86db

Meta/Llama 3.1 70B Instructtext → text

LilianZhou

5 months ago

Prompt

According to the context, answer the question with only 1 or 0. 1 for yes and 0 for no
context:
{passage}
question:
{question}

conflict-main-2a556f1d-96da-402a-8bbe-15e1058c8d21

boolq/super_glue_boolq_validation.parquet

conflict-main-2a556f1d-96da-402a-8bbe-15e1058c8d21

boolq/super_glue_boolq_validation.parquet

completed 3270 rows600464 tokens$ 0.5404 2 iterations

Qwen on Commitment Bank Q

b939a272-43f2-46ac-9e26-3b3cef2f456a

Qwen/Qwen2.5 72B Instructtext → text

LilianZhou

5 months ago

Prompt

Based on the premise，determine whether to agree with the hypothesis. Respond with only 0, 1 or 2. 2 for Support, 0 for Oppose, or 1 for Neutral.
premise:
{premise}
hypothesis:
{hypothesis}

main

cb/super_glue_cb_validation.parquet

main

cb/super_glue_cb_validation.parquet

completed 56 rows9435 tokens$ 0.0085 4 iterations

Qwen 72B on Boolq Validation

a5631f40-d8b1-4343-99e1-32aa2ca08ad7

Qwen/Qwen2.5 72B Instructtext → text

LilianZhou

5 months ago

Prompt

According to the context, answer the question with only 1 or 0. 1 for yes and 0 for no
context:
{passage}
question:
{question}

main

boolq/super_glue_boolq_validation.parquet

N/A

boolq/super_glue_boolq_validation.parquet

error no case clause matching: {:error, "resource_not_found"} 2700 / 3270 rows546534 tokens$ 0.4900 2 iterations

Qwen 72B on Boolq Validation

3f7688c6-01b2-4c3d-a50a-476c550ca736

Qwen/Qwen2.5 72B Instructtext → text

LilianZhou

5 months ago

Prompt

According to the context, answer the question with only 1 or 0. 1 for yes and 0 for no
context:
{passage}
question:
{question}

main

boolq/super_glue_boolq_validation.parquet

completed 5 row sample1189 tokens$ 0.0011 3 iterations

228b59d1-f298-4849-a01a-93511511a5c4

Qwen/Qwen2.5 72B Instructtext → text

LilianZhou

5 months ago

Prompt

According to the context, answer the question concisely. Try to answer in one or a few words.
context:
{paragraph}
question:
{question}

main

multirc/super_glue_multirc_test.parquet

completed 5 row sample2149 tokens$ 0.0019 1 iteration

bbad6e19-ad2b-4c4b-a6bc-22f0e9f33ca3

Qwen/Qwen2.5 72B Instructtext → text

LilianZhou

5 months ago

Prompt

According to the context, answer the question concisely. Try to answer in one or a few words.
context:
{paragraph}
question:
{question}

main

multirc/super_glue_multirc_validation.parquet

completed 5 row sample1280 tokens$ 0.0012 2 iterations

test gpt4-o

7638571d-fffc-43fc-8135-437bb7b9b1c5

OpenAI/GPT 4o minitext → text

LilianZhou

5 months ago

Prompt

Use the passage as context and pick the correct answer from given choices to put into the placeholder in the query. Do not output anthing besides the answer.
passage:
{passage}
choices:
{entities}
query:
{query}

main

record/super_glue_record_validation.parquet

completed 5 row sample1578 tokens$ 0.0002 1 iteration

Test llama3.1 8B on super_glue

c09749b7-c681-475e-8643-00fe5fed7d5c

Meta/Llama 3.1 8B Instructtext → text

LilianZhou

5 months ago

Prompt

Use the passage as context and pick the correct answer from given choices to put into the placeholder in the query. Do not output anthing besides the answer.
passage:
{passage}
choices:
{entities}
query:
{query}

main

record/super_glue_record_validation.parquet

completed 5 row sample1632 tokens$ 0.0003 1 iteration

Test llama3.1 8B

18f4ac2d-9cab-407b-9bfb-5ac9996eaebc

OpenAI/GPT 4o minitext → text

LilianZhou

5 months ago

Prompt

Use the passage as context and pick the correct answer from given choices to put into the placeholder in the query. Do not output anything besides the answer.
passage:
{passage}
choices:
{entities}
query:
{query}

main

record/super_glue_record_validation.parquet

completed 5 row sample1570 tokens$ 0.0002 1 iteration

Test llama3.1 8B on super_glue

cb56be48-8656-48c1-be12-727cb2ae1b48

Meta/Llama 3.1 8B Instructtext → text

LilianZhou

5 months ago

Prompt

Use the passage as context and pick the correct answer from given choices to put into the placeholder in the query. Do not output anthing besides the answer.
passage:
{passage}
query:
{query}
choices:
{entities}

main

record/super_glue_record_train.parquet

completed 20 row sample6614 tokens$ 0.0013 3 iterations