Repository evaluations - datasets/ChartQA

Evaluations/Judge 11B Responses

llama-3.2-11B-cot-separate-steps

val_100_ex.json

Type: text → text

Model:

OpenAI/GPT 4o mini

Provider:

OpenAI

Target field: are_equivalent

Prompt

Are the two responses equivalent? Reply with true or false.

Response 1:
{label}

Response 2:
{conclusion}

Queued: Dec 6, 2024, 5:33 PM UTC

Completed: Dec 6, 2024, 5:33 PM UTC

00:00:02

5 row sample

173 tokens$ 0.0000

5 rows processed, 173 tokens used ($0.0000)

Estimated cost for all 100 rows: $0.0006

Sample Results completed

9 columns, 1-5 of 100 rows

imgname

query

Find the average of the percentage value of bars greater than 1?

What is the sum of the two medians?

What's the total sum of peak points of green and red lines?

How many bars have a Very worried value is greater than its Somewhat Worried value?

Is the graph increasing or decreasing?