ox | Oxen AI

Bessie

User account

ox's Repositories

Displaying Page 5 of 11 (109 total Repositories)

QuoraDuplicateQuestions

Public

Updated: 10 months ago

This dataset contains 404,290 questions pairs from Quora, and if they are duplicates of eachother.

72.6 mb

The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. However, recent advances in neural language models have already reached around 90% accuracy on variants of WSC. This raises an important question whether these models have truly acquired robust commonsense capabilities or whether they rely on spurious biases in the datasets that lead to an overestimation of the true capabilities of machine commonsense.

14.5 mb

108

TriviaQA

Public

Updated: 10 months ago

TriviaQA is a reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions.

10.3 gb

AGIEval

Public

Updated: 10 months ago

AGIEval is a human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving. T

9.2 mb

221

MMLU

Public

Updated: 10 months ago

MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pre-training by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.

166 mb

2179

CatDogBBox

Public

Updated: 10 months ago

Cats vs Dogs

100.2 mb

228.1K

470

CatVsDogBoundingBox

Public

Updated: 10 months ago

Repository of images of cats and dogs for object detection.

141.3 mb

8.1K2

Numenta-Anomaly-Benchmark

Public

Updated: 10 months ago

Empty

Time Series Structured Data

Horse2Zebra

Public

Updated: 10 months ago

688.8 mb

2.7K1

Bessie

ox's Repositories

SuperNaturalInstructions2m

QuoraDuplicateQuestions

winogrande

TriviaQA

AGIEval

MMLU

CatDogBBox

CatVsDogBoundingBox

Numenta-Anomaly-Benchmark

Horse2Zebra