Bessie
Bessie
ox
User account
ox's Repositories
Displaying Page 5 of 11 (109 total Repositories)
Updated: 10 months ago

1.7 gb
2
00
Updated: 10 months ago

This dataset contains 404,290 questions pairs from Quora, and if they are duplicates of eachother.

72.6 mb
00
Updated: 10 months ago

The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. However, recent advances in neural language models have already reached around 90% accuracy on variants of WSC. This raises an important question whether these models have truly acquired robust commonsense capabilities or whether they rely on spurious biases in the datasets that lead to an overestimation of the true capabilities of machine commonsense.

14.5 mb
108
00
TriviaQA
Public
Updated: 10 months ago

TriviaQA is a reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions.

10.3 gb
00
AGIEval
Public
Updated: 10 months ago

AGIEval is a human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving. T

9.2 mb
221
00
MMLU
Public
Updated: 10 months ago

MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pre-training by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.

166 mb
2179
00
Updated: 10 months ago

Cats vs Dogs

100.2 mb
228.1K
470
Updated: 10 months ago

Repository of images of cats and dogs for object detection.

141.3 mb
8.1K2
00
Updated: 10 months ago

Updated: 10 months ago

688.8 mb
2.7K1
10