Featured Datasets

ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. The project has been instrumental in advancing computer vision and deep learning research.

139.5 gb
1521.2M
Updated: 7 months ago

A Dataset for VQA on Document Images.

9.7 gb
1117K
Updated: 6 months ago

This dataset contains 33K cleaned conversations with pairwise human preferences. It is collected from 13K unique IP addresses on the Chatbot Arena from April to June 2023. Each sample includes a question ID, two model names, their full conversation text in OpenAI API JSON format, the user vote, the anonymized user ID, the detected language tag, the OpenAI moderation API tag, the additional toxic tag, and the timestamp.

41.6 mb
21
Updated: 1 year ago

OCRBench is a comprehensive evaluation benchmark designed to assess the OCR capabilities of Large Multimodal Models. It comprises five components: Text Recognition, SceneText-Centric VQA, Document-Oriented VQA, Key Information Extraction, and Handwritten Mathematical Expression Recognition. The benchmark includes 1000 question-answer pairs, and all the answers undergo manual verification and correction to ensure a more precise evaluation.

4.5 gb
3210K
Updated: 2 months ago

This repo contains data for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive"

466.3 mb
4K302
Updated: 11 months ago

This is a tutorial of how to use Oxen.ai with PixArt

185.3 mb
75212115
Updated: 7 months ago

1.2 mb
22
Updated: 11 months ago

Dataset for the paper "OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models"

1 gb
21
Updated: 5 months ago

A dataset from the Allen Institute of AI consisting of genuine grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answering. The dataset the Easy Set.

1.5 mb
1
Updated: 1 year ago

Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pre-trained and instruction tuned generative text models in 8B sizes. The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks. Further, in developing these models, they took great care to optimize helpfulness and safety.

16.1 gb
410
Updated: 1 year ago
View all featured repositories
Featured Collections

Some of the Oxen team's favorite collections.

LLM-SFT

Interesting datasets to supervise fine-tune (SFT) language models with.

a collection by ox

Visual LLMs

This collection is datasets for understanding of images with large language models

a collection by datasets

LLM-Feedback

Datasets with human or AI feedback. Useful for training reward models or applying techniques like DPO.

a collection by ox

LLM-Eval

A list of standard benchmarks for LLM evaluation

a collection by ox

Multimodal

List of datasets that cross modalities, combinations of text, image, audio, video etc.

a collection by ox

Browse all collections