Explore Dataset Repositories

Featured Datasets

datasets/ImageNet-1k

public

ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. The project has been instrumental in advancing computer vision and deep learning research.

139.5 gb

1521.2M

Updated: 1 year ago

datasets/DocVQA

public

A Dataset for VQA on Document Images.

9.7 gb

1117K

Updated: 11 months ago

lmsys/chatbot_arena_conversations

public

This dataset contains 33K cleaned conversations with pairwise human preferences. It is collected from 13K unique IP addresses on the Chatbot Arena from April to June 2023. Each sample includes a question ID, two model names, their full conversation text in OpenAI API JSON format, the user vote, the anonymized user ID, the detected language tag, the OpenAI moderation API tag, the additional toxic tag, and the timestamp.

41.6 mb

Updated: 1 year ago

lmms-lab/OCRBench-v2

public

OCRBench is a comprehensive evaluation benchmark designed to assess the OCR capabilities of Large Multimodal Models. It comprises five components: Text Recognition, SceneText-Centric VQA, Document-Oriented VQA, Key Information Extraction, and Handwritten Mathematical Expression Recognition. The benchmark includes 1000 question-answer pairs, and all the answers undergo manual verification and correction to ensure a more precise evaluation.

4.5 gb

3210K

Updated: 7 months ago

BLINK-Benchmark/BLINK

public

This repo contains data for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive"

466.3 mb

4K302

Updated: 1 year ago

ox/PixArtTutorial

public

This is a tutorial of how to use Oxen.ai with PixArt

185.3 mb

75212115

Updated: 1 year ago

mlabonne/harmless_alpaca

public

1.2 mb

Updated: 1 year ago

OpenCoder-LLM/opc-sft-stage1

public

Dataset for the paper "OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models"

1 gb

Updated: 10 months ago

datasets/ARC-Easy

public

A dataset from the Allen Institute of AI consisting of genuine grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answering. The dataset the Easy Set.

1.5 mb

Updated: 2 years ago

models/llama-3-8b-instruct

public

Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pre-trained and instruction tuned generative text models in 8B sizes. The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks. Further, in developing these models, they took great care to optimize helpfulness and safety.

16.1 gb

410

Updated: 2 years ago

View all featured repositories

Featured Collections

Some of the Oxen team's favorite collections.

Browse all collections

Featured Datasets

datasets/ImageNet-1k

datasets/DocVQA

lmsys/chatbot_arena_conversations

lmms-lab/OCRBench-v2

BLINK-Benchmark/BLINK

ox/PixArtTutorial

mlabonne/harmless_alpaca

OpenCoder-LLM/opc-sft-stage1

datasets/ARC-Easy

models/llama-3-8b-instruct

Featured Collections

LLM-SFT

Visual LLMs

LLM-Feedback

LLM-Eval

Multimodal