Featured Datasets
ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. The project has been instrumental in advancing computer vision and deep learning research.
This dataset contains 33K cleaned conversations with pairwise human preferences. It is collected from 13K unique IP addresses on the Chatbot Arena from April to June 2023. Each sample includes a question ID, two model names, their full conversation text in OpenAI API JSON format, the user vote, the anonymized user ID, the detected language tag, the OpenAI moderation API tag, the additional toxic tag, and the timestamp.
OCRBench is a comprehensive evaluation benchmark designed to assess the OCR capabilities of Large Multimodal Models. It comprises five components: Text Recognition, SceneText-Centric VQA, Document-Oriented VQA, Key Information Extraction, and Handwritten Mathematical Expression Recognition. The benchmark includes 1000 question-answer pairs, and all the answers undergo manual verification and correction to ensure a more precise evaluation.
This repo contains data for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive"
Dataset for the paper "OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models"
Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pre-trained and instruction tuned generative text models in 8B sizes. The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks. Further, in developing these models, they took great care to optimize helpfulness and safety.
Featured Collections
Some of the Oxen team's favorite collections.
Visual LLMs
This collection is datasets for understanding of images with large language models
a collection by datasets
LLM-Feedback
Datasets with human or AI feedback. Useful for training reward models or applying techniques like DPO.
a collection by ox
Multimodal
List of datasets that cross modalities, combinations of text, image, audio, video etc.
a collection by ox