Home | Oxen.ai

Build World-Class
AI Datasets. Together.

Open-source tools to track, iterate, collaborate on, and discover multi-modal data in any format.

ImageAudioVideoTabularTextMore...

Trusted by a community of engineers and researchers at companies like

Public Datasets

Explore Datasets

Oxen’s public and private datasets allow you to iterate on data within your organization or share them with the world.

datasets/ARC-Challenge

A dataset from the Allen Institute of AI consisting of genuine grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answering. The dataset the Challenging Set of questions.

1 text files > 99%

Question Answering Natural Language Processing

Updated: Dec 21, 2023

datasets/Wikipedia

Wikipedia dataset containing cleaned articles. There are 6.4 million articles that can be streamed via apache arrow files.

65 tabular files 98.5%

1 text files 1.5%

Updated: Dec 27, 2023

ox/Flickr8k

A benchmark collection for sentence-based image description and search, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events. … The images were chosen from six different Flickr groups, and tend not to contain any well-known people or locations, but were manually selected to depict a variety of scenes and situations.

8.1K image files > 99%

9 text files < 1%

3 tabular files < 1%

Image Captioning Computer Vision Generative AI

Updated: Oct 9, 2023

ox/Flowers

An image classification dataset containing 3670 images of flowers across 5 classes: daisy, dandelion, roses, sunflowers, tulips. The images are of nonstandard sizes and aspect ratios, ranging from 500 x 442 px to 143 x 240 px.

3.7K image files > 99%

4 text files < 1%

1 tabular files < 1%

Image Classification Computer Vision

Updated: Nov 28, 2023

170

ox/MiniSpeechCommands

Subset of speech commands to test audio recognition systems on.

8K audio files > 99%

3 text files < 1%

1 tabular files < 1%

Audio Classification Audio

Updated: Apr 4, 2023

ox/CelebA

CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations.

203K image files > 99%

5 tabular files < 1%

1 text files < 1%

Generative AI Image Classification Computer Vision

Updated: Jan 18, 2023

Measure Performance

Better Datasets.
Better AI.

AI is only as good as the datasets you feed it. Gain visibility into the data that goes in and out of your model.

Version Control

Find the changes that matter

Datasets change every day. Oxen’s version control allows you to quickly narrow down the most important changes that affect your model.

Scalability & Versatility

Thousands of hours of audio?
Millions of images?
Billion rows in your csv?
No problem.

Oxen’s data version control is built to handle data of any shape or size.

Performance

Built for speed

Oxen.ai saves your engineers hours syncing data from training, testing, to evaluation. From fast syncing of data to removing push/pull bottlenecks from traditional VCS systems, Oxen.ai was built for machine learning datasets and workflows.

Data Visibility

Goodbye Messy Blob Storage.
Hello Oxen.

Oxen’s data version control turns your unstructured data into beautifully rendered datasets that evolve overtime. Dive into any version of the dataset at any point in time and see exactly what changed.

Command Line Tooling

Powered by industrial strength version control

Oxen.ai has re-imagined version control for data. At the core are the same principles that have made Git so powerful, but Oxen has optimized down to the merkle trees, hashing principles, and network protocols to make it work more fast and effortless for large scale datasets.

Collaboration

Collaborate with your team

Many stakeholders, ML Eng, Data Science, Product, Legal, Auditing, Community. The more eyes the better

Community

Join the growing Herd

Oxen.ai has developed a strong and growing community of individuals focused on furthering machine learning and artificial intelligence. From academic researchers training the next generation of models, to full-stack developers leveraging existing API's to build amazing products. Every Friday we get together and read research papers, discuss them, and apply them to our own work.

Featured Posts, Tutorials, and Case studies

Data Version Control 101 with Oxen

This intro tutorial from Oxen.ai shows how Oxen can make versioning your data as easy as versioning your code. Oxen is built to track and store changes for everything from a single CSV to data repositories with millions of unstructured images, videos, audio or text files. The tutorial will go through what data version control is, why it is important, and how Oxen helps data scientists and engineers gain visibility and confidence when sharing data with the rest of their team. Here's a video ve...

Greg Schoeninger

Nov 9, 2023

Arxiv Dive Manifesto

Every Friday the team at Oxen.ai gets together and goes over research papers, blog posts, or books that help us stay up to date with the latest in Machine Learning and AI. We call it Arxiv Dives because https://arxiv.org/ is a great resource for the latest research in the field. In September of 2023, we decided to make it public so that anyone can join. We’ve had amazing minds from hundreds of companies like Amazon, DoorDash, Meta, Google, and Tesla join the conversation, but I thought it would...

Greg Schoeninger

Nov 5, 2023

How to run Llama-2 on CPU after fine-tuning with LoRA

Running Large Language Models (LLMs) on the edge is a fascinating area of research, and opens up many use cases that require data privacy or lower cost profiles. With libraries like ggml coming on to the scene, it is now possible to get models anywhere from 1 billion to 13 billion parameters to run locally on a laptop with relatively low latency. In this tutorial, we are going to walk step by step how to fine tune Llama-2 with LoRA, export it to ggml, and run it on the edge on a CPU. We assume...

Greg Schoeninger

Oct 23, 2023

Build World-ClassAI Datasets. Together.

Trusted by a community of engineers and researchers at companies like

Explore Datasets

datasets/ARC-Challenge

datasets/Wikipedia

ox/Flickr8k

ox/Flowers

ox/MiniSpeechCommands

ox/CelebA

Better Datasets.Better AI.

Find the changes that matter

Thousands of hours of audio?Millions of images?Billion rows in your csv?No problem.

Built for speed

Goodbye Messy Blob Storage.Hello Oxen.

Powered by industrial strength version control

Collaborate with your team

Join the growing Herd

Featured Posts, Tutorials, and Case studies

Data Version Control 101 with Oxen

Arxiv Dive Manifesto

How to run Llama-2 on CPU after fine-tuning with LoRA

Build World-Class
AI Datasets. Together.

Better Datasets.
Better AI.

Thousands of hours of audio?
Millions of images?
Billion rows in your csv?
No problem.

Goodbye Messy Blob Storage.
Hello Oxen.