Featured Datasets
Featured Collections
Some of the Oxen team's favorite collections.
Visual LLMs
This collection is datasets for understanding of images with large language models
a collection by datasets
LLM-Feedback
Datasets with human or AI feedback. Useful for training reward models or applying techniques like DPO.
a collection by ox
Multimodal
List of datasets that cross modalities, combinations of text, image, audio, video etc.
a collection by ox
Featured Posts, Tutorials, and Case studies
Data Version Control 101 with Oxen
This intro tutorial from Oxen.ai shows how Oxen can make versioning your data as easy as versioning your code. Oxen is built to track and store changes for everything from a single CSV to data repositories with millions of unstructured images, videos, audio or text files. The tutorial will go through what data version control is, why it is important, and how Oxen helps data scientists and engineers gain visibility and confidence when sharing data with the rest of their team. Here's a video ve...
Arxiv Dive Manifesto
Every Friday the team at Oxen.ai gets together and goes over research papers, blog posts, or books that help us stay up to date with the latest in Machine Learning and AI. We call it Arxiv Dives because https://arxiv.org/ is a great resource for the latest research in the field. In September of 2023, we decided to make it public so that anyone can join. We’ve had amazing minds from hundreds of companies like Amazon, DoorDash, Meta, Google, and Tesla join the conversation, but I thought it would...
How to run Llama-2 on CPU after fine-tuning with LoRA
Running Large Language Models (LLMs) on the edge is a fascinating area of research, and opens up many use cases that require data privacy or lower cost profiles. With libraries like ggml coming on to the scene, it is now possible to get models anywhere from 1 billion to 13 billion parameters to run locally on a laptop with relatively low latency. In this tutorial, we are going to walk step by step how to fine tune Llama-2 with LoRA, export it to ggml, and run it on the edge on a CPU. We assume...