datasets
datasets's Repositories
This is a cleaned version of the HuggingFaceH4/ultrafeedback_binarized dataset that just has the chosen and rejected samples.
Question, context, answer triples that are marked as having the answer in context, not having the answer in context, and being a question that does not make sense to ask.
A growing and diverse dataset of text for AI to graze on and learn new information. Just like a pasture in the wild, it is a combination of sources. All the data is in Arrow format so it is easy to randomly access and stream.
LLaVA Visual Instruct 150K is a set of GPT-generated multimodal instruction-following data. It is constructed for visual instruction tuning and for building large multimodal towards GPT-4 vision/language capability.
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.