This dataset contains 3.3K expert-level pairwise human preferences for model responses generated by 6 models in response to 80 MT-bench questions. The annotators are mostly graduate students with expertise in the topic areas of each of the questions.