TurkQA Data Set

Christopher Malon and Bing Bai
(C) 2013 NEC Laboratories America

This release contains the training and testing data used in the paper

Please cite the paper if you use the data set in your research.


Question and answer data in "results/" is for non-commercial use, released under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 license.

The sentences in "problems/" are from Wikipedia, released under the Creative Commons Attribution-ShareAlike 3.0 license.

Copies of these licenses appear in the subdirectories.

Download the data set here or here.


TurkQA consists of a selection of sentences from English Wikipedia articles, with questions and answers crowdsourced from workers on Amazon Mechanical Turk.

Turks were shown the sentence at the beginning of a random Wikipedia article, and instructed to write four questions.

The questions may be of two types:

The Turks were given the rules:

Assignments of sentences to Turks were random. If a Turk found an assignment too difficult, it could be traded for another one.

The original sentences are collected in the "problems" subdirectory, with the Turk worker submissions in the "results" subdirectory.

The lines of a result file alternate between questions and answers. Questions appear in plain text. Answers consist of the string "yes" or "no", or else two numerical indices, separated by a space, indicating the range of characters from the original sentence to be taken as the answer.

Due to sentence splitting errors, some problems consist of multiple sentences.

For inquiries, contact Christopher Malon (last name at this site).

Amazon Mechanical Turk is a trademark of Amazon Technologies, Inc.