Citation Credit
Neural Text Generation from Structured Data with Application to the Biography Domain
RĂ©mi Lebret, David Grangier and Michael Auli, EMNLP 2016
http://arxiv.org/abs/1603.07771
This publication provides further information about the data, and we kindly ask you to cite this paper when using the data. The data was extracted from the English wikipedia dump (enwiki-20150901) relying on the articles referred by WikiProject Biography.
@inproceedings{Lebret_EMNLP2016,
author = {Lebret, R. and Grangier, D. and Auli, M.},
title = {{Neural Text Generation from Structured Data with Application to the Biography Domain }},
booktitle = {Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year = {2016}
}
Dataset Description
For each article, we extracted the first paragraph (text) and the infobox (structured data). Each infobox is encoded as a list of (field name, field value) pairs. We used Stanford CoreNLP to preprocess the data, i.e. we broke the text into sentences and tokenized both the text and the field values. The dataset was randomly split in three subsets train (80%), valid (10%), test (10%). We strongly recommend using test only for the final evaluation.
The data is organised in three subdirectories for train, valid and test.
Each directory contains 7 files:
SET.id
contains the list of wikipedia ids, one article per line.SET.url
contains the url of the wikipedia articles, one article per line.SET.box
contains the infobox data, one article per line.SET.nb
contains the number of sentences per article, one article per line.SET.sent
contains the sentences, one sentence per line.SET.title
contains the title of the wikipedia article, one per line.SET.contributors
contains the url of the wikipedia article history, which list the authors of the article.
Hence all the file allows to access the information for one article relying on line numbers. It is necessary to use SET.nb to split the sentences (SET.sent) per article. The format for encoding the infobox data SET.box follows the following scheme: each line encode one box, each box is encoded as a list of tab separated tokens, each token has the following form fieldname_position:wordtype. We also indicates when a field is empty or contains no readable tokens with fieldname:. For instance the first box of the valid set starts with
type_1:pope name_1:michael name_2:iii name_3:of
name_4:alexandria title_1:56th title_2:pope title_3:of title_4:alexandria
title_5:& title_6:patriarch title_7:of title_8:the
title_9:see title_10:of title_11:st. title_12:mark image:
which indicates that the field "type" contains 1 token "pope", the field "name" contains 4 tokens "michael iii of alexandria", the field "title" contains 12 tokens "56th pope of alexandria & patriarch of the see of st. mark", the field "image" is empty.
Dataset Statistics
Mean | Q-5% | Q-95% | |
---|---|---|---|
# tokens per sentence | 26.1 | 13 | 46 |
# tokens per table | 53.1 | 20 | 108 |
# table tokens per sentence | 9.5 | 3 | 19 |
# fields per table | 19.7 | 9 | 36 |
Published Results
Publication | Model | Perplexity | BLEU | ROUGE | NIST |
---|---|---|---|---|---|
Lebret et al. (2016) | Template Kneser-Ney | 7.46 | 19.8 | 10.7 | 5.19 |
Lebret et al. (2016) | Table Neural Language Model | 4.40 | 34.7 | 25.8 | 7.98 |
Decoding beam width is 5.
Version Information
v1.0 (this version) Initial Release.
License
License information is provided in License.txt
Decompressing zip files
We splitted the archive in multiple files. To extract, run
cat wikipedia-biography-dataset.z?? > tmp.zip
unzip tmp.zip
rm tmp.zip