HomePub Dataset: Publication Strings from Academic Homepages

HomePub is a dataset that contains 2,500 academic homepages collected from 100 universities around the world. In the HomePub dataset, each publication string is labeled. See the paper below for details:

Yiqing Zhang, Jianzhong Qi, Rui Zhang*, Chuandong Yin (2018). PubSE: A Hierarchical Model for Publication Extraction from Academic Homepages. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, [PDF] (* Contact author)

This data is under the Community Data License Agreement – Sharing – Version 1.0

Please cite the above paper if you intend to use this dataset. If you have inquiries about the dataset, please contact (homepub DOT dataset AT gmail DOT com). If you have general inquiries about the research paper, please contact Prof Rui Zhang

How to Get this Dataset

You can access the latest version of this dataset from HERE.

More Information about the HomePub dataset

We have collected visible texts from the webpages and then manually tagged all the publication strings in them. During tagging, we mark the beginning and ending byte offsets of each publication string. Each publication string in HomePub dataset is annotated by two annotators. Disagreement is resolved by a third annotator. Tagging of publication lists is time-consuming, especially when the researcher has a long publication list. We have developed a program PageTagger to assist the tagging. On average, it takes about 2.5 minutes to tag one academic homepage when using the PageTagger tool.

The following table shows some basic statistical results about the HomePub dataset. Among all the 2,500 annotated pages in the HomePub dataset, 2,188 (87.52%) are academic homepages, while 312 (12.48%) are misclassified as academic homepages. Among the 2,188 academic homepages, 723 homepages (33.04%) contain publication lists, which consist of a total of 13,237 publication strings. On average, there are 732.1 (std=1583.3) tokens, 89.9 (std=141.6) lines, and 18.3 (std=35.4) publication strings per homepage. Each publication string contains 215.8 (std=93.1) characters and 49.2 (std=22.9) tokens on average.

Total number of webpages	2,500
Homepages with publication list	723 (28.9%)
Total number of publication strings	13,237
Homepages with multi-line publication strings	117 (16.2%)
Average number of tokens per homepage	732.1 (std=1583.3)
Average number of lines per homepage	89.9 (std=141.6)
Average number of publication strings per homepage	18.3 (std=35.4)
Cross annotator agreement (publication string level)	83.76%
Cross annotator agreement (Cohen’s kappa coefficient)	0.2084

The following table shows the distribution of the number of lines that each publication string occupies. There are about ten percent publication strings that occupy more than one line. Specifically, among all the 13,237 publication strings in the HomePub dataset, 12088 (91.32% of the total publication strings) publication strings are single-line publication strings, 383 publication strings (2.89%) occupies two lines, and 621 publication strings (4.69%) occupies three lines.

This table shows that we may not ignore the multi-line publication strings when designing extraction algorithms, and it is a challenging task for extraction algorithms to accurate extract multi-line publication strings accurately.

Number of lines each publication string occupies	Number of publication strings	Percentage
1	12,088	91.32%
2	383	2.89%
3	621	4.69%
4	85	0.64%
5	29	0.22%
6	31	0.23%

Acknowledgment

We thank all the contributors to this dataset.