HomePub Dataset: Publication Strings from Academic Homepages

HomePub is a dataset that contains 2,500 academic homepages collected from 100 universities around the world. In the HomePub dataset, each publication string is labeled. See the paper below for details:

This data is under the RECEX SHARED SOURCE LICENSE version 1.0

Please cite the above paper if you intend to use this dataset. If you have inquiries about the dataset, please contact (homepub DOT dataset AT gmail DOT com). If you have general inquiries about the research paper, please contact Prof Rui Zhang

How to Get this Dataset

More Information about the HomePub dataset

We have collected visible texts from the webpages and then manually tagged all the publication strings in them. During tagging, we mark the beginning and ending byte offsets of each publication string. Each publication string in HomePub dataset is annotated by two annotators. Disagreement is resolved by a third annotator. Tagging of publication lists is time-consuming, especially when the researcher has a long publication list. We have developed a program PageTagger to assist the tagging. On average, it takes about 2.5 minutes to tag one academic homepage when using the PageTagger tool.

The following table shows some basic statistical results about the HomePub dataset. Among all the 2,500 annotated pages in the HomePub dataset, 2,188 (87.52%) are academic homepages, while 312 (12.48%) are misclassified as academic homepages. Among the 2,188 academic homepages, 723 homepages (33.04%) contain publication lists, which consist of a total of 13,237 publication strings. On average, there are 732.1 (std=1583.3) tokens, 89.9 (std=141.6) lines, and 18.3 (std=35.4) publication strings per homepage. Each publication string contains 215.8 (std=93.1) characters and 49.2 (std=22.9) tokens on average.

Total number of webpages 2,500
Homepages with publication list 723 (28.9%)
Total number of publication strings 13,237
Homepages with multi-line publication strings 117 (16.2%)
Average number of tokens per homepage 732.1 (std=1583.3)
Average number of lines per homepage 89.9 (std=141.6)
Average number of publication strings per homepage 18.3 (std=35.4)
Cross annotator agreement (publication string level) 83.76%
Cross annotator agreement (Cohen’s kappa coefficient) 0.2084

The following table shows the distribution of the number of lines that each publication string occupies. There are about ten percent publication strings that occupy more than one line. Specifically, among all the 13,237 publication strings in the HomePub dataset, 12088 (91.32% of the total publication strings) publication strings are single-line publication strings, 383 publication strings (2.89%) occupies two lines, and 621 publication strings (4.69%) occupies three lines.

This table shows that we may not ignore the multi-line publication strings when designing extraction algorithms, and it is a challenging task for extraction algorithms to accurate extract multi-line publication strings accurately.

Number of lines each publication string occupies Number of publication strings Percentage
1 12,088 91.32%
2 383 2.89%
3 621 4.69%
4 85 0.64%
5 29 0.22%
6 31 0.23%

Acknowledgment

We thank all the contributors to this dataset.