HomePub is a dataset that contains 2,500 academic homepages collected from 100 universities around the world. In the HomePub dataset, each publication string is labeled. See the paper below for details:
Please cite the above paper if you intend to use this dataset. If you have
inquiries about the dataset, please contact (homepub DOT dataset AT gmail DOT
com). If you have general inquiries about the research paper, please contact
Prof Rui Zhang
We have collected visible texts from the webpages and then manually tagged all the publication strings in them. During tagging, we mark the beginning and ending byte offsets of each publication string. Each publication string in HomePub dataset is annotated by two annotators. Disagreement is resolved by a third annotator. Tagging of publication lists is time-consuming, especially when the researcher has a long publication list. We have developed a program PageTagger to assist the tagging. On average, it takes about 2.5 minutes to tag one academic homepage when using the PageTagger tool.
The following table shows some basic statistical results about the HomePub dataset. Among all the 2,500 annotated pages in the HomePub dataset, 2,188 (87.52%) are academic homepages, while 312 (12.48%) are misclassified as academic homepages. Among the 2,188 academic homepages, 723 homepages (33.04%) contain publication lists, which consist of a total of 13,237 publication strings. On average, there are 732.1 (std=1583.3) tokens, 89.9 (std=141.6) lines, and 18.3 (std=35.4) publication strings per homepage. Each publication string contains 215.8 (std=93.1) characters and 49.2 (std=22.9) tokens on average.
Total number of webpages | 2,500 |
Homepages with publication list | 723 (28.9%) |
Total number of publication strings | 13,237 |
Homepages with multi-line publication strings | 117 (16.2%) |
Average number of tokens per homepage | 732.1 (std=1583.3) |
Average number of lines per homepage | 89.9 (std=141.6) |
Average number of publication strings per homepage | 18.3 (std=35.4) |
Cross annotator agreement (publication string level) | 83.76% |
Cross annotator agreement (Cohen’s kappa coefficient) | 0.2084 |
The following table shows the distribution of the number of lines that each publication string occupies. There are about ten percent publication strings that occupy more than one line. Specifically, among all the 13,237 publication strings in the HomePub dataset, 12088 (91.32% of the total publication strings) publication strings are single-line publication strings, 383 publication strings (2.89%) occupies two lines, and 621 publication strings (4.69%) occupies three lines.
This table shows that we may not ignore the multi-line publication strings when
designing extraction algorithms, and it is a challenging task for extraction
algorithms to accurate extract multi-line publication strings accurately.
Number of lines each publication string occupies | Number of publication strings | Percentage |
---|---|---|
1 | 12,088 | 91.32% |
2 | 383 | 2.89% |
3 | 621 | 4.69% |
4 | 85 | 0.64% |
5 | 29 | 0.22% |
6 | 31 | 0.23% |
We thank all the contributors to this dataset.