Here are some of the issues that @RyanSteinberg found with the second round of data collection:
16% of extClick events lack section_id data. Manual review of links on 2
pages points to three possible link types as causes. See "extClicks from
Hep and Black Bear pages" sheet for examples:
- links under "External Links" at page bottom
- links in navboxes
- external links embedded within text blocks
69% of fnHover events lack section_id data. This is probably due to the
fact that hovering behavior happens at the top of articles in the main
section where we said section_id would not be captured. That said,
manual review found examples where hovering over links outside of the
infobox and main section were still producing null section_id data. See
"handful of fnHover and fnClick events from Hep/Bear" sheet.
Link to google sheet with specific examples:
I also took another look at events with negative event_offset_time
values. The numbers looked large at first glance, but in comparison to
all event data, they're insignificant and hardly worth pursing. Adding a
note to wherever we document data issues is more than sufficient.
Three more issues (see this document for more info)
Only specific labels are included and the intent was to include all such
identifier labels like ISSN, ISBN, etc. The team may decide this is not
Likely requires quantification of total number of "freely accessible"
links present in Wikipedia to determine if low numbers are an artifact
of capture or just due to low use. Maybe you have some ideas here?
We seem to be missing counts for some link types. Maybe our
specification assumed a particular citation template? Again, the team
may decide this is not worth pursuing.
We should tackle these issues before the third round of data collection.