Page MenuHomePhabricator

Citation Usage instrumentation issues
Closed, ResolvedPublic

Description

Here are some of the issues that @RyanSteinberg found with the second round of data collection:

16% of extClick events lack section_id data. Manual review of links on 2
pages points to three possible link types as causes. See "extClicks from
Hep and Black Bear pages" sheet for examples:
 - links under "External Links" at page bottom
 - links in navboxes
 - external links embedded within text blocks

69% of fnHover events lack section_id data. This is probably due to the
fact that hovering behavior happens at the top of articles in the main
section where we said section_id would not be captured. That said,
manual review found examples where hovering over links outside of the
infobox and main section were still producing null section_id data. See
"handful of fnHover and fnClick events from Hep/Bear" sheet.

Link to google sheet with specific examples:
https://docs.google.com/spreadsheets/d/1UTsp1T3Dac94ny0O80U2mVwXoK3E2B2eAmpjuDUjRk4/edit

I also took another look at events with negative event_offset_time
values. The numbers looked large at first glance, but in comparison to
all event data, they're insignificant and hardly worth pursing. Adding a
note to wherever we document data issues is more than sufficient.

Update
Three more issues (see this document for more info)

*citation_identifier_label*
Only specific labels are included and the intent was to include all such
identifier labels like ISSN, ISBN, etc. The team may decide this is not
worth pursuing.

*freely_accessible*
Likely requires quantification of total number of "freely accessible"
links present in Wikipedia to determine if low numbers are an artifact
of capture or just due to low use. Maybe you have some ideas here?

*citation_in_text_refs*
We seem to be missing counts for some link types. Maybe our
specification assumed a particular citation template? Again, the team
may decide this is not worth pursuing.

We should tackle these issues before the third round of data collection.

Event Timeline

Change 484756 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/extensions/WikimediaEvents@master] Correctly identify section IDs

https://gerrit.wikimedia.org/r/484756

@Miriam any other issues we need to tackle before collecting more data?

Hi @bmansurov, I discussed with the others, and we are OK with starting the data collection, no more issues on our side. Thanks!

@RyanSteinberg

Regarding citation_identifier_label, the list of identifiers is hard-coded according to the field description in the schema.

If this external link is a cited reference and the preceding link is an identifier label (DOI, PMID, PMC), report the preceding link label; required only when action is extClick.

If we want other identifiers such as ISBN, I'll need the list of them. Otherwise I won't be able to distinguish a link to an identifier from any other link as the article is not structured.

Change 485045 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/extensions/WikimediaEvents@master] CitationUsage: correctly identify freely accessible resources

https://gerrit.wikimedia.org/r/485045

@RyanSteinberg, regarding *freely_accessible*, I've submitted a patch to fix the issue. Apparently template styles have changed, so I had to adapt the code. As for identifying the total number of freely available resources, I'm not sure what the best approach is. One approach is to parse Wikipedia dumps and look for this information.

@RyanSteinberg re: *citation_in_text_refs*,

The example given in the document is https://en.wikipedia.org/w/index.php?title=GPS_signals&oldid=852545363#CITEREFGPS-IS-800D, and the link is http://www.gps.gov/technical/icwg/IS-GPS-800D.pdf

It's also stated that

[the link] is cited 7 times in text but missing no clear way to identify this.

but this is not accurate as that exact link is not cited anywhere (because it doesn't have a backlink to the article).

As for the implementation, we're looking at the number of backlinks preceding the external link.

Change 484756 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Correctly identify section IDs

https://gerrit.wikimedia.org/r/484756

Change 485045 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] CitationUsage: correctly identify freely accessible resources

https://gerrit.wikimedia.org/r/485045

@RyanSteinberg re: *citation_in_text_refs*,

As for the implementation, we're looking at the number of backlinks preceding the external link.

Got it. The high number of NULL citation_in_text_refs values from extClicks originating from "reference" sections seem to come from (1) pages like the example where references are really just anchors pointing back up into the article OR (2) external links embedded in the main article body. I think this is as good as we can get.

@bmansurov
Thank you for the fix to *freely_accessible*

If we want other identifiers such as ISBN, I'll need the list of them. Otherwise I won't be able to distinguish a link to an identifier from any other link as the article is not structured.

Adding ISBN and ISSN to this hard-coded list seems reasonable to me but I'm still waiting to hear from others on my team (@Lauren.maggio ) if other identifier labels should be included.

@Lauren.maggio and I just discussed citation_identifier_label. Using a more comprehensive list of identifiers from citation style 1 would improve the data quality of this element. That said, we don't think we'll be able to make meaningful use of it and recommend dropping the element altogether. @bmansurov can you remove citation_identifier_label? Thank you for your patience on this one.

Change 486339 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/extensions/WikimediaEvents@master] CitationUsage: drop identifier label

https://gerrit.wikimedia.org/r/486339

Change 486339 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] CitationUsage: drop identifier label

https://gerrit.wikimedia.org/r/486339

bmansurov moved this task from In Progress to Done (current quarter) on the Research board.

Thanks, everyone, who helped move this task along.