Page MenuHomePhabricator

Spike: Internal links, Files, etc. may not show accurate contributor information [4 hours]
Open, Needs TriagePublicSpike

Description

What is the problem?

In the HTML that the WhoColor API returns, and that WhoWroteThat inserts into the page, any wikitext in [[ ]] (e.g. links, files) gets treated as one token. We therefore treat this entire token as being added by one contributor as part of one revision (the revision that added the first [[). But, this might not actually be the case. For example, if a later contributor modifies the caption of an image this won't be reflected in WWT.

For example, the image in the top right here has been modified by at least two contributors. But, clicking on that image will show information for only the first contributor (once T231959 is fixed).

Normally, each word in an article gets treated as a separate token, so we know who wrote each word and when.

Visual Examples:


Possible solutions
  • Do nothing. Perhaps this is not a big problem.
  • Fix WhoColor API. This might be tricky, as tokenising each word inside [[ ]] will often produce invalid wikitext.
  • Don't show revision details about things like images, internal links, etc. This would be a shame, as we do have this information.
  • Find some other way of extracting the revision details of individual words inside files, links, etc. The WhoColor API does give us that information already.
  • Something else. I might suggest a spike of some sort to investigate possibilities.
Steps to reproduce problem

See the example above.

Will happen whenever two or more contributors (or the same contributor in separate revisions) add/modify a link, file, etc.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 4 2019, 1:44 PM

Also affects Category tags.

ifried updated the task description. (Show Details)Sep 17 2019, 12:01 AM
ifried renamed this task from Internal links, Files, etc. may not show accurate contributor information to Spike: Internal links, Files, etc. may not show accurate contributor information [4 hours].Sep 17 2019, 11:17 PM
ifried moved this task from To Be Estimated/Discussed to Estimated on the Community-Tech board.

A particularly bad example is the link [[Olympia]] in the fourth sentence here: https://tr.wikipedia.org/w/index.php?title=%C3%89douard_Manet. In the revision details we attribute that link to a revision which does not even include the word "Olympia".

I believe when WikiWho is doing the diff between the revisions it is mixing up a [[ token from another link.

This seems like another example where WhoColor is incorrectly putting things together. The normal WikiWho API attributes the Olympia link to https://tr.wikipedia.org/w/index.php?diff=4019869 which appears to be correct.

Restricted Application edited projects, added Community-Tech; removed Community-Tech (Kanban-Q3-2019-20). · View Herald TranscriptJan 7 2020, 6:23 PM
HMonroy claimed this task.Jan 9 2020, 10:11 PM
HMonroy moved this task from Ready to In Development on the Community-Tech (Kanban-Q3-2019-20) board.
HMonroy removed HMonroy as the assignee of this task.Jan 13 2020, 5:12 PM
HMonroy moved this task from In Development to Ready on the Community-Tech (Kanban-Q3-2019-20) board.
HMonroy added a subscriber: HMonroy.
HMonroy claimed this task.Jan 13 2020, 5:26 PM
HMonroy moved this task from Ready to In Development on the Community-Tech (Kanban-Q3-2019-20) board.
Restricted Application changed the subtype of this task from "Bug Report" to "Spike". · View Herald TranscriptJan 15 2020, 12:11 AM
HMonroy added a comment.EditedJan 15 2020, 5:47 PM

This issue is happening when the link and text inside the link do not have the same revision number.
For case:

A particularly bad example is the link [[Olympia]] in the fourth sentence here: https://tr.wikipedia.org/w/index.php?title=%C3%89douard_Manet. In the revision details we attribute that link to a revision which does not even include the word "Olympia".
I believe when WikiWho is doing the diff between the revisions it is mixing up a [[ token from another link.

The link [[ (token id 274) has revision :

1: "[["
2: 859742

The word that follows `Olympia' (token id 275) has revision:

1: "olympia"
2: 4019869

When the user selects [[Olympia]], he/she is selecting <span class="editor-token token-editor-39 active" id="token-274"><a href="/wiki/Olympia_(tablo)" title="Olympia (tablo)">Olympia</a></span> so this will retrieve the information for the token with id 274, which is the information for token [[ rather than Olympia

FaFlo added a subscriber: FaFlo.Tue, Jan 28, 9:38 AM

Just now get to answer this. Will have a look this week and get back to you.

@FaFlo, thank you for getting back to us and for taking the time to look into this issue! Do you have any updates on this?

FaFlo added a comment.EditedWed, Feb 5, 11:22 AM

I would like to split this apart into distinct issues:

(A) So, regarding the example of "olympia" in the fourth sentence of the Manet article (after ''başyapıtlarından", *WikiWho TokenID*= 1669, in here) :
@dom_walden 's guess is right afaics, there is a mix up of the square link brackets with an earlier revision as you see in the output of the base WikiWho API. This happens since these are small and very common characters and the diff can get confused. I cannot offer a simple fix right now. Rather, I would like to know how often WikiWho (base, not WhoColor) is wrong in these instances to get a better idea of when this happens (after which kinds of revisions/changes).

--> There is not much to do about this right now, so I would focus on the second part, which I think is the main reason for this phabricator task. I.e. here this wrong attribution gets propagated into the WhoColor markup since we attribute the whole linked word to one revision instead of several (if that would be correct), and a better solution for the second part would "soften" the impact of the wrong WikiWho attribution somewhat (at least only the brackets would be colored wrongly).

(B) Let's assume the link/brackets annotation in WikiWho base is correct: Then the remaining problem is the "misattribution in highlighting/span", in WhoColor, since we implemented this "one span for the whole linked word" approach. This was done 'back in the day' since there was no good way to show it differently anyway in the Tampermonkey script interface we had build. So my questions would be, before changing any span setting etc.: what would be the best way to actually show this distinction between link and linked token (and title of linked page) in the WhoWroteThat interface? And secondly, what would be the best way of delivering HTML that makes that possible?

I will also think about a solution now how the team working on WhoWroteThat can test different/better ways of transforming the WikiWho output into parsable WikiText (and then HTML) through WhoColor - which we then could simply deploy on our side. I think this is necessary since we are a bit short-staffed and otherwise it would take quite long to solve these issues.