Page MenuHomePhabricator

Investigation: Who Wrote That revision search tool
Open, NormalPublic8 Story Points

Description

An investigation task for wish #4, Who Wrote That revision search tool (formerly "Blame tool")

When I see suspicious text on a page, I should be able to find its creator and first revision.

Existing related ticket: T2639

Notes on the project page:
https://meta.wikimedia.org/wiki/Community_Tech/Who_Wrote_That_revision_search_tool

For the investigation:

  • What's wrong with the existing tools?
  • What type of tool should we build: search or color-highlighting
  • What are the current limitations?
  • Where should we host the new tool, Toolforge or on wiki?
  • How do we assess & deal with the storage needs?

Event Timeline

DannyH created this task.Jan 4 2018, 2:24 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 4 2018, 2:24 AM
Tgr added a subscriber: Tgr.Jan 4 2018, 3:01 AM

It would be nice to clarify the vocabulary on this. Most programmers probably associate "blame tool" with git blame, while most Wikipedians associate it with WikiBlame, which is not actually a blame tool in the previous sense (it's more like git bisect, ie. a revision search tool).

WhoColor on the other hand seems like a great blame tool (if a little slow currently). It would be awesome if the Foundation could support the authors in getting it on better hardware and ensuring maintainer continuity so it can be relied on in the long term.

(Note that unlike bisecting, blaming - essentially, content persistence - is a complex problem, which has been the topic of several research projects in the past. Trying to write a new tool from scratch is probably not a good idea.)

TBolliger updated the task description. (Show Details)Jan 10 2018, 12:17 AM
kaldari set the point value for this task to 8.Jan 10 2018, 12:21 AM
kaldari updated the task description. (Show Details)
FaFlo added a subscriber: FaFlo.EditedJan 13 2018, 4:03 PM

Hi, as an author of WikiWho/ WhoColor:

  • Great that this is being picked up, I would be happy to be of assistance
  • "Note that unlike bisecting, blaming - essentially, content persistence - is a complex problem, which has been the topic of several research projects in the past. Trying to write a new tool from scratch is probably not a good idea" --> yup, and we have evaluated WikiWho in that regard, showing high accuracy especially also for longer, more complex revision histories, although only for English so far (see the paper)
  • Regarding speed: we are processing the EventStreams of several languages on the fly, that is not an issue. We just don't have any caching layer for the materialized json yet, but that is on the to do list. For the mid-term future (2-3 years), the upkeep and further development of the service is secured at GESIS (my employer) and also the extension to more languages (although maybe not all). But for the long term I also think hosting it at the WMF might make more sense.

@FaFlo Thanks for reaching out! I've tried WhoColor and found it really pretty to look at and fast to use. Good to learn that it works with several languages. Thanks for all your fantastic work on it.

Can you tell us why not all languages can be added to the service? Technically speaking, why is the tool dependent on which language is it being used for?
Also, can you explain, at a high level, how the service works currently (via the browser extension)? It's possible that we'd have to convert it into a MediaWiki extension (if we decide to go down that route) and I'm wondering what that would take.

We hope to keep in touch with you as things progress on this wish and we investigate all the possible solutions.

TBolliger renamed this task from Investigation: Blame tool to Investigation: Who Wrote That revision search tool.Jan 16 2018, 9:07 PM
TBolliger updated the task description. (Show Details)
TBolliger updated the task description. (Show Details)
FaFlo added a comment.EditedJan 17 2018, 3:14 PM

@Niharika

Can you tell us why not all languages can be added to the service? Technically speaking, why is the tool dependent on which language is it being used for?

Technically, that is not a problem at all. Some adaptations are probably necessary for Cyrillic and other non-Latin scripts regarding sentence and token splitting etc., but that is not a major thing. It's mainly a manpower bottleneck right now, but we are in the process of hiring someone to help out with that. And we are adding more languages constantly, so we will try to get all of them in due time. If you guys say you want your extension/application to regularly consume that data in the future, I can make a better case inhouse why we should put more effort into extending it to more languages and do so quicker. Basically, GESIS is committed to providing this as a service as we build other stuff for (social) scientists on top of it, also for non-Wikimedia-Wikis and other sources in the future.

Also, can you explain, at a high level, how the service works currently (via the browser extension)? It's possible that we'd have to convert it into a MediaWiki extension (if we decide to go down that route) and I'm wondering what that would take.

I attached a crude diagram of the system architecture. Note that the WhoColor API is basically built on top of the WikiWho API (or better: its internal data store), which provides the more "bare" token provenance/change info.
The biggest quality issue for the interface/userscript right now is coloring and parsing the text correctly, simply putting <spans> around every token doesn't work that well and needs a lot of exceptions (e.g. interferes with correct parsing of templates, refs etc). That is however an issue strictly on the visualization side of things, the provenance/change info you get from WikiWho is not affected by that.

Currently deployed code for WhoColor can be found here: https://github.com/wikiwho/WhoColor
The APIs and some (incomplete, it's on my list) documentation live here: api.wikiwho.net

Thanks for the detailed reply, @FaFlo. Much appreciated!

I'll keep you informed as things develop and we discuss this project internally.

Moving to backlog as CommTech already has plenty of projects in process.

BBlack added a subscriber: BBlack.Apr 19 2018, 3:51 PM

My random $0.02 as a bytstander:

  • In terms of naming it and its focus, it might be better to think in terms of this identifying the revision/edit responsible for the content rather than the editor person/bot. Having the rev/edit info also implicitly gives you the editor person/bot, but you can remove the personal focus of "Who?" (and sometimes the use of this tool would be e.g. to find out "When?" rather than "Who?").
  • A nice feature would be to have something analogous to git blame's -w flag, which ignores commits to the affected text which are whitespace-only. In other words, there could be a raw mode which asks "Which revision last touched this block of text at all?", and also a different mode which asks "Which revision last changed the real content in this block, ignoring other newer edits which just changed the formatting of it?"
FaFlo added a comment.Apr 21 2018, 7:38 AM

Hi, I can only comment on how we implemented it for api.wikiwho.net, but these are good points in general as well:

  • revisions/"when" vs. "who" : Regarding the actual implementation, you get the rev-ids of the origin and change revisions for a token and then fetch the meta-info for that revision in a second step, such as the editor and timestamp. So like you say, it is not "who" in the first instance, that is just derived.
  • whitespace: we split the text into tokens at the whitespace (and other special chars), so you would not attribute changes where someone just adds/removes whitespace (i.e., whitespaces are not tokens) without altering other text pieces. Alas, if someone would split a word into two via a whitespace (or -conversely - concatenate) , we would attribute the "new" tokens to that editor. If we talk about "formatting" in a wiki-markup sense, one would probably have to ex-post filter those changes that touched "cosmetic" markup like section headers or hrules, which is doable, but more tricky. Or simply run the whole thing in parallel on the parsed, front-end text with formatting ignored.

@FaFlo Hello! Our team is wrapping up work on several projects and Blame tool is next on our list. Our current plan of action is to help build out the Tampermonkey scripts into standalone browser extensions and if all goes well, into on-wiki gadget(s). There are probably a bunch of things we can do to improve the UX (as noted from your comment above) and we'll be looking into that as well.
Do you have a list of languages that are currently being supported by the wikiwho api?
Thanks so much for your patience and help on this. :)

FaFlo added a comment.Jan 7 2019, 4:38 PM

Hi, nice to see this is getting traction again, a proper browser extension/better interface/cleaner highlighting would be great!

the languages currently are EN, DE, ES, TR, EU, but adding 2-3 major ones should be possible relatively short term .

Thanks @FaFlo! Support for more languages will be awesome. Let me know if we can provide any help with that work.
I'll keep you in the loop as things develop on this project.

@FaFlo I came up with a few questions for you while playing around with WhoColor script on enwiki. :)

  • On the Austria article on enwiki, when I turn on WhoColor, it gives me the sidebar and data as expected but I cannot click to highlight text beyond a certain point in the article (beginning the middle of the 20th Century paragraph). Similarly, data for the conflict tab is cut-off at the same point. Is that a problem with the extension script or the API isn't able to return that data for some reason? Are the percentage numbers in the sidebar accurately representing the entire article's content %?
  • On the same article, when I turn on the script, the infobox template pretty much vanishes. What causes that?
  • Is the API able to provide us with the revision number for any piece of highlighted text in addition to the author?
  • I gather that we are not able to highlight text that is added within/for templates, tables and categories. Is there anything else that we can't highlight? What about references?
FaFlo added a comment.EditedJan 16 2019, 11:20 AM

@Niharika

The annotations per token that the WikiWho APIs produce are always for the wiki markup, including all tokens, also those in tables, references. I does *not* expand templates or anything transcluded, which means that the content of those elements is not annotated for now, only the wiki markup that they are called with. That does not, therefore, pertain to tables, references and infoboxes in general, as long as nothing is transcluded. I.e. that is the first source of "error" in the sense that the API has simply no annotations for the transcluded content. (It could be added of course, but that would mean a couple more steps, including processing all templates. In practice I would rather add some nice-looking HTML in the frontend that says something along the lines of "could not color this template". )

Now, having all the annotations (revisions of origin and thereby author, timestamp etc, to respond with YES to that question) already available in the basic API, the secondary WhoColor API is more or less a convenience service: Using some heuristics we came up with, it puts <spans> with meta-information around all the tokens and then goes to Wikipedia's own API for parsing, then returns the "enhanced" HTML. That is done so the client does not have to do it (one could also think about shifting this processing of the base info to <span>ned text to a (heavier) client I guess). See code here.
This means regarding your question about the per-user percentages, that these are computed on the raw, wiki markup content. They should always add up to 100% - in this article at least they do. However, since not all content is parsed into visible HTML, there can be a mismatch between what is visible and what is displayed as "owned" per user on the right.

Next, and probably most important: this <span> stuff is messy. It breaks correct parsing of the span-annotated markup into valid HTML in several instances. This is what leads to the discontinuation of coloring/markup in the article you mentioned, where there is probably some wrongly set span, and I believe it also causes the vanishing of most of the infobox template. It also affects the parsing of tables and references (for tables, we just skip them deliberately now, because it always messes up the parser). We have tried to fix all instances reported to us (shoutout to @Ragesoss ) but there are still some out there. It could also likely be that there is a completely different, better way of doing this than with spans. In other words: if a more elegant way of annotating with that meta/color info can be found (spans or an alternative), everything but transcluded content can be colored. We are open to changing the WhoColor API accordingly. The other, more flexible solution in terms of development would be a client-side processing of the Wikiwho base API's content.

Niharika added a comment.EditedJan 16 2019, 12:23 PM

@FaFlo That makes sense. I was thinking about how we could do this better. The most prominent use case I have heard so far for this tool is to catch vandals by figuring out who wrote a given piece of text. For that simple use case, allowing only one author's text to be highlighted at a time makes sense. If the user clicks on another text, the highlight on first text goes away and the new text gets highlighted. And in the sidebar, instead of showing a list of all authors we show more information (revision id, date, author data etc.) for the highlighted text. What do you think of this idea?
Also, what's a good use case for the current multiple highlights view?

@FaFlo Another question - have you looked into highlighting text on the page without modifying the page html by adding spans? I was wondering about how feasible that might be.

Tgr added a comment.Jan 16 2019, 8:43 PM

It could also likely be that there is a completely different, better way of doing this than with spans.

Use web annotations and some third-party tool that can handle them, maybe? (see also T149667: Amazing Article Annotations) But it's a massive amount of work, if at all feasible. Making the span logic more clever is probably a lot easier, even if less elegant.

The more fundamental problem with your approach IMO is that for a change like {{foo|bar}} to {{foo|baz}} you don't have any way to locate baz in the page HTML (assuming it is a somewhat common word). To do that reliably, you'd either have to integrate with / reproduce the parsing logic in some way, or do the blame calculation separately for wikitext and for HTML (which brings its own cans of worms as the rendered HTML of a revision changes over time, in part due to parser logic changes and in part due to edits to templates).

Wiki Education has been using the WhoColor API on the Dashboard for a while now, and I can report how it's been received by our users — mostly instructors who are using it for reviewing the work that their students have done on existing articles. Here's an example of it in action: https://dashboard.wikiedu.org/courses/University_of_Chicago/Censorship_and_Information_Control_During_Information_Revolutions_(Autumn)/articles?showArticle=51093345

Setting aside the cases where the annotations don't work because of syntax that isn't handled correctly and the authorship data stops partway through, the most consistent feedback we get is around the lack of support for highlighting most references. We haven't heard many complaints about the limited support for templates. I'm sure that's partly because of the particulars of our program, where most student editors are focused on article text and rarely dive deeply into infoboxes and other templates. But my guess is that even without diving into transcluded content and data present in template parameters, just fixing the the known problems with references and unhandled syntax would be a solid 90% solution for a wide variety of use cases: being able to see who added which text to what is (mostly) a text document is almost always what people will want to do with it.

@Ragesoss @FaFlo Is there a reliable way to figure out how commonly annotations break because of syntax errors? I tested it out on a few random pages and it was broken for me on one, so I suppose it's not uncommon.

Wiki Education articles aren't necessarily representative in terms of the likelihood of having markup edge cases, but my rough guess based on that would be about 1 in 40 articles. It's probably higher in the general article population from lists and other table-heavy article genres that our student editors don't work on as often (and probably higher on well-developed articles, since they'll have more markup and more complex markup).

FaFlo added a comment.Jan 20 2019, 1:11 PM

.. or do the blame calculation separately for wikitext and for HTML.

I think the most feasible option is i) keep doing the calculation only on wikitext (for the reasons you mention), and just highlight what is visible in the HTML (as is done now), but additionally ii) offer another view with highlighting done on the wikitext with all markup, in case the user wants to look into it in more detail. A better version of this would indicate to the user parts in the HTML view for which there is more information to be seen re: links, templates etc. in that secondary wikitext view.
Web annotations sound interesting, but I cannot judge how feasible that is.

Wiki Education articles aren't necessarily representative in terms of the likelihood of having markup edge cases, but my rough guess based on that would be about 1 in 40 articles. It's probably higher in the general article population from lists and other table-heavy article genres that our student editors don't work on as often (and probably higher on well-developed articles, since they'll have more markup and more complex markup).

I would concur with that assessment.

@FaFlo Thank you. That is helpful.
I see that the WhoColor API does include information about the revision id for the token - I'm looking at the tokens array (Inline Model 2). Also, helpfully, each token has a unique token ID. But there does not seem to be a clear way to link the tokens in the extended html syntax with the corresponding token in the tokens array.
Is it possible for the tokens array entries to optionally include the token-id for the token it represents?

Krenair added a subscriber: Krenair.Feb 8 2019, 1:27 AM

Sorry, I didn't see the notification for that message...

Is it possible for the tokens array entries to optionally include the token-id for the token it represents?

Yes, that should be possible and not too hard to add. Let me check.

Now that I had time to look at it: the

id="token-805"

in the extended HTML is *not* actually the WikiWho token ID, but simply a positional index for the token for that revision. (I do admit that we have to update the documentation to that effect...)
The WhoColor userscript goes through the extended HTML in the conflict and age views and retrieves the respective conflict and age scores by order from the list in the inline model 2.
(the "class name" is simply the user id - or an ad-hoc user hash for IPs - but I guess you figured that out already)

The reason why we did not include the actual WikiWho token IDs was simply to reduce the output size of the json, but they could be added to inline model 2 if necessary. The question is what you are planning to do with it and if it is a good trade-off for adding the extra data in the json.

DannyH removed a subscriber: DannyH.Feb 19 2019, 4:54 PM
FaFlo added a comment.Apr 3 2019, 4:38 PM

@Niharika what is the current status of this project, do you need any input? Do you need the token ids in the output as you requested?

@FaFlo Sorry for not getting back to you on this. My team is currently wrapping up a couple other projects and it's possible that we won't be picking this up for another month or two. I will setup a session to talk with my teammates to discuss our API requirements at the upcoming Wikimedia Hackathon in May and circle back with you. Are you going to be there by any chance? :)

FaFlo added a comment.Apr 9 2019, 2:58 PM

@Niharika Alright. I won't be at the Hackathon unfortunately. But I'm in San Francisco in the week of May 13-17, so if you guys are based at the SF offices, I could simply swing by to talk about what is already there and what we could provide. Not strictly necessary, but might be helpful.

@Niharika Alright. I won't be at the Hackathon unfortunately. But I'm in San Francisco in the week of May 13-17, so if you guys are based at the SF offices, I could simply swing by to talk about what is already there and what we could provide. Not strictly necessary, but might be helpful.

@FaFlo that sounds great! I will be working remotely that week but @Mooeypoo (lead engineer on the team) will be in the office and has offered to meet up. I'll be joining remotely. I'll setup a meeting time for the 13th. Thanks!

@Niharika Alright. I won't be at the Hackathon unfortunately. But I'm in San Francisco in the week of May 13-17, so if you guys are based at the SF offices, I could simply swing by to talk about what is already there and what we could provide. Not strictly necessary, but might be helpful.

@FaFlo that sounds great! I will be working remotely that week but @Mooeypoo (lead engineer on the team) will be in the office and has offered to meet up. I'll be joining remotely. I'll setup a meeting time for the 13th. Thanks!

@FaFlo I setup a meeting for 4pm PST on the 13th. You can email me at niharika@wikimedia.org and we can coordinate further. :)

@FaFlo , we took a closer look at the response we're getting from articles that break, and we found that the cause seems to be that the process is running out of memory. See:

</a></span></span></li>\n<li><span class=\"nowrap\"> <a href=\"/wiki/WorldCat_Identities\" class=\"mw-redirect\" title=\"WorldCat Identities\">WorldCat Identities</a> (via VIAF): <a rel=\"nofollow\" class=\"external text\" href=\"https://www.worldcat.org/identities/containsVIAFID/148842731\">148842731</a></span></li></ul>\n</div></td></tr></tbody></table></div>\n<!-- \nNewPP limit report\nParsed by mw1340\nCached time: 20190620223042\nCache expiry: 2592000\nDynamic content: false\nComplications: [vary‐revision‐exists, vary‐revision]\nCPU time usage: 3.880 seconds\nReal time usage: 5.248 seconds\nPreprocessor visited node count: 18361/1000000\nPreprocessor generated node count: 0/1500000\nPost‐expand include size: 407659/2097152 bytes\nTemplate argument size: 56826/2097152 bytes\nHighest expansion depth: 19/40\nExpensive parser function count: 64/500\nUnstrip recursion depth: 1/20\nUnstrip post‐expand size: 324421/5000000 bytes\nNumber of Wikibase entities loaded: 5/400\nLua time usage: 2.067/10.000 seconds\nLua memory usage: 37.49 MB/50 MB\n-->\n<!--\nTransclusion expansion time report (%,ms,calls,template)\n100.00% 4181.082      1 -total\n 31.64% 1322.902      1 Template:Coord\n 19.11%  799.118      1 Template:Reflist\n 14.88%  622.330      1 Template:Infobox_country\n 12.25%  512.137      1 Template:Infobox\n  9.64%  403.196     59 Template:Cite_web\n  6.49%  271.433      3 Template:ISO_3166_code\n  4.78%  199.694     11 Template:Lang\n  4.67%  195.319      3 Template:Small\n  4.44%  185.717      1 Template:Native_name\n-->\n</div>","present_editors":[["Gog the Mild","

Specifically, it looks like the system dumped this into the payload:

NewPP limit report\nParsed by mw1340\nCached time: 20190620223042\nCache expiry: 2592000\nDynamic content: false\nComplications: [vary‐revision‐exists, vary‐revision]\nCPU time usage: 3.880 seconds\nReal time usage: 5.248 seconds\nPreprocessor visited node count: 18361/1000000\nPreprocessor generated node count: 0/1500000\nPost‐expand include size: 407659/2097152 bytes\nTemplate argument size: 56826/2097152 bytes\nHighest expansion depth: 19/40\nExpensive parser function count: 64/500\nUnstrip recursion depth: 1/20\nUnstrip post‐expand size: 324421/5000000 bytes\nNumber of Wikibase entities loaded: 5/400\nLua time usage: 2.067/10.000 seconds\nLua memory usage: 37.49 MB/50 MB\n-->\n<!--\nTransclusion expansion time report (%,ms,calls,template)\n100.00% 4181.082      1 -total\n 31.64% 1322.902      1 Template:Coord\n 19.11%  799.118      1 Template:Reflist\n 14.88%  622.330      1 Template:Infobox_country\n 12.25%  512.137      1 Template:Infobox\n  9.64%  403.196     59 Template:Cite_web\n  6.49%  271.433      3 Template:ISO_3166_code\n  4.78%  199.694     11 Template:Lang\n  4.67%  195.319      3 Template:Small\n  4.44%  185.717      1

These seem to not be of the same class as the <span> issues, but rather pages that have deeper memory issues.

Two questions here --

  1. Can anything be done about the memory issue? Perhaps parsing long pages incrementally into a file rather than memory?
  2. Can we catch those errors and add a field into the API payload so the client can tell that the page may have problems?

Sorry for the late reply. Valid point regarding the memory, will look into
it.


Sent from mobile, excuse brevity and/or formal mistakes.

Mooeypoo <no-reply@phabricator.wikimedia.org> schrieb am Fr., 21. Juni
2019, 01:02:

Mooeypoo added a comment.
@FaFlo https://phabricator.wikimedia.org/p/FaFlo/ , we took a closer
look at the response we're getting from articles that break, and we found
that the cause seems to be that the process is running out of memory. See:
</a></span></span></li>\n<li><span class=\"nowrap\"> <a href=\"/wiki/WorldCat_Identities\" class=\"mw-redirect\" title=\"WorldCat Identities\">WorldCat Identities</a> (via VIAF): <a rel=\"nofollow\" class=\"external text\" href=\"https://www.worldcat.org/identities/containsVIAFID/148842731\">148842731</a></span></li></ul>\n</div></td></tr></tbody></table></div>\n<!-- \nNewPP limit report\nParsed by mw1340\nCached time: 20190620223042\nCache expiry: 2592000\nDynamic content: false\nComplications: [vary‐revision‐exists, vary‐revision]\nCPU time usage: 3.880 seconds\nReal time usage: 5.248 seconds\nPreprocessor visited node count: 18361/1000000\nPreprocessor generated node count: 0/1500000\nPost‐expand include size: 407659/2097152 bytes\nTemplate argument size: 56826/2097152 bytes\nHighest expansion depth: 19/40\nExpensive parser function count: 64/500\nUnstrip recursion depth: 1/20\nUnstrip post‐expand size: 324421/5000000 bytes\nNumber of Wikibase entities loaded: 5/400\nLua time usage: 2.067/10.000 seconds\nLua memory usage: 37.49 MB/50 MB\n-->\n<!--\nTransclusion expansion time report (%,ms,calls,template)\n100.00% 4181.082 1 -total\n 31.64% 1322.902 1 Template:Coord\n 19.11% 799.118 1 Template:Reflist\n 14.88% 622.330 1 Template:Infobox_country\n 12.25% 512.137 1 Template:Infobox\n 9.64% 403.196 59 Template:Cite_web\n 6.49% 271.433 3 Template:ISO_3166_code\n 4.78% 199.694 11 Template:Lang\n 4.67% 195.319 3 Template:Small\n 4.44% 185.717 1 Template:Native_name\n-->\n</div>","present_editors":[["Gog the Mild","
Specifically, it looks like the system dumped this into the payload:
NewPP limit report\nParsed by mw1340\nCached time: 20190620223042\nCache expiry: 2592000\nDynamic content: false\nComplications: [vary‐revision‐exists, vary‐revision]\nCPU time usage: 3.880 seconds\nReal time usage: 5.248 seconds\nPreprocessor visited node count: 18361/1000000\nPreprocessor generated node count: 0/1500000\nPost‐expand include size: 407659/2097152 bytes\nTemplate argument size: 56826/2097152 bytes\nHighest expansion depth: 19/40\nExpensive parser function count: 64/500\nUnstrip recursion depth: 1/20\nUnstrip post‐expand size: 324421/5000000 bytes\nNumber of Wikibase entities loaded: 5/400\nLua time usage: 2.067/10.000 seconds\nLua memory usage: 37.49 MB/50 MB\n-->\n<!--\nTransclusion expansion time report (%,ms,calls,template)\n100.00% 4181.082 1 -total\n 31.64% 1322.902 1 Template:Coord\n 19.11% 799.118 1 Template:Reflist\n 14.88% 622.330 1 Template:Infobox_country\n 12.25% 512.137 1 Template:Infobox\n 9.64% 403.196 59 Template:Cite_web\n 6.49% 271.433 3 Template:ISO_3166_code\n 4.78% 199.694 11 Template:Lang\n 4.67% 195.319 3 Template:Small\n 4.44% 185.717 1
These seem to not be of the same class as the <span> issues, but rather
pages that have deeper memory issues.
Two questions here --

  1. Can anything be done about the memory issue? Perhaps parsing long pages incrementally into a file rather than memory?
  2. Can we catch those errors and add a field into the API payload so the client can tell that the page may have problems?

*TASK DETAIL*
https://phabricator.wikimedia.org/T184144
*EMAIL PREFERENCES*
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
*To: *Mooeypoo
*Cc: *Mooeypoo, Krenair, Cameron11598, Prtksxna, Tbayer, BBlack,
Niharika, FaFlo, Ragesoss, Tgr, Aklapper, aezell, JJMC89, B20180,
Samwilson, Nakon, MusikAnimal, Fhocutt, Ricordisamoa, -jem-

[off-topic] @FaFlo: Please consider removing unneeded full quotes, or put them below a -- line which will automatically strip everything after that line. Thanks :)

ifried added a subscriber: ifried.Jul 10 2019, 12:48 AM

@FaFlo Hello! My name is Ilana, and I’m the new product manager for the Community Tech team. Nice to meet you. The Community Tech team will be working on the Who Wrote That tool soon, so I’ll be reaching out with questions about the WhoColor API on occasion. Today, I have two questions for you:

• What concerns would you have about using the WhoColor API to retrieve data from older versions of a page?
• How far back in revision history does the WhoColor API go?

I’m asking because we’re exploring the possibility of enabling Who Wrote That for older versions of an article. We know that the WhoColor API enables access to older page revisions, so we’re discussing the possibility of including this data in the tool. We’re in the very early stages of discussing this possibility, and we wanted to loop you into the conversation. When you get a chance, we would love to know your thoughts on this topic. Thank you!

FaFlo added a comment.Jul 11 2019, 7:56 AM

@ifried Hey Ifried, nice to meet you and good to hear :)

What concerns would you have about using the WhoColor API to retrieve data from older versions of a page?

None, you can call any revision of an article and get back the tokens at the time that were in the revision and their history.
The only "weird" but intended behavior here is that the lists of in and outs will also "look into the future" for tokens in older revisions (see example here for token_id: 31614 , but you can simply ignore those with higher revids than the oldid you have retrieved.

How far back in revision history does the WhoColor API go?

It retrieves and processes all revisions ever done for a language edition.

FaFlo added a comment.Jul 11 2019, 8:04 AM

@Mooeypoo , is the memory error in the output still occurring in your testing? We tried a fix.

@ifried Hey Ifried, nice to meet you and good to hear :)

What concerns would you have about using the WhoColor API to retrieve data from older versions of a page?

None, you can call any revision of an article and get back the tokens at the time that were in the revision and their history.
The only "weird" but intended behavior here is that the lists of in and outs will also "look into the future" for tokens in older revisions (see example here for token_id: 31614 , but you can simply ignore those with higher revids than the oldid you have retrieved.

I'm sorry, it might be my misunderstanding here, but I don't understand what you mean with "looking into the future" with old rev_ids. Do you mean in terms of conflicts? I'm a little confused, can you explain further?

FaFlo added a comment.Jul 16 2019, 8:14 AM

Sorry for the ambiguous wording. I meant if you are at revision 5 of an article with 10 revisions, and looking at a token that existed in revision 5 (e.g. with the call https://api.wikiwho.net/en/api/v1.0.0-beta/rev_content/OUR_DUMMY_ARTICLE/5/?......), and if that token has been deleted in, say, revision 8, then you would already see that in its "out" list, although you are at revision 5 currently.

ifried moved this task from Backlog to Done on the Who-Wrote-That board.Tue, Sep 10, 8:38 PM