Page MenuHomePhabricator

Provide useful diffs to high-volume consumers of recent changes
Open, NormalPublic

Description

The diff of <parent_revision> vs. <new revision> should be added to RCStream. This will be useful for a wide range of wiki tools.

Diffs are CPU intensive, so they haven't historically been generated at the time that an edit is saved. However, tools are already sending requests to the MediaWiki API for every revision made in English Wikipedia using the RCStream. It would be very beneficial to be able to skip the additional request that causes the diff to be generated.

Given the computational complexity of generating diffs, it's expected that this will cause RCStream to lag by a few additional seconds. This would be acceptable for the applications that are currently hitting the API for diffs since they are currently delayed for the same reason. However, it might make sense to separate the Diff-RCStream from the Metadata-only-RCStream.

Save times are a potential concern here. It was also noted in conversation with Ori and Yuvi (drawing from memory) that RCStream is emitted as a post-save hook so as to not slow page save times. It seems that the diff could be generated and new RCStream events submitted via a similar post-save hook.

Event Timeline

Halfak created this task.May 23 2015, 12:55 PM
Halfak raised the priority of this task from to Needs Triage.
Halfak updated the task description. (Show Details)
Halfak added a project: Wikimedia-Stream.
Halfak added subscribers: Halfak, Afandian, yuvipanda and 2 others.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 23 2015, 12:55 PM

Right now Echo generates a diff on every save anyway, so we can generalize it and use that to emit it via RCStream as well.

Right now Echo generates a diff on every save anyway, so we can generalize it and use that to emit it via RCStream as well.

Uh, Echo isn't going to be doing that for much longer hopefully :/.

Oh damn. But where is Echo going to do it? In a Job queue?

Hi guys, any movement on this?

I can get started writing a Python program to do this by watching the RCStream and pulling diffs from the API if that would be helpful?

I'm not clear if the changes to echo mean that this is the only option or whether you had other ideas, @Halfak?

Joe

It seems clear to me that

  1. Both Echo and RCstream need diffs
  2. The work is best done after page content save is completed
  3. We have worker queues precisely for this type of thing

I think we are currently waiting on someone like @aaron to come confirm or repair my wrong way of thinking. :)

Maybe @mattflaschen or @Legoktm can comment on where Echo does it's diff now and whether or not that can be moved to a worker queue.

Echo does it's diffing in includes/DiffParser.php, which is called from includes/DiscussionParser.php
The diffing is initiated by hook 'ArticleSaveComplete', where it's wrapped in a DeferredUpdates::addCallableUpdate & all it needs is the Revision object.
Shouldn't be too hard to turn that into a job.

It's possible to add commands to RCStream in addition to subscribe that let individual clients enable events in addition to change (e.g. a diff event). Like emit( "enable", "diff" ); and in theory emit( "disable", "change" ); if they only want diffs even.

However I don't think clients really want to get every single diff. Most if not all clients apply some kind of filtering before deciding to query for a diff. It'd a lot of bandwidth for clients. Making separate HTTP requests for the diffs doesn't appear to be a bottleneck of problem for clients. It's just inefficient from the server perspective.

I'd like to block this issue on a better diff in general. Sending the old HTML diffs isn't really in anyone's best interest and still leaves tools to re-invent a whole lot of post processing before it's actually useful. See T56328 for the better format and exposure through the API.

I'd like to re-purpose this task for the broader use case of providing diffs to high-volume subscribers. Instead of RCStream, I'm thinking RESTBase would make a good fit for that purpose.

Subscribers use RCStream to discover changes, and make highly performant requests to RESTBase for the diffs. We can help RESTBase get these efficiently either through dedicated warm-up jobs, or by pre-computing in a on-shutdown hook in MediaWiki, etc.

Krinkle renamed this task from Add diffs to RCStream to Provide useful diffs to high-volume consumers of RCStream.Jun 12 2015, 1:50 AM
Krinkle set Security to None.

I'd like to re-purpose this task for the broader use case of providing diffs to high-volume subscribers. Instead of RCStream, I'm thinking RESTBase would make a good fit for that purpose.
Subscribers use RCStream to discover changes, and make highly performant requests to RESTBase for the diffs. We can help RESTBase get these efficiently either through dedicated warm-up jobs, or by pre-computing in a on-shutdown hook in MediaWiki, etc.

In the long-term, general sense, it seems to me this might be yet another use-case for T84923: Reliable publish / subscribe event bus. In the short- / mid-term, though, I can see RESTBase providing that info, but in order for it to be fast, we'd need to limit the scope. Something like, it's acceptable to ask for a diff between parent-rev and current-rev, but not more than that. Once requested for the first time, the diff can be stored into Cassandra and served afterwards. Allowing any two, non-successive versions to be diffed takes up too much storage and gives no real benefit.

I'd just like to add a use-case here. The reason we started this ticket is because CrossRef is interested in a live stream of citations of DOIs (we are the registration agency for most DOIs) as they are cited (and/or uncited). We are listening on RCStream and then fetching every diff to look for a DOI. The reason we wanted to move beyond this is because requesting a diff for every event is taxing on the server, high-bandwidth, requires lots of HTTP connections and probably wouldn't be practical to run this service on a remote server (it's currently on wfmlabs tools).

@Halfak thought we could that the RCStream-watching and diff-fetching could be combined, closer to the source, and that there were others (including himself) that could use thsi service. In our case we could supply a filter which would cut out a lot of unwanted content, but such a filter might not be applicable in the general case.

So the use case that I have in mind (and you can take it or leave it) is something along the lines of:

"I would like to subscribe to a stream of diffs in more-or-less real-time so I can filter them. This should be input stream / pub-sub rather than poll-based."

So the use case that I have in mind (and you can take it or leave it) is something along the lines of:
"I would like to subscribe to a stream of diffs in more-or-less real-time so I can filter them. This should be input stream / pub-sub rather than poll-based."

RESTBase is a stateless, content-delivery-focused service, not a pub/sub system. From what you are describing, you really want T84923: Reliable publish / subscribe event bus with a filtering worker attached to it. I am not aware of any possible short-term solutions in this area, though.

Sorry I'm not sure if we're talking at cross-purposes. I'm talking about the original purpose of this ticket (I did't mention RESTBase!).

Sorry I'm not sure if we're talking at cross-purposes. I'm talking about the original purpose of this ticket (I did't mention RESTBase!).

Ah, maybe I wasn't clear. I was merely stating that RESTBase could not be used for your specific purpose (which seems to have been implied by @Krinkle in his earlier comment) and that, in the ideal case, the general-purpose event bus we are envisioning would fit your use-case nicely.

Subscribers use RCStream to discover changes, and make highly performant requests to RESTBase for the diffs. We can help RESTBase get these efficiently either through dedicated warm-up jobs, or by pre-computing in a on-shutdown hook in MediaWiki, etc.

[..] I can see RESTBase providing that info, but in order for it to be fast, we'd need to limit the scope. Something like, it's acceptable to ask for a diff between parent-rev and current-rev, but not more than that. Once requested for the first time, the diff can be stored into Cassandra and served afterwards. Allowing any two, non-successive versions to be diffed takes up too much storage and gives no real benefit.

I agree. Only store the canonical diff associated with a revision ID. Beware though that these requests tend to mostly be made within a very short window after the edit is saved. Currently the existing plethora of tools tend to screen scrape. Typically using fine-tuned urls like:
https://en.wikipedia.org/w/index.php?diff=653150900&diffonly=1&action=render

Or sometimes with the default UI urls (which include skin wrapping, and the page content - except on en.wikipedia.org were that was disabled by default for performance reasons, diffonly=0 enables it):
https://en.wikipedia.org/w/index.php?title=Representational_state_transfer&type=revision&diff=653150900&oldid=653150817

And of course these are non-deterministic urls with many variations.

Depending on how much computation is saved by re-using data already in memory during the save action in MediaWiki, it may be worthwhile for MediaWiki to compute and cache the diffs after the request shutdown of save actions. (Instead of waiting for an incoming API request to demand its creation). For that to really be worthwhile, it should compute a re-usable data model, not the HTML. I'd like to avoid the new infrastructure from inheriting the HTML formatted diffs as the main format. They'll remain available over the above urls for interested parties, of course. But for new APIs we can do better.

@mobrovac, does RESTBase current provide diffs faster than the MediaWiki API? Currently, the MediaWiki API is too slow to generate diffs for us to follow all of the change events we need to examine in real time.

@Krinkle +1 for improving the format of diffs that get stored, but that seems orthogonal to this task. It seems that @Afandian and I could make use of whatever format that was produced -- assuming it was produced quickly enough to keep up.

I'd also like to +1 the current-rev vs. previous-rev diffing. This is what we are looking for.

@Halfak, the idea with RB would be to generate each diff once at a deterministic URL (leveraging Varnish's request coalescing), and sharing the (stored) diff among all followers.

@GWicke, this sounds like an interesting solution. How do you currently subscribe to changes in RESTBase? And is RESTBase prepared to cache enwiki diffs for us?

@Halfak, our main mechanism of tracking changes is currently sending requests to RB from the job queue. This is done for each edit, but also indirect changes to templates or images.

RB can indeed store blobs like diffs. We have some support for time-based retention, so could keep those diffs for a limited time.

This would be very useful to the cocytus project, which I run, and would have completely avoided this time I took down labs-redis.

Cocytus was the reason we created this ticket in the first place, @notconfusing! Hopefully it will replace most of the code.

Sitic added a subscriber: Sitic.Jul 23 2015, 11:44 PM
Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptDec 10 2015, 11:36 PM
chasemp triaged this task as Normal priority.Apr 11 2016, 2:20 PM
Krinkle moved this task from Inbox to Backlog on the Wikimedia-Stream board.Apr 21 2016, 11:30 PM
Krinkle renamed this task from Provide useful diffs to high-volume consumers of RCStream to Provide useful diffs to high-volume consumers of recent changes.Jan 30 2017, 10:59 PM

@Krinkle

This feature will be helpful (every single revision ):

I am building a web-based Vandalism crowd-source patrolling tool, early demo can be found at http://wikiloop-battlefiled-demo.herokuapp.com. It listen to the RCStream and format it on the page, and add UX feature to make patrolling easier and more fun.

It will be helpful if we can show diff directly at this web. If diff is not in the RCStream, we might have to file additional request to server to obtain full copy of wikitext and do diff-ing on our side - which adds to your serving load.

@Xinbenlv In general I wouldn't be too concerned about server load. You can use the Revisions query API for this purpose, which also supports batching (via revids) and Varnish HTTP caching (on our end) so long smaxage=600 is specified. See also the API Etiquette.

If we find latencies do become a problem, this endpoint might be something we could expose via the high-volume REST API instead. See rest v1 for more about that.