Page MenuHomePhabricator

Process: Trending service should process real time edits
Closed, ResolvedPublic5 Estimated Story Points

Description

When an edit happens:

When I query the store for a page I should be able to see:

If user A edited the page 1 times and user B edited it 9 times, the bias is ( 9/10 - 1/10 ) = 0.8
If user A edits 5 times and user B 5 times the bias is (5/10 - 5/10) = 0
If an article was only edited by user A then the bias is 1

An edit should only be processed once and the store should be accessible to all rest endpoints.

Every 100 edits, a purge script is run. During the purge pages which meet the following criteria are removed from the store:

Production requirement:

  • The amount of pages stored in the service's memory should never exceed a configurable number

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 322784 had a related patch set uploaded (by Jdlrobson):
Ignore articles outside the main namespace

https://gerrit.wikimedia.org/r/322784

Change 322787 had a related patch set uploaded (by Jdlrobson):
Exclude bots from edit count

https://gerrit.wikimedia.org/r/322787

@Ottomata a few questions:

  • How can I detect if a page has been created in the mediawiki.revision-create schema. It seems I only have the comment to go on which isn't very reliable.
  • How can I detect if an event in the mediawiki.revision-create schema is a revert? Reverts are important for detecting edit wars which shouldn't trend.
    • There seems to be a rev_len property but this doesn't tell me how big the edit was so I can't calculate bytes changed from the edit before. This is useful for working out what kind of edits are happening on an article. Of course I can ignore the first edit and calculate based on that for subsequent edits at the loss of the first edit.... is that what I'm going to have to do?

Thanks in advance for your answers.

Change 322975 had a related patch set uploaded (by Jdlrobson):
Track anonymous edits

https://gerrit.wikimedia.org/r/322975

Change 323042 had a related patch set uploaded (by Jdlrobson):
Describe behaviour when pages are moved

https://gerrit.wikimedia.org/r/323042

Change 323043 had a related patch set uploaded (by Jdlrobson):
Describe delete behaviour

https://gerrit.wikimedia.org/r/323043

Change 323072 had a related patch set uploaded (by Jdlrobson):
Process contributor information and how biased editing is

https://gerrit.wikimedia.org/r/323072

Change 322784 merged by Mobrovac:
Ignore articles outside the main namespace

https://gerrit.wikimedia.org/r/322784

Change 322787 merged by Mobrovac:
Exclude bots from edit count

https://gerrit.wikimedia.org/r/322787

Change 322975 merged by Mobrovac:
Track anonymous edits

https://gerrit.wikimedia.org/r/322975

Change 323042 merged by Mobrovac:
Describe behaviour when pages are moved

https://gerrit.wikimedia.org/r/323042

Change 323043 merged by Mobrovac:
Describe delete behaviour

https://gerrit.wikimedia.org/r/323043

@Pchelolo I'll need some help today adding subscriptions to mediawiki.page-move and mediawiki.page-delete events.
@Ottomata I need your help understanding how to obtain the additional information https://phabricator.wikimedia.org/T145554#2816124

Today I'll fix up https://gerrit.wikimedia.org/r/323072 and begin writing patches to manage the amount of pages we watch.

Jdlrobson updated the task description. (Show Details)

Change 323072 merged by Mobrovac:
Process contributor information and how biased editing is

https://gerrit.wikimedia.org/r/323072

Change 323196 had a related patch set uploaded (by Jdlrobson):
Process move and delete events

https://gerrit.wikimedia.org/r/323196

Jdlrobson updated the task description. (Show Details)

Change 323239 had a related patch set uploaded (by Jdlrobson):
WIP: Implement a configurable purge strategy

https://gerrit.wikimedia.org/r/323239

Change 323324 had a related patch set uploaded (by Jdlrobson):
Record bytes change

https://gerrit.wikimedia.org/r/323324

3 patches need review [Ping @mobrovac @Pchelolo ] and then this is done.

Unfortunately, it looks like the event bus service is a little limited - I can't access reverts/whether page was created - waiting to hear from @Ottomata .about whether I need to file some new tickets against it). For a first version we can probably get away without these.

Unfortunately, it looks like the event bus service is a little limited - I can't access reverts/whether page was created - waiting to hear from @Ottomata .about whether I need to file some new tickets against it). For a first version we can probably get away without these.

To check whether the page was just created, exemine rev_parent_id property - it will be undefined for new pages.

As for reverts - I've look at the hooks and it seems there's absolutely no way for us to understand if a particular revision was created as a result of a rollback or not. So it would be quite hard to get this info to the service. Is it absolutely required?

@Jdlrobson Where did you get the info about reverts before? I can't find it anywhere..

Change 323196 merged by Mobrovac:
Process move and delete events

https://gerrit.wikimedia.org/r/323196

mobrovac updated the task description. (Show Details)

I added to the task:

Production requirement:

  • The amount of pages stored in the service's memory should never exceed a configurable number

We need to keep the memory bounded, so we have to enforce the maximum number of pages stored in the service's cache. This requirement should be checked on every page addition. If the number of pages would exceed it, the purge mechanism must be invoked. If none of the pages are purged, then we need to find an alternative criterion to remove at least one page, so that the new one can be stored.

What kind of number do you think is acceptable?
The most I've probably seen in the store is 5000.
If the configurable number is going to be higher than this it's probably enough to throw away the oldest edits.

What kind of number do you think is acceptable?
The most I've probably seen in the store is 5000.

I was thinking a couple of thousand pages, but we may as well start with 5k and adjust it if needed.

If the configurable number is going to be higher than this it's probably enough to throw away the oldest edits.

As long as the heuristic ensures no more than the configured amount of pages will be in cache, I'll be happy :)

@Pchelolo has answered your Q about page creates. As for reverts, hm, yeah, is this currently present in recentchanges? If so, perhaps we can figure out how recentchanges knows this, and add it to revision-create.

There seems to be a rev_len property but this doesn't tell me how big the edit was so I can't calculate bytes changed from the edit before. This is useful for working out what kind of edits are happening on an article.

Hm, perhaps we can add a rev_parent_len property to revision-create. @Pchelolo whatchu think?

Change 323897 had a related patch set uploaded (by Jdlrobson):
Track newly created pages

https://gerrit.wikimedia.org/r/323897

Reviews needed! All 6 patches are in the description. If these are merged I can finish this service as I'll have most of the information I need.

I'm working around the bytes changes by tracking the size of the first edit seen and comparing to last for the time being - https://gerrit.wikimedia.org/r/323324

The following patches still need +2s to complete this task!

Record bytes change (by Jdlrobson, 6 days old) [1]
https://gerrit.wikimedia.org/r/323324
Use Map for storing trending information (by Jdlrobson, 5 days old) [1]
https://gerrit.wikimedia.org/r/323489
Implement a configurable purge strategy (by Jdlrobson, 6 days old) [0]
https://gerrit.wikimedia.org/r/323239
Edit stream should replay edits up to the maximum age (by Jdlrobson, 5 days old) [0]
https://gerrit.wikimedia.org/r/323490

I think this might be stalling because I believe Petr is on vacation.

as for

Record bytes change (by Jdlrobson, 6 days old) [1]

I think it should be feasible and not difficult to add a rev_parent_len field to revision-create that contains the length of the parent revision.

I will take a look at them tomorrow. For future reference, please add me to a patchset if you want me to review it.

The following patches still need +2s to complete this task!

You are missing the one imposing a hard limit on the number of pages stored in memory.

Record bytes change (by Jdlrobson, 6 days old) [1]
https://gerrit.wikimedia.org/r/323324
Use Map for storing trending information (by Jdlrobson, 5 days old) [1]
https://gerrit.wikimedia.org/r/323489
Implement a configurable purge strategy (by Jdlrobson, 6 days old) [0]
https://gerrit.wikimedia.org/r/323239
Edit stream should replay edits up to the maximum age (by Jdlrobson, 5 days old) [0]
https://gerrit.wikimedia.org/r/323490

I will take a look at them tomorrow. For future reference, please add me to a patchset if you want me to review it.

@mobrovac the hard limit of pages is part of Implement a configurable purge strategy (by Jdlrobson, 6 days old) https://gerrit.wikimedia.org/r/323239

Change 323897 merged by jenkins-bot:
Track newly created pages

https://gerrit.wikimedia.org/r/323897

Change 323239 merged by jenkins-bot:
Implement a configurable purge strategy

https://gerrit.wikimedia.org/r/323239

https://gerrit.wikimedia.org/r/323324 is the only patch needed to finish this task (currently has 3 +1s...) - please merge this as priority.

both these patches will make the code better so are worth reviewing if you have time:

Change 323324 merged by jenkins-bot:
Record bytes change

https://gerrit.wikimedia.org/r/323324

Jdlrobson updated the task description. (Show Details)

This can be signed off. I'll poke for review in the other 2 patches in the other task as I now have enough to build off of.

@Jdlrobson can you share some instructions on how to test?

I'm having a trouble getting the changes to work:

$ wget -O - http://0.0.0.0:6927/en.wikipedia.org/v1/feed/trending-edits | tail
--2016-12-09 21:07:59--  http://0.0.0.0:6927/en.wikipedia.org/v1/feed/trending-edits
Connecting to 0.0.0.0:6927... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/json]
Saving to: ‘STDOUT’

    [ <=>                                                                                                                 ] 51          --.-K/s   in 0s      

2016-12-09 21:07:59 (4.06 MB/s) - written to stdout [51]

The event bus log says:

2016-12-09 21:07:09,316 [3663] (MainThread) 201 POST /v1/events (127.0.0.1) 117.26ms

No, I was wrong regarding the trailing slash. Are you running in Vagrant? Is kafka/ZK running?

@Pchelolo yes in vagrant, and kafka seems to be running:

sudo service kafka status
kafka start/running, process 3150
phuedx claimed this task.
phuedx subscribed.

The service has been deployed.

@Jdlrobson: Perhaps you could run a workshop during the All Hands to help interested Reading Web engineers to set up a dev/test environment.