Page MenuHomePhabricator

Surface link changes as a stream
Closed, ResolvedPublic

Description

Summary

We'd like to create a stream of events that outputs external links that are being added/removed to Wikipedia (in addition to the metadata associated with the change). What's the best way of doing so?

We'll go with Mediawiki + EventBus extension.

Previously

Here are some alternatives:

  • Previously EventLogging was used for this purpose, but it didn't fully materalize. Is EventLogging still the right solution?
  • Should we use Mediawiki + EventBus extension?
  • Should we use Change Propagation?
  • Should the RecentChanges event stream be expanded to include links?
  • Anything else?

Given that MediaWiki templates/modules generate links, I think the right approach is to parse HTML diffs (as opposed to Wikitext diffs) for links by listening to Parsoid events. Any issues with that? Or a better way of doing it?

Can references added to Wikidata statements be captured too?

Event Timeline

bmansurov created this task.

Clarifying: ChnageProp consumes EventBus data just like EventStreams consumes EventBus data. So you cannot "use" changeprop rather you will be sending events to EventBus (soon to be called EventGate) and consuming them from elsewhere and in turn exposing them to the world.

Previously EventLogging was used for this purpose, but it didn't fully materalize. Is EventLogging still the right solution?

I think eventlogging should not be needed. It seems that this can be done from mediawiki directly.

Should the RecentChanges event stream be expanded to include links?

Probably not, makes sense to create an event specifically for this. RecentChanges is used to note an edit event happened, it does not concern itself with content of the event and seems a bad idea to bloat event schema to do so. See https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/mediawiki/recentchange/2.yaml

See resources below:

Events can be published directly from mediawiki. The event streams we have for RecentChanges and similar do not use eventlogging but rather events are sent directly from mediawiki's backend when they occur. Schemas for those events can be found here: https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema/mediawiki

Also please see: https://www.mediawiki.org/wiki/Wikimedia_Technology/Annual_Plans/FY2019/TEC2:_Modern_Event_Platform

And: https://github.com/wikimedia/eventgate

We are happy to help you CR your changes as needed be, just let us know if you need a meeting to come up with a plan on the steps needed to accomplish what this ticket wants to do. I would start drafting the schema of the event you are interested in.

Thanks, @Nuria. I'll wait and see what other teams have to say before drafting a solution. Since we're interested in identifying links, I'm not sure how MediaWiki can be used to generate HTML diffs before we can parse them. That's why I was suggesting Parsoid.

@bmansurov ah I think I understand what you meant now! if mediawiki cannot generate the diff you are interested on at the time the page is edited you need to consume an event that happens later in the chain, ya, makes sense.

The task at hand is very easy. There's LinksUpdateComplete hook in MW core, it gets the LinksUpdate which contains the list of external links and methods to get links insertions and removals. Writing code in EventBus extension to use the hook is simple. The important part of the work is for you to come up with the schema for the event you want to get. We already have an event for page properties change. I would assume your event will be quite similar. If you create the schema, I can write the producer code or point you where to write it.

Moving to services-blocked until Research comes up with the schema for the event.

@Pchelolo good to hear. Besides the links themselves, will we be able to extract metadata associated with the change too? We'll need the following:

  • Revision ID
  • Page ID
  • Timestamp of the change
  • possibly others

@bmansurov look at the event schema for properties change I've linked, it contains all of the metadata you need and it's generated from the same hook. Will such a schema work for you if we replace removed_properties with removed_links and added_properties with added_links? We can also emit the full list of links before the edit and after the edit and let clients do the diff. In general, we can get any metadata you can imagine.

@Pchelolo yes, replacing "properties" with "links" should do it. I don't think we need the full list of links, just the changes.

Also, any idea on if this will work with Wikidata?

@Pchelolo yes, replacing "properties" with "links" should do it. I don't think we need the full list of links, just the changes.

Perfect. Just make a change request in gerrit with a new schema. I can point you to where to write the code for emitting the event, or I write the code myself if you don't wanna bother with PHP

Also, any idea on if this will work with Wikidata?

That is a good question. We advise from WD people on that, but I don't think I know a reason why it would not.

Change 486521 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/event-schemas@master] Add page link changes schema

https://gerrit.wikimedia.org/r/486521

@bmansurov I think you need to consider also couple more things: a list of links can be very lengthy, do we have a limit for how much this field should occupy? Are links url encoded? (we probably want them to be so).

Are links url encoded? (we probably want them to be so).

I agree with @Nuria on this. The 'meta.uri' is encoded, we need to be consistent. Whether the LinksUpdate in MW has encoded or not encoded links - needs to be tested.

Change 486691 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/extensions/EventBus@master] Send page-external-links-change event

https://gerrit.wikimedia.org/r/486691

@Nuria @Pchelolo I'm encoding the URL in the above patch. How should we handle the situation with long list of links? Is that really a problem? Should we create multiple events?

How should we handle the situation with long list of links? Is that really a problem?

You should not worry about that unless you encounter a list of links that's > 4mb long.

Should we create multiple events?

Please don't :) It will make it much more difficult to produce and consume such events.

Nuria raised the priority of this task from Medium to Needs Triage.Jan 25 2019, 11:44 PM
Nuria moved this task from Incoming to Radar on the Analytics board.

@Pchelolo yes, replacing "properties" with "links" should do it. I don't think we need the full list of links, just the changes.

Also, any idea on if this will work with Wikidata?

*poke* @Addshore :)

Change 486521 merged by Ppchelko:
[mediawiki/event-schemas@master] Add page link changes schema

https://gerrit.wikimedia.org/r/486521

Thanks for your input, all - it's great to see how quickly new event streams can be set up!

What are the next steps for getting the event stream live and accessible from this schema?

@Samwalton9 we still need to see if urls are url encoded or not and hook publishing to one of the mediawiki events (I think @bmansurov is doing this with @Pchelolo .help?) Once events are flowing and looking OK they can be set to be published to the outside world.

We have discovered that we would need to update LinkUpdates class in the core to support this functionality, so it will take a little bit more time then we anticipated, but we're making progress.

Change 488957 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/core@master] Expose external link additions and deletions

https://gerrit.wikimedia.org/r/488957

bmansurov renamed this task from How to surface link changes as a stream? to Surface link changes as a stream.Feb 7 2019, 4:29 PM
bmansurov updated the task description. (Show Details)

We're getting closer. The last part of the puzzle would be to emit the event publically via event streams. In order to do that, we need to add the mediawiki.page-links-change topic to the list of exposed topics in puppet for production and for deployment-prep

Change 488957 merged by jenkins-bot:
[mediawiki/core@master] Expose external link additions and deletions

https://gerrit.wikimedia.org/r/488957

Change 486691 merged by Ppchelko:
[mediawiki/extensions/EventBus@master] Send page-links-change event

https://gerrit.wikimedia.org/r/486691

Change 489211 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[operations/puppet@production] Add page-links-change event to EventStreams

https://gerrit.wikimedia.org/r/489211

Change 490143 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/event-schemas@master] Add page-links-change to topic config

https://gerrit.wikimedia.org/r/490143

Change 490143 merged by Ottomata:
[mediawiki/event-schemas@master] Add page-links-change to topic config

https://gerrit.wikimedia.org/r/490143

Just saw some errors in prod

Failed processing event: Topic mediawiki.page-links-change not configured

^ should fix.

Change 489211 merged by Ottomata:
[operations/puppet@production] Add page-links-change event to EventStreams

https://gerrit.wikimedia.org/r/489211

Thanks, everyone, for helping me with this task. I see the events are being emitted. Here's some data:

{
    "rev_id": 857576225,
    "performer": {
        "user_text": "Andre Engels",
        "user_registration_dt": "2012-10-31T13:36:19Z",
        "user_is_bot": false,
        "user_id": 6253,
        "user_groups": [
            "*",
            "user",
            "autoconfirmed"
        ],
        "user_edit_count": 81451
    },
    "page_title": "Q61720097",
    "page_namespace": 0,
    "page_is_redirect": false,
    "page_id": 61558166,
    "meta": {
        "offset": 25645,
        "partition": 0,
        "uri": "https://www.wikidata.org/wiki/Q61720097",
        "topic": "eqiad.mediawiki.page-links-change",
        "schema_uri": "mediawiki/page/links-change/1",
        "request_id": "3c81d7f1-4a94-4aae-a0da-38e252f3838c",
        "id": "76d84922-2fd3-11e9-b4fc-1866da994cb1",
        "dt": "2019-02-13T21:08:09+00:00",
        "domain": "www.wikidata.org"
    },
    "database": "wikidatawiki",
    "added_links": [
        {
            "link": "/wiki/Q5",
            "external": false
        },
        {
            "link": "/wiki/Property:P2268",
            "external": false
        },
        {
            "link": "/wiki/Property:P245",
            "external": false
        },
        {
            "link": "/wiki/Property:P2843",
            "external": false
        },
        {
            "link": "/wiki/Property:P31",
            "external": false
        },
        {
            "link": "/wiki/Property:P3782",
            "external": false
        },
        {
            "link": "/wiki/Property:P4927",
            "external": false
        },
        {
            "link": "/wiki/Property:P569",
            "external": false
        },
        {
            "link": "/wiki/Property:P570",
            "external": false
        },
        {
            "link": "/wiki/Property:P650",
            "external": false
        },
        {
            "link": "/wiki/Property:P813",
            "external": false
        },
        {
            "link": "/wiki/Property:P854",
            "external": false
        },
        {
            "link": "https://rkd.nl/en/explore/artists/110591",
            "external": true
        },
        {
            "link": "http://oxfordindex.oup.com/view/10.1093/benz/9780199773787.article.B00007415",
            "external": true
        },
        {
            "link": "http://www.oxfordartonline.com/benezit/view/10.1093/benz/9780199773787.001.0001/acref-9780199773787-e-00007415",
            "external": true
        },
        {
            "link": "http://www.getty.edu/vow/ULANFullDisplay%3Ffind%3D%26role%3D%26nation%3D%26subjectid%3D500036540",
            "external": true
        },
        {
            "link": "http://www.musee-orsay.fr/fr/espace-professionnels/professionnels/chercheurs/rech-rec-art-home/notice-artiste.html%3Fnnumid%3D1023",
            "external": true
        },
        {
            "link": "https://www.invaluable.com/features/viewArtist.cfm%3FartistRef%3D5a5g2du93t",
            "external": true
        },
        {
            "link": "http://www.artnet.com/artists/michel-arnoux",
            "external": true
        }
    ]
}

As was mentioned before we'll be seeing link changes from Group 2 Wikipedias starting tomorrow.

Here's the stream URL: https://stream.wikimedia.org/v2/stream/page-links-change

Change 490472 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/services/eventstreams@master] Add spec for links-change

https://gerrit.wikimedia.org/r/490472

Change 490472 merged by Ottomata:
[mediawiki/services/eventstreams@master] Add spec for links-change

https://gerrit.wikimedia.org/r/490472

Pchelolo claimed this task.

MW train has been deployed, so the events are available for all wikis. The final piece is adding the stream to the documentation, but apparently, it's blocked now by T216184.

I will close this task as no work needs to be done here, the stream will appear in the documentation as soon as we deploy EventStreams service.