Page MenuHomePhabricator

Create new mediawiki.page_links_change stream based on fragment/mediawiki/state/change/page
Open, Needs TriagePublic

Description

Representing page links changes as state will be useful for inputs to ML models (T328899), but will also be very useful for state transfer of page links state to other places.

We should create a common event data model for links on MW pages.
See T333497#8772933 for a summary of the different kinds of links that might be on MW pages.
In addition to the links listed there, we should include page redirect targets as a link type.

Looking at the different kinds of links, I can see two broad kinds: Links to MW pages (articles, templates, categories(?) images, etc.) and arbitrary hyperlinks to external URLs. If it is sane to put these kinds of links in the same data model, we should, but perhaps external links are different enough to warrant their own data model. This ticket should be used to make and document this decision.

Done is

  • A new mediawiki page link state entity data model is bikeshed and decided on
  • A new mediawiki.page_links_change.v1 stream is produced via EventBus extension. This stream should likely only contain normal wiki page links.

Other streams that represent links should use this new mw link data model.

NOTE: There is a lot of context that we need to get from DBAs and other MediaWiki folks to do this right. See also:

Other related tasks:

Related Objects

Event Timeline

We may want to emit this change directly from EventBus, not as from a streaming enrichment job. We'd need notifications of the page link changes to do streaming enrichment. We could use the existent mediawiki.page-links-change stream (which does not represent the current state of pages, only links removed and added) as a notification event to enrich a new event, but it would probably be easier to just emit from EventBus.

Ottomata renamed this task from Create new mediawiki.page_links_change schema based on fragment/mediawiki/state/change/page to Create new mediawiki.page_links_change stream based on fragment/mediawiki/state/change/page.Mar 8 2023, 3:06 PM

@calbon This is probably a task that Event Platform will do, unless you want @AikoChou to do some PHP EventBus work :)

If you all do want to do this, we are happy to review, and it would probably help expedite getting it done.

When we do this, we should also consider all the other kinds of mediawiki 'links'. Ideally we can create a common schema for each of these, but emit them as separate streams, e.g. page_links_change, page_images_change, page_category_change, page_templates_change etc. (maybe we'd put links in all the stream names too? TBD I guess).

See T333497#8772933 and below

In https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/914867, @pfischer is modeling a 'link target' in order to get information about page redirect targets for updating search indexes.

This is looking pretty relevant, and I think to do this right we should think together about how we want to model page links in our MW state event schemas. It might be that the redirect target of a page should be propagated to search indexes via the described new page_links_change stream. But, even if we do want to put redirect target info page_change stream, it'd be nice if the data model we use for a page link was the same.

Currently the link_target entity does map to MW's LinkTarget. It is related to the page more than to a revision as this is the information we get without further lookups in PageChangeHooks. There is only one use case for us (cirrus search update pipeline, see https://phabricator.wikimedia.org/T325315#8827182) where it's relevant at what revision the redirect existed/pointed to a certain target.

@pfischer pasting some stuff from Slack here so it doesn't get lost.

For the event data model we need to:

  • determine if we can/should make this ‘mediawiki link’ data model able to represent all kinds of mediawiki links, including redirects? Including external links?
  • if we decide that this data model shouldn’t represent a kind of link (e.g. external) then we should document that in this phab task, and explain why, and potentially explain how we might want represent those with a different data model.
  • Document the possible different kinds of links in the event schema.

Quick recap on what LinkTarget does...

MediaWiki distinguishes between several different kinds of links:

  • links to pages (possibly plus a section on that page, and possibly specified as a relative sub-page link; may link to a special page or non-existing page)
  • relative section links (jumps on the same page, not recorded in the database)
  • interwiki links (links to some external site using an interwiki prefix - the target site doesn't actually have to be a wiki; we can distinguish between "local" interwikis and external interwikis)
  • language links (to sister wikis)
  • external links (full URL)
  • category links
  • image (media) links
  • template links
  • ...maybe I missed something?

While LinkTarget can represent the targets of all kinds of links, it does not distingish between links with different semantics (e.g. [[Category:Foo]] is a categorization, while [[:Category:Foo]] is a regular link to a category). Thes would be recorded to different database tables, but the LinkTarget representation would look the same.

Aye, thanks Daniel! This is great. Similar to but with more detail to what Isaac wrote here.

Most likely what we want to do is define a common event data model that can represent as many of these concepts as we can. But we will probably want to put different kinds of links changes into different streams. E.g. mediawiki.page_wiki_links_change, mediawiki.page_external_links_change, mediawiki.page_image_links_change, who knows. Or maybe we'll want most of them in one stream. TBD. The important part for me right now is the data model(s).

I wrote a schema comparison based on the current database schemas linked in @Isaac's comment. Based on that (see DB Schema Transposed sheet), we would end up with the following bag of properties to fit all link schemas listed above:

  • from (all)
  • title (pl, iwl, ll)
  • from_namespace (pl, il, tl)
  • namespace (pl)
  • prefix|lang (iwl, ll)
  • target_id (tl)
  • to (cl, il, el)
  • sortkey (cl)
  • sortkey_prefix (cl)
  • timestamp (cl)
  • collation (cl)
  • type (cl)
  • id (el)
  • index (el)
  • index_60 (el)
  • to_domain_index (el)

@pfischer thanks for compiling all this!
Some fields like to can be of different meaning, a full url for external links and a page_title for images and categories I wonder if we should not rename some of them.
Some of these fields are also db optimizations that I'm not sure we need to expose (e.g. index, index_60, to_domain_index) because I think they'd be trivial to extract.

Would something like this work?

  • link_type: enum (page, interwiki, language, external, category, media, template)
  • page_title: optional[string] (for page, interwiki, language, category, media, template)
  • page_namespace: optional[int] (for page, category, media, template)
  • page_id: optional[int] (for page only?)
  • is_redirect: optional[boolean] (for page, category?, template?)
  • external_url: optional[string] (for external)
  • iw_prefix: optional[string] (interwiki, language)
  • fragment?: optional[string] (page, language, interwiki, external)

Examples: P48704
Problems I've seen:

  • no way to distinguish local uploads from media uploaded to commons for media links
  • fragment: is not stored in the db but could be extracted I hope?
  • page_id: is not really stored in the db and will be the target page_id at the time the event is emitted, might help to identify redlinks
    • would this be needed for other types like template?
  • might be hard to distinguish between external vs local interwiki links (should there be a another field to distinguish those?)
  • interwiki prefixes are local to the source wiki, wikt might mean en.wiktionary.org on en.wikipedia.org but fr.wiktionary.org on fr.wikipedia.org, same for language links en might mean en.wiktionary or en.wikisource depending the project the event is emitted from
  • there are many optional fields here, might be a code smell that indicates we mix too many different types in the same model?
  • the link text itself is not there [[Page#Fragment|click here]] or [https://wikipedia.org click here], click here is not there
  • what happens if the page has multiple links to the same target URL but with varying fragment? or varying link text? (they might be dedupped by MW no?)

Thanks!

no way to distinguish local uploads from media uploaded to commons for media links

Wha? How does MW do this then? I guess if the page doesn't exist locally...look for it in commons? Is that built into MW core? Seems weird!

might be hard to distinguish between external vs local interwiki links (should there be a another field to distinguish those?)

Why is this hard? Or do you mean when a wiki link is saved as an external link because the editor used [] instead of [[]]?

interwiki prefixes are local to the source wiki

Maybe we can add a project (or domain?) field that normalizes this?

what happens if the page has multiple links to the same target URL but with varying fragment? or varying link text?

I guess not much we can do about that?

there are many optional fields here, might be a code smell that indicates we mix too many different types in the same model?

The main one that I'm not sure about is external_url. I'd like to use the same model for all links, but this one is the only one is exclusive to (almost) all of the other fields. Perhaps external links should get their own model.

Also, categorylinks is just weird in general, as these don't really seem like 'links' to me, but just happen to be implemented that way.


Anyway, so for @pfischer's purpose in T325315: Add support for redirects in CirrusSearch, the basic model we need now is for redirects, which is the same basic model for normal (local) page links, I think?

  • link_type
  • page_namespace_id
  • page_title
  • page_id
  • is_redirect

And, actually...this is almost the same as entity/page schema schema fragment (minus link_type, plus revision_count)

I wonder if we can just reuse page entity in the link fragment schema. If we exclude external links from this model, we could do:

title: fragment/mediawiki/state/entity/page_link
$id: /fragment/mediawiki/state/entity/page_link/1.0.0
type: object

allOf:
  - $ref: /fragment/mediawiki/state/entity/page/1.0.0

properties:
  link_type:
    type: string
  is_redirect:
    type: boolean
  
  # + fragment if we can?
  # + link_text if we can?

(We might want to move the page fragment schema revision_count field elsewhere.)

And actually, T325315: Add support for redirects in CirrusSearch is not trying to model in general links on a page, but info about a page's redirectness. So perhaps we don't need to use page_link model there at all, as long as redirect is always to a local page? I'll comment in that ticket now about this idea.

Thanks!

no way to distinguish local uploads from media uploaded to commons for media links

Wha? How does MW do this then? I guess if the page doesn't exist locally...look for it in commons? Is that built into MW core? Seems weird!

I mean MW certainly knows at some point but it does not seem that LinkTarget or anything in the imagelinks table has any info about that.

might be hard to distinguish between external vs local interwiki links (should there be a another field to distinguish those?)

Why is this hard? Or do you mean when a wiki link is saved as an external link because the editor used [] instead of [[]]?

I meant "hard" in the sense how to properly model this, MW uses the "interwiki" type for encoding both but we end-up for example encoding the uri_path of an external interwiki link as a page_title which sounds weird to me.

interwiki prefixes are local to the source wiki

Maybe we can add a project (or domain?) field that normalizes this?

I believe this would be helpful indeed.

what happens if the page has multiple links to the same target URL but with varying fragment? or varying link text?

I guess not much we can do about that?

For the CirrusSearch use cases we actually don't care, I just added this for other use-cases that might be more interested in knowing fine-grained details about the links themselves.

there are many optional fields here, might be a code smell that indicates we mix too many different types in the same model?

The main one that I'm not sure about is external_url. I'd like to use the same model for all links, but this one is the only one is exclusive to (almost) all of the other fields. Perhaps external links should get their own model.

Also, categorylinks is just weird in general, as these don't really seem like 'links' to me, but just happen to be implemented that way.


Anyway, so for @pfischer's purpose in T325315: Add support for redirects in CirrusSearch, the basic model we need now is for redirects, which is the same basic model for normal (local) page links, I think?

  • link_type
  • page_namespace_id
  • page_title
  • page_id
  • is_redirect

And, actually...this is almost the same as entity/page schema schema fragment (minus link_type, plus revision_count)

I wonder if we can just reuse page entity in the link fragment schema. If we exclude external links from this model, we could do:

title: fragment/mediawiki/state/change/page_link
$id: /fragment/mediawiki/state/change/page_link/1.0.0
type: object

allOf:
  - $ref: /fragment/mediawiki/state/change/page/1.0.0

properties:
  link_type:
    type: string
  is_redirect:
    type: boolean
  
  # + fragment if we can?
  # + link_text if we can?

(We might want to move the page fragment schema revision_count field elsewhere.)

And actually, T325315: Add support for redirects in CirrusSearch is not trying to model in general links on a page, but info about a page's redirectness. So perhaps we don't need to use page_link model there at all, as long as redirect is always to a local page? I'll comment in that ticket now about this idea.

For CirrusSearch we want to know when the page is a redirect:

  • what is the target page_id
  • ignore it if the target is yet another redirect (double redirects)
  • ignore if the target is an interwiki link

So re-using page_link sounds fine to me (we might actually don't need link_type if we don't model interwiki links yet).
If we want to model interwiki links as part of the page_link model and re-uses this for the page-state stream then I'm not sure we have a good model yet.
Not modeling interwiki links yet might mean that we don't set the link target field when we hit a redirect that points to an interwiki link, we'd just know that it's redirect.

I wonder if we can just reuse page entity in the link fragment schema. If we exclude external links from this model, we could do:

title: fragment/mediawiki/state/change/page_link
$id: /fragment/mediawiki/state/change/page_link/1.0.0
type: object

allOf:
  - $ref: /fragment/mediawiki/state/change/page/1.0.0

properties:
  link_type:
    type: string
  is_redirect:
    type: boolean
  
  # + fragment if we can?
  # + link_text if we can?

(We might want to move the page fragment schema revision_count field elsewhere.)

I think we can omit link_type. Instead, we could add iw_prefix|project|domain:

title: fragment/mediawiki/state/change/page_link
$id: /fragment/mediawiki/state/change/page_link/1.0.0
type: object

allOf:
  - $ref: /fragment/mediawiki/state/entity/page/1.1.0

properties:
  domain:
    type: string
  is_redirect:
    type: boolean
  
  # + fragment if we can?
  # + link_text if we can?

Why did you $ref: /fragment/mediawiki/state/change/page/1.0.0 instead /fragment/mediawiki/state/entity/page/1.1.0, @Ottomata?

Why did you $ref: /fragment/mediawiki/state/change/page/1.0.0 instead /fragment/mediawiki/state/entity/page/1.1.0, @

Oops mistake! I did indeed mean to use entity/page, good catch!

And

fragment/mediawiki/state/change/page_link

Should have been fragment/mediawiki/state/entity/page_link

too. Edited.

Alright, so if we reduce the scope from general link to page link -- that is a link to wiki page, either local or in another wiki -- you would be fine with the schema we sketched up above, @Ottomata? If so, I'd adapt my change request accordingly.

Based on all the info you've gathered, and comments in T325315, I think we can avoid committing a link_target entity schema for now, and just use page info fields in a redirect_target_page as described in T325315#8898609.

When we get around to working on this ticket, we'll use a schema similar to the one you and David proposed above.

So, I think, we can pause on this ticket, and focus on T325315 without making a new generic link target entity. If so, let's work on that in T325315.

Ya?

Change 914867 had a related patch set uploaded (by Ottomata; author: Peter Fischer):

[schemas/event/primary@master] Encode redirect targets in page change events.

https://gerrit.wikimedia.org/r/914867

Alright, in the latest patch for including redirect page link info (T325315), we have added a page_link entity model, that could be used as the following field example:

page_link_target:
  # Info about the page linked to:
  interwiki_prefix: mw
  is_redirect: false
  namespace_id: 0
  page_id: 123
  page_title: The_Page_Title

We can expand and add more fields later as we need to support more types of links than redirects.

My outstanding questions is if we will want/need to be able to represent external links with the same data model. I'm leaning towards not doing it. If we don't, this means there would be two different fields in a new mediawiki.page_links_change stream: page_links and external_links. @achou, reading the Model Card Data section it looks like your model will not use external links for T328899 outlink, right?

This data model is a fragment schema, so we can rename it later, or make another one, if we decide it should include more than links to MW pages.

If there are no objections, @pfischer and I will move forward with this for adding redirect_page_link field to mediawiki/page/change schema in T325315: Add support for redirects in CirrusSearch.

Change 914867 merged by jenkins-bot:

[schemas/event/primary@master] Encode redirect targets in page change events.

https://gerrit.wikimedia.org/r/914867

@achou, reading the Model Card Data section it looks like your model will not use external links for T328899 outlink, right?

Sorry for the delayed response. That's right, the outlink topic model does not use external links to predict article topics.