Page MenuHomePhabricator

Limit languages on EntityStub rdf builders
Closed, ResolvedPublic

Description

When an item uses entities (e.g. saying an statement of P31:Q5), the non-dump (flavor=dump) RDF output of that item would include label and description of P31 and Q5 in all languages. That bloats the output, drastically, causes performance issues and caused a major incident today (while mitigated by other means).

After discussion the decision has been made to use the requested language

Faster, no performance issue but change of behavior which people might depend on. Might cause client-side cache pollution.

AC

  • Label/descriptions are output in the requested language (uselang=?) defaulting to the site language
  • This new behavior is hidden behind a feature flag and will not be enabled in production until announcement has been made.

Event Timeline

Ladsgroup updated the task description. (Show Details)
Ladsgroup added a subscriber: Manuel.

This is technically a task for Lydia and @Manuel to decide on.

IIRC the output of Special:EntityData is cacheable in the web cache layer. Cache fragmentation should be considered when deciding on language parameters. Maybe the cache should be bypassed if a parameter is present. Or if more than one language is requested. Or something.

Note on the side: it would be nice to turn this into a REST handler at some point.

These days, we only cache a few specific URL patterns (see T260349):

operations/mediawiki-config.git/wmf-config/Wikibase.php
// entity data for URLs matching these patterns will be cached in Varnish and purged if needed;                                                                                                                                                                                                                                                        
// all other entity data URLs will receive no caching                                                                                                                                                                                                                                                                                                  
$wgWBRepoSettings['entityDataCachePaths'] = [
    // // JSON from entity page JS, compare wikibase.entityPage.entityLoaded.js                                                                                                                                                                                                                                                                        
    '/wiki/Special:EntityData/{entity_id}.json?revision={revision_id}',
    // Turtle from Query Service updater, compare WikibaseRepository.java                                                                                                                                                                                                                                                                              
    '/wiki/Special:EntityData/{entity_id}.ttl?flavor=dump&revision={revision_id}',
    // third pattern with high volume of requests in Hive, source unknown                                                                                                                                                                                                                                                                              
    '/wiki/Special:EntityData?id={entity_id}&revision={revision_id}&format=json',
];

The only RDF pattern in there uses flavor=dump, so at least in production, this shouldn’t make a difference for the cache. But I suppose Wikibase should also work correctly if a non-dump URL pattern is configured for caching.

Adam also pointed out that, if we cache non-dump RDF flavors, we need to purge the cache whenever the labels of any of the mentioned entities change. That’s almost certainly not remotely feasible.

But if we say that non-dump RDF flavors are never cacheable, then it should be okay to use the user interface language for the labels (which can then be controlled with ?uselang as usual).

Then in technical point of view having the request language (the third option) makes sense. It remains with community consultation and product to decide on what to do next.

If we go with the request language:

What do I get if I make a request with ?uselang=de-at, and a mentioned item has labels in English, German and Austrian German?

  • Austrian German label
  • Austrian German, German and English label

And what do I get if I make a request with ?uselang=de-at, and a mentioned item only has labels in English and German?

  • German and English label
  • German label
  • no labels at all (in the stub – the full entity data for that item still includes all labels, of course)

I think using fallback chain would be better, if the item has label in de-at show that, if not, then show de and then en, etc.

flavor=dump is used primarily for updating WDQS, right? In that case, the data is polled because it is know to have changed. So caching seems pointless...

flavor=dump is used primarily for updating WDQS, right? In that case, the data is polled because it is know to have changed. So caching seems pointless...

Except for the fact that this pattern is performed by multiple users outside the cluster. Outside the cluster the usage of the wdqs-updater has steadily been rising over the years.
(And this also used to happen for each individual wdqs server inside the cluster too, though with the new updater this will be reduced).
I forget where the ticket is now, but the data was dug into when the decision was made to cache this

flavor=dump is used primarily for updating WDQS, right? In that case, the data is polled because it is know to have changed. So caching seems pointless...

Beside what Adam said, the problem is that each wdqs host has its own updater so each change gets requested multiple times even internally.

@Manuel @Lydia_Pintscher

In an attempt to un-stall this I will try to formulate a question to answer.

This blew up recently because of T281272: HTTP 500 error for https://www.wikidata.org/wiki/Special:EntityData/Q30.rdf / ttl but has been working previously. However providing labels etc for all languages is deemed by developers as bloating the response and a hit on performance.

Proposed solutions.

  1. Don't do anything (Perfectly valid situation but the output will be massive, performance issues, etc.) (note: Q42 currently gives around 1mb of RDF)
  2. No label or description being output (The dump flavor does this. Not necessary bad but the most drastic change.)
  3. Only output a set of languages to
    1. Use request language (uselang=?) defaulting to the site language and fallback using a language fallback chain
    2. Some pre-defined set of languages, but which?

When discussed this morning the favored solutions seems to be 3A, however this means we change a big part of the output by leaving out other languages.

The open questions i see are:

  • Which solution do we go with?
  • Will this be welcome by the consumers of the api (Not having all the languages in the response)?

Amir and I talked about it some more. Based on the technical side of it and several reusers complaining this is my evaluation:

  • We don't remove the labels and descriptions for linked entities completely because that is already causing issues on the query service where people are trying to get exactly this data
  • We do not default to a specific language or set of languages because that will privilege one/some language over others for no good reason.
  • We don*t leave things as they are because both from a technical as well as reuser point of view the current situation is not great.

That leaves us with going by request language incl. fallbacks.

To move forward with this we need to follow the stable interface policy. This means we make the change, hide it behind a feature flag and turn it off initially, announce it incl. having it on on a test system, wait, turn it on in production. (Announcement to be coordinated with @Mohammed_Sadat_WMDE)

toan updated the task description. (Show Details)

Okay!

I've trimmed down the description with these new details. I think we can un-stall this?

I don't think you can get around sending "en" as fall-back even if the request is for "lang=es".

Okay!

I've trimmed down the description with these new details. I think we can un-stall this?

Yes from my side! We might need to be explicit about the fallback applying? (i.e. follow Wikidata's existing language fallback chain until a label/description is found if it exists in the chain)

I don't think you can get around sending "en" as fall-back even if the request is for "lang=es".

Language fallback should apply. So if the request is for Spanish but no Spanish label exists then English would be checked and used if available.

So regarding my questions from T285795#7189553, first bullet point, then second bullet point? We only emit one label per entity?

The patch I put up responds with all languages of the fallback chain so if you ask for Austrian German, the labels and descriptions for Austrian German, German and English will be returned regardless if the entity already has the label in Austrian German or not. That's a little bit of redundancy but I think it's fine (comparing to what we currently have, that's nothing).

Addshore renamed this task from Decide on languages on EntityStub rdf builders to Limit languages on EntityStub rdf builders.Jul 26 2021, 12:45 PM

Change 708204 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[operations/mediawiki-config@master] Enable request language for RDF stubs in testwikidatawiki

https://gerrit.wikimedia.org/r/708204

Change 708204 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable request language for RDF stubs in testwikidatawiki

https://gerrit.wikimedia.org/r/708204

Mentioned in SAL (#wikimedia-operations) [2021-07-27T06:18:27Z] <ladsgroup@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:708204|Enable request language for RDF stubs in testwikidatawiki (T285795)]], Part I (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2021-07-27T06:20:44Z] <ladsgroup@deploy1002> Synchronized wmf-config/Wikibase.php: Config: [[gerrit:708204|Enable request language for RDF stubs in testwikidatawiki (T285795)]], Part II (duration: 00m 56s)

Deployed on testwikidatawiki:
An example of TTL output:

PageBeforeAfter
https://test.wikidata.org/wiki/Q308

In two weeks after the announcement, we can turn it on everywhere.

Someone needs to add a Documentation task to this.
I assume all the new options available and perhaps a reference link to this ticket would go somewhere in here? https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format

There are no new options available. There’s a purely internal feature flag for the migration period, which we’ll remove again once the migration is done.

@Ladsgroup am I correct in thinking we can move forward with this now?
It certainly is 2 weeks after the announcement?

The announcement was mentioned 23rd of August as the deploy date. Let me double check.

Yup:

This change is currently available for testing at test.wikidata.org. It will be deployed on Wikidata on August 23rd. You are welcome to give us general feedback by leaving a comment in this ticket.

Change 714322 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Set request languages rdf output for wikidata to true

https://gerrit.wikimedia.org/r/714322

Change 714322 merged by jenkins-bot:

[operations/mediawiki-config@master] Set request languages rdf output for wikidata to true

https://gerrit.wikimedia.org/r/714322

Mentioned in SAL (#wikimedia-operations) [2021-08-23T07:44:39Z] <ladsgroup@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:714322|Set request languages rdf output for wikidata to true (T285795)]] (duration: 00m 57s)

Change 735394 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[operations/mediawiki-config@master] Remove tmpUseRequestLanguagesForRdfOutput Wikibase setting

https://gerrit.wikimedia.org/r/735394

Change 735394 merged by jenkins-bot:

[operations/mediawiki-config@master] Remove tmpUseRequestLanguagesForRdfOutput Wikibase setting

https://gerrit.wikimedia.org/r/735394

Mentioned in SAL (#wikimedia-operations) [2021-11-10T12:32:50Z] <lucaswerkmeister-wmde@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:735394|Remove tmpUseRequestLanguagesForRdfOutput Wikibase setting (T285795)]] (1/2) (duration: 00m 56s)

Mentioned in SAL (#wikimedia-operations) [2021-11-10T12:34:03Z] <lucaswerkmeister-wmde@deploy1002> Synchronized wmf-config/Wikibase.php: Config: [[gerrit:735394|Remove tmpUseRequestLanguagesForRdfOutput Wikibase setting (T285795)]] (2/2) (duration: 00m 56s)