Page MenuHomePhabricator

[LEX] Empty senses/forms lists presentation in dump
Closed, ResolvedPublicBUG REPORT

Description

Non-empty "senses" and "forms" tags are presented as arrays in lexemes dumps. But empty lists are presented as objects.

How to reproduce

Non-empty tags are presented as arrays:

"senses":[{"id":"L4-S1",...

But empty tags are looked as objects:

"senses":{}

Expected behavior
Empty lists should be presented as arrays also:

"senses":[]

Notes
Before completing this we must consider:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

This is sort of the inverse of T138104 and T241422: usually, empty PHP arrays get emitted as empty lists in JSON, and we have to do some extra work to produce empty objects instead (because claims should be an object). Here it’s apparently the other way around – the dumps seem to default to empty objects, but for forms and senses we want lists. We need to make sure that we don’t accidentally turn empty claims back into lists, though.

ItamarWMDE renamed this task from Empty senses/forms lists presentation in dump to [LEX] Empty senses/forms lists presentation in dump.Nov 9 2023, 10:12 AM

Change 982847 had a related patch set uploaded (by Arthur taylor; author: Arthur taylor):

[mediawiki/extensions/Wikibase@master] WIP Return stdClass/Object from Serializers for empty lists

https://gerrit.wikimedia.org/r/982847

Change 983159 had a related patch set uploaded (by Arthur taylor; author: Arthur taylor):

[mediawiki/extensions/WikibaseLexeme@master] Change expected serialization format of JSON dumps to include arrays

https://gerrit.wikimedia.org/r/983159

Change 983162 had a related patch set uploaded (by Arthur taylor; author: Arthur taylor):

[mediawiki/extensions/Wikibase@master] Added strict type and type info to modified files

https://gerrit.wikimedia.org/r/983162

Change 983081 had a related patch set uploaded (by Arthur taylor; author: Arthur taylor):

[mediawiki/extensions/WikibaseLexeme@master] Disable JSON Dump tests to prepare for schema change in Wikibase

https://gerrit.wikimedia.org/r/983081

Before completing this we must consider:

Hm, but this is a bit tricky. Since this issue only affects the dumps (and not Special:EntityData nor wbgetentities), and we don’t seem to have testwikidatawiki entity dumps (compare wikidatawiki, see entities/), it’s not clear to me how we can fulfill the requirement that “The change will be available for testing at least two weeks before deployment on https://test.wikidata.org/” if we consider this a breaking change :|

Before completing this we must consider:

Hm, but this is a bit tricky. Since this issue only affects the dumps (and not Special:EntityData nor wbgetentities), and we don’t seem to have testwikidatawiki entity dumps (compare wikidatawiki, see entities/), it’s not clear to me how we can fulfill the requirement that “The change will be available for testing at least two weeks before deployment on https://test.wikidata.org/” if we consider this a breaking change :|

Meh. That's unfortunate. I guess one way would be to ping the folks dealing with dumps and see if we can (temporarily?) enable them for test.wikidata.org as well?

Alternatively, how did we handle this in T138104 or T241422? Or is it that we are not "changing behavior", but "fixing a bug"?

In any case, we certainly should start the communication around this next week.

@Arian_Bozorg re. your comment above - how do you want to proceed here? If we consider this a breaking change, we would need to selectively enable the functionality, and need to setup JSON dumps on testwiki. If we don't consider this a breaking change I guess we can be a little less rigorous about the rollout. Right now, I have a flag for the change in functionality, but it's not exposed as a config option - should I create the option?

Alternatively, how did we handle this in T138104 or T241422? Or is it that we are not "changing behavior", but "fixing a bug"?

T138104 explicitly only affected the API and not the dumps. T241422 doesn’t mention the dumps, but from looking at the main fix commit, I suspect it also only affected the API. So we’re in new territory here as far as I can tell.

Thanks everyone, I think Michael's option works well for us here. @Masssly will be in touch with the people working dumps tomorrow. And we can turn on the config option before so that they can see how it can affect them.

Let's wait for the response from the people pinged before we move ahead on this. cc: @Masssly

@ArielGlenn, @BTullis, @Milimetric, @JEbe-WMF, @xcollazo: Apologies if you've received notifications elsewhere. We are planning to address the issue of how empty forms and senses are presented in dumps by changing empty lists from objects to arrays, aligning with the presentation of non-empty tags. Since you're involved in generating the dumps, we would appreciate hearing your thoughts before moving forward. Additionally, we are exploring the feasibility of making the fixed dumps available for testing on test.wikidata.org

Since you're involved in generating the dumps, we would appreciate hearing your thoughts before moving forward.

I have no objection in fixing what you folks all agree is a bug. It is also my interpretation that the Stable Interface Policy applies to wikidata.org and not to the dumps.

I do think we should announce and document the change as per your policy, and also on https://lists.wikimedia.org/postorius/lists/xmldatadumps-l.lists.wikimedia.org/ out of courtesy to dumps users, but note these entities dumps are separate from our main database dumps.

Additionally, we are exploring the feasibility of making the fixed dumps available for testing on test.wikidata.org

Is there any use for the proposed testwikidatawiki dumps other than to do the testing proposed in this task? If not, I would rather not setup a dump just for testing.

Additionally, we are exploring the feasibility of making the fixed dumps available for testing on test.wikidata.org

Is there any use for the proposed testwikidatawiki dumps other than to do the testing proposed in this task? If not, I would rather not setup a dump just for testing.

IIRC our thinking (in some meeting where we talked about this) was that, even though this change to the dumps is pretty minor and harmless, it would still be useful to set up testwikidatawiki dumps in general, so that we have them available when we make other changes to the dumps that might require more testing (which will probably happen sooner or later).

After removing the (used) OPTION_USE_OBJECTS_FOR_MAPS (per T353626), I looked at what it would mean to just apply OPTION_USE_OBJECTS_FOR_EMPTY_MAPS change globally (and thereby simply eliminate the need for the option). The effect of making the change globally is that some tests fail because of interactions with the ApiResult serialization.

To avoid making global changes to the way we serialize objects, my preference would be to use the flag - it seems like too large a can of worms to open to address such a minor serialization issue.

After removing the (used) OPTION_USE_OBJECTS_FOR_MAPS (per T353626), I looked at what it would mean to just apply OPTION_USE_OBJECTS_FOR_EMPTY_MAPS change globally (and thereby simply eliminate the need for the option). The effect of making the change globally is that some tests fail because of interactions with the ApiResult serialization.

To avoid making global changes to the way we serialize objects, my preference would be to use the flag - it seems like too large a can of worms to open to address such a minor serialization issue.

I see, that’s unfortunate. In the long term, I feel like it would be better for the serializers to directly return all the information they can (not just objects for all maps, but even the extra metadata that’s currently tacked on in ResultBuilder::getEntitySerializationWithMetaData()), and for later parts of the stack to strip out the metadata they don’t need again – but that would be a lot more effort. Let’s just limit this option to the dumps for now, then.

It is also my interpretation that the Stable Interface Policy applies to wikidata.org and not to the dumps.

That’s actually a good question… the policy (permalink) mentions the RDF and JSON dumps in “Stable Data Formats”, but doesn’t actually list them in “Stable Public APIs” – but neither are they listed in “Unstable Interfaces”. I feel like this is probably just an oversight, and they should be considered stable interfaces – we’ve mentioned them in some breaking change announcements (example), and some of the discussions on the policy talk page also involve the dumps. But this is probably up to Product to decide.

Yeah I'd say this is an oversight and they should be considered stable.

I think the code is ready and we need a product decision – do we (ask WMF to) set up dumps for Test Wikidata, so we can first roll out the change there? Or do we just announce this and then (after 1 week, 2 weeks, whatever) deploy it to Wikidata for the next round of dumps there, without the ability to test it on Test Wikidata?

I don't think we can invest the time right now to do the test dumps, so let's deploy it even if it can't be tested on Test Wikidata unfortunately.

Alright, then we just need to settle the timing I think. The lexeme JSON dumps run each Wednesday (per Puppet), and according to systemd on snapshot1008 they seem to finish in less than an hour, so I think we could for instance send out the announcement tomorrow, let the dumps of 2024-01-17, -24 and -31 still run with the old code (and the “bad” empty senses/form representation), then deploy the fix between 2024-02-01 and 2024-02-06 and have the dumps of 2024-02-07 as the first fixed dumps. Plus or minus any number of weeks if we want (e.g. if we already want the new code for the 2024-01-31 dumps). WDYT?

That sounds good to me. Let's prepare an announcement to send out tomorrow.

The announcement says that we’ll deploy this on 8 February (so the first dump with the new code will be on 14 February); until then, I think this task is stalled, and the best column is probably In Development? Or maybe Waiting for Deployment Window?

Lucas_Werkmeister_WMDE changed the task status from Stalled to Open.Feb 9 2024, 11:34 AM

Unstalling, and I’ve just +2ed the relevant changes. I think on Monday we can backport it to wmf.17 (so that it’s deployed regardless of whether the train reaches group1 on Wednesday before or after the dumps start), and run a few quick test dumps to check that the fix works.

Change 983081 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@master] Disable JSON Dump tests to prepare for schema change in Wikibase

https://gerrit.wikimedia.org/r/983081

Change 982847 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Return stdClass/Object from Serializers for empty lists

https://gerrit.wikimedia.org/r/982847

Change 983162 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Added strict type and type info to modified files.

https://gerrit.wikimedia.org/r/983162

Change 983159 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@master] Change expected serialization format of JSON dumps to include arrays

https://gerrit.wikimedia.org/r/983159

Change 1002431 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Arthur taylor):

[mediawiki/extensions/WikibaseLexeme@wmf/1.42.0-wmf.17] Disable JSON Dump tests to prepare for schema change in Wikibase

https://gerrit.wikimedia.org/r/1002431

Change 1002432 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Arthur taylor):

[mediawiki/extensions/Wikibase@wmf/1.42.0-wmf.17] Return stdClass/Object from Serializers for empty lists

https://gerrit.wikimedia.org/r/1002432

Change 1002433 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Arthur taylor):

[mediawiki/extensions/WikibaseLexeme@wmf/1.42.0-wmf.17] Change expected serialization format of JSON dumps to include arrays

https://gerrit.wikimedia.org/r/1002433

Change 1002431 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@wmf/1.42.0-wmf.17] Disable JSON Dump tests to prepare for schema change in Wikibase

https://gerrit.wikimedia.org/r/1002431

Change 1002432 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@wmf/1.42.0-wmf.17] Return stdClass/Object from Serializers for empty lists

https://gerrit.wikimedia.org/r/1002432

Change 1002433 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@wmf/1.42.0-wmf.17] Change expected serialization format of JSON dumps to include arrays

https://gerrit.wikimedia.org/r/1002433

Mentioned in SAL (#wikimedia-operations) [2024-02-12T16:12:59Z] <logmsgbot> lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:1002431|Disable JSON Dump tests to prepare for schema change in Wikibase (T305660)]], [[gerrit:1002432|Return stdClass/Object from Serializers for empty lists (T305660)]], [[gerrit:1002433|Change expected serialization format of JSON dumps to include arrays (T305660)]]

Mentioned in SAL (#wikimedia-operations) [2024-02-12T16:14:22Z] <logmsgbot> lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1002431|Disable JSON Dump tests to prepare for schema change in Wikibase (T305660)]], [[gerrit:1002432|Return stdClass/Object from Serializers for empty lists (T305660)]], [[gerrit:1002433|Change expected serialization format of JSON dumps to include arrays (T305660)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-02-12T16:22:42Z] <logmsgbot> lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:1002431|Disable JSON Dump tests to prepare for schema change in Wikibase (T305660)]], [[gerrit:1002432|Return stdClass/Object from Serializers for empty lists (T305660)]], [[gerrit:1002433|Change expected serialization format of JSON dumps to include arrays (T305660)]] (duration: 09m 42s)

This is deployed now and should take effect with the next dumps; I did some tests on mwdebug2002, and as far as I can tell, lexemes are indeed fixed ("senses": []) while items didn’t regress ("claims": {}).

The lexeme part of this looks good to me, I’m seeing "senses":[] and "forms":[] in

curl -s 'https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.json.gz' | gunzip | grep -E --color=yes '"(forms|senses)":[[{][]}]' | head -25

I guess we should leave this open until the next non-lexeme dumps to check that items still look correct as well.

Lucas_Werkmeister_WMDE assigned this task to ArthurTaylor.

The non-lexeme dump also looks good:

$ curl -sL https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz | gunzip | grep -E --color=yes '"claims":[[{][]}]' | head -25
{"type":"item","id":"Q23678","labels":{"gl":{"language":"gl","value":"B-2C"},"en":{"language":"en","value":"B-2C"},"es":{"language":"es","value":"B-2C"},"de":{"language":"de","value":"B-2C"}},"descriptions":{"gl":{"language":"gl","value":"Motor foguete desenvolvido polos Estados Unidos."},"en":{"language":"en","value":"rocket engine"},"es":{"language":"es","value":"nave espacial"},"de":{"language":"de","value":"Raketentriebwerk"}},"aliases":{},"claims":{},"sitelinks":{"glwiki":{"site":"glwiki","title":"B-2C","badges":[]}},"pageid":27069,"ns":0,"title":"Q23678","lastrevid":73962379,"modified":"2013-10-01T16:58:46Z"},
{"type":"item","id":"Q56405","labels":{"zh":{"language":"zh","value":"\u683c\u66fc\u8bed"},"br":{"language":"br","value":"Midjoueg"},"gv":{"language":"gv","value":"Miju-Mishmish"}},"descriptions":{},"aliases":{"gv":[{"language":"gv","value":"Deng Geman"},{"language":"gv","value":"Deng Kaman"}]},"claims":{},"sitelinks":{"brwiki":{"site":"brwiki","title":"Midjoueg","badges":[]}},"pageid":59180,"ns":0,"title":"Q56405","lastrevid":16956037,"modified":"2013-03-30T11:03:14Z"},
{"type":"item","id":"Q282911","labels":{"hu":{"language":"hu","value":"Az alkohol hat\u00e1sai"},"ro":{"language":"ro","value":"Efectele etanolului asupra organismului"}},"descriptions":{},"aliases":{},"claims":{},"sitelinks":{"rowiki":{"site":"rowiki","title":"Efectele etanolului asupra organismului","badges":[]},"huwiki":{"site":"huwiki","title":"Az alkohol hat\u00e1sa az emberi szervezetre","badges":[]}},"pageid":273471,"ns":0,"title":"Q282911","lastrevid":1396249191,"modified":"2021-04-05T09:52:56Z"},

There are some "claims":{} in there and no "claims":[].

I think this is done \o/