Page MenuHomePhabricator

Update Wikidata unit conversion config (normalized quantities)
Closed, ResolvedPublic5 Estimated Story Points

Description

As a Wikidata editor, I want to work with normalized quantities in the query service, even if the source data is specified using rather unusual units.

Problem:
The unit conversion (configured in unitConversionConfig.json) has not been updated in three years (last changed in August 2017). In the meantime, a lot more units on Wikidata have gained conversions to standard SI units: re-running the maintenance script which generates the config file grows it by a factor of three, and @Toni_001 states that over a thousand units were added (and also that the current state of units on Wikidata is overall much better anyways). See project chat discussion (permalink).

However, it looks like we’ll need to update that maintenance script a bit before it can be used – when I ran it, the conversions it produced converted units like millimetre and centimetre to themselves (with a factor of 1) rather than to multiples of metre with the appropriate factor. (Personal comment: the script also has a few parameters for configurable Item IDs and then hard-codes a whole bunch of other Item and Property IDs, so I expect it’s not useful for third parties at the moment. Maybe we can improve that too?)

Example:
(None yet – there doesn’t seem to be any popular unit that’s currently grotesquely wrong and would require an update ASAP, just a bunch of smaller units that could use the improvement.)

Acceptance criteria:

  • WDQS unit conversion table is updated based on current data from Wikidata

Open questions:

  • Can we get this done ahead of the apparently-upcoming WDQS reload (T267175#6613631)? Otherwise the changes won’t be fully effective until the next reload.
  • @Gehel would it cause any issues with the streaming updater if the same revisions produced different RDF due to changes to the unit conversion config? Do we need to coordinate deploying this with you?

Event Timeline

Dev note: as far as I could tell, the script doesn’t connect to the local wiki at all, so you don’t need production access to test it – just run it locally with --sparql https://query.wikidata.org/sparql and you should get a JSON file corresponding to Wikidata.

Gehel added subscribers: Zbyszko, dcausse.

@dcausse / @Zbyszko : can you answer for the impact on the streaming updater?

We don't have yet a way to fix a journal where changes are performed without a new revision, all this kind of changes require some special handling and might affect a huge number of triples. Therefor a full reload is the sole solution we have at the moment for this kind of operation (both for the current updater and the future streaming updater).

During the transition (change to unitConversionConfig.json but the blazegraph journal not reloaded from fresh dumps), there will be some leaked triples and the streaming updater is likely to leak a bit more triples than the current one.

I'm marking this task as blocking the next reload so that we don't forget about it.

Thanks for fixing this script!

Few notes on the next steps and how we should synchronize our efforts:

Once the script has updated the json file read by wikibase we will have to re-import the wdqs machines using a dump generated based on the new units.
Given current schedules I think the ideal plan is:

  • update and deploy unitConversionConfig.json on a friday before the lexeme dump is started (23:00 UTC on fridays)
  • once the ttl dumps are availabe (lexemes and all) generally on the next thursday start the cook-book to reimport one wdqs machine
  • once the import is done and the lag is absorbed (one/two weeks) use the data-transfer cookbook to replicate the fresh journal to other wdqs machines

As to when we should schedule:

  • we are still waiting on T267175 to make sure we fix all outstanding synchronization issues before moving forward with a fresh import

Alright, I uploaded a new unitConversionConfig.json at https://gerrit.wikimedia.org/r/657131; I’ve -2ed it for now due to the unclarity about synchronization issues. Lexeme dumps are currently blocked, though (T220883).

(None yet – there doesn’t seem to be any popular unit that’s currently grotesquely wrong and would require an update ASAP, just a bunch of smaller units that could use the improvement.)

Side note: I did find three units that are currently off by a factor of ten, see the commit message of the mediawiki-config change; the only one that could be popular is probably the technical atmosphere, and even that is apparently only used in a single statement. (Several units underwent minor changes, though, such as the light-year.)

Alright, I uploaded a new unitConversionConfig.json at https://gerrit.wikimedia.org/r/657131; I’ve -2ed it for now due to the unclarity about synchronization issues. Lexeme dumps are currently blocked, though (T220883).

Thanks!

Lexeme RDF dumps are functional and are the one wdqs is using.

Moving to waiting as T267175 is the last ticket still blocking this and is on the search team's plate.

Lexeme RDF dumps are functional and are the one wdqs is using.

Ah, right, I mixed up the dumps. Sorry :D

To test this, we can query for the normalized value of the statement “conversion to standard unit: 1 technical atmosphere”, which is present on kilogram-force per square centimetre and on the sandbox item. On the former item, the statement has existed for a while, and the item hasn’t been edited yet since the config change was deployed earlier today, so the query service still has data that was generated using the old unit conversion config. On the sandbox item, I just added the statement, so if the query service has it then it’s using the new config.

SELECT ?subject ?subjectLabel ?amount WHERE {
  VALUES ?subject {
    wd:Q13582667
    wd:Q4115189
  }
  ?subject p:P2442/psn:P2442/wikibase:quantityAmount ?amount.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Try it!

subjectsubjectLabelamount
wd:Q13582667kilogram-force per square centimetre9806.65
wd:Q4115189Wikidata Sandbox98066.5

A look at the Wikipedia article tells us that one kilogram-force per square centimetre, aka one technical atmosphere, ought to be 98.0665 kPa or 98066.5 Pa, not 9806.65 Pa as under the old config.

(An easier test might be to look for one of the many new units that were added with the new config, but I don’t have an example for that.)

Addshore claimed this task.
Addshore added a subscriber: Addshore.

I gather the reload is tracked under T267927: Reload wikidata journal from fresh dumps so I'll close this one!