Page MenuHomePhabricator

Provide a way to add new unit normalizations to the query service without a full reload
Closed, ResolvedPublic

Description

We want to be able to add normalization for additional units without needing a full reload.

Note that we don't need to be able to change an existing conversion; that would be nice to have, bit it's not a requirement.

Possible implementation strategy, option I:

  • Find all values referencing the respective unit using a SPARQL query
  • Find the statements using that value, and the items using that statement

...option II:

  • Find all values referencing the respective unit using a SPARQL query
  • Find the statements using that value
  • Compute the normalized value using the new mapping, and add them to the triple store.

...option II|:

  • Find all statement values referencing the respective unit while scanning a JSON dump
  • Compute the normalized value using the new mapping, and output them (as n-triples or turtle).
  • Bulk-load the new triples into the query service

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
OpenNone
ResolvedSmalyshev
DuplicateNone
DuplicateNone
OpenNone
OpenNone
OpenNone
OpenNone
Resolvedhoo
OpenNone
OpenNone
ResolvedFeatureMichael
Resolveddaniel
OpenNone
ResolvedLydia_Pintscher
Resolvedthiemowmde
Resolveddaniel
OpenNone
StalledNone
StalledNone
ResolvedSmalyshev
OpenNone
OpenNone
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
InvalidNone
Resolveddaniel
ResolvedSmalyshev
OpenNone
ResolvedLydia_Pintscher
ResolvedLadsgroup
OpenNone
ResolvedLadsgroup

Event Timeline

Right now I think this should be the plan:

The tool gets two .json config files - new config and old config. Old config can be optional. Then:

  1. Diff the configs and produce list of new units
  2. For each new primary unit:
    1. Run SPARQL query to find all values using it and generate self-referencing normalized statements with wikibase:quantityNormalized
    2. Run SPARQL query to find all statements using those values (need to see if we have too many we may have to split it in batches) and generate parallel normalized statements for those, with the same value.
  3. For each new non-primary unit:
    1. Run SPARQL query to find all values using it and generate new converted value for each one. Generate SPARQL for those new values and also wikibase:quantityNormalized statements on the old values.
    2. Run SPARQL query to find all statements using those values and generate parallel normalized statements for those, with the new converted value.

The output of the tool will be RDF/TTL that can be bulk-loaded into the instance.

We need to see if we will be able to hold all the values described in memory. So far the most popular unit - square kilometre - has 13398 usages, it should not be a problem to hold all of them in memory I think.

Change 312627 had a related patch set uploaded (by Smalyshev):
Script to produce RDF mappings for new normalized units

https://gerrit.wikimedia.org/r/312627

Change 312627 merged by jenkins-bot:
Script to produce RDF mappings for new normalized units

https://gerrit.wikimedia.org/r/312627

Change 319402 had a related patch set uploaded (by Smalyshev):
Script to produce RDF mappings for new normalized units

https://gerrit.wikimedia.org/r/319402

Change 319402 abandoned by Smalyshev:
Script to produce RDF mappings for new normalized units

Reason:
we can do it without backporting

https://gerrit.wikimedia.org/r/319402