On https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m I observe there are always about 20k fewer triples in wdqs10xx (eqiad) compared with wdqs20xx (codfw) at any given moment. This is strange, because the lag of the servers do not differ significantly, and this phenomenon has persisted since at least last year. Not sure what causes it. It was about 10k less in October 2024, then 20k less in December 2024.
Description
Related Objects
Event Timeline
The discrepancies in the number of triples between the two datacenters can be explained by the way updates are computed.
The streaming updater responsible for computing updates is running independently in eqiad & codfw and thus produce two update streams which in perfect conditions have exactly the same updates. In reality they are not for several reasons:
- Fetching the content of wikibase item might fail in codfw but succeed in eqiad leading to reconcile event in one DC (this is particularly true in case of outages in a single datacenter)
- Late events might not be the same, events are transferred between our two kafka clusters using mirrormaker, late events also trigger a reconcile event
Having reconcile events might lead to slightly different number of triples mainly because of orphaned references/values and sitelinks (see T302189).
I think this might explain most of the discrepancies we see here, it is very possible that there are real discrepancies too but we assume that they're acceptable, perfect consistency is not something we can afford for WDQS.
As a comparison, prior to having the streaming updater in place the number of triples was different on every single nodes even in the same DC, so this "problem" is not entirely new.
Please see below the evolution of the difference between wdqs2016 (codfw) and wdqs1014 (eqiad) between April 2024 and now:
the difference evolved from 30k to 10k to 20k which I believe is acceptable (30k is 0.00018% of the total number of triples).
(The bump on 2024-12-02 could be interesting to investigate to determine if an outage happened).
Generally speaking I think that monitoring "triple divergences" is a better way to assess the accuracy of the state of wdqs vs wikidata: https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater?orgId=1&var-site=eqiad&var-k8sds=eqiad%20prometheus%2Fk8s&var-opsds=eqiad%20prometheus%2Fops&var-cluster_name=wdqs here we count the number of triples/hour that are not supposed to be here or missing when applying an update, if this number is constantly above 30/hour we assume that we need to perform a full data-reload.
I'm tentatively declining this ticket because the difference is expected, please feel free to re-open if you disagree. If on the other hand you notice actual discrepancies between WDQS and wikidata please let us know so that we can investigate the reasons and possibly enhance the system.
