Page MenuHomePhabricator

WDQS lag detection required manual adjustment during DC switchover
Open, HighPublic

Description

During today's DC switchover, the WDQS lag checking/monitoring required manual adjustment: https://gerrit.wikimedia.org/r/701927

I'm marking this as high priority because it ended up causing user impact as it affected bots that check the lag before editing.

On IRC @Gehel said that the long-term fix for this is T244590: [Epic] Rework the WDQS updater as an event driven application. How long-term are we looking at? Will that be in place for the switch back in ~1 month? Or do we also need a short term solution?

Event Timeline

Legoktm triaged this task as High priority.Jun 28 2021, 5:50 PM
Legoktm created this task.
Addshore moved this task from Inbox to External Realm on the wdwb-tech board.
Addshore moved this task from incoming to monitoring on the Wikidata board.

It is unlikely that the long term fix (T244590) will be in place for the switch back. The simple workaround (but far from ideal) is to revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/701927/ during the switch back. The more complex solution would be to implement better metrics in the current updater, but I doubt we will have time to do that before the switch back.

@Gehel what ends up consuming that value? Can we have it read the primary DC from conftool?

For now I've documented this as a manual step: https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter&type=revision&diff=1920831&oldid=1920827

@Gehel what ends up consuming that value? Can we have it read the primary DC from conftool?

For now I've documented this as a manual step: https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter&type=revision&diff=1920831&oldid=1920827

The system is passed a topic name at startup used as a reference for the lag (https://gerrit.wikimedia.org/r/plugins/gitiles/wikidata/query/rdf/+/refs/heads/master/tools/src/main/java/org/wikidata/query/rdf/tool/change/KafkaPoller.java#365). The updater then only reports the timestamp of this topic to blazegraph which is then consumed by a prometheus exporter.
The way this "timestamp" (lag) is determined should be changed to either:

  • determine dynamically what is the "reportingTopic" calling conftool to know where eventgate is pushing MW events
  • do what the new updater is doing (T244590): compute an avg of the timestamps instead

Because none of these changes would be trivial I think we prefer to wait for the new system to be in place.

Because none of these changes would be trivial I think we prefer to wait for the new system to be in place.

Fair enough, thanks for the explanation. Could you add a blocker to this task of whichever one is for deploying the new updater?