Page MenuHomePhabricator

No data after 20170517193000 available via Quarry from tables (recentchanges, revision, logging) for several Mediawiki databases (svwiki_p, fiwiki_p, nowiki_p, ...)
Closed, ResolvedPublic

Description

When using Quarry to retrieve data from Mediawiki database tables, no data after 20170517193000 is available in the recentchanges, revision or logging tables for several databases xxwiki_p (where xx = sv, fi, no, nl, pl, tr, it, pt, ...).
From other databases like yywiki_p (where yy=en, de, ft, es, da, ru, et, la, lt, lv, nn, rp, ceb, ja ...) data is available as usual.

SQL:
use svwiki_p;
SELECT NOW(), rc_timestamp FROM recentchanges
ORDER BY rc_timestamp DESC
LIMIT 5

just gives this

NOW(),rc_timestamp
2017-05-18T22:54:38,20170517192932
2017-05-18T22:54:38,20170517192932
2017-05-18T22:54:38,20170517192932
2017-05-18T22:54:38,20170517192921
2017-05-18T22:54:38,20170517192921

i.e. data for more than 27 hours is missing.

Edit:
It seems the databases are now gradually being updated. (I only checked the svwiki_p table recentchanges)
15 minutes ago the most recent data was 16.1 hours old and right now the most recent data is 14.6 hours old.
With that pace the "catch up" will be completed and the function back to normal in 2-3 hours time.

It would be interesting to know why this affected some language versions but not other.

Event Timeline

Larske created this task.May 18 2017, 11:01 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 18 2017, 11:01 PM
Larske renamed this task from No data after 20170517193000 available via Quarry from tables (recentchanges, revisions, logging) for several Mediawiki databases (svwiki_p, fiwiki_p, nowiki_p, ...) to No data after 20170517193000 available via Quarry from tables (recentchanges, revision, logging) for several Mediawiki databases (svwiki_p, fiwiki_p, nowiki_p, ...) .May 18 2017, 11:03 PM
Larske updated the task description. (Show Details)
Larske updated the task description. (Show Details)May 19 2017, 8:13 AM
jcrespo closed this task as Resolved.May 19 2017, 9:48 AM
jcrespo claimed this task.
jcrespo added a subscriber: jcrespo.

You can check the replication lag at https://tools.wmflabs.org/replag/ (or better, directly by querying the heartbeat_p.heartbeat table). With the current infrastructure, it is impossible to avoid lag, whenever something in production changes the structure of the tables (schema change). That is going to change with the new architecture planned on T140788

In particular, there was ongoing a production schema change on s2, which includes the following projects: https://noc.wikimedia.org/db.php#tabs-2 According to my monitoring, the schema change finished and now it should catch up.

jcrespo added a comment.EditedMay 19 2017, 9:51 AM

Note that only 1 server (c1) was affected. c3 was unaffected, and that could have been used temporarily. The aim with the new architecture is not to avoid lag (that is not possible), but to change the active server to a non-lagged server transparently.

Thanks for the prompt response with explanation on what was ongoing. The cluster s2 now seems to be fully updated.

I highly recomend your code to integrate some kind of check for the heartbeat_p.heartbeat table to produce warnings or user notices when appropiate. More on that: https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database#Identifying_lag

Lag will also eventually show up at our graphs, on places like: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=labsdb1001 (but not at the moment)