Page MenuHomePhabricator

Wikidata dumps creating large amounts of log spam
Closed, ResolvedPublicPRODUCTION ERROR

Description

(Not sure yet if these are the regular dumps or the wikidata-specific ones)

Lots of errors coming from snapshot1008 related to replication lag and read only mode.

This is probably caused because the eqiad s8 master (passive datacenter) was being reimaged for upgrade to the latest kernel and security patches and the mariadb 10.1 upgrade. (Jaime's comment: This happens once every 3 years.)

We need to research why this was happening, and preventing on some way:

  • Can server maintenance be done on a better way, not creating log spam?
  • Can replication checks reduce its spam-icity so we get properly alerted but don't get the same message millions of time. Eg, give an error, but not on cached subsequent occurences or other different model
  • Why does the snapshots require not being lagged? Is lag important for those? If no, Can they just skip those checks? If yes, shouldn't they be running on the active datacenter instead, and not on eqiad?

While this didn't cause issues to final issues, this cause some issues on production due to impact to the logging infrastructure: https://grafana.wikimedia.org/dashboard/db/production-logging?orgId=1&from=1537184523025&to=1537197061054 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&from=1537184523025&to=1537197061054

Event Timeline

The wikidata weeklies are running now; the 'regular' (xml/sql) dumps completed on Sept 15th for the first run of the month.

root@snapshot1008:~# netstat -a -p | grep php | grep eqiad | grep db
tcp        0      0 snapshot1008.eqia:41898 db1087.eqiad.wmne:mysql ESTABLISHED 105786/php7.0       
tcp        0      0 snapshot1008.eqia:42528 db1087.eqiad.wmne:mysql ESTABLISHED 107804/php7.0       
tcp        0      0 snapshot1008.eqia:39826 db1087.eqiad.wmne:mysql ESTABLISHED 120869/php7.0       
tcp        0      0 snapshot1008.eqia:36730 db1087.eqiad.wmne:mysql ESTABLISHED 118688/php7.0       
tcp        0      0 snapshot1008.eqia:39936 db1087.eqiad.wmne:mysql ESTABLISHED 104820/php7.0       
tcp        0      0 snapshot1008.eqia:35566 db1087.eqiad.wmne:mysql ESTABLISHED 118107/php7.0       
tcp        0      0 snapshot1008.eqia:36468 db1087.eqiad.wmne:mysql ESTABLISHED 118553/php7.0       
tcp        0      0 snapshot1008.eqia:35032 db1087.eqiad.wmne:mysql ESTABLISHED 117876/php7.0       

root@snapshot1008:~# ps ww -p 105786 107804 120869 118688 104820 118107 118553 117876
   PID TTY      STAT   TIME COMMAND
104820 ?        S     45:03 /usr/bin/php7.0 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 6 --sharding-factor 8 --batch-size 2000 --format ttl --flavor full-dump --entity-type item --entity-type property --no-cache --dbgroupdefault dump --part-id 6-11 --first-page-id 35200001 --last-page-id 38400000
105786 ?        S     41:08 /usr/bin/php7.0 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 4 --sharding-factor 8 --batch-size 2000 --format ttl --flavor full-dump --entity-type item --entity-type property --no-cache --dbgroupdefault dump --part-id 4-11 --first-page-id 35200001 --last-page-id 38400000
107804 ?        R     40:27 /usr/bin/php7.0 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 5 --sharding-factor 8 --batch-size 2000 --format ttl --flavor full-dump --entity-type item --entity-type property --no-cache --dbgroupdefault dump --part-id 5-11 --first-page-id 35200001 --last-page-id 38400000
117876 ?        S     16:24 /usr/bin/php7.0 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 7 --sharding-factor 8 --batch-size 2000 --format ttl --flavor full-dump --entity-type item --entity-type property --no-cache --dbgroupdefault dump --part-id 7-11 --first-page-id 35200001 --last-page-id 38400000
118107 ?        R     15:45 /usr/bin/php7.0 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 3 --sharding-factor 8 --batch-size 2000 --format ttl --flavor full-dump --entity-type item --entity-type property --no-cache --dbgroupdefault dump --part-id 3-11 --first-page-id 35200001 --last-page-id 38400000
118553 ?        R     14:28 /usr/bin/php7.0 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 1 --sharding-factor 8 --batch-size 2000 --format ttl --flavor full-dump --entity-type item --entity-type property --no-cache --dbgroupdefault dump --part-id 1-11 --first-page-id 35200001 --last-page-id 38400000
118688 ?        S     13:59 /usr/bin/php7.0 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 2 --sharding-factor 8 --batch-size 2000 --format ttl --flavor full-dump --entity-type item --entity-type property --no-cache --dbgroupdefault dump --part-id 2-11 --first-page-id 35200001 --last-page-id 38400000
120869 ?        S     12:03 /usr/bin/php7.0 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 0 --sharding-factor 8 --batch-size 2000 --format ttl --flavor full-dump --entity-type item --entity-type property --no-cache --dbgroupdefault dump --part-id 0-11 --first-page-id 35200001 --last-page-id 38400000

So they are indeed using eqiad. Why is that?

Adding @hoo to see what insights he may have.

Well, the mystery of 'why eqiad' is solved; the choice of whether to parse db-eqiad.php or db-codfw.php is determined by the global $wmfDatacenter. This is set in multiversion/MWRealm.php from the value of $wmfCluster, which is taken from the contents of /etc/wikimedia-cluster on the server itself. Since the dumps are run from servers in eqiad, there we have it.

jcrespo added subscribers: akosiaris, Joe.

CC @Joe @akosiaris not to solve the specific issue, but to note it as a potential missed step/puppet code change related to switch datacenter.

Having said that, if dumps are run from eqiad, they probably should query local dbs to prevent privacy issues? So technically we are running active-active if we have to account dumps? Will that break something?

Just for the record, the errors were:
2018-09-19 07:13:15 [580058d7b2babd5f85e725ec] snapshot1008 wikidatawiki 1.32.0-wmf.20 DBReplication ERROR: Server db1109 has 29.26508808136 seconds of lag (>= 10) {"host":"db1109","lag":29.26508808136,"maxlag":10}
2018-09-19 07:13:31 [c36272088a424bcc438a31e9] snapshot1008 wikidatawiki 1.32.0-wmf.20 DBReplication ERROR: Wikimedia\Rdbms\LoadBalancer::getRandomNonLagged: server db1109 is not replicating? {"host":"db1109"}
Was there anything else?

To fix this, the following MediaWiki LB default config could be made adjustable (to use a null logger) by a maintenance-script argument (or by a config we would set for dump maintenance scripts):
MWLBFactory::applyDefaultConfig: 'replLogger' => LoggerFactory::getInstance( 'DBReplication' )

I just looked into this some more… We could also create a new LB settings key which would be passed as lagWarnThreshold to LoadMonitor (currently this options is never set, as their is no way to do so, thus this always defaults to LoadMonitor::LAG_WARN_THRESHOLD (10s)).

With that, we could enable certain dump-related maintenance scripts to set it to a reasonable value for their operation… for the Wikidata dumpers, 10-20m of lag should pose no problem.
Something else which (I think) we could already do without altering LoadBalancer related code (but just live-changing DB configuration in Maintenance) is to entirely disabling the load monitoring (use LoadMonitorNull), but I'm not sure that is desirable.

Any thoughts?

Or we could alter LoadMonitor::getServerStates to not log for a certain db-server more than every N seconds (where N could be derived from LAG_WARN_THRESHOLD, which currently is 10s).

Of the above options, I prefer the idea of explicitly increasing the LAG_WARN_THRESHOLD in scripts where that's desired. Anything that depends on a particular db server makes me uneasy, and I don't like the idea of turning off load monitoring completely either.

Anything that depends on a particular db server makes me uneasy

Hm… above I meant that per DB server it should only ever warn each (like) LAG_WARN_THRESHOLD / 2 seconds… it makes little sense to let a single LoadMonitor instance create more log entries per DB.

(This limitation would be per LoadMonitor instance, so not spanning multiple maintenance runs or web requests (which we don't care for here anyway))

Ah, I misread! Before we arbitrarily throttle though, I'd like the dbas to weigh in. It might be that having the flood actually flood may be helpful sometimes, in the case where it might mean something (not for dumps scripts).

@aaron This task mentions various rdbms/LoadBalancer-related configuration options and how they interact with maintenance scripts that seemingly need to tolerate longer replication lag without warnings. I'm not 100% sure, but it looks like this is proposing changes to the library and/or ServiceWiring, if so, I think this might benefit from your feedback as well. Hereby solicited :)

Change 478773 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: reduce LoadBalancer replication log spam

https://gerrit.wikimedia.org/r/478773

Imarlier moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.
Imarlier subscribed.

@aaron is going to see what else can be done to reduce spam, will then assign back to @ArielGlenn

I just revisited the original logs… and I found that the way way way more common error back then was from LoadBalancer::getRandomNonLagged:

2018-09-19 07:13:27 [1d74bb1b0e64917ec3cd47e7] snapshot1008 wikidatawiki 1.32.0-wmf.20 DBReplication ERROR: Wikimedia\Rdbms\LoadBalancer::getRandomNonLagged: server db1109 is not replicating? {"host":"db1109"}

I'll try to have a look at this soon.

Change 478773 merged by jenkins-bot:
[mediawiki/core@master] rdbms: reduce LoadBalancer replication log spam

https://gerrit.wikimedia.org/r/478773

I just revisited the original logs… and I found that the way way way more common error back then was from LoadBalancer::getRandomNonLagged:

2018-09-19 07:13:27 [1d74bb1b0e64917ec3cd47e7] snapshot1008 wikidatawiki 1.32.0-wmf.20 DBReplication ERROR: Wikimedia\Rdbms\LoadBalancer::getRandomNonLagged: server db1109 is not replicating? {"host":"db1109"}

I'll try to have a look at this soon.

I just had a brief look at this… and it's not easy to resolve this one AFAICT, given we don't ever want to completely ignore the fact that a DB is not replicating at all, I guess. We could use a HashBagOStuff in LoadBalancer to determine that we only log this errors once every N seconds (per DB server)?!

Change 483859 had a related patch set uploaded (by Paladox; owner: Aaron Schulz):
[mediawiki/core@REL1_32] rdbms: reduce LoadBalancer replication log spam

https://gerrit.wikimedia.org/r/483859

Change 483942 had a related patch set uploaded (by Reedy; owner: Aaron Schulz):
[mediawiki/core@REL1_31] rdbms: reduce LoadBalancer replication log spam

https://gerrit.wikimedia.org/r/483942

Change 483943 had a related patch set uploaded (by Reedy; owner: Aaron Schulz):
[mediawiki/core@REL1_30] rdbms: reduce LoadBalancer replication log spam

https://gerrit.wikimedia.org/r/483943

Change 483859 merged by jenkins-bot:
[mediawiki/core@REL1_32] rdbms: reduce LoadBalancer replication log spam

https://gerrit.wikimedia.org/r/483859

Change 483942 merged by jenkins-bot:
[mediawiki/core@REL1_31] rdbms: reduce LoadBalancer replication log spam

https://gerrit.wikimedia.org/r/483942

Change 483943 merged by jenkins-bot:
[mediawiki/core@REL1_30] rdbms: reduce LoadBalancer replication log spam

https://gerrit.wikimedia.org/r/483943

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:08 PM