Page MenuHomePhabricator

Phabricator release/2019-07-03/1 from wmf/stable creating lag on codfw hosts
Closed, ResolvedPublic

Description

Following:

[02:27:57]  <twentyafterfour>	!log Deploying Phabricator release/2019-07-03/1 from wmf/stable

There have been a huge increase on INSERTS on the phabricator master:
https://grafana.wikimedia.org/d/000000273/mysql?panelId=2&fullscreen&orgId=1&from=1562196465630&to=1562215625463&var-dc=eqiad%20prometheus%2Fops&var-server=db1072&var-port=9104

This is making codfw not to be able to cope with replication (db1117:3323 was also lagging behind, but as it has faster disks it was able to catch up). But codfw keep increasing their lag:
https://grafana.wikimedia.org/d/000000273/mysql?panelId=6&fullscreen&orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2065&var-port=9104&from=now-24h&to=now

From what I can see the INSERTS are mostly:

INSERT INTO `phabricator_worker`.`worker_activetask` (`failureTime`, `taskClass`, `leaseOwner`, `leaseExpires`, `failureCount`, `dataID`, `priority`, `objectPHID`, `id`, `dateCreated`, `dateModified`) VALUES (NULL, 'PhabricatorRepositoryGitCommitMessageParserWorker', NUL)

INSERT INTO `phabricator_repository`.`repository_commit` (`repositoryID`, `phid`, `authorIdentityPHID`, `committerIdentityPHID`, `commitIdentifier`, `epoch`, `authorPHID`, `auditStatus`, `summary`, `importStatus`) VALUES ('xxx', 'xxx, NULL, N)

Is this expected? When can we expect this to finish? codfw hosts will keep lagging forever if this doesn't stop or reduce its insertion rate.

Details

Related Gerrit Patches:
operations/puppet : productionPhabricator: Set taskmasters to 4

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 4 2019, 5:00 AM
mmodell added a subscriber: mmodell.Jul 4 2019, 5:53 AM

I'm cleaning up the worker queue to lighten the load. It should subside soon.

Peachey88 updated the task description. (Show Details)Jul 4 2019, 10:04 AM

Mentioned in SAL (#wikimedia-operations) [2019-07-04T10:47:29Z] <marostegui> Ease replication consistency option on db2065 to allow it to catch a bit - T227251

Mentioned in SAL (#wikimedia-operations) [2019-07-04T10:47:29Z] <marostegui> Ease replication consistency option on db2065 to allow it to catch a bit - T227251

I have eased consistency replication variables (sync_binlog and innodb_flush_log_at_trx_commit) so db2065 can catch up a bit. I will wait for it to reach 0 before setting it back to their defaults (and then it will start lagging behind with the current INSERT rate)

Mentioned in SAL (#wikimedia-operations) [2019-07-04T12:42:55Z] <marostegui> Restore defaults replication consistency options on db2065 - T227251

I have restored the defaults after db2065 caught up

@Marostegui: The phabricator work queue is almost empty now, see https://phabricator.wikimedia.org/daemon/ (There were well over 1 million jobs, now down to just over 300,000 and those are search jobs which should not have significant insert / update load on mysql. Rather, those jobs will be doing a lot of mysql read queries and inserting a lot of documents into elasticsearch.

I could cancel the rest of the search jobs, I think that would still produce quite a bit of database activity but maybe less than all the queue status updates needed to execute the jobs. This would result in the search index missing some data though.

@Marostegui ok I found a way to slow down the queue: I lowered phd.taskmasters to 1

Now the graphs look better. Unfortunately, puppet will set the config back to 10 taskmasters unless we make a commit to rOPUP Wikimedia Puppet

Change 520770 had a related patch set uploaded (by 20after4; owner: 20after4):
[operations/puppet@production] Phabricator: Set taskmasters to 4

https://gerrit.wikimedia.org/r/520770

mmodell triaged this task as High priority.Jul 4 2019, 2:59 PM

Change 520770 merged by Marostegui:
[operations/puppet@production] Phabricator: Set taskmasters to 4

https://gerrit.wikimedia.org/r/520770

Now the graphs look better. Unfortunately, puppet will set the config back to 10 taskmasters unless we make a commit to rOPUP Wikimedia Puppet

I have merged the patch, do we have to restart something on phab level?

MoritzMuehlenhoff lowered the priority of this task from High to Normal.Jul 5 2019, 12:31 PM
Marostegui closed this task as Resolved.Jul 5 2019, 12:50 PM
Marostegui assigned this task to mmodell.

Just to clarify, we have lowered the priority because the slaves are no longer lagging.
A few minutes ago the master went back to normal INSERT values - normal meaning before the upgrade.
Resolving this - thanks @mmodell!