Page MenuHomePhabricator

Massive increase of writes in m3 section
Closed, ResolvedPublic

Description

The m3 db primary master db1072 has suffered a huge increase on writes which matches the Gerrit upgrade past 8th June

M3 contains phabricator databases and not Gerrit, but the times match so I don't know if it can be related somehow.

17:14 demon@deploy1001: Finished deploy [gerrit/gerrit@7324140]: 2.15.2 (duration: 00m 11s)

17:14 demon@deploy1001: Started deploy [gerrit/gerrit@7324140]: 2.15.2

17:12 no_justification: gerrit: taking offline for 2.14 -> 2.15 upgrade

https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1072&var-port=9104&panelId=2&fullscreen&from=1528454408179&to=1528627208179

7 days graph: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=2&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1072&var-port=9104&from=1528110351086&to=1528715151086

Revisions and Commits

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 10 2018, 10:42 AM
Marostegui triaged this task as High priority.Jun 10 2018, 10:45 AM
greg added a comment.Jun 10 2018, 10:45 AM

Related: The Gerrit upgrade included a migration that created many new git refs. Those are replicated to Phabricator and thus it also had to ingest/index them.

Ah right! I'm from my phone and cannot check what the writes are. Any ETA for that to be finished?

Codfw is lagging behind as it cannot cope with the amount of writes. Not a big deal as it is not used, but it is an indicative of how massive it is. It would be nice to have an estimation on when it will be finished.

It will be a long while as phab has to parse all the new commits (notedb) we should probaly try to ignore refs/changes/**/meta in phabricator.

It will be a long while as phab has to parse all the new commits (notedb) we should probaly try to ignore refs/changes/**/meta in phabricator.

Long while meaning...days?
I could ease replication later to avoid the Codfw master getting more delayed.

We haven't needed to replicate any refs other than heads and tags since we brought gitiles online... Disable them. Now. And prune them from Phab while we're at it.

Paladox added a comment.EditedJun 10 2018, 12:47 PM

But phabricator changed it's behaviour and now clones

refs/**

So to fix this we need regex to not clone

refs/changes/**/meta
Marostegui moved this task from Triage to In progress on the DBA board.Jun 11 2018, 9:12 AM
Marostegui updated the task description. (Show Details)Jun 11 2018, 11:06 AM

This https://phabricator.wikimedia.org/D1067 will fix it so no more new notedb refs are cloned.

This https://phabricator.wikimedia.org/D1067 will fix it so no more new notedb refs are cloned.

When are you planning to get that deployed?

Need someone to approve it, merge it and then i think @mmodell would have to deploy it.

Excellent - thanks! :)

Mentioned in SAL (#wikimedia-operations) [2018-06-11T17:25:32Z] <twentyafterfour> Phabricator: deploying hotfix (D1067) refs T196840 T196860 T196855

I'm going to stop phd and attempt to clear out the backlog from the queue (it's a lot of useless updates that we don't need to write to the db ultimately)

Mentioned in SAL (#wikimedia-operations) [2018-06-11T17:59:29Z] <twentyafterfour> phabricator: taking phd offline while I clear out the queue backlog (downtime is logged in icinga) see T196840

phabricator is going to parse the existing refs/changes/*/*/meta commits (no new ones will be added to the queue so this will eventually go down). According to @mmodell it queued over 8 million refs.

@Marostegui: I canceled some of the queued jobs which should have helped somewhat. The only thing I know to do beyond this is to stop replicating from gerrit.

Mentioned in SAL (#wikimedia-operations) [2018-06-12T22:24:58Z] <twentyafterfour> phabricator: I scheduled a 24 hour downtime in icinga for the phd service, to give me time to work on this issue. See T196840

mmodell added a comment.EditedJun 12 2018, 11:14 PM

I'm deleting queued jobs in batches of 100,000. I've also reduced the number of phabricator workers to 3 (from 10) so overall there should be a reduction in write traffic. It won't return to normal until I've managed to clear the queue which is still at least 5 million jobs.

@Marostegui: I canceled some of the queued jobs which should have helped somewhat. The only thing I know to do beyond this is to stop replicating from gerrit.

Thanks for the heads up.
We are not in such a bad situation (at least for now) that we have to stop replicating things. So let's wait for all the stuff to finish

I've got the queue down to 3.1M by canceling jobs. There is still write traffic involved even to delete the jobs so it hasn't really reduced the traffic as much as I'd like. I think things should return to normal in a couple of hours once I've gotten the queue cleared.

jcrespo lowered the priority of this task from High to Medium.Jun 13 2018, 5:39 AM
jcrespo added a subscriber: jcrespo.

I don't think this is high from our perspective- they have dedicated db resources and the replica is up to data, and were aware of the issue at the end of the week (they acked the problems this caused), so as long as this is temporary, it is not a huge issue for us. Yes, codfw is lagging, and reparsing all commits doesn't scale so an alternative way should be searched to avoid this every time, but both the master and the replica are healty, so no big deal.

The gerrit notedb migration was a one time event, so it shouldn't really be something that happens with every update.

Vvjjkkii renamed this task from Massive increase of writes in m3 section to jbbaaaaaaa.Jul 1 2018, 1:05 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed mmodell as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Marostegui renamed this task from jbbaaaaaaa to Massive increase of writes in m3 section.Jul 2 2018, 5:07 AM
Marostegui closed this task as Resolved.
Marostegui assigned this task to mmodell.
Marostegui lowered the priority of this task from High to Medium.
Marostegui updated the task description. (Show Details)