Page MenuHomePhabricator

Migrate parsercache hosts to file per table
Closed, ResolvedPublic

Description

We believe that due to: https://gerrit.wikimedia.org/r/#/c/354504/ the increase of disk space on the parsercache happened past Saturday 10th (T167784)

The parsercache hosts are not using file per table and the ibdata1 files are eating most of the space.

root@pc1004:/srv/sqldata-cache# df -hT /srv
Filesystem                 Type  Size  Used Avail Use% Mounted on
/dev/mapper/pc1004--vg-srv xfs   2.2T  1.9T  261G  89% /srv

root@pc1004:/srv/sqldata-cache# ls -lh ibdata1
-rw-rw---- 1 mysql mysql 1.9T Jun 10 10:19 ibdata1

Event Timeline

Marostegui moved this task from Triage to Pending comment on the DBA board.

Change 358167 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Parsercache: temporarily increase limit for space alarm

https://gerrit.wikimedia.org/r/358167

Change 358167 merged by Volans:
[operations/puppet@production] MariaDB: temporarily increase limit for space alarm

https://gerrit.wikimedia.org/r/358167

After today's issue with the parser cache, there is not much margin of disk space on the hosts.
Even with the clean up run that @tstarling has done, given that we are not using file per table the OS disk space will never get back.
Right now we don't have much room as:

root@pc1004:/srv/sqldata-cache# df -hT /srv/
Filesystem                 Type  Size  Used Avail Use% Mounted on
/dev/mapper/pc1004--vg-srv xfs   2.2T  1.9T  267G  88% /srv

And that is only with this amount of binlogs as I have been purging the old ones:

root@pc1004:/srv/sqldata-cache# mysql --skip-ssl -e "show binary logs;" -B -N | wc  -l
26

I am not sure how many we generate per hour, but if I look at other hosts which I have not purged totally, they have generated around 10 per hour, which makes it around 240G per day best case scenario (we expire logs after 1 day) so there is not much room there.
The ibdata1 shouldn't grow much anyways and should probably remain stable and so should the binlogs will normally grow to reach normally the same amount of space (as I said, 240G in total best case scenario).

As a quick hack we can set a cronjob to purge binlogs and leave only 12 hours of binlogs instead of 24h.

Marostegui renamed this task from Migrate parsercache host to file per table to Migrate parsercache hosts to file per table.Jun 10 2017, 3:42 PM
Marostegui added a project: SRE.
Marostegui updated the task description. (Show Details)

Until we decide how and when we migrate these hosts to file per table (it needs to be done soon) I have left a screen running on pc1004-1006 and pc2004-2006 purging logs every 6 hours. It will not purge logs if there are not 20 binlogs there at least.

jcrespo moved this task from Pending comment to In progress on the DBA board.

Taking db1096 & db2072 (current spare hosts originally intended for s8) as a temporary measure to failover pc hosts.

Change 358907 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Add temporary parsercache machines to both eqiad and codfw

https://gerrit.wikimedia.org/r/358907

Change 358907 merged by Jcrespo:
[operations/puppet@production] Add temporary parsercache machines to both eqiad and codfw

https://gerrit.wikimedia.org/r/358907

Change 358913 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-prometheus: Add temporary hosts to monitoring

https://gerrit.wikimedia.org/r/358913

Change 358913 merged by Jcrespo:
[operations/puppet@production] mariadb-prometheus: Add temporary hosts to monitoring

https://gerrit.wikimedia.org/r/358913

Change 358918 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Enable file-per-table option on parsercaches

https://gerrit.wikimedia.org/r/358918

Change 358918 merged by Jcrespo:
[operations/puppet@production] mariadb: Enable file-per-table option on parsercaches

https://gerrit.wikimedia.org/r/358918

Change 358926 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] parsercache: Switchover pc1004 and pc2004 to db1096 and db2072

https://gerrit.wikimedia.org/r/358926

Mentioned in SAL (#wikimedia-operations) [2017-06-14T11:34:16Z] <jynus> about to deploy performance-impacting change on the parsercache persistent storage T167567

Change 358926 merged by jenkins-bot:
[operations/mediawiki-config@master] parsercache: Switchover pc1004 and pc2004 to db1096 and db2072

https://gerrit.wikimedia.org/r/358926

Change 359195 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Pool db1099 and db1101 as temporary substitutes of pc2/3

https://gerrit.wikimedia.org/r/359195

Change 359195 merged by Jcrespo:
[operations/puppet@production] mariadb: Pool db1099 and db1101 as temporary substitutes of pc2/3

https://gerrit.wikimedia.org/r/359195

Mentioned in SAL (#wikimedia-operations) [2017-06-16T08:50:53Z] <jynus> bringing down pc1005 and pc1006 for maintenance T167567

pc1004,5,6 have been upgraded, restarted and they are catching up replication from the currently pooled servers db1096,db1099 and db1101. I will keep the new hosts pooled for some time, not only to catch up replication, but to tune the purge system.

pc2004,5,6's replication is currently stopped and have not yet been upgraded/defragemented on purpose. I have added log_expire_days = 30 to the eqiad servers so they can catch up later.

This is done, continuing service monitoring at T167784