Page MenuHomePhabricator

purgeParserCache.php should not take over 24 hours for its daily run
Open, HighPublic

Description

This follows-up from T280605#7070345

Background

This purgeParserCache.php script is scheduled to run every night, to prune ParserCache blobs that then are beyond their expiry date. Our blobs generally have an expiry date of 30 days, which means we expect this nightly run will remove the blobs we stored roughly 30 days ago on that day.

Problem

As of writing, the purge script now takes over a week to complete a single run. This has numerous consequences:

  1. Due to taking 10 days to run, we are effectively having to accomodate blobs for upto 37 days rather than 30-31 days. This means more space is occupied by default.
  2. Each run is taking longer than the last. This means the backlog is growing, and thus the space consumption as well. E.g. I expect we'll soon be accomodating blobs for 40 days, etc. There is no obvious end, other than a full disk.
  3. With the backlog growing, the run will take even longer, as it has to iterate more blobs to purge them. See point 2.
What we know

The script ran daily up until 19 April 2020 (last year):

  • 19 April 2020: Run took 1 day (the last time this happened).
  • 24 April 2020: Run took 3 days.
  • 26 Jun 2020: Run took 4 days.
  • 28 Nov 2020: Run took 6 days.
  • 13 Apr 2021: Run took 7 days.
  • 7 May 2021: Run was aborted after 5 days during which it completed 81% (2 May 01:51 - 7 May 05:23)
  • 13 May 2021: The current is at 26% which has taken 116 hours so far (May 7 05:23 - May 13 01:42). Extrapolating I would expect 446 hours in total, or 18 days?

(Caveat: The script's percentage meter assumes all shards are equal which they probably aren't.)

The script iterates over each parser cache database host, then each parser cache table on that host, and then selects/deletes in batches of 100 rows with a past expiry date. (code 1, code 2). It waits for a 500 ms sleep between each such batch.

This sleep was introduced in 2016 to mitigate T150124: Parsercache purging can create lag.

In 2016, the first mitigation used 100ms, which was then increased to 500ms.

Note that this task is not about product features adding more blobs to the ParserCache in general. I believe as it stands, the problem this task is about, will continue to worses even if our demand remains constant going forward. However, that increased demand in the last 12 months (see T280605) has pushed us over an invisible tipping point that has cascaded into this self-regressing situation.

Data

1# [19:15 UTC] krinkle at mwmaint1002.eqiad.wmnet in /var/log/mediawiki/mediawiki_job_parser_cache_purging
2# $ fgrep "Deleting" syslog.log
3Apr 10 01:00:03 mwmaint1002 mediawiki_job_parser_cache_purging[146571]: Deleting objects expiring before 01:00, 10 April 2020
4Apr 11 03:56:45 mwmaint1002 mediawiki_job_parser_cache_purging[240365]: Deleting objects expiring before 03:56, 11 April 2020
5Apr 12 08:12:32 mwmaint1002 mediawiki_job_parser_cache_purging[72089]: Deleting objects expiring before 08:12, 12 April 2020
6Apr 13 07:29:56 mwmaint1002 mediawiki_job_parser_cache_purging[2846]: Deleting objects expiring before 07:29, 13 April 2020
7Apr 14 02:38:09 mwmaint1002 mediawiki_job_parser_cache_purging[210524]: Deleting objects expiring before 02:38, 14 April 2020
8Apr 15 01:00:03 mwmaint1002 mediawiki_job_parser_cache_purging[83431]: Deleting objects expiring before 01:00, 15 April 2020
9Apr 16 01:00:06 mwmaint1002 mediawiki_job_parser_cache_purging[66915]: Deleting objects expiring before 01:00, 16 April 2020
10Apr 17 01:15:19 mwmaint1002 mediawiki_job_parser_cache_purging[12688]: Deleting objects expiring before 01:15, 17 April 2020
11Apr 18 01:00:01 mwmaint1002 mediawiki_job_parser_cache_purging[204947]: Deleting objects expiring before 01:00, 18 April 2020
12Apr 19 13:07:02 mwmaint1002 mediawiki_job_parser_cache_purging[182094]: Deleting objects expiring before 13:07, 19 April 2020
13Apr 21 20:59:35 mwmaint1002 mediawiki_job_parser_cache_purging[250916]: Deleting objects expiring before 20:59, 21 April 2020
14Apr 24 19:47:46 mwmaint1002 mediawiki_job_parser_cache_purging[101035]: Deleting objects expiring before 19:47, 24 April 2020
15Apr 27 23:09:18 mwmaint1002 mediawiki_job_parser_cache_purging[206784]: Deleting objects expiring before 23:09, 27 April 2020
16Apr 30 16:53:08 mwmaint1002 mediawiki_job_parser_cache_purging[205448]: Deleting objects expiring before 16:53, 30 April 2020
17May 3 00:37:27 mwmaint1002 mediawiki_job_parser_cache_purging[42048]: Deleting objects expiring before 00:37, 3 May 2020
18May 5 05:13:29 mwmaint1002 mediawiki_job_parser_cache_purging[20388]: Deleting objects expiring before 05:13, 5 May 2020
19May 7 16:30:25 mwmaint1002 mediawiki_job_parser_cache_purging[19386]: Deleting objects expiring before 16:30, 7 May 2020
20May 9 21:54:40 mwmaint1002 mediawiki_job_parser_cache_purging[259760]: Deleting objects expiring before 21:54, 9 May 2020
21May 12 03:47:45 mwmaint1002 mediawiki_job_parser_cache_purging[13882]: Deleting objects expiring before 03:47, 12 May 2020
22May 14 04:12:29 mwmaint1002 mediawiki_job_parser_cache_purging[164959]: Deleting objects expiring before 04:12, 14 May 2020
23May 16 02:39:31 mwmaint1002 mediawiki_job_parser_cache_purging[32112]: Deleting objects expiring before 02:39, 16 May 2020
24May 18 02:41:09 mwmaint1002 mediawiki_job_parser_cache_purging[196859]: Deleting objects expiring before 02:41, 18 May 2020
25May 20 07:12:25 mwmaint1002 mediawiki_job_parser_cache_purging[167348]: Deleting objects expiring before 07:12, 20 May 2020
26May 22 13:35:47 mwmaint1002 mediawiki_job_parser_cache_purging[186721]: Deleting objects expiring before 13:35, 22 May 2020
27May 25 06:03:37 mwmaint1002 mediawiki_job_parser_cache_purging[149790]: Deleting objects expiring before 06:03, 25 May 2020
28May 28 13:49:01 mwmaint1002 mediawiki_job_parser_cache_purging[190184]: Deleting objects expiring before 13:49, 28 May 2020
29Jun 1 05:22:29 mwmaint1002 mediawiki_job_parser_cache_purging[41516]: Deleting objects expiring before 05:22, 1 June 2020
30Jun 3 14:01:01 mwmaint1002 mediawiki_job_parser_cache_purging[161911]: Deleting objects expiring before 14:01, 3 June 2020
31Jun 5 23:24:21 mwmaint1002 mediawiki_job_parser_cache_purging[19792]: Deleting objects expiring before 23:24, 5 June 2020
32Jun 9 18:49:46 mwmaint1002 mediawiki_job_parser_cache_purging[228933]: Deleting objects expiring before 18:49, 9 June 2020
33Jun 13 06:37:17 mwmaint1002 mediawiki_job_parser_cache_purging[33396]: Deleting objects expiring before 06:37, 13 June 2020
34Jun 16 06:29:11 mwmaint1002 mediawiki_job_parser_cache_purging[84032]: Deleting objects expiring before 06:29, 16 June 2020
35Jun 19 06:14:53 mwmaint1002 mediawiki_job_parser_cache_purging[92680]: Deleting objects expiring before 06:14, 19 June 2020
36Jun 22 17:19:58 mwmaint1002 mediawiki_job_parser_cache_purging[88936]: Deleting objects expiring before 17:19, 22 June 2020
37Jun 26 11:15:00 mwmaint1002 mediawiki_job_parser_cache_purging[156464]: Deleting objects expiring before 11:15, 26 June 2020
38Jun 30 09:46:00 mwmaint1002 mediawiki_job_parser_cache_purging[205484]: Deleting objects expiring before 09:46, 30 June 2020
39Jul 3 17:42:57 mwmaint1002 mediawiki_job_parser_cache_purging[221089]: Deleting objects expiring before 17:42, 3 July 2020
40Jul 6 21:08:00 mwmaint1002 mediawiki_job_parser_cache_purging[119347]: Deleting objects expiring before 21:08, 6 July 2020
41Jul 9 17:05:20 mwmaint1002 mediawiki_job_parser_cache_purging[36442]: Deleting objects expiring before 17:05, 9 July 2020
42Jul 12 08:39:17 mwmaint1002 mediawiki_job_parser_cache_purging[128375]: Deleting objects expiring before 08:39, 12 July 2020
43Jul 14 22:00:43 mwmaint1002 mediawiki_job_parser_cache_purging[100580]: Deleting objects expiring before 22:00, 14 July 2020
44Jul 17 08:42:25 mwmaint1002 mediawiki_job_parser_cache_purging[24302]: Deleting objects expiring before 08:42, 17 July 2020
45Jul 20 01:20:05 mwmaint1002 mediawiki_job_parser_cache_purging[64583]: Deleting objects expiring before 01:20, 20 July 2020
46Jul 22 23:32:32 mwmaint1002 mediawiki_job_parser_cache_purging[253940]: Deleting objects expiring before 23:32, 22 July 2020
47Jul 26 00:31:13 mwmaint1002 mediawiki_job_parser_cache_purging[40242]: Deleting objects expiring before 00:31, 26 July 2020
48Jul 29 06:18:16 mwmaint1002 mediawiki_job_parser_cache_purging[191576]: Deleting objects expiring before 06:18, 29 July 2020
49Aug 1 08:44:23 mwmaint1002 mediawiki_job_parser_cache_purging[29297]: Deleting objects expiring before 08:44, 1 August 2020
50Aug 4 10:56:00 mwmaint1002 mediawiki_job_parser_cache_purging[173135]: Deleting objects expiring before 10:56, 4 August 2020
51Aug 7 22:14:41 mwmaint1002 mediawiki_job_parser_cache_purging[39261]: Deleting objects expiring before 22:14, 7 August 2020
52Aug 11 06:58:47 mwmaint1002 mediawiki_job_parser_cache_purging[258229]: Deleting objects expiring before 06:58, 11 August 2020
53Aug 14 07:15:29 mwmaint1002 mediawiki_job_parser_cache_purging[54964]: Deleting objects expiring before 07:15, 14 August 2020
54Aug 17 06:40:56 mwmaint1002 mediawiki_job_parser_cache_purging[100453]: Deleting objects expiring before 06:40, 17 August 2020
55Aug 20 01:53:44 mwmaint1002 mediawiki_job_parser_cache_purging[231161]: Deleting objects expiring before 01:53, 20 August 2020
56Aug 22 22:47:23 mwmaint1002 mediawiki_job_parser_cache_purging[185114]: Deleting objects expiring before 22:47, 22 August 2020
57Aug 25 20:14:44 mwmaint1002 mediawiki_job_parser_cache_purging[175794]: Deleting objects expiring before 20:14, 25 August 2020
58Aug 28 21:03:15 mwmaint1002 mediawiki_job_parser_cache_purging[38037]: Deleting objects expiring before 21:03, 28 August 2020
59Sep 1 04:16:18 mwmaint1002 mediawiki_job_parser_cache_purging[254032]: Deleting objects expiring before 04:16, 1 September 2020
60Oct 28 01:00:07 mwmaint1002 mediawiki_job_parser_cache_purging[85535]: Deleting objects expiring before 01:00, 28 October 2020
61Nov 1 13:27:46 mwmaint1002 mediawiki_job_parser_cache_purging[111100]: Deleting objects expiring before 13:27, 1 November 2020
62Nov 5 03:02:09 mwmaint1002 mediawiki_job_parser_cache_purging[125718]: Deleting objects expiring before 03:02, 5 November 2020
63Nov 8 06:52:23 mwmaint1002 mediawiki_job_parser_cache_purging[210463]: Deleting objects expiring before 06:52, 8 November 2020
64Nov 11 04:04:49 mwmaint1002 mediawiki_job_parser_cache_purging[46811]: Deleting objects expiring before 04:04, 11 November 2020
65Nov 13 18:05:08 mwmaint1002 mediawiki_job_parser_cache_purging[23469]: Deleting objects expiring before 18:05, 13 November 2020
66Nov 16 08:24:39 mwmaint1002 mediawiki_job_parser_cache_purging[25291]: Deleting objects expiring before 08:24, 16 November 2020
67Nov 18 11:56:21 mwmaint1002 mediawiki_job_parser_cache_purging[130565]: Deleting objects expiring before 11:56, 18 November 2020
68Nov 21 18:06:30 mwmaint1002 mediawiki_job_parser_cache_purging[145145]: Deleting objects expiring before 18:06, 21 November 2020
69Nov 27 06:01:34 mwmaint1002 mediawiki_job_parser_cache_purging[149815]: Deleting objects expiring before 06:01, 27 November 2020
70Dec 3 06:56:15 mwmaint1002 mediawiki_job_parser_cache_purging[253031]: Deleting objects expiring before 06:56, 3 December 2020
71Dec 9 03:39:20 mwmaint1002 mediawiki_job_parser_cache_purging[144320]: Deleting objects expiring before 03:39, 9 December 2020
72Dec 14 05:59:59 mwmaint1002 mediawiki_job_parser_cache_purging[111951]: Deleting objects expiring before 05:59, 14 December 2020
73Dec 18 08:23:34 mwmaint1002 mediawiki_job_parser_cache_purging[49615]: Deleting objects expiring before 08:23, 18 December 2020
74Dec 21 09:08:16 mwmaint1002 mediawiki_job_parser_cache_purging[149733]: Deleting objects expiring before 09:08, 21 December 2020
75Dec 24 17:45:30 mwmaint1002 mediawiki_job_parser_cache_purging[13185]: Deleting objects expiring before 17:45, 24 December 2020
76Dec 27 23:09:22 mwmaint1002 mediawiki_job_parser_cache_purging[236394]: Deleting objects expiring before 23:09, 27 December 2020
77Dec 31 04:37:15 mwmaint1002 mediawiki_job_parser_cache_purging[209639]: Deleting objects expiring before 04:37, 31 December 2020
78Jan 3 10:12:26 mwmaint1002 mediawiki_job_parser_cache_purging[9945]: Deleting objects expiring before 10:12, 3 January 2021
79Jan 6 22:31:39 mwmaint1002 mediawiki_job_parser_cache_purging[145468]: Deleting objects expiring before 22:31, 6 January 2021
80Jan 10 11:40:57 mwmaint1002 mediawiki_job_parser_cache_purging[37493]: Deleting objects expiring before 11:40, 10 January 2021
81Jan 14 08:21:18 mwmaint1002 mediawiki_job_parser_cache_purging[104818]: Deleting objects expiring before 08:21, 14 January 2021
82Jan 18 10:04:29 mwmaint1002 mediawiki_job_parser_cache_purging[201674]: Deleting objects expiring before 10:04, 18 January 2021
83Jan 23 02:11:11 mwmaint1002 mediawiki_job_parser_cache_purging[131431]: Deleting objects expiring before 02:11, 23 January 2021
84Jan 28 03:31:21 mwmaint1002 mediawiki_job_parser_cache_purging[17858]: Deleting objects expiring before 03:31, 28 January 2021
85Feb 2 01:50:58 mwmaint1002 mediawiki_job_parser_cache_purging[155799]: Deleting objects expiring before 01:50, 2 February 2021
86Feb 7 02:45:33 mwmaint1002 mediawiki_job_parser_cache_purging[99534]: Deleting objects expiring before 02:45, 7 February 2021
87Feb 12 01:00:02 mwmaint1002 mediawiki_job_parser_cache_purging[54460]: Deleting objects expiring before 01:00, 12 February 2021
88Feb 17 08:06:44 mwmaint1002 mediawiki_job_parser_cache_purging[109741]: Deleting objects expiring before 08:06, 17 February 2021
89Feb 22 07:49:48 mwmaint1002 mediawiki_job_parser_cache_purging[213157]: Deleting objects expiring before 07:49, 22 February 2021
90Feb 27 15:09:30 mwmaint1002 mediawiki_job_parser_cache_purging[334]: Deleting objects expiring before 15:09, 27 February 2021
91Mar 5 00:05:42 mwmaint1002 mediawiki_job_parser_cache_purging[7982]: Deleting objects expiring before 00:05, 5 March 2021
92Mar 10 15:46:03 mwmaint1002 mediawiki_job_parser_cache_purging[66538]: Deleting objects expiring before 15:46, 10 March 2021
93Mar 16 19:45:27 mwmaint1002 mediawiki_job_parser_cache_purging[100664]: Deleting objects expiring before 19:45, 16 March 2021
94Mar 23 09:25:13 mwmaint1002 mediawiki_job_parser_cache_purging[120833]: Deleting objects expiring before 09:25, 23 March 2021
95Mar 30 08:56:02 mwmaint1002 mediawiki_job_parser_cache_purging[92844]: Deleting objects expiring before 08:56, 30 March 2021
96Apr 6 02:31:55 mwmaint1002 mediawiki_job_parser_cache_purging[185950]: Deleting objects expiring before 02:31, 6 April 2021
97Apr 13 00:41:29 mwmaint1002 mediawiki_job_parser_cache_purging[17162]: Deleting objects expiring before 00:41, 13 April 2021
98Apr 19 18:08:28 mwmaint1002 mediawiki_job_parser_cache_purging[151047]: Deleting objects expiring before 18:08, 19 April 2021
99Apr 26 02:20:46 mwmaint1002 mediawiki_job_parser_cache_purging[90978]: Deleting objects expiring before 02:20, 26 April 2021
100May 2 01:51:04 mwmaint1002 mediawiki_job_parser_cache_purging[216627]: Deleting objects expiring before 01:51, 2 May 2021
101May 7 05:23:37 mwmaint1002 mediawiki_job_parser_cache_purging[56878]: Deleting objects expiring before 05:23, 16 May 2021

Event Timeline

Change 685878 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] purgeParserCache.php: Remove carriage return printer, minor cleanup

https://gerrit.wikimedia.org/r/685878

Krinkle added subscribers: Kormat, Marostegui, jcrespo.

@Marostegui @Kormat @jcrespo

I need from you a simple explanation or best-guess about what the story is with parser cache and replication lag, how (if) it relates to the purge script, and what order of magnitude you think we can experiment with in terms of shortening sleep or increasing batch size.

From a brief IRC chat with @jcrespo, I have understood the following (please confirm):

  • We believe the purge script itself is unlikely to be, by itself, a source of replication lag in the "traditional" sense. (It has no concurrency, only one instance of it runs at any given time, it does not use transactions, its selects and deletes are trivial and use primary keys and simple indexed conditions).
  • We believe even without any purging, parser cache databases sometimes experience lag, from its main operation of receiving large blobs during web requests. The lag is not common, but it does happen from time to time.
  • We believe that to avoid this happening a lot, we want to reduce the amount of interwoven delete queries from the purge script, which would take away from its main operation and thus could cause lag that way. This is why we need the purge script to be graceful in "some way".

Right now that "way" is a batch size of 100 rows, and a sleep of 0.5 seconds.

One alternative I considered is explicitly waiting for replication as we do in other maintenance scripts. However, I think this is not something we want to do (and I've documented as much), because this would make the script require cross-dc coordination which could cause reliability issues (I think in general we want this to continue even if there is cross-dc trouble), and could make the script "too" graceful (waiting for replication and cross-dc latency to acknowledge replication, might make the wait longer and waste more idle time), and might not be possible in the current configuration assuming MW isn't aware of cross-dc PC hosts currently.

A more advanced approach could be a rewrite of the script to handle each parsercache host in parallel. Today, we do 1 batch on 1 table on 1 host, and then wait. The other hosts are untouched during the majority of the scripts run. Handling them in parallel could mean we issue one batch on each host, then wait 0.5s - time spent on other hosts, and then iterate again for the next batch on each host. Based on my napkin math, that could cut it down by an order of magnitude. Whehter this is acceptable depends on whether you think it is okay for the current load we add to a single host to be experienced on other hosts at the same time.

A more naive approach would be to crank up the batch size and/or reduce the sleep range within the existing approach.

I propose we start with the naive approach if possible, and that we schedule a time soon, to do this together (in-place patch on mwmaint1002) and monitor the impact as we go.

I heard you saying for the second time:

it does not use transactions

But every query is a transaction. If you send a query without START TRANSACTION and COMMIT explicitly, you can think, in a practical sense, as if it had been envelop by those explicitly. You can test that by running a write query without start transaction and seeing that it blocks other writes on the same rows, and that the changes cannot be seen until they finish (it is atomic). It is true there is some optimizations in the case of single SELECTs, but that is only because it is a special case without writes.

LSobanski moved this task from Triage to Refine on the DBA board.

Have we considered using table partitioning to make the purge less expensive? If we partitioned based on expiration date, we could drop the older partition(s) in a single operation, rather than have to query for and remove individual rows.

Change 685878 merged by jenkins-bot:

[mediawiki/core@master] purgeParserCache.php: Remove carriage return printer, minor cleanup

https://gerrit.wikimedia.org/r/685878

Have we considered using table partitioning to make the purge less expensive? If we partitioned based on expiration date, we could drop the older partition(s) in a single operation, rather than have to query for and remove individual rows.

Dropping a partition isn't an online operation, meaning that we'd block the table entirely on each drop which is a no-go.

A more advanced approach could be a rewrite of the script to handle each parsercache host in parallel. Today, we do 1 batch on 1 table on 1 host, and then wait. The other hosts are untouched during the majority of the scripts run. Handling them in parallel could mean we issue one batch on each host, then wait 0.5s - time spent on other hosts, and then iterate again for the next batch on each host. Based on my napkin math, that could cut it down by an order of magnitude. Whehter this is acceptable depends on whether you think it is okay for the current load we add to a single host to be experienced on other hosts at the same time.

I really like this idea, how long do you think it'd take you to implement?

A more naive approach would be to crank up the batch size and/or reduce the sleep range within the existing approach.

I propose we start with the naive approach if possible, and that we schedule a time soon, to do this together (in-place patch on mwmaint1002) and monitor the impact as we go.

+1 to try this. This most likely means more lag, but as of today, we do not use codfw hosts, so lag shouldn't be a big consideration right now, so I would propose to increase the batch size a bit and see how that goes.

Change 692581 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] pc1010: Disable notifications

https://gerrit.wikimedia.org/r/692581

Change 692581 merged by Marostegui:

[operations/puppet@production] pc1010: Disable notifications

https://gerrit.wikimedia.org/r/692581

Mentioned in SAL (#wikimedia-operations) [2021-05-18T12:40:29Z] <Krinkle> krinkle@mw1002 purge-parsercache-now.php on pc1010 (spare, depooled), ref P16060, T280605, T282761

@Krinkle did the manual purge on pc1010 finish? If so, are we ok to go ahead and optimize the tables there so we can proceed and swap pc1010 with pc1007?

@Kormat Yes, the purge has completed on all 255 tables. Go ahead!

Optimize run started against pc1010.

Optimize of pc1010 finished.

Disk space usage went from 3.72TB to 2.44TB.

It ran from 2021-05-20T08:42:38+00:00 to 2021-05-20T19:40:40+00:00 (~11h duration).

Change 693413 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/mediawiki-config@master] db-equiad.php: Set pc1010 as pc1 primary.

https://gerrit.wikimedia.org/r/693413

Change 693413 merged by jenkins-bot:

[operations/mediawiki-config@master] db-eqiad.php: Set pc1010 as pc1 primary.

https://gerrit.wikimedia.org/r/693413

Mentioned in SAL (#wikimedia-operations) [2021-05-25T09:00:25Z] <kormat@deploy1002> Synchronized wmf-config/db-eqiad.php: Set pc1010 as pc1 primary T282761 (duration: 00m 58s)

Mentioned in SAL (#wikimedia-operations) [2021-05-25T09:01:28Z] <kormat> stopping replication on pc1010 T282761

Mentioned in SAL (#wikimedia-operations) [2021-05-25T09:05:20Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on pc[2007,2010].codfw.wmnet,pc1007.eqiad.wmnet with reason: Purging parsercache T282761

Mentioned in SAL (#wikimedia-operations) [2021-05-25T09:05:23Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on pc[2007,2010].codfw.wmnet,pc1007.eqiad.wmnet with reason: Purging parsercache T282761

Current status:

  • pc1010 is now the primary for pc1 in mw-config
  • pc1010 is set as mysql_role 'master' in puppet
  • I've run stop slave; reset slave all on pc1010, so it no longer replicates from pc1007
  • I've created a downtime for 7 days for pc[2007,2010].codfw.wmnet,pc1007.eqiad.wmnet

Change 694331 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] pc1010: Set mysql_role as 'master'

https://gerrit.wikimedia.org/r/694331

Change 694331 merged by Kormat:

[operations/puppet@production] pc1010: Set mysql_role as 'master'

https://gerrit.wikimedia.org/r/694331

Mentioned in SAL (#wikimedia-operations) [2021-05-25T17:34:00Z] <Krinkle> mwmaint1002: Running purge-parsercache-now.php on server 2/4 (pc1007, depooled spare). Ref P16060, T280605, T282761.

The purge has finished as of 2021-05-26T06:00Z. I'll start the optimize process now.

Mentioned in SAL (#wikimedia-operations) [2021-05-26T08:11:24Z] <kormat> running 'optimize table' over parsercache db on pc1007 with replication enabled T282761

Optimize of pc1007 (and replicas) finished.

Disk space usage went from 3.91TB to 2.3TB.

It ran from 2021-05-26T08:11:37+00:00 to 2021-05-26T17:07:47+00:00 (~9h duration)

It took about 2.5h more for the replicas to catch up.

Mentioned in SAL (#wikimedia-operations) [2021-05-27T12:50:39Z] <kormat@deploy1002> Synchronized wmf-config/db-eqiad.php: Repool pc1007 as pc1 master T282761 (duration: 01m 04s)

Change 696398 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] pc1010: Move to pc2.

https://gerrit.wikimedia.org/r/696398

Change 696398 merged by Kormat:

[operations/puppet@production] pc1010: Move to pc2.

https://gerrit.wikimedia.org/r/696398

Current status:

  • pc1 is repooled and back in service.
  • pc1010 is now in pc2, and replicating from pc1008. This means it will have at least _some_ relevant entries when it becomes pc2 primary next week.

Change 697921 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/mediawiki-config@master] db-eqiad.php: Set pc1010 as pc2 primary.

https://gerrit.wikimedia.org/r/697921

Change 697921 merged by jenkins-bot:

[operations/mediawiki-config@master] db-eqiad.php: Set pc1010 as pc2 primary.

https://gerrit.wikimedia.org/r/697921

Mentioned in SAL (#wikimedia-operations) [2021-06-03T10:13:53Z] <kormat@deploy1002> Synchronized wmf-config/db-eqiad.php: Set pc1010 as pc2 primary T282761 (duration: 00m 58s)

pc1010 is now pc2 primary, and is no longer replicating from pc1008:

root@pc1010.eqiad.wmnet[(none)]> stop slave;
Query OK, 0 rows affected (0.025 sec)

root@pc1010.eqiad.wmnet[(none)]> reset slave all;
Query OK, 0 rows affected (0.049 sec)

@Krinkle: you can run your purge against pc1008 now.

Mentioned in SAL (#wikimedia-operations) [2021-06-03T10:21:07Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on pc2008.codfw.wmnet,pc1008.eqiad.wmnet with reason: Purging parsercache T282761

Mentioned in SAL (#wikimedia-operations) [2021-06-03T10:21:12Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on pc2008.codfw.wmnet,pc1008.eqiad.wmnet with reason: Purging parsercache T282761

Mentioned in SAL (#wikimedia-operations) [2021-06-04T13:39:35Z] <Krinkle> mwmaint1002: Running purge_parsercache_now.php on pc1008, server 3/4, ref T282761

15:39:55 <Krinkle> kormat: it's running now, tee'ed to /home/krinkle/purge_parsercache_now_pc1008.log

Run finished at 2021-06-05T14:30. Running optimize over all pc* tables now.

Mentioned in SAL (#wikimedia-operations) [2021-06-08T12:14:26Z] <kormat> setting pc1008 back as pc2 primary T282761

Mentioned in SAL (#wikimedia-operations) [2021-06-08T12:15:51Z] <kormat@deploy1002> Synchronized wmf-config/db-eqiad.php: Repool pc1008 as pc2 master T282761 (duration: 00m 57s)

Optimize of pc1008 (and replica) finished.

Disk space usage went from 3.94TB to 2.23TB.

It ran from 2021-06-07T09:14:19+00:00 to 2021-06-07T16:59:09+00:00 (~7.75h)

It took about 2h more for the replica (pc2008) to catch up.

Change 698777 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/mediawiki-config@master] db-eqiad.php: Set pc1010 as pc3 primary.

https://gerrit.wikimedia.org/r/698777

Change 698778 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] pc1010: Move to pc3.

https://gerrit.wikimedia.org/r/698778

Change 698778 merged by Kormat:

[operations/puppet@production] pc1010: Move to pc3.

https://gerrit.wikimedia.org/r/698778

Change 698777 merged by jenkins-bot:

[operations/mediawiki-config@master] db-eqiad.php: Set pc1010 as pc3 primary.

https://gerrit.wikimedia.org/r/698777

Mentioned in SAL (#wikimedia-operations) [2021-06-08T14:05:07Z] <kormat> setting pc1010 as pc3 primary T282761

Mentioned in SAL (#wikimedia-operations) [2021-06-08T14:05:50Z] <kormat@deploy1002> Synchronized wmf-config/db-eqiad.php: Set pc1010 as pc3 master T282761 (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2021-06-08T15:19:04Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on pc2009.codfw.wmnet,pc1009.eqiad.wmnet with reason: Purging parsercache pc3 T282761

Mentioned in SAL (#wikimedia-operations) [2021-06-08T15:19:09Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on pc2009.codfw.wmnet,pc1009.eqiad.wmnet with reason: Purging parsercache pc3 T282761

Mentioned in SAL (#wikimedia-operations) [2021-06-08T15:23:39Z] <Krinkle> mwmaint1002: Running purge-parsercache-now.php on server 4/4 (pc1009) ref P16060, T280605, T282761.

Mentioned in SAL (#wikimedia-operations) [2021-06-10T10:28:51Z] <kormat> running optimize tables against pc1009 (pc3) T282761

Optimize of pc1009 (and replica) finished.

Disk space usage went from 3.9TB to 2TB.

It ran from 2021-06-10T10:28:26+00:00 to 2021-06-10T18:39:59+00:00 (~8.2h)