Page MenuHomePhabricator

Parsercache sudden increase of connections
Closed, ResolvedPublic

Description

At 18:00 UTC all parsercache hosts experienced a sudden increase in connections:



While the connections dropped to more normal values at around 19:11 UTC, they are still a bit over normal values.
The load hit pc1008 harder than other hosts, and it was depooled and we pooled pc1010 instead: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/580117/

pc1010 seemed to behave better than pc1008, which will be troubleshooted separately at T247787
The reason for this sudden connection increase is still unknown and needs to be researched

At around 19:16, however, we see a big increase on writes which is still on-going
https://grafana.wikimedia.org/d/000000273/mysql?from=1584353429837&to=1584434949647&var-dc=eqiad%20prometheus%2Fops&var-server=pc1009&var-port=9104&fullscreen&panelId=2

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 16 2020, 7:44 PM
Marostegui triaged this task as High priority.Mar 16 2020, 7:44 PM
Marostegui added a project: Operations.
Marostegui updated the task description. (Show Details)
jcrespo added a subscriber: jcrespo.EditedMar 16 2020, 7:45 PM

pc1008 coincidental hw issues handled separately at T247787

May be related to T247562.

Marostegui lowered the priority of this task from High to Medium.EditedMar 17 2020, 6:35 AM

May be related to T247562.

Thanks - however, I don't think it is related as the timestamps do not match.
The parsercache connections went fully back to previous values at around 23:28, however, I believe we should investigate what could've caused this.

https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&fullscreen&panelId=37&from=1584340499940&to=1584426899941&var-dc=eqiad%20prometheus%2Fops&var-server=pc1007&var-port=9104

There were issues probably related to T247562 (as @brennen pointed out) throughout the day including the times we had the connection spikes: https://logstash.wikimedia.org/goto/ced05ae17369f4b693a46b26621dbc7a but as it can be seen, right when we had the connections spikes, the errors start to decrease around 18:00 UTC
https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&fullscreen&panelId=37&from=1584340499940&to=1584426899941&var-dc=eqiad%20prometheus%2Fops&var-server=pc1007&var-port=9104

Marostegui added a comment.EditedMar 17 2020, 11:40 AM

The cronjobs that were ran from mwmaint1002 around the times:

Mar 16 17:50:01 mwmaint1002 CRON[164787]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 17:51:01 mwmaint1002 CRON[165026]: (www-data) CMD (echo "$$: Starting dispatcher" >> /var/log/wikidata/dispatchChanges-wikidatawiki.log; /usr/local/bin/mwscript extensions/Wikibase/repo/maintenance/dispatchChanges.php --wiki wikidatawiki >> /var/log/wikidata/dispatchChanges-wikidatawiki.log 2>&1; echo "$$: Dispatcher exited with $?" >> /var/log/wikidata/dispatchChanges-wikidatawiki.log)
Mar 16 17:51:01 mwmaint1002 CRON[165027]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 17:52:01 mwmaint1002 CRON[165207]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 17:53:01 mwmaint1002 CRON[165462]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 17:54:01 mwmaint1002 CRON[165667]: (www-data) CMD (echo "$$: Starting dispatcher" >> /var/log/wikidata/dispatchChanges-wikidatawiki.log; /usr/local/bin/mwscript extensions/Wikibase/repo/maintenance/dispatchChanges.php --wiki wikidatawiki >> /var/log/wikidata/dispatchChanges-wikidatawiki.log 2>&1; echo "$$: Dispatcher exited with $?" >> /var/log/wikidata/dispatchChanges-wikidatawiki.log)
Mar 16 17:54:01 mwmaint1002 CRON[165668]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 17:55:01 mwmaint1002 CRON[165856]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 17:56:01 mwmaint1002 CRON[166038]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 17:57:01 mwmaint1002 CRON[166236]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 17:57:01 mwmaint1002 CRON[166237]: (www-data) CMD (echo "$$: Starting dispatcher" >> /var/log/wikidata/dispatchChanges-wikidatawiki.log; /usr/local/bin/mwscript extensions/Wikibase/repo/maintenance/dispatchChanges.php --wiki wikidatawiki >> /var/log/wikidata/dispatchChanges-wikidatawiki.log 2>&1; echo "$$: Dispatcher exited with $?" >> /var/log/wikidata/dispatchChanges-wikidatawiki.log)
Mar 16 17:58:01 mwmaint1002 CRON[166493]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 17:59:01 mwmaint1002 CRON[166657]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 18:00:01 mwmaint1002 CRON[166878]: (www-data) CMD (/usr/local/bin/mwscript extensions/Wikibase/repo/maintenance/pruneChanges.php --wiki testwikidatawiki --number-of-days=3 >> /var/log/wikidata/prune-testwikidata.log 2>&1)
Mar 16 18:00:01 mwmaint1002 CRON[166880]: (www-data) CMD (echo "$$: Starting dispatcher" >> /var/log/wikidata/dispatchChanges-wikidatawiki.log; /usr/local/bin/mwscript extensions/Wikibase/repo/maintenance/dispatchChanges.php --wiki wikidatawiki >> /var/log/wikidata/dispatchChanges-wikidatawiki.log 2>&1; echo "$$: Dispatcher exited with $?" >> /var/log/wikidata/dispatchChanges-wikidatawiki.log)
Mar 16 18:00:01 mwmaint1002 CRON[166879]: (www-data) CMD (/usr/local/bin/mwscript extensions/Wikibase/repo/maintenance/pruneChanges.php --wiki wikidatawiki --number-of-days=3 >> /var/log/wikidata/prune2.log 2>&1)
Mar 16 18:00:01 mwmaint1002 CRON[166881]: (www-data) CMD (echo "$$: Starting dispatcher" >> /var/log/wikidata/dispatchChanges-testwikidatawiki.log; /usr/local/bin/mwscript extensions/Wikibase/repo/maintenance/dispatchChanges.php --wiki testwikidatawiki >> /var/log/wikidata/dispatchChanges-testwikidatawiki.log 2>&1; echo "$$: Dispatcher exited with $?" >> /var/log/wikidata/dispatchChanges-testwikidatawiki.log)
Mar 16 18:00:01 mwmaint1002 CRON[166883]: (www-data) CMD (/usr/local/bin/mwscriptwikiset extensions/FlaggedRevs/maintenance/updateStats.php flaggedrevs.dblist > /dev/null 2> /dev/null)
Mar 16 18:00:01 mwmaint1002 CRON[166882]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 18:01:01 mwmaint1002 CRON[167210]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 18:02:01 mwmaint1002 CRON[167454]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 18:03:01 mwmaint1002 CRON[167748]: (www-data) CMD (echo "$$: Starting dispatcher" >> /var/log/wikidata/dispatchChanges-wikidatawiki.log; /usr/local/bin/mwscript extensions/Wikibase/repo/maintenance/dispatchChanges.php --wiki wikidatawiki >> /var/log/wikidata/dispatchChanges-wikidatawiki.log 2>&1; echo "$$: Dispatcher exited with $?" >> /var/log/wikidata/dispatchChanges-wikidatawiki.log)
Mar 16 18:03:01 mwmaint1002 CRON[167749]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 18:04:01 mwmaint1002 CRON[168078]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 18:05:01 mwmaint1002 CRON[168324]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 18:06:01 mwmaint1002 CRON[168611]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 18:06:01 mwmaint1002 CRON[168612]: (www-data) CMD (echo "$$: Starting dispatcher" >> /var/log/wikidata/dispatchChanges-wikidatawiki.log; /usr/local/bin/mwscript extensions/Wikibase/repo/maintenance/dispatchChanges.php --wiki wikidatawiki >> /var/log/wikidata/dispatchChanges-wikidatawiki.log 2>&1; echo "$$: Dispatcher exited with $?" >> /var/log/wikidata/dispatchChanges-wikidatawiki.log)
Mar 16 18:07:01 mwmaint1002 CRON[168797]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 18:08:01 mwmaint1002 CRON[169187]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 18:09:01 mwmaint1002 CRON[169417]: (www-data) CMD (echo "$$: Starting dispatcher" >> /var/log/wikidata/dispatchChanges-wikidatawiki.log; /usr/local/bin/mwscript extensions/Wikibase/repo/maintenance/dispatchChanges.php --wiki wikidatawiki >> /var/log/wikidata/dispatchChanges-wikidatawiki.log 2>&1; echo "$$: Dispatcher exited with $?" >> /var/log/wikidata/dispatchChanges-wikidatawiki.log)
Mar 16 18:09:01 mwmaint1002 CRON[169419]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 18:10:01 mwmaint1002 CRON[169600]: (www-data) CMD (/usr/local/bin/foreachwiki extensions/CirrusSearch/maintenance/saneitizeJobs.php --push --refresh-freq=7200 >> /var/log/mediawiki/cirrus-sanitize/push-jobs.log 2>&1)
Mar 16 18:10:01 mwmaint1002 CRON[169601]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)

And around the time where we see on writes, the increase starts at 19:16 or so:

Mar 16 19:08:01 mwmaint1002 CRON[191181]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 19:09:01 mwmaint1002 CRON[191342]: (www-data) CMD (echo "$$: Starting dispatcher" >> /var/log/wikidata/dispatchChanges-wikidatawiki.log; /usr/local/bin/mwscript extensions/Wikibase/repo/maintenance/dispatchChanges.php --wiki wikidatawiki >> /var/log/wikidata/dispatchChanges-wikidatawiki.log 2>&1; echo "$$: Dispatcher exited with $?" >> /var/log/wikidata/dispatchChanges-wikidatawiki.log)
Mar 16 19:09:01 mwmaint1002 CRON[191343]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 19:10:01 mwmaint1002 CRON[191618]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 19:11:01 mwmaint1002 CRON[191844]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 19:12:01 mwmaint1002 CRON[192061]: (www-data) CMD (echo "$$: Starting dispatcher" >> /var/log/wikidata/dispatchChanges-wikidatawiki.log; /usr/local/bin/mwscript extensions/Wikibase/repo/maintenance/dispatchChanges.php --wiki wikidatawiki >> /var/log/wikidata/dispatchChanges-wikidatawiki.log 2>&1; echo "$$: Dispatcher exited with $?" >> /var/log/wikidata/dispatchChanges-wikidatawiki.log)
Mar 16 19:12:01 mwmaint1002 CRON[192062]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 19:13:01 mwmaint1002 CRON[192265]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 19:14:01 mwmaint1002 CRON[192477]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 19:15:01 mwmaint1002 CRON[192600]: (www-data) CMD (echo "$$: Starting dispatcher" >> /var/log/wikidata/dispatchChanges-testwikidatawiki.log; /usr/local/bin/mwscript extensions/Wikibase/repo/maintenance/dispatchChanges.php --wiki testwikidatawiki >> /var/log/wikidata/dispatchChanges-testwikidatawiki.log 2>&1; echo "$$: Dispatcher exited with $?" >> /var/log/wikidata/dispatchChanges-testwikidatawiki.log)
Mar 16 19:15:01 mwmaint1002 CRON[192601]: (www-data) CMD (echo "$$: Starting dispatcher" >> /var/log/wikidata/dispatchChanges-wikidatawiki.log; /usr/local/bin/mwscript extensions/Wikibase/repo/maintenance/dispatchChanges.php --wiki wikidatawiki >> /var/log/wikidata/dispatchChanges-wikidatawiki.log 2>&1; echo "$$: Dispatcher exited with $?" >> /var/log/wikidata/dispatchChanges-wikidatawiki.log)
Mar 16 19:15:01 mwmaint1002 CRON[192611]: (www-data) CMD (/usr/local/bin/mwscript extensions/Wikibase/repo/maintenance/pruneChanges.php --wiki wikidatawiki --number-of-days=3 >> /var/log/wikidata/prune2.log 2>&1)
Mar 16 19:16:01 mwmaint1002 CRON[193098]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 19:17:01 mwmaint1002 CRON[193251]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 19:18:01 mwmaint1002 CRON[193598]: (www-data) CMD (echo "$$: Starting dispatcher" >> /var/log/wikidata/dispatchChanges-wikidatawiki.log; /usr/local/bin/mwscript extensions/Wikibase/repo/maintenance/dispatchChanges.php --wiki wikidatawiki >> /var/log/wikidata/dispatchChanges-wikidatawiki.log 2>&1; echo "$$: Dispatcher exited with $?" >> /var/log/wikidata/dispatchChanges-wikidatawiki.log)
Mar 16 19:18:01 mwmaint1002 CRON[193599]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 19:18:07 mwmaint1002 crontab[193766]: (root) LIST (www-data)
Mar 16 19:19:01 mwmaint1002 CRON[195739]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 19:20:01 mwmaint1002 CRON[195940]: (www-data) CMD (/usr/local/bin/mwscript maintenance/getLagTimes.php --wiki aawiki --report 2>/dev/null >/dev/null)
Mar 16 19:21:01 mwmaint1002 CRON[196097]: (www-data) CMD (echo "$$: Starting dispatcher" >> /var/log/wikidata/dispatchChanges-wikidatawiki.log; /usr/local/bin/mwscript extensions/Wikibase/repo/maintenance/dispatchChanges.php --wiki wikidatawiki >> /var/log/wikidata/dispatchChanges-wikidatawiki.log 2>&1; echo "$$: Dispatcher exited with $?" >> /var/log/wikidata/dispatchChanges-wikidatawiki.log)
Marostegui updated the task description. (Show Details)Mar 17 2020, 11:41 AM

I don't see any MediaWiki deploys in https://wikitech.wikimedia.org/wiki/Server_Admin_Log at either of these times.

At 19:11 I see "marostegui@deploy1001: Synchronized wmf-config/db-eqiad.php: Pool pc1010 instead of pc1008 as pc1008 is overloaded". That seems like it would cause an increase in misses and so load on the other pc servers. How long does it usually take to recover after something like that? I want to think there's a downward trend in the graphs, but if it's not just my imagination it's pretty slow.

On the MediaWiki log side, I see an increase of "Async set op failed" between 17:58 and 19:08, which seems consistent with parser cache. Oddly, they seem to come in bursts every 5 minutes. I don't see any pattern in source URLs or anything.

I don't see anything too obvious beginning at around 19:11. There's an increase in slow parses, but that may be consistent with increased load due to increased parser cache misses.

I don't see any MediaWiki deploys in https://wikitech.wikimedia.org/wiki/Server_Admin_Log at either of these times.

At 19:11 I see "marostegui@deploy1001: Synchronized wmf-config/db-eqiad.php: Pool pc1010 instead of pc1008 as pc1008 is overloaded". That seems like it would cause an increase in misses and so load on the other pc servers. How long does it usually take to recover after something like that? I want to think there's a downward trend in the graphs, but if it's not just my imagination it's pretty slow.

That deployment was just a long shot, that actually went well. We replaced pc1008 with another host, just to see if that new host (which belongs to pc1) would perform better.
And it did perform better and connections decreased, the unknown about why all parsercache hosts experienced such massive increase of connections at around 18:00 remains :-(

Anomie added a comment.EditedMar 17 2020, 3:15 PM

And it did perform better and connections decreased, the unknown about why all parsercache hosts experienced such massive increase of connections at around 18:00 remains :-(

Each write to ParserCache sets two keys into the backend, which will probably get sharded to two different servers. Once SqlBagOStuff opens a connection to one of the servers, it keeps that connection open until request shutdown. So if we assume that pc1008 is somehow failing in a way that has connections hang open for a while, we'd also see a smaller increase in idle open connections on pc1007 and pc1009 for the cases where ParserCache's first write goes to pc1007/pc1009 and the second one goes to pc1008. That seems consistent with what the three graphs show.

Interesting, so your theory is that pc1008 is the culprit rather than the other way around.
We have made some progress on the investigation related to pc1008 and it does show some performance issues with its disks (T247787#5975506)
I will start the incident report tomorrow btw

There's an increase in slow parses, but that may be consistent with increased load due to increased parser cache misses.
How long does it usually take to recover after something like that? I want to think there's a downward trend in the graphs, but if it's not just my imagination it's pretty slow

Yep, it would take 30 days for a full recovery, in a logarithmic pattern (first 1 hour would be the most effective, with decreasing increments after that). Note that as long as hit ratio stays above 50%, performance loss would not create an outage.

I would add this behavior to the list of things that would probably be improvable at T133523, long term. The double key write you describe was unknown to me, but would mean a SPOF (not only 33% of connections would fail or degrade on a server slowdown, but all of them). While an architecture refactoring would be needed as mentioned there for a proper solution (automatic failover, etc.), I think that a simple patch adding a smaller than usual timeout could help here?- as a parsercache disk write could be a considered a non-fatal error, at least under some circumstance? Open for discussion, not sure about this.

jcrespo added a comment.EditedMar 17 2020, 4:46 PM

On the other hand, I wonder if the increase in writes (misses, to some extent) on pc1 and pc3 after failover, which is real: https://grafana.wikimedia.org/d/000000273/mysql?fullscreen&panelId=2 would be justified by the virtual wipe of data on pc2, which I would only expect the only one to be "cold".

Edit:
For example, pc1 has lost 1TB of free disk space in only 1 day, and still getting lots of writes:
https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1&fullscreen&panelId=6&from=1583858890502&to=1584463690502&var-contentModel=wikitext

That is abnormal.

In other words, I don't think @Anomie is wrong, but I think there is still something ongoing we should keep an eye on. We could have another instance of T167784 or T167784#3473685

Not the case, the extra space is only happening on pc1010, which is expected.

Mentioned in SAL (#wikimedia-operations) [2020-03-17T17:10:21Z] <jynus> purging some old rows on pc1010 on a screen to earn some time T247788

The double key write you describe was unknown to me,

So a quick background: The parse has many factors that can result in different output. For some of these we don't even bother to cache when they're not at their default values, but there's still at least 6 that we do store the different outputs for all the possible values.

But the thing is that while these 6 factors can result in different output, it's pretty often the case that for any particular page they don't actually make a difference. For example, the user's "thumbnail size" preference is irrelevant if there are no images in the page, and the user's UI language is irrelevant if the content part of the page doesn't use it.

So we do something that's very much like the HTTP Vary header: we track which of the factors actually affected the parser's output and store the list under one key, and then the actual output under a second key derived from just the relevant factors. When a later request is checking the cache, we fetch the first key so we can construct the second key with only the factors that are relevant, avoiding a reparse (and cache entries with duplicate content) when only irrelevant factors are different.

P.S. The switch to Parsoid, when that eventually happens, will probably be able to eliminate some of those 6 factors but probably not all of them.

but would mean a SPOF (not only 33% of connections would fail or degrade on a server slowdown, but all of them).

There's two things to look at here: fraction of PC entries that go missing when one server is wiped, and the increase in idle connections on unaffected servers when one of the servers is hanging connections.

For PC entries affected, if just one of the three servers is wiped we should wind up with about 55%: 11% having both keys on the wiped server, 22% having just the first on the wiped server, and 22% having just the second on the wiped server, while 44% would have both keys only on the two unaffected servers and so would not be affected.

For requests that result in an increase in idle connections on the unaffected servers, that should be only the 22% of requests having just the first key on the hanging server. Another 33% touch the hanging server first and so don't open an idle connection to one of the healthy ones, and the 44% never touch the hanging server at all.

Yeah, that'll probably be needed. MW's SqlBagOStuff won't do any early eviction based on storage space pressure, only on the normal time-based expiry (30 days, IIRC).

AMooney added a subscriber: AMooney.

Untagging because there is nothing for CPT to do at this point. Anomie will stay subscribed and will retag if needed.

Thanks @Anomie for the detailed explanation, like Jaime, I also had no idea about that double write factor, that explains things. I have added these learnings to T133523 as Jaime proposed.
Any objection on closing this task as this was clearly a consequence and not the cause?

As Jaime said, there's still work to do on pc1010 (purge, defragmentation, replication etc) but we can probably use T247787: investigate pc1008 for possible hardware issues / performance under high load for that.

Any objection on closing this task as this was clearly a consequence and not the cause?

If you're asking me, I have no objection.

Marostegui closed this task as Resolved.Mar 18 2020, 1:42 PM
Marostegui assigned this task to Anomie.

Resolving. Thanks @Anomie for the clarifications and explaining what was going on.

I will be doing an IR and posting the link here btw