Page MenuHomePhabricator

Wikidata entity dumper keeps connecting to depooled host for really long time
Closed, ResolvedPublic

Description

I have depooled a replica in es5 to do maintenance and after an hour, there are still connections to it from the entity data dumper. That's the only maint script/dumper that still holds and keeps connecting to the depooled db after this time. This problem hinders DBAs ability to do maintenance (and more importantly primary switchovers are not getting reflected meaning the primary can get a lot of reads).

This is similar to T298485: MW scripts should reload the database config and a proper fix would be to reload config on the fly but at least (and as a quick measure), the dumper should not take this long, it should be split to smaller batches that take maybe an hour or less.

Event Timeline

Hm, this would be visible in the dump files, right? I remember noticing a while ago that the dump started with a subset of item IDs in ascending order, then eventually jumped back to the beginning and did another subset, and so on. If I understand correctly, this is because we run eight(?) dumpers in parallel, going through item IDs congruent 0-7 modulo 8 (or something like that – it might also be based on the page ID), and then we concatenate their results at the end (with some slight editing to fix up the JSON array syntax). If we increase the number of dumpers, the results will probably be listed in a different order.

Hm, this would be visible in the dump files, right?

I don't think so, maybe I'm missing your point.

what it should be as something along the lines of other dumpers, short period of time and then picking it up from where it was left off. So after doing maybe 1000 items, it can exit and signal the last item id it was done, then the runner picks it up and start from that item id and run the script again concatenating to the same file. It doesn't need to change anything in file structure of the file or concurrency of the jobs.

Okay, that makes sense. So we’d have the same number of dumpers running in parallel, each of them sequentially running through its segment of the item IDs with periodic restarts.

I'll have a look soon (probably in early March, as I'm away-ish now).

Today, I was trying to upgrade s8 to bullseye and I can't depool any host, they all ended up with lingering connections from snapshot1011.eqiad.wmnet. Can you in the meantime tell me when this script is being ran? so I avoid that time.

Mentioned in SAL (#wikimedia-operations) [2022-03-02T08:02:20Z] <Amir1> killing all entity dumpers of wikidata in snapshot1008 (T300255)

I'm sorry but I just killed all rdf and json dumper that kept connecting to depooled s8 db I needed to do maintenance on. Please assume this week's dump is failed.

If the DBAs are going to kill our dumps over this then I assume finding a solution must be high priority.

Change 768032 had a related patch set uploaded (by Hoo man; author: Hoo man):

[operations/puppet@production] Wikibase dumps: Lower batch size (reduce run time)

https://gerrit.wikimedia.org/r/768032

Change 768032 merged by Ladsgroup:

[operations/puppet@production] Wikibase dumps: Lower batch size (reduce run time)

https://gerrit.wikimedia.org/r/768032

I don’t think we can verify this, let’s just assume it’s working better now unless we hear otherwise from the DBAs.