Wikidata entity dumper keeps connecting to depooled host for really long time
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ladsgroup
	Jan 27 2022, 12:28 PM

Description

I have depooled a replica in es5 to do maintenance and after an hour, there are still connections to it from the entity data dumper. That's the only maint script/dumper that still holds and keeps connecting to the depooled db after this time. This problem hinders DBAs ability to do maintenance (and more importantly primary switchovers are not getting reflected meaning the primary can get a lot of reads).

This is similar to T298485: MW scripts should reload the database config and a proper fix would be to reload config on the fly but at least (and as a quick measure), the dumper should not take this long, it should be split to smaller batches that take maybe an hour or less.

Details

	Subject	Repo	Branch	Lines +/-
	Wikibase dumps: Lower batch size (reduce run time)	operations/puppet	production	+11 -9

Customize query in gerrit

Related Objects

Mentioned In: T321770: Entity dumpers should reload db config on the fly
Mentioned Here: T138208: Connections to all db servers for wikidata as wikiadmin from snapshot, terbium
T298485: MW scripts should reload the database config

Event Timeline

Ladsgroup created this task.Jan 27 2022, 12:28 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 27 2022, 12:28 PM

Hm, this would be visible in the dump files, right? I remember noticing a while ago that the dump started with a subset of item IDs in ascending order, then eventually jumped back to the beginning and did another subset, and so on. If I understand correctly, this is because we run eight(?) dumpers in parallel, going through item IDs congruent 0-7 modulo 8 (or something like that – it might also be based on the page ID), and then we concatenate their results at the end (with some slight editing to fix up the JSON array syntax). If we increase the number of dumpers, the results will probably be listed in a different order.

Hm, this would be visible in the dump files, right?

I don't think so, maybe I'm missing your point.

what it should be as something along the lines of other dumpers, short period of time and then picking it up from where it was left off. So after doing maybe 1000 items, it can exit and signal the last item id it was done, then the runner picks it up and start from that item id and run the script again concatenating to the same file. It doesn't need to change anything in file structure of the file or concurrency of the jobs.

Okay, that makes sense. So we’d have the same number of dumpers running in parallel, each of them sequentially running through its segment of the item IDs with periodic restarts.

I'll have a look soon (probably in early March, as I'm away-ish now).

Today, I was trying to upgrade s8 to bullseye and I can't depool any host, they all ended up with lingering connections from snapshot1011.eqiad.wmnet. Can you in the meantime tell me when this script is being ran? so I avoid that time.

NVM, it is the xml dumper T138208: Connections to all db servers for wikidata as wikiadmin from snapshot, terbium

Mentioned in SAL (#wikimedia-operations) [2022-03-02T08:02:20Z] <Amir1> killing all entity dumpers of wikidata in snapshot1008 (T300255)

I'm sorry but I just killed all rdf and json dumper that kept connecting to depooled s8 db I needed to do maintenance on. Please assume this week's dump is failed.

If the DBAs are going to kill our dumps over this then I assume finding a solution must be high priority.

Addshore subscribed.Mar 2 2022, 9:36 AM

Addshore added a subscriber: ItamarWMDE.Mar 2 2022, 9:39 AM

made a note about the dump fail in the next weekly summary.

hoo moved this task from Incoming to Doing on the Wikidata-Campsite (Team A Hearth 🏰🔥) board.Mar 3 2022, 10:19 AM

Change 768032 had a related patch set uploaded (by Hoo man; author: Hoo man):

[operations/puppet@production] Wikibase dumps: Lower batch size (reduce run time)

https://gerrit.wikimedia.org/r/768032

gerritbot added a project: Patch-For-Review.Mar 4 2022, 11:26 AM

hoo moved this task from Doing to Peer Review on the Wikidata-Campsite (Team A Hearth 🏰🔥) board.Mar 4 2022, 11:31 AM

Change 768032 merged by Ladsgroup:

[operations/puppet@production] Wikibase dumps: Lower batch size (reduce run time)

https://gerrit.wikimedia.org/r/768032

Maintenance_bot removed a project: Patch-For-Review.Mar 9 2022, 9:10 AM

hoo moved this task from Peer Review to Tech Verification on the Wikidata-Campsite (Team A Hearth 🏰🔥) board.Mar 12 2022, 10:06 PM

Manuel edited projects, added Wikidata Dev Team (Sprint-∞); removed Wikidata-Campsite (Team A Hearth 🏰🔥).Aug 2 2022, 9:19 AM

Manuel moved this task from Parents/Waiting to Tech Verification on the Wikidata Dev Team (Sprint-∞) board.

I don’t think we can verify this, let’s just assume it’s working better now unless we hear otherwise from the DBAs.

karapayneWMDE moved this task from Our work done to 2022 Completed Tasks on the Wikidata Dev Team (Sprint-∞) board.Aug 9 2022, 9:52 AM

Ladsgroup mentioned this in T321770: Entity dumpers should reload db config on the fly.Oct 27 2022, 8:06 AM

Wikidata entity dumper keeps connecting to depooled host for really long timeClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Wikidata entity dumper keeps connecting to depooled host for really long time
Closed, ResolvedPublic
Actions