Page MenuHomePhabricator

Make sure Wikibase dump maintenance scripts solely use the "dump" db group
Closed, ResolvedPublic

Description

As repeatedly requested, Wikibase should only connect to the dump database group in the (long running) dump scripts (repo/maintenance/dumpEntities.php, repo/maintenance/dumpJson.php and repo/maintenance/dumpRdf.php).

We have a few options to implement this:

  1. Try to pass the db group (as string) to all classes that get database connections?
  2. Create an own (static?) wrapper which we can use instead of wfGetDB (and wfGetLB/ the analog function on MediaWikiServices?)?
  3. Propose a core patch to be able to change the default for $groups in LoadBalancer::getConnection.

I'm not sure about this… all options seem kind of dirty in their own way, except maybe the third, if properly thought through.

Event Timeline

hoo created this task.Oct 3 2016, 2:51 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 3 2016, 2:51 AM
hoo added a subscriber: aaron.Oct 3 2016, 5:51 PM

MY preferred solution would be to introduce a "ConnectionManager" or "ConnectionHandle" interface into core. WikibaseClient has ConsistentReadConnectionManager that would be one possible implementation. Setting the desired db groupcould easily be added to that.

Relevant code that currently takes a LoadBalancer or a DatabaseBase, or that calls wfGetDB or wfGetLB, would need to be changed to take a ConnectionManager.

I think this would be the cleanest solution. I have actually often wished for having such a class in core.

hoo added a comment.Oct 27 2016, 3:46 PM

MY preferred solution would be to introduce a "ConnectionManager" or "ConnectionHandle" interface into core. WikibaseClient has ConsistentReadConnectionManager that would be one possible implementation. Setting the desired db groupcould easily be added to that.

This is very similar to my second idea… having it in core (in a widely used manner, with broad consensus) would probably make it a nice solution.

I'm not sure the gains outweigh the amount of work needed here, but then again, injecting DB connections, rather than using global functions sounds like a good idea to me in general.

One thing that is unclear: What are the benefits of introducing a new class rather than to extend LoadBalancer (which is what we use to "lazy" inject database connections these days).

aaron added a comment.Oct 27 2016, 4:26 PM

Does getLazyConnectionRef() help here?

hoo added a comment.Oct 27 2016, 4:31 PM

Does getLazyConnectionRef() help here?

Hm… I guess DBConnRef also kind of overlaps the idea behind the classes Daniel suggested: Wrap the database connection (getter) in a stateful way, so that we can attach further information to it (the default group to use, in this case).

We probably could use DBConnRef here.

Two things about DBConnRef that I don't like:

  • It hogs the connection until it is destroyed. That makes me feel uncomfortable to make it a member in a service. Services should not own a connection permanently.
  • If your service needs read- and write connections, you have to pass in two DBConnRef objects.

Regarding using LoadBalancer: we can, but then you need to pass in three arguments: the LB, the wiki (db) name, and the groups. The ConnectionManager would bundle that info (just like DBConnRef does).

hoo added a comment.Oct 28 2016, 7:39 PM

I can see that… but it seems to me we're adding another layer just because the underlying ones are "bad", we should try to avoid that.

If the new thing allows us to use it as a replacement for passing LoadBalancer to encapsulate connections and for DBConnRef entirely, that would be a way forward, but just adding another thing sounds awry to me.

If the new thing allows us to use it as a replacement for passing LoadBalancer to encapsulate connections and for DBConnRef entirely, that would be a way forward, but just adding another thing sounds awry to me.

The new ConnectionManager thingy would be a replacement for passing LoadBalancer + wiki name + groups. It encapsulates the same information as a DBConnRef, but it would be different in some ways:

  • it would not implement the DatabaseBase interface
  • it would not hog a connection permanently.
  • it would not automatically release anything when going out of scope.

To me, DBConnRef is a way to automatically release a connection after using it. It's basically a C++ style "automatic object" version of the Database object. So a DBConnRef's scope should be a single method invocation, it should not be a member. The new ConnectionManager could very well have a getConnectionRef() method in addition to a getConnection() method.

For the task at hand: to whatever is simple and gets the job done. If you don't want to mess with core, the simplest way would be to just pass the groups in addition to the wiki name. Using DBConnRef as a member without any modification is probably not a good idea, because it would hog the DB connection indefinitely.

Patch adding COnnectionManager to core: https://gerrit.wikimedia.org/r/#/c/322644/3

hoo added a comment.Oct 5 2017, 1:51 PM

I looked into this some more just now:

I think the best way to go here is to make it possible (how?) to always append a "default" group to the $groups array in LoadBalancer::getConnection. That way, this group will be chosen over the "generic pool", whenever no group is explicitly asked for.

@hoo how necessary is this? Should we pick it up for Wikidata-Campsite?

hoo added a comment.Jun 5 2018, 1:59 PM

@hoo how necessary is this? Should we pick it up for Wikidata-Campsite?

Yes, we really want this

Change 440022 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/core@master] Allow passing the default DB group to use in Maintenance scripts

https://gerrit.wikimedia.org/r/440022

hoo claimed this task.Jun 12 2018, 9:53 PM
hoo moved this task from Incoming to Peer Review on the Wikidata-Campsite board.

Change 440022 merged by jenkins-bot:
[mediawiki/core@master] Allow passing the default DB group to use in Maintenance scripts

https://gerrit.wikimedia.org/r/440022

Change 440986 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] snapshot: make wikidata dump cronjobs use dump db servers

https://gerrit.wikimedia.org/r/440986

Change 441072 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/core@master] Maintenance::finalSetup: Make sure we re-create LBFactory

https://gerrit.wikimedia.org/r/441072

Change 441073 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/extensions/Wikibase@master] Make dump maintenance scripts use the "dump" db group

https://gerrit.wikimedia.org/r/441073

Change 441072 merged by jenkins-bot:
[mediawiki/core@master] Maintenance::finalSetup: Make sure we re-create LBFactory

https://gerrit.wikimedia.org/r/441072

Change 441073 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Make dump maintenance scripts use the "dump" db group

https://gerrit.wikimedia.org/r/441073

Change 441212 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/extensions/Wikibase@master] Add "dumpDBDefaultGroup" repo setting

https://gerrit.wikimedia.org/r/441212

Change 441212 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add "dumpDBDefaultGroup" repo setting

https://gerrit.wikimedia.org/r/441212

hoo closed this task as Resolved.Jun 25 2018, 10:58 AM

I'll evaluate this with @ArielGlenn before we can further increase the number of shards/dump runners.

Change 440986 merged by ArielGlenn:
[operations/puppet@production] snapshot: make wikidata dump cronjobs use dump db servers

https://gerrit.wikimedia.org/r/440986

@ArielGlenn I could confirm that this works correctly on the current dump run:

$ sudo -u dumpsgen sh -c "lsof -p `ps aux | grep dumpRdf.php | awk '{ print $2}' | paste -s -d ,`" | awk '/->db/ {print $9}'
snapshot1008.eqiad.wmnet:56086->db1087.eqiad.wmnet:mysql
snapshot1008.eqiad.wmnet:59318->db1087.eqiad.wmnet:mysql
snapshot1008.eqiad.wmnet:39248->db1087.eqiad.wmnet:mysql
snapshot1008.eqiad.wmnet:49804->db1087.eqiad.wmnet:mysql
snapshot1008.eqiad.wmnet:56006->db1087.eqiad.wmnet:mysql
snapshot1008.eqiad.wmnet:51752->db1087.eqiad.wmnet:mysql
jcrespo reopened this task as Open.EditedAug 13 2018, 9:36 AM

dumpJson.php is failing many times per minute, while I only rebooted for maintenance an rc replica (which is fully depooled):

https://logstash.wikimedia.org/goto/e2dee89b02bc72774c4320671d879b05

'''Edit''': This may be offtopic here, but I am not sure where in the tree to report it, so please move it where it corresponds, or tell me to open a new bug.

I didn't get a chance to check things then but all open connections to dbs now from snapshot1008 are to db1087, as they should be. This will take some digging.

I will hazard a guess that the script initially retrieves the correct db for the dumps group but does not try to refresh it periodically (or on every failure). We could consider doing so after some number of consecutive failures; @hoo what do you think?

jcrespo added a comment.EditedAug 13 2018, 11:09 AM

does not try to refresh it periodically

But db1101 never was a dump host, but rc, and it used to have load 1. Also I checked and didn't have any connected client (I checked).

does not try to refresh it periodically

But db1101 never was a dump host, but rc, and it used to have load 1. Also I checked and didn't have any connected client (I checked).

That's the most bizarre thing ever. All i can say is, we'll dig into it, and sorry!

No need to be sorry, could this be a load balancer bug?

hoo added a comment.Aug 15 2018, 1:33 PM

No need to be sorry, could this be a load balancer bug?

I just looked into this briefly and skimmed all (seemingly) relevant code paths… I still have no idea how this could happen (also these errors don't seem to be on mwlog1001?!).

Does anyone have any further hints here?

I still have no idea how this could happen (also these errors don't seem to be on mwlog1001?!).

Does dumpJson.php call any function that could force any check on all replicas to fail (and because it has old configuration, it fails on a depooled non-slow host?

Addshore moved this task from Unsorted 💣 to Next on the User-Addshore board.
Addshore moved this task from incoming to in progress on the Wikidata board.Sep 18 2018, 2:25 PM
hoo added a comment.Oct 15 2018, 11:49 AM

I still have no idea how this could happen (also these errors don't seem to be on mwlog1001?!).

Does dumpJson.php call any function that could force any check on all replicas to fail (and because it has old configuration, it fails on a depooled non-slow host?

I really can't think of anything like this. In the past MediaWiki used to "magically" do weird things like this, but I think it no longer does.

Having a quick look at what connections the dumpers currently have open also looks good to me:

hoo@snapshot1008:~$ sudo -u dumpsgen lsof -li | grep -P '('`ps aux | awk '/maintenance\/dump/ { print $2 }' | paste -d "|" -s`')' | grep -oP '(db|es)\d+\..*\.wmnet' | sort | uniq -c
      8 db1104.eqiad.wmnet
      4 es1011.eqiad.wmnet
      4 es1013.eqiad.wmnet
      6 es1014.eqiad.wmnet
      2 es1019.eqiad.wmnet

I will also log this information (in a screen on snapshot1008) for the next hours/days and see whether any unexpected connections are opened.

@jcrespo Do you have any more information for what to look out here?

jcrespo added a comment.EditedOct 15 2018, 12:10 PM

@jcrespo Do you have any more information for what to look out here?

The problem is most likely the loadbalancer- working with old data, even if the server is not needed, generates lag warnings for some complex dependency of how lag check works (generating 40K logs per minute). Abandon ship- this is not going to work, and nothing that wikibase developers or DBAs do can fix it.

Although you could do a new database requests every 10 minutes or so to mitigate that. Kill the process and restart it so it reloads the new config.

hoo added a comment.Oct 16 2018, 9:27 AM

@jcrespo Do you have any more information for what to look out here?

The problem is most likely the loadbalancer- working with old data, even if the server is not needed, generates lag warnings for some complex dependency of how lag check works (generating 40K logs per minute). Abandon ship- this is not going to work, and nothing that wikibase developers or DBAs do can fix it.
Although you could do a new database requests every 10 minutes or so to mitigate that. Kill the process and restart it so it reloads the new config.

We currently run the dumper instances for about 1h... would it be helpful to shorten this to maybe 30m?

@jcrespo Do you have any more information for what to look out here?

The problem is most likely the loadbalancer- working with old data, even if the server is not needed, generates lag warnings for some complex dependency of how lag check works (generating 40K logs per minute). Abandon ship- this is not going to work, and nothing that wikibase developers or DBAs do can fix it.
Although you could do a new database requests every 10 minutes or so to mitigate that. Kill the process and restart it so it reloads the new config.

We currently run the dumper instances for about 1h... would it be helpful to shorten this to maybe 30m?

It will definitely help, but it may still get into race conditions, but obviously less than if it is every 1h :-)

hoo added a comment.Oct 17 2018, 6:16 PM

We currently run the dumper instances for about 1h... would it be helpful to shorten this to maybe 30m?

It will definitely help, but it may still get into race conditions, but obviously less than if it is every 1h :-)

I'll look into this :)

I found no connections from our dump scripts to any db other than db1104 (and some es hosts).

Change 468059 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/puppet@production] Wikidata entity dumps: Halve pagesPerBatch

https://gerrit.wikimedia.org/r/468059

Change 468059 merged by ArielGlenn:
[operations/puppet@production] Wikidata entity dumps: Halve pagesPerBatch

https://gerrit.wikimedia.org/r/468059

The above script should be in effect for the rdf jobs this week and for all wikidata weeklies after that.

@hoo is this done now? If so can we close it? :)

The patch that was pushes out cuts down the amount of time that a script may have failed to notice a db configuration change, but it doesn't address the root cause, which is rather harder.

hoo closed this task as Resolved.Dec 6 2018, 11:48 AM

The issue at hands has been solved, and we also halved the time the dumper instances are running.

For actually fixing the root cause of the above mentioned problems (the dumpers take a long time to pick up MW DB configuration changes), please see T211326: (Wikidata) Dump maintenance scripts should periodically reload DB configuration.