Page MenuHomePhabricator

New error "DB is set and has not been closed by the Load Balancer" for certain bad revisions during page content dumps
Closed, ResolvedPublic

Description

Typical error:

Rebooting getText infrastructure failed (DB is set and has not been closed by the Load Balancer) Trying to continue anyways
getting/checking text 406916 failed (Generic error while obtaining text for id 406916) for revision 413206

This is new, I guess it might have to do with changes in preparation for being able to reload the db config.

You can reproduce this from snapshot1009 (testbed host) as the dumpsgen user: run the command

/usr/bin/php7.2 /srv/mediawiki/multiversion/MWScript.php dumpTextPass.php --wiki=azwiki --report=1 --stub=gzip:/mnt/dumpsdata/temp/dumpsgen/db_config_issue/azwiki-20220822-stub-meta-history.xml.gz   --spawn=/usr/bin/php7.2 --output=bzip2:/mnt/dumpsdata/temp/dumpsgen/db_config_issue/azwiki-20220822-pages-meta-history.xml.bz2

The dumps do continue on after the faulty revision is processed but it would be nice to sort whatever is a problem here.

Event Timeline

ArielGlenn triaged this task as Medium priority.
ArielGlenn added subscribers: daniel, Ladsgroup.

Addinf @Ladsgroup and @daniel since I suspect they'll know which part of recent work might have come into play here, if any. Note that dumpText.php and TextPassDumper.php haven't been changed recently. The earlier run on the 1st of the month did not produce these errors.

This probably would fix your case, I don't know what exactly caused it to be triggered now but I know the underlying issue (changing of db connection to db connection ref): https://gerrit.wikimedia.org/r/c/mediawiki/core/+/824677

This probably would fix your case, I don't know what exactly caused it to be triggered now but I know the underlying issue (changing of db connection to db connection ref): https://gerrit.wikimedia.org/r/c/mediawiki/core/+/824677

Thanks, added myself as a cc on the patch.

Merged and backported, can you tell me if that fixed your issue?

Merged and backported, can you tell me if that fixed your issue?

I"ll be able to test when the train rolls out tomorrow to the group2 wikis; the errors are on a few of those.

The error is still there, annoyingly enough, @Ladsgroup

There's not a stack trace; the job continues on, as mentioned in the task description. Full text of what I see in the logs:

getting/checking text 406916 failed (Received text is unplausible for id 406916) for revision 413206
      1617
      
       (Will retry 4 more times)
getting/checking text 406916 failed (Generic error while obtaining text for id 406916) for revision 413206
      1617
      
       (Will retry 3 more times)
Rebooting getText infrastructure failed (DB is set and has not been closed by the Load Balancer) Trying to continue anyways
getting/checking text 406916 failed (Generic error while obtaining text for id 406916) for revision 413206
      1617
      
       (Will retry 2 more times)
Rebooting getText infrastructure failed (DB is set and has not been closed by the Load Balancer) Trying to continue anyways
getting/checking text 406916 failed (Generic error while obtaining text for id 406916) for revision 413206
      1617
      
       (Will retry 1 more times)
Rebooting getText infrastructure failed (DB is set and has not been closed by the Load Balancer) Trying to continue anyways
getting/checking text 406916 failed (Generic error while obtaining text for id 406916) for revision 413206
      1617

That inner exception text is coming from TextPassDumper::rotateDb. There is a check for open connections after LoadBalancer::closeAll which is unexpected, because load balancer should know about its own open connections and close them.
Something wrong within MediaWiki-libs-Rdbms?

Errors again today from the run, same job, same wikis: cawiki, etwiki, ocwiki, azwiki, hriki, nowiki, ruwikinews. Just noting it here so we know the error is still an issue.

Just a note that we are still seeing some of these with the most recent run.

Are you still seeing this error? It shouldn't happen after loadmonitor clean up.

yeah, we still see them, and we got this on Feb 2, 8:40 am UTC.

*******Wikis with exceptions:
etwiki, ocwiki
===========================================================

*** Wiki: etwiki
=====================
[20230202082805]: Rebooting getText infrastructure failed (DB is set and has not been closed by the Load Balancer) Trying to continue anyways

*** Wiki: ocwiki
=====================
[20230202081000]: Rebooting getText infrastructure failed (DB is set and has not been closed by the Load Balancer) Trying to continue anyways

The errors we got were more than this and this is just an excerpt from the errors. We also got errors from several wikis later that day and the next day.

Just a note that we still regularly see these errors on each dump run for a small selection of wikis.

I marked a couple of these as bad just to see what that process was like, see T346969

Milimetric claimed this task.
Milimetric moved this task from Active to Done on the Dumps-Generation board.

We suspected this was resolved with the latest update to dumps generation code. I checked the logs of all dumps jobs for the past 3 weeks and found zero instances of this error. In the previous runs of dumps there were thousands of instances. So I'm marking this as resolved for now.

We suspected this was resolved with the latest update to dumps generation code. I checked the logs of all dumps jobs for the past 3 weeks and found zero instances of this error. In the previous runs of dumps there were thousands of instances. So I'm marking this as resolved for now.

There was also https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1078491 recently, as part of T351615: maintenance/dumpBackup.php does not dump when 'actor' is in $wgSharedTables, that likely helped here as well.