Page MenuHomePhabricator

beta-update-databases-eqiad hasn't completed in 22 days because of (?) FlowUpdateRevisionContentLength:version2
Closed, ResolvedPublic

Description

https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/

FlowPopulateLinksTables is the last task that's logged as being done (indeed, already complete) before it fails.

The next one is FlowUpdateRevisionContentLength:version2 but it never returns…

Event Timeline

Jdforrester-WMF raised the priority of this task from to High.
Jdforrester-WMF updated the task description. (Show Details)
Jdforrester-WMF added a subscriber: Jdforrester-WMF.

https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/label=deployment-bastion-eqiad,wikidb=enwiki/lastFailedBuild/console is timing out.

There are a lot of edits by Selenium user, and this one-time update is not optimized. We should see if we can just run it without the timeout, e.g. from the shell.

Change 197226 had a related patch set uploaded (by Greg Grossmeier):
Set timeout to 45 minutes for db update

https://gerrit.wikimedia.org/r/197226

Jdforrester-WMF renamed this task from beta-update-databases-eqiad hasn't completed in 20 days because of (?) FlowUpdateRevisionContentLength:version2 to beta-update-databases-eqiad hasn't completed in 22 days because of (?) FlowUpdateRevisionContentLength:version2.Mar 19 2015, 1:42 AM
Jdforrester-WMF set Security to None.
Mattflaschen-WMF raised the priority of this task from High to Unbreak Now!.Mar 19 2015, 2:32 AM

There is some Flow updater which hammer the parsoid varnish cache with requests such as:

pass/200 57 POST http://parsoid/enwiki/Topic:S3cn7knq5tqf9nf6
pass/200 55 POST http://parsoid/enwiki/Topic:S3cna8dz7v2wm24g

On parsoid side ( /data/project/parsoid/parsoid.log ) and looking for the first topic:

{"name":"parsoid","hostname":"deployment-parsoid05","pid":7053,"level":30,"logType":"info",
"wiki":"enwiki","title":"Topic:S3cn7knq5tqf9nf6","oldId":null,
"msg":"started serializing","longMsg":"started serializing","time":"2015-03-19T07:29:55.789Z","v":0}

{"name":"parsoid","hostname":"deployment-parsoid05","pid":7053,"level":30,"logType":"info",
"wiki":"enwiki","title":"Topic:S3cn7knq5tqf9nf6","oldId":null,
"msg":"completed serializing in 3 ms","longMsg":"completed serializing in\n3\nms","time":"2015-03-19T07:29:55.793Z","v":0}

{"name":"parsoid","hostname":"deployment-parsoid05","pid":7055,"level":30,"logType":"info",
"wiki":"enwiki","title":"Topic:S3cn7knq5tqf9nf6","oldId":null,
"msg":"started serializing","longMsg":"started serializing","time":"2015-03-19T08:30:22.259Z","v":0}

{"name":"parsoid","hostname":"deployment-parsoid05","pid":7055,"level":30,"logType":"info",
"wiki":"enwiki","title":"Topic:S3cn7knq5tqf9nf6","oldId":null,
"msg":"completed serializing in 4 ms","longMsg":"completed serializing in\n4\nms","time":"2015-03-19T08:30:22.262Z","v":0}

Note the times 7:29 and 8:30 which indicates the update of that topic keep being redone on each update.

I have no idea what updating task is running nor why it is updating existing field nor why it ends up hitting parsoid. Might need to be investigated.

The update job is aborted by a Jenkins after 30 minutes and since the update start from scratch each time, the job can never complete.

The workaround is to run the script manually from deployment-bastion.eqiad.wmflabs and in a screen. Something like:

screen
sudo -u mwdeploy mwscript update.php --wiki=enwiki --quick

Then mark in the update table that the update task has been completed to prevent update.php from running it again.

Change 197226 merged by jenkins-bot:
Beta timing out jobs now abort 45 mins for db update

https://gerrit.wikimedia.org/r/197226

Change 197226 merged by jenkins-bot:
Beta timing out jobs now abort 45 mins for db update

https://gerrit.wikimedia.org/r/197226

And that was deployed. Next run of the job is in 35 minutes.... let's see what happens.

https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/8314/label=deployment-bastion-eqiad,wikidb=enwiki/console

Catchable fatal error: Argument 1 passed to LogRowUpdateGenerator::loadTopic() must be an instance of Flow\Model\UUID, string given, called in /mnt/srv/mediawiki-staging/php-master/extensions/Flow/maintenance/FlowFixLog.php on line 104 and defined in /mnt/srv/mediawiki-staging/php-master/extensions/Flow/maintenance/FlowFixLog.php on line 148
wmerrors: timed out during fatal error handler, aborting
Build step 'Execute shell' marked build as failure
Finished: FAILURE

Change 198133 had a related patch set uploaded (by EBernhardson):
postId and topicId can be strings or UUIDs

https://gerrit.wikimedia.org/r/198133

Change 198133 merged by jenkins-bot:
postId and topicId can be strings or UUIDs

https://gerrit.wikimedia.org/r/198133

That patch fixed the second issue (which was different from the first).

Apparently someone manually ran the update 'FlowUpdateRevisionContentLength:version2' which was taking a crazy long time to run. So the root cause is fixed apparently.

I guess the production wikis have been/will be updated using a batched maintenance script as well. We can probably close this task.