Maxlag lag incorrectly causing bot to crash with pywikibot.data.api.APIError
Closed, ResolvedPublic

Description

Pywikibot checks if maxlag is returned ( https://phabricator.wikimedia.org/diffusion/PWBC/browse/master/pywikibot/data/api.py;31c0234c3e0dd5f0740aa7d69d94651fbcb8bea8$2075 )

lagpattern = re.compile(r"Waiting for [\d.]+: (?P<lag>\d+) seconds? lagged")
....

if code == "maxlag":
    lag = lagpattern.search(info)
    if lag:
        pywikibot.log(
            u"Pausing due to database lag: " + info)
        self.site.throttle.lag(int(lag.group("lag")))
        continue

https://phabricator.wikimedia.org/diffusion/MW/browse/master/includes/api/ApiMain.php;cfa04ceb2b8878ab237d8dd2bdbd578972bce80e$1122

Parsing a string in the api output is just horrible, but seems to be the only way to retrieve the maxlag output. The output was an integer in seconds, but now a float so the regex doesn't match anymore. Simple fix is to update the regex to r"Waiting for [\d.]+: (?P<lag>\d+(\.\d+)?) seconds? lagged"

Bigger question is if we want to know the maximum lag and if we want to parse strings. I see several options here:

  • Just patch it up (short term solution, should do this whatever longer term we choose)
  • Stop using the lag info for backing up and instead just use the standard incremental backing off
  • Update the mediawiki api to include the lag in the output in a structured format so we don't have to do evil string parsing
Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald TranscriptAug 26 2016, 4:35 PM

I think using the maxlag data is fundamentally incorrect: having N seconds lag does not imply that the lag will be resolved in N seconds: if there are no edits, the replag will likely be resolved much quicker, and if there is massive congestion, it could take much longer.

From the api docs: https://www.mediawiki.org/wiki/Manual:Maxlag_parameter

The following HTTP headers are set:
Retry-After: a recommended minimum number of seconds that the client should wait before retrying
X-Database-Lag: The number of seconds of lag of the most lagged slave

So we should just follow the Retry-After header instead.

I agree with @valhallasw . On the man page I also noticed the nice link https://www.mediawiki.org/w/api.php?action=query&titles=MediaWiki&format=json&maxlag=-1 . To trigger a maxlag warning. We should probably add a test for this.

Xqt added a comment.Aug 27 2016, 5:07 PM

I don't think that it is fundamentally incorrect. The maxlag parameter is a good enough prediction what will be the next replag. I guess it is a good strategy to use it instead of running against refusal of requests over and over again or (another strategy we used) double throttleling after each blocked request.

Anyway I would propose another strategy: All requests should always pass but fetch the replag and wait this time for the _next_ request.

Xqt triaged this task as High priority.Aug 27 2016, 5:07 PM

Change 307117 had a related patch set uploaded (by Xqt):
[bugfix] Update lagpattern

https://gerrit.wikimedia.org/r/307117

Change 307118 had a related patch set uploaded (by Xqt):
[bugfix] Update lagpattern

https://gerrit.wikimedia.org/r/307118

Xqt claimed this task.Aug 28 2016, 10:05 AM

Change 307119 had a related patch set uploaded (by Xqt):
[bugfix] Update lagpattern

https://gerrit.wikimedia.org/r/307119

Change 307120 had a related patch set uploaded (by Xqt):
[bugfix] Update lagpattern

https://gerrit.wikimedia.org/r/307120

Change 307117 abandoned by Xqt:
[bugfix] Update lagpattern

https://gerrit.wikimedia.org/r/307117

Change 307119 merged by jenkins-bot:
[bugfix] Update lagpattern

https://gerrit.wikimedia.org/r/307119

Change 307120 merged by jenkins-bot:
[bugfix] Update lagpattern

https://gerrit.wikimedia.org/r/307120

Change 307118 merged by jenkins-bot:
[bugfix] Update lagpattern

https://gerrit.wikimedia.org/r/307118

Change 307121 had a related patch set uploaded (by Xqt):
[bugfix] Add test for lagpattern

https://gerrit.wikimedia.org/r/307121

With r307119 and 120 merged, this should be resolved for now, but the 'how long should we wait' issue remains.

I don't think that it is fundamentally incorrect. The maxlag parameter is a good enough prediction what will be the next replag.

It's a good prediction for the next replag, but not for when the replag will be small enough again. Besides, the wiki sends a 'Retry-After' header, which is exactly what we need. (but I'm not sure how easily accessible it is in our code).

Xqt added a comment.Aug 28 2016, 12:36 PM
This comment was removed by Xqt.

Change 307122 had a related patch set uploaded (by Merlijn van Deen):
Add maxlag regexp tests

https://gerrit.wikimedia.org/r/307122

Change 307121 merged by jenkins-bot:
[bugfix] Add test for lagpattern

https://gerrit.wikimedia.org/r/307121

Change 307122 merged by jenkins-bot:
Add maxlag regexp tests

https://gerrit.wikimedia.org/r/307122

Xqt lowered the priority of this task from High to Low.Aug 28 2016, 4:04 PM

Change 307127 had a related patch set uploaded (by Xqt):
[IMPR] use retry-after value

https://gerrit.wikimedia.org/r/307127

Xqt added a comment.Jul 14 2017, 10:13 AM

Didn't found the trick for the tests.

Change 307127 merged by jenkins-bot:
[pywikibot/core@master] [IMPR] use retry-after value

https://gerrit.wikimedia.org/r/307127

Xqt closed this task as Resolved.Apr 24 2018, 8:42 PM