Page MenuHomePhabricator

Tired of APIError: readonly
Closed, DuplicatePublic

Description

APIError: readonly: The database has been automatically locked while the slave database servers catch up to the master [readonlyreason:Waiting for 3 lagged database(s); help:See https://es.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes.]

I am quite fed up with this. I'm doing a search and replace task using template.py, and I can't finish them because of continuously receiving this kind of messages that abort the grid job. Before you ask, I'm running the task via the job grid and I'm using a maxlag value of 3 (< 5, which is "nicer" & recommended, and a put_throttle waiting of 15 seconds, which is very respectful IMHO, which results in 4 edits per minute).

#!/bin/bash
cd /data/project/mabot/scripts
python template.py "Ficha de científico" -subst -cc -family:wikipedia -lang:es -summary:"Fusionando {{ficha de científico}} en {{ficha de persona}} por [[WP:BOT/S|solicitud]]" -always -namespace:0 -maxlag:3 -pt:15

The task should've been finished in a day or two, because it's just a couple thousand of templates that, if the job were to run without stop, it'd have been finished already.

Before you ask too, I've tried to setup a .bigbrotherrc watchdog, but apparently jstart does not work with this pywiki script:

Traceback (most recent call last):
  File "template.py", line 122, in <module>
    import pywikibot
ImportError: No module named pywikibot
[2017-30-04T15:14] /mnt/nfs/labstore-secondary-tools-project/mabot/.pywikibot/jobs/tarea.sh exited with code 1. Respawning...

Please help me sort this out. Either this is a Labs issue or an API issue or the eswiki db is very lagged. I don't really know, so I'd appreciate an investigation here and, oh, maybe if pywikibot could be used with jstart I could just have the bot restarted automatically.

Event Timeline

The error message means that the eswiki db is lagged, yes. I think the main problem that you're having is that whatever script you are running isn't retrying after the readonly error, so you have to manually restart the job.

Also somewhat relatedly, does maxlag=3 even do anything given:

# Activate read-only mode for bots when lag is getting high.
# This should be lower than 'max lag' in the LBFactory conf.
$wgAPIMaxLagThreshold = 3;

?

Thanks @Legoktm. I am not sure if there's a pywikibot option for the template.py script or others that forces the bot to retry when they trigger the readonly error. I didn't knew that maxlag thing you mention. Following mw:Maxlag, I decided that '3' was a nice value given that '5' was the standard one. I can try reduce it from 3 to 2 or 1 if that'd help, and also make the bot wait more time between edits, although I think 15 seconds between each is acceptable. Maybe we should have @jcrespo investigate if there's something that is lagging the eswiki DB so much, and I'd also appreciate if @Xqt or @jayvdb, which are very experienced python developers could weight in and see if an option for retrying a job when the readonly error is triggered is already there or could be developed. Regards.

See also T154011: Manage temporary readonly error for the pywikibot-side handling of the error.

apparently jstart does not work with this pywiki script

Pywikibot in general support being started with jstart (and some of my bots are configured to use jstart). In fact, there is little special about jstart besides that it puts the job into a 'continuous' queue and the script in a wrapper. I tried to investigate to see what went wrong with your tool, but:

Your tool dir is chmod 770:

$ ls -ld /data/project/mabot
drwxrwx--- 7 tools.mabot tools.mabot 4096 May  2 14:34 /data/project/mabot

...making it impossible for others to know what is going on with your tool.

Also, the complete command you use for calling jsub and jstart would be helpful.

@zhuyifei1999 Sorry but I'd like to keep my project closed to my eyes only and the admins if they ever need to see what's going on there. However you can see in the description which command I'm using for the task and to submit the job. I'll repeat it here nonetheless.

jsub command
jsub -once -N fichact -l release=trusty -mem 3G /data/project/mabot/.pywikibot/jobs/tarea.sh
pywikibot job
#!/bin/bash
cd /data/project/mabot/scripts
python template.py "Ficha de científico" -subst -cc -family:wikipedia -lang:es -summary:"Fusionando {{ficha de científico}} en {{ficha de persona}} por [[WP:BOT/S|solicitud]]" -always -namespace:0 -maxlag:1 -pt:15

Using jstart in lieu of jsub gave me:

Traceback (most recent call last):
  File "template.py", line 122, in <module>
    import pywikibot
ImportError: No module named pywikibot
[2017-30-04T15:14] /mnt/nfs/labstore-secondary-tools-project/mabot/.pywikibot/jobs/tarea.sh exited with code 1. Respawning...

The job I was performing with my bot is nearly finished so maybe this is not that important now.

Regards.

The necessary questions (which I could have investigate myself if the dir were at least 004) before knowing what went wrong with jstart:

  1. do you have a custom $PATH, $PYTHONPATH, or virtualenv?
  2. where is your install of pywikibot you are trying to load from? (i.e. where do you expect import pywikibot to import pywikibot from?)
  3. do you use a directory-based pywikibot install (user-config.py, apicache, etc in working dir /data/project/mabot/scripts) or user-based install (... in ~/.pywikibot)

And no... deliberately keeping your project closed is against labs ideology of open source and open data.

You can see here a summary of the lag of s7, which is where eswiki is hosted (it is fully public): https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?panelId=7&fullscreen&orgId=1&from=now-7d&to=now

With the exception of a small window on the 2017-04-30, where there was a known degradation of service (probably due to lots of requests sent at the same time by multiple users), I do not see huge spikes of lag, at most spikes of ~1 second of only one server at a time, which probably should not put things on read only, and is considered a normal occurence.

Either this is a Labs issue or an API issue or the eswiki db is very lagged.

My bet is in 2 places (but I am guessing right now)- it could be an issue with the client code, that doesn't handle spurious errors adequately (I am not familiar of how pywikibot works, but if it doesn't handle errors for you, maybe it is expecting you to handle those, so you need some wrapper to retry when it fails. Another option is that mediawiki could be extremely cautious and switch to read only too fast. The last problem is not easy to solve because being more lax and serving content that is more outdated could have worse consequences for both the client bots and the general health of the system. A solution on that regard may need very complex analysis of the causes, and we barely now start to have a good monitoring to even start to change to a better logic:

Having said that, I am not saying there is nothing to do at database level (we can fine-tune semi-sync replication to make writes slower and see the impact of ROW-based replication), but I think you may be looking for a practical solution to implement short term (those will not be fast at all), so I would suggest to put a wrapper to capture read-only errors, wait, and retry automatically. If you see T95501 performance has done an incredible great job to reduce lag causes and we are continuously working on making it better. My first assessment to your request:

( Maybe we should have @jcrespo investigate if there's something that is lagging the eswiki DB so much)

is that dbs are not lagged or in an anomalous state (except for the period already mentioned with bad performance), but obviously I could be looking at the wrong graphs. I think some people has expressed the will of rewriting some of the logic- see T95799, but that is a very complex issue and probably not what you need- what you need is to be able to do batch edits so what I would do is to get the list of queries that failed in a batch and retry them in a second pass (or get someone to help you with such change). If you execute something for long enough, it is very likely that you will find an error, and on the next try it will work- but that should not stop your processing.

The task is already finished with the template replaced so I'm done here. Thank you to everyone who have helped. Maybe T154011 is the solution for this, I don't know.

However and in Re your questions, it's been some time I followed this manual to setup my bot, so I'm using the shared pywikibot files, which have been working well. Maybe that answers them all.

I am not offering any tool created or coded by me ex novo but using the already public and shared pywikibot suite, so there's nothing to freely license which is not already freely licensed and accessible already. Other files I keep, such as user-config.py or personal how-to notes should certainly not be made public; and I plan to keep it that way. In tools I help to mantain, such as rTSTW, the code is indeed public and freely licensed so it can be forked, etc. That does not apply in this case, as so let's not wikilawyer over it. Thanks.

https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Developing#Using_the_shared_Pywikibot_files_.28recommended_setup.29 is a $PYTHONPATH based setup. AFAIK the griddoes not import environment variables from the bastion (T143691#2585880), so .bash_profile must have executed by a bash login shell. I will investigate this.

@MarcoAurelio while T164277 is not fixed, a workaround is to set the environment within the script you are running (in this case /mnt/nfs/labstore-secondary-tools-project/mabot/.pywikibot/jobs/tarea.sh)

@zhuyifei1999

Something like:

jstart -N jobname -l release=trusty -mem 3G /mnt/nfs/labstore-secondary-tools-project/mabot/.pywikibot/jobs/tarea.sh

?

I meant add export PYTHONPATH=/data/project/shared/pywikipedia/core:/data/project/shared/pywikipedia/core/externals/httplib2:/shared/pywikipedia/core/scripts to /mnt/nfs/labstore-secondary-tools-project/mabot/.pywikibot/jobs/tarea.sh (if you decide to stick to $PYTHONPATH method)

For your jstart problem, you can expressly set PWB path by using python /shared/pywikipedia/core/pwb.py yourscriptname.py as command. It's what I use for my bot.

T154011: Manage temporary readonly error will normally fix the error in PWB side.

Change 354246 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/mediawiki-config@master] Log "api-readonly" errors

https://gerrit.wikimedia.org/r/354246

Yup, I'm tired of it too, but should be handled by Pywikibot. MediaWiki seems to have introduced a new way of throwing a readonly error and Pywikibot doesn't handle that properly yet. https://gerrit.wikimedia.org/r/#/c/333033/ is in code review for that, but seems to be a bit stuck

Change 354246 merged by jenkins-bot:
[operations/mediawiki-config@master] Log "api-readonly" errors

https://gerrit.wikimedia.org/r/354246

Mentioned in SAL (#wikimedia-operations) [2017-05-31T13:19:58Z] <hoo@tin> Synchronized wmf-config/InitialiseSettings.php: Log "api-readonly" errors (T164191, T123867) (duration: 00m 43s)