One of them reported to me at 15:30 (UTC), by @Multichill.
I wasn't able to find anything meaningful in any of the logs, not even a record for writes failing because the DB went read only.
hoo | |
Jan 17 2016, 4:06 PM |
F7734856: pasted_file | |
Apr 24 2017, 6:49 AM |
One of them reported to me at 15:30 (UTC), by @Multichill.
I wasn't able to find anything meaningful in any of the logs, not even a record for writes failing because the DB went read only.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Declined | None | T3268 Database replication lag issues (tracking) | |||
Duplicate | None | T108551 Database locked error while publishing article using CX | |||
Resolved | aaron | T95501 Fix causes of replica lag and get it to under 5 seconds at peak | |||
Resolved | Lydia_Pintscher | T123867 Repeated reports of wikidatawiki (s5) API going read only |
The output:
File "C:\pywikibot\coredev\pywikibot\data\api.py", line 2172, in submit raise APIError(**result['error']) pywikibot.data.api.APIError: readonly: The wiki is currently in read-only mode [readonlyreason:Waiting for 5 lagged database(s); help:See https://www.wikidata.org/w/api.php for API usage]<class 'pywikibot.data.api.APIError'> CRITICAL: Closing network session.
Timestamp: 16:29 or 16:30 17 January 2016 CET.
At the same time I was using WIDAR and that stalled too. That was actually the trigger to look at my bot to see if it was still editing.
Almost forgot that I have logging enabled:
2016-01-17 16:30:13 api.py, 2077 in submit: WARNING API error readonly: The wiki is currently in read-only mode 2016-01-17 16:30:13 api.py, 2168 in submit: VERBOSE API Error: query= u'{u\'maxlag\': [\'5\'], u\'format\': [u\'json\'], \'baserevid\': [292839452], u\'bot\': [1], u\'value\': [u\'{"entity-type": "item", "numeric-id": 2365880}\'], \'entity\': [u\'Q22087659\'], u\'assert\': [u\'user\'], u\'token\': [u\'token+\\\\\'], \'snaktype\': [u\'value\'], \'action\': [u\'wbcreateclaim\'], \'property\': [u\'P195\']}' 2016-01-17 16:30:13 api.py, 2170 in submit: VERBOSE response= {u'servedby': u'mw1224', u'error': {u'info': u'The wiki is currently in read-only mode', u'readonlyreason': u'Waiting for 5 lagged database(s)', u'code': u'readonly', u'help': u'See https://www.wikidata.org/w/api.php for API usage'}} 2016-01-17 16:30:13 throttle.py, 140 in checkMultiplicity: VERBOSE Found 1 wikipedia:nl processes running, including this one. 2016-01-17 16:30:13 __init__.py, 732 in stopme: VERBOSE Dropped throttle(s). 2016-01-17 16:30:13 http.py, 92 in _flush: CRITICAL Closing network session. 2016-01-17 16:30:13 http.py, 96 in _flush: VERBOSE Network session closed.
I've just uploaded a patch to log readonlyreason errors: https://gerrit.wikimedia.org/r/264595
Last 2 days there were 251 hits of
DELETE /* MessageGroupStats::clear 127.0.0.1 */ FROM `translate_groupstats` WHERE tgs_group = 'page-Wikidata:List of properties/Summary table' AND tgs_lang = 'en' /* e567ea5f86b3e2b07cd283d8aa9270a9 db1058 wikidatawiki 11s */ /* localhost */
and 115 of:
REPLACE /* OAIHook::updateSave 127.0.0.1 */ INTO `updates` (up_page, up_action, up_timestamp, up_sequence) VALUES ('16941457', 'modify', '20160117141656', NULL) /* 4d6179cc9face20ce4fa70babf729cc6 db1058 wikidatawiki 11s */ /* localhost */
Both within small periods of time.
EDIT: I said initially that the first one looks the better candidate, I now think I do not have enough data to say that, and that even both could be responsible at the same time.
Saw these as well (in tendril), but couldn't see any correlation (time wise), also they only appeared for db1058 (master) there. Given we have statement based replication, these should show up for the slaves as well, in case they run for longer periods of time, shouldn't they?
Actually, now that I see, both could be responsible, as both are reported around 15:30 UTC.
Given we have statement based replication
Not 100% truth everywhere anymore. See our puppet config.
these should show up for the slaves as well, in case they run for longer periods of time, shouldn't they?
No, I do not think our monitoring system would catch system user queries on slaves even if we used exclusively STATEMENT replication, and even if it would, that would not discard having problems with them because it takes them only being <10 seconds to not be noticed but create pileups due to single-thread replication. The fact that this created lag on all 5 of our slaves (usually) means that the problem is in the centralized distribution point of transactions (master)- unless it is a progressive failover where the servers are depooled one by one.
As a related, but probably offtopic, you may want to have a look at this page: https://www.wikidata.org/wiki/Wikidata:List_of_properties/Summary_table
That table should really get archived… rendering it is failing since forever anyway.
The queries you list above seem to have happened after the lag spike, but the following two happened immediately before it:
db1058 wikiuser wikidatawiki mw1206 - 15236731133 73B94B9B0 4s 2016-01-17 15:29:52 SELECT /* WikiPage::lockAndGetLatest … */ page_latest FROM `page` WHERE page_id = '23582281' AND page_namespace = '4' AND page_title = 'Database_reports/birthday_today' LIMIT 1 FOR UPDATE
db1058 wikiuser wikidatawiki mw1013 localhost 15236732019 73B9496E6 4s 2016-01-17 15:29:42 DELETE /* LinksUpdate::incrTableUpdate 127.0.0.1 */ FROM `pagelinks` WHERE pl_from = '23582281' AND ((pl_namespace = '0' AND pl_title IN ('Algeria', 'Amanda_de_Cadenet', 'Andorra', 'Archivist', 'Australia', 'Author', 'Ballet_dancer', 'Bandleader', 'Belize', 'Bolivia', 'Byelorussian_Soviet_Socialist_Republic', 'Camera_operator', 'Chinese_Academy_of_Social_Sciences', 'Civil_engineer', 'Coach_(basketball)', 'Coach_(sport)', 'Conductor_(music)', 'Councillor', 'Democratic_Republic_of_the_Congo', 'Diplomat', 'Dominican_Republic', 'Ecuador', 'Equatorial_Guinea', 'Estonia', 'Estonian_Academy_of_Sciences', 'Ethiopia', 'Film_director', 'Go_player', 'Greenland', 'Heimatforscher', 'Henry_Hill', 'Illustrator', 'Iran', 'Jo_Peters_Po
No idea why I wasn't able to find these yesterday (but I found the ones you listed above) :S
I just got reports about this again (supposedly shortly after 2016-01-23 15:34:43 CET), but I couldn't find anything regarding that in tendril.
I suppose it is possible, there is a spike of 26 seconds on lag on db1045, but probably only for a few seconds. But it is not s5 going read only- it is the API going read only because "too much lag" on db1045. It is a protection measure. It means it is working as intended. We can minimize this happening (we mentioned having a second API server, when we get the hardware), avoid lag by fixing mediawiki's queries, but this is not an error that should not happen- it is meant to force users to retry and not saturate the servers.
The API only goes read only if more than half of the servers are lagged for more than 5s. That really should not happen unless there actually are too many writes.
I agree that it's ok for that to happen sometimes, but it shouldn't happen often.
To actually know how often this happens, I would like to get https://gerrit.wikimedia.org/r/264595 merged and deployed, please review.
Maybe we should increase APIMaxLagThreshold to 7s or even 10s? In order to be able to make such choices, we might also want to log the lag times in case the API goes read only.
I do not own such changes, and you should discuss them with Aaron, which I think is trying to lower all those timeouts. Or better, with wikidata developers to avoid long queries and monitoring lag, as reported several times by me: T111769.
@Multichill: Is this relevant still?
We could also start listening to 'api-readonly' logs, to find out how often this happens.
I did this edit that might help in the short term. But definitely there should be some limits on how people behave on Wikis. I check the db query and might be able to do something.
In the last 10 minutes:
pywikibot.data.api.APIError: readonly: The database has been automatically locked while the slave database servers catch up to the master [readonlyreason:Waiting for 7 lagged database(s); help:See https://www.wikidata.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes.]
<class 'pywikibot.data.api.APIError'>
CRITICAL: Closing network session.
pywikibot.data.api.APIError: readonly: The database has been automatically locked while the slave database servers catch up to the master [readonlyreason:Waiting for 5 lagged database(s); help:See https://www.wikidata.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes.]
<class 'pywikibot.data.api.APIError'>
CRITICAL: Closing network session.
This is most likely due to: T148609#3183908 as we had issues with most of the server there, which were having spikes in threads, load, lag etc.
Just had it again:
pywikibot.data.api.APIError: readonly: The database has been automatically locked while the slave database servers catch up to the master [readonlyreason:Waiting for 3 lagged database(s); help:See https://www.wikidata.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes.]
<class 'pywikibot.data.api.APIError'>
For how long did this happen? Was it just a small bump?
I have checked s5 slaves around the time of this ticket creation and I have not found any significant delay, just small 1 second peaks (which might not be even real but only graphing issues), would that be enough to cause this error?
There is also nothing on the general shard lag around that time: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=6&fullscreen&orgId=1&from=now-24h&to=now&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=x1&var-role=All
Shard and datacenter is wrong in the grafana link. I think the correct one is: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=6&fullscreen&orgId=1&from=now-24h&to=now&var-dc=codfw%20prometheus%2Fops&var-group=core&var-shard=s5&var-role=All
To my shock, it has one spike of two minutes lag:
Thanks! Too used to eqiad :-)
I will check that 2 minutes lag (which however, is not at the time of when the comment was added)
Looks like db2059 (api slave) had some peaks of lag around that time, which matches some slow queries too. Could be related to all the stuff being discussed here: T163495
Thanks. This is a sample of slow queries in tendril which it might be useful. Some slow queries (I don't see any slow write queries causing replication lag)
Just improvements in the database queries not probably related to this bug
That is the max lag, and it is normal on the slaves that are not waited by mediawiki. This issue has nothing to do with databases, mediawiki does what it is programmed to do: if it detects lag even if a few seconds, it disables the API (by design)- if you do not like that functionality, a specific change should be directed to the mediawiki persistance functionality maintainers, if you do not like you bot erroring out or crashing, complain to the API client application (as the response clearly tells you to retry in X seconds, but databases themselves have no issues- individual servers lag at a time and that is the way it is programmed (which to me makes sense, but I have no say on that).
Note I am not saying this bug is invalid, I am saying there is nothing broken on the databases or its hardware. For example, there is a vslow slave on each shard that lags all the time, and that is ok because it is not used for main queries (has weight 0). As you can see here, only 1 server is slower - and normally by 1 second ( https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?panelId=5&fullscreen&orgId=1&from=1492933452444&to=1493019852444&var-dc=codfw%20prometheus%2Fops ), how the application responds to that is not a database issue as that is a normal state.
Select queries do not cause lag, unless there is something so impacting that blocks all available resources (as Innodb reads never block writes)- only writes do (or blocking queries), so while the queries mentioned may be bad- lag is caused by writes- either bots writing too many rows at a time, or large write transactions. Current response of mediawiki is to put thing in read only when lots of load is detected- if that is undesired (or maybe can be tuned better)- direct your complains to MediaWiki-libs-Rdbms (not sure if that is handled by performance, platform or who?), not DBA (I think queries should only blocked if the majority of slaves are lagged, and not only when one is lagged). Replication is by design asyncronous, and lag will be existing always, unless you want us to slow down writes and write syncronously, reducing the throughput to 1/1000 to the current one (and creating even more errors).
I can tell you what I think it happens- wikidata is highly populated by bots and they create lots of traffic, and I think that causes many times stress on s5, higher than normal. A better approach rather than stopping serving all requests would be to rate-limit the bad users- but I think there is not such a technology yet (and not easy to implement). I have also seen that some job related writes are sometimes too intensive (see T163544) What I can tell you is what we as DBAs will be doing to minimize wikidata impact- we are going to give it dedicated hardware on it own separate shard and call it s8. If that will solve this issues the answer probably is no- I think this is not a bug, and I would support making it invalid- if your client detects an error, it should retry after 1 minute (or the time you are told to wait) once. The only thing I would support is to make the detection (for example, I think the lag detection code is too strict -complaining when the error is >1 second and the detection error is 1+ seconds, and I already transmitted my opinion on that) and its effects lower (like allowing more than X seconds of lag or not blocking in read only mode certain actions), but the general logic I think works as intended.
I agree with @jcrespo. This phab card is becoming a place for general database issues of Wikidata which are valid but I think should be addressed in another way. Overall, the load (especially the write load) needs to reduce. That can happen via several tasks that we are doing (moving wb_terms to elastic, killing entity_per_page table, using redis as lock manager instead of sql, etc.) This might solve it and might not but we will have database locks (might happen in lesser frequency but will happen) and given the size of the project, it's inevitable.
Thanks @jcrespo for you response. Task was filed because we noticed the database going read-only every once in a while and because T100123 hasn't been implemented yet, this has impact. Something might have been wrong. I'll just sum up your response as: Works as designed.
That leaves us without any actions in this task so I think it's best to just close this one. Do you agree @hoo ?
Change 349867 abandoned by Ladsgroup:
[WIP] [Code experiment] Try to select all columns from wb_terms
Change 354246 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/mediawiki-config@master] Log "api-readonly" errors
Change 354246 merged by jenkins-bot:
[operations/mediawiki-config@master] Log "api-readonly" errors
Mentioned in SAL (#wikimedia-operations) [2017-05-31T13:19:58Z] <hoo@tin> Synchronized wmf-config/InitialiseSettings.php: Log "api-readonly" errors (T164191, T123867) (duration: 00m 43s)