Page MenuHomePhabricator

Repeated reports of wikidatawiki (s5) API going read only
Closed, ResolvedPublic

Description

One of them reported to me at 15:30 (UTC), by @Multichill.

I wasn't able to find anything meaningful in any of the logs, not even a record for writes failing because the DB went read only.

Event Timeline

hoo raised the priority of this task from to Needs Triage.
hoo updated the task description. (Show Details)
hoo added projects: Wikidata, DBA.
hoo added subscribers: hoo, jcrespo, Multichill and 4 others.

The output:

  File "C:\pywikibot\coredev\pywikibot\data\api.py", line 2172, in submit
    raise APIError(**result['error'])
pywikibot.data.api.APIError: readonly: The wiki is currently in read-only mode [readonlyreason:Waiting for 5 lagged database(s); help:See https://www.wikidata.org/w/api.php for API usage]<class 'pywikibot.data.api.APIError'>
CRITICAL: Closing network session.

Timestamp: 16:29 or 16:30 17 January 2016 CET.

At the same time I was using WIDAR and that stalled too. That was actually the trigger to look at my bot to see if it was still editing.

Almost forgot that I have logging enabled:

2016-01-17 16:30:13             api.py, 2077 in             submit: WARNING  API error readonly: The wiki is currently in read-only mode
2016-01-17 16:30:13             api.py, 2168 in             submit: VERBOSE  API Error: query=
u'{u\'maxlag\': [\'5\'], u\'format\': [u\'json\'], \'baserevid\': [292839452], u\'bot\': [1], u\'value\': [u\'{"entity-type": "item", "numeric-id": 2365880}\'], \'entity\': [u\'Q22087659\'], u\'assert\': [u\'user\'], u\'token\': [u\'token+\\\\\'], \'snaktype\': [u\'value\'], \'action\': [u\'wbcreateclaim\'], \'property\': [u\'P195\']}'
2016-01-17 16:30:13             api.py, 2170 in             submit: VERBOSE             response=
{u'servedby': u'mw1224', u'error': {u'info': u'The wiki is currently in read-only mode', u'readonlyreason': u'Waiting for 5 lagged database(s)', u'code': u'readonly', u'help': u'See https://www.wikidata.org/w/api.php for API usage'}}
2016-01-17 16:30:13        throttle.py,  140 in  checkMultiplicity: VERBOSE  Found 1 wikipedia:nl processes running, including this one.
2016-01-17 16:30:13        __init__.py,  732 in             stopme: VERBOSE  Dropped throttle(s).
2016-01-17 16:30:13            http.py,   92 in             _flush: CRITICAL Closing network session.
2016-01-17 16:30:13            http.py,   96 in             _flush: VERBOSE  Network session closed.

I've just uploaded a patch to log readonlyreason errors: https://gerrit.wikimedia.org/r/264595

Last 2 days there were 251 hits of

DELETE /* MessageGroupStats::clear 127.0.0.1 */ 
FROM `translate_groupstats` 
WHERE tgs_group = 'page-Wikidata:List of properties/Summary table' 
AND tgs_lang = 'en' 
/* e567ea5f86b3e2b07cd283d8aa9270a9 db1058 wikidatawiki 11s */ /* localhost */

and 115 of:

REPLACE /* OAIHook::updateSave 127.0.0.1 */ 
INTO `updates` (up_page, up_action, up_timestamp, up_sequence) 
VALUES ('16941457', 'modify', '20160117141656', NULL)
/* 4d6179cc9face20ce4fa70babf729cc6 db1058 wikidatawiki 11s */ /* localhost */

Both within small periods of time.

EDIT: I said initially that the first one looks the better candidate, I now think I do not have enough data to say that, and that even both could be responsible at the same time.

Saw these as well (in tendril), but couldn't see any correlation (time wise), also they only appeared for db1058 (master) there. Given we have statement based replication, these should show up for the slaves as well, in case they run for longer periods of time, shouldn't they?

Actually, now that I see, both could be responsible, as both are reported around 15:30 UTC.

Given we have statement based replication

Not 100% truth everywhere anymore. See our puppet config.

these should show up for the slaves as well, in case they run for longer periods of time, shouldn't they?

No, I do not think our monitoring system would catch system user queries on slaves even if we used exclusively STATEMENT replication, and even if it would, that would not discard having problems with them because it takes them only being <10 seconds to not be noticed but create pileups due to single-thread replication. The fact that this created lag on all 5 of our slaves (usually) means that the problem is in the centralized distribution point of transactions (master)- unless it is a progressive failover where the servers are depooled one by one.

As a related, but probably offtopic, you may want to have a look at this page: https://www.wikidata.org/wiki/Wikidata:List_of_properties/Summary_table

As a related, but probably offtopic, you may want to have a look at this page: https://www.wikidata.org/wiki/Wikidata:List_of_properties/Summary_table

That table should really get archived… rendering it is failing since forever anyway.

The queries you list above seem to have happened after the lag spike, but the following two happened immediately before it:

db1058	wikiuser	wikidatawiki	mw1206	-	15236731133	73B94B9B0	4s	2016-01-17 15:29:52
SELECT /* WikiPage::lockAndGetLatest … */ page_latest FROM `page` WHERE page_id = '23582281' AND page_namespace = '4' AND page_title = 'Database_reports/birthday_today' LIMIT 1 FOR UPDATE
db1058	wikiuser	wikidatawiki	mw1013	localhost	15236732019	73B9496E6	4s	2016-01-17 15:29:42
DELETE /* LinksUpdate::incrTableUpdate 127.0.0.1 */ FROM `pagelinks` WHERE pl_from = '23582281' AND ((pl_namespace = '0' AND pl_title IN ('Algeria', 'Amanda_de_Cadenet', 'Andorra', 'Archivist', 'Australia', 'Author', 'Ballet_dancer', 'Bandleader', 'Belize', 'Bolivia', 'Byelorussian_Soviet_Socialist_Republic', 'Camera_operator', 'Chinese_Academy_of_Social_Sciences', 'Civil_engineer', 'Coach_(basketball)', 'Coach_(sport)', 'Conductor_(music)', 'Councillor', 'Democratic_Republic_of_the_Congo', 'Diplomat', 'Dominican_Republic', 'Ecuador', 'Equatorial_Guinea', 'Estonia', 'Estonian_Academy_of_Sciences', 'Ethiopia', 'Film_director', 'Go_player', 'Greenland', 'Heimatforscher', 'Henry_Hill', 'Illustrator', 'Iran', 'Jo_Peters_Po

No idea why I wasn't able to find these yesterday (but I found the ones you listed above) :S

I just got reports about this again (supposedly shortly after 2016-01-23 15:34:43 CET), but I couldn't find anything regarding that in tendril.

I suppose it is possible, there is a spike of 26 seconds on lag on db1045, but probably only for a few seconds. But it is not s5 going read only- it is the API going read only because "too much lag" on db1045. It is a protection measure. It means it is working as intended. We can minimize this happening (we mentioned having a second API server, when we get the hardware), avoid lag by fixing mediawiki's queries, but this is not an error that should not happen- it is meant to force users to retry and not saturate the servers.

I suppose it is possible, there is a spike of 26 seconds on lag on db1045, but probably only for a few seconds. But it is not s5 going read only- it is the API going read only because "too much lag" on db1045. It is a protection measure. It means it is working as intended. We can minimize this happening (we mentioned having a second API server, when we get the hardware), avoid lag by fixing mediawiki's queries, but this is not an error that should not happen- it is meant to force users to retry and not saturate the servers.

The API only goes read only if more than half of the servers are lagged for more than 5s. That really should not happen unless there actually are too many writes.
I agree that it's ok for that to happen sometimes, but it shouldn't happen often.

To actually know how often this happens, I would like to get https://gerrit.wikimedia.org/r/264595 merged and deployed, please review.

Maybe we should increase APIMaxLagThreshold to 7s or even 10s? In order to be able to make such choices, we might also want to log the lag times in case the API goes read only.

hoo renamed this task from Repeated reports of wikidatawiki (s5) going read only to Repeated reports of wikidatawiki (s5) API going read only.Jan 25 2016, 9:53 AM
hoo set Security to None.

I do not own such changes, and you should discuss them with Aaron, which I think is trying to lower all those timeouts. Or better, with wikidata developers to avoid long queries and monitoring lag, as reported several times by me: T111769.

@Multichill: Is this relevant still?

We could also start listening to 'api-readonly' logs, to find out how often this happens.

I still get read-only mode when editing via API several times per week

I did this edit that might help in the short term. But definitely there should be some limits on how people behave on Wikis. I check the db query and might be able to do something.

@Multichill: Do you get such issues recently? My bot stopped getting them

In the last 10 minutes:

pywikibot.data.api.APIError: readonly: The database has been automatically locked while the slave database servers catch up to the master [readonlyreason:Waiting for 7 lagged database(s); help:See https://www.wikidata.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce&gt; for notice of API deprecations and breaking changes.]
<class 'pywikibot.data.api.APIError'>
CRITICAL: Closing network session.


pywikibot.data.api.APIError: readonly: The database has been automatically locked while the slave database servers catch up to the master [readonlyreason:Waiting for 5 lagged database(s); help:See https://www.wikidata.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce&gt; for notice of API deprecations and breaking changes.]
<class 'pywikibot.data.api.APIError'>
CRITICAL: Closing network session.

In the last 10 minutes:

pywikibot.data.api.APIError: readonly: The database has been automatically locked while the slave database servers catch up to the master [readonlyreason:Waiting for 7 lagged database(s); help:See https://www.wikidata.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce&gt; for notice of API deprecations and breaking changes.]
<class 'pywikibot.data.api.APIError'>
CRITICAL: Closing network session.


pywikibot.data.api.APIError: readonly: The database has been automatically locked while the slave database servers catch up to the master [readonlyreason:Waiting for 5 lagged database(s); help:See https://www.wikidata.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce&gt; for notice of API deprecations and breaking changes.]
<class 'pywikibot.data.api.APIError'>
CRITICAL: Closing network session.

This is most likely due to: T148609#3183908 as we had issues with most of the server there, which were having spikes in threads, load, lag etc.

Just had it again:

pywikibot.data.api.APIError: readonly: The database has been automatically locked while the slave database servers catch up to the master [readonlyreason:Waiting for 3 lagged database(s); help:See https://www.wikidata.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce&gt; for notice of API deprecations and breaking changes.]
<class 'pywikibot.data.api.APIError'>

Just had it again:

pywikibot.data.api.APIError: readonly: The database has been automatically locked while the slave database servers catch up to the master [readonlyreason:Waiting for 3 lagged database(s); help:See https://www.wikidata.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce&gt; for notice of API deprecations and breaking changes.]
<class 'pywikibot.data.api.APIError'>

For how long did this happen? Was it just a small bump?
I have checked s5 slaves around the time of this ticket creation and I have not found any significant delay, just small 1 second peaks (which might not be even real but only graphing issues), would that be enough to cause this error?

There is also nothing on the general shard lag around that time: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=6&fullscreen&orgId=1&from=now-24h&to=now&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=x1&var-role=All

Shard and datacenter is wrong in the grafana link. I think the correct one is: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=6&fullscreen&orgId=1&from=now-24h&to=now&var-dc=codfw%20prometheus%2Fops&var-group=core&var-shard=s5&var-role=All

To my shock, it has one spike of two minutes lag:

pasted_file (906×1 px, 79 KB)

Thanks! Too used to eqiad :-)
I will check that 2 minutes lag (which however, is not at the time of when the comment was added)

Looks like db2059 (api slave) had some peaks of lag around that time, which matches some slow queries too. Could be related to all the stuff being discussed here: T163495

Thanks. This is a sample of slow queries in tendril which it might be useful. Some slow queries (I don't see any slow write queries causing replication lag)
Just improvements in the database queries not probably related to this bug

  • SELECT /* ApiQueryPagePropNames::execute */ DISTINCT pp_propname FROM page_props ORDER BY pp_propname LIMIT 501 /* eca54e37324ba61f23cd87946fe72863 db2059 wikidatawiki 29s */` (We don't have index on pp_propname, we probably should)
  • SELECT /* Wikibase\TermSqlIndex::fetchTerms */ term_entity_type, term_type, term_language, term_text, term_entity_id FROM wb_terms WHERE term_entity_type = 'property' AND term_entity_id IN ('243', '1036', '1054', '675', '957', '212', '921', '345', '1899', '625', '106', '1344', '3417', '279', '1343', '1889', '3832', '3797', '3788', '699', '585', '495', '361', '703', '1315', '1711', '244', '366', '734', '2283', '2079', '176', '571', '1128', '69', '136', '1659', '405', '2578', '248') AND term_language = 'en' AND term_type IN ('label', 'description') /* 00b21093309916582c83809bac85cbcd db2045 wikidatawiki 293s */` This is strange. We have index on term_language, term_entity_id.
jcrespo unsubscribed.

That is the max lag, and it is normal on the slaves that are not waited by mediawiki. This issue has nothing to do with databases, mediawiki does what it is programmed to do: if it detects lag even if a few seconds, it disables the API (by design)- if you do not like that functionality, a specific change should be directed to the mediawiki persistance functionality maintainers, if you do not like you bot erroring out or crashing, complain to the API client application (as the response clearly tells you to retry in X seconds, but databases themselves have no issues- individual servers lag at a time and that is the way it is programmed (which to me makes sense, but I have no say on that).

Note I am not saying this bug is invalid, I am saying there is nothing broken on the databases or its hardware. For example, there is a vslow slave on each shard that lags all the time, and that is ok because it is not used for main queries (has weight 0). As you can see here, only 1 server is slower - and normally by 1 second ( https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?panelId=5&fullscreen&orgId=1&from=1492933452444&to=1493019852444&var-dc=codfw%20prometheus%2Fops ), how the application responds to that is not a database issue as that is a normal state.

Select queries do not cause lag, unless there is something so impacting that blocks all available resources (as Innodb reads never block writes)- only writes do (or blocking queries), so while the queries mentioned may be bad- lag is caused by writes- either bots writing too many rows at a time, or large write transactions. Current response of mediawiki is to put thing in read only when lots of load is detected- if that is undesired (or maybe can be tuned better)- direct your complains to MediaWiki-libs-Rdbms (not sure if that is handled by performance, platform or who?), not DBA (I think queries should only blocked if the majority of slaves are lagged, and not only when one is lagged). Replication is by design asyncronous, and lag will be existing always, unless you want us to slow down writes and write syncronously, reducing the throughput to 1/1000 to the current one (and creating even more errors).

I can tell you what I think it happens- wikidata is highly populated by bots and they create lots of traffic, and I think that causes many times stress on s5, higher than normal. A better approach rather than stopping serving all requests would be to rate-limit the bad users- but I think there is not such a technology yet (and not easy to implement). I have also seen that some job related writes are sometimes too intensive (see T163544) What I can tell you is what we as DBAs will be doing to minimize wikidata impact- we are going to give it dedicated hardware on it own separate shard and call it s8. If that will solve this issues the answer probably is no- I think this is not a bug, and I would support making it invalid- if your client detects an error, it should retry after 1 minute (or the time you are told to wait) once. The only thing I would support is to make the detection (for example, I think the lag detection code is too strict -complaining when the error is >1 second and the detection error is 1+ seconds, and I already transmitted my opinion on that) and its effects lower (like allowing more than X seconds of lag or not blocking in read only mode certain actions), but the general logic I think works as intended.

I agree with @jcrespo. This phab card is becoming a place for general database issues of Wikidata which are valid but I think should be addressed in another way. Overall, the load (especially the write load) needs to reduce. That can happen via several tasks that we are doing (moving wb_terms to elastic, killing entity_per_page table, using redis as lock manager instead of sql, etc.) This might solve it and might not but we will have database locks (might happen in lesser frequency but will happen) and given the size of the project, it's inevitable.

Thanks @jcrespo for you response. Task was filed because we noticed the database going read-only every once in a while and because T100123 hasn't been implemented yet, this has impact. Something might have been wrong. I'll just sum up your response as: Works as designed.

That leaves us without any actions in this task so I think it's best to just close this one. Do you agree @hoo ?

Lydia_Pintscher claimed this task.

I think so too :)

Change 349867 abandoned by Ladsgroup:
[WIP] [Code experiment] Try to select all columns from wb_terms

https://gerrit.wikimedia.org/r/349867

Change 349867 abandoned by Ladsgroup:
[WIP] [Code experiment] Try to select all columns from wb_terms

https://gerrit.wikimedia.org/r/349867

This patch was not related to this and had the incorrect bug number

Change 354246 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/mediawiki-config@master] Log "api-readonly" errors

https://gerrit.wikimedia.org/r/354246

Change 354246 merged by jenkins-bot:
[operations/mediawiki-config@master] Log "api-readonly" errors

https://gerrit.wikimedia.org/r/354246

Stashbot subscribed.

Mentioned in SAL (#wikimedia-operations) [2017-05-31T13:19:58Z] <hoo@tin> Synchronized wmf-config/InitialiseSettings.php: Log "api-readonly" errors (T164191, T123867) (duration: 00m 43s)