Page MenuHomePhabricator

Long running query MessageCache::loadFromDB(en)-small on WMF "special" slaves
Closed, ResolvedPublicPRODUCTION ERROR

Description

db1051	1031386465	wikiuser	mw1193	enwiki	49s
SELECT /* MessageCache::loadFromDB(en)-small */ page_title,old_id,old_text,old_flags FROM `page`,`revision`,`text` WHERE page_is_redirect = '0' AND page_namespace = '8' AND (page_title NOT LIKE '%/%' ESCAPE '`' ) AND (page_latest=rev_id) AND (rev_text_id=old_id) AND (page_len <= 1024)

Reasons for suspicion:

  • The query is allowing very broad parameters (there is no limit -1024 does not assure there is not millions of rows returned, and if it gets slightly bigger, it will actually do it)
  • It is not done by wikiadmin, but wikiuser- it is the result of an end-user API call, not a job/maintenance/one-time query
  • It doesn't use JOIN...ON syntax, but custom SQL, which not only is bad code, it means it is not going over the regular SQL sanitization

All these are guesses of a potential SQL injection or backdoor doing this strange query. It could be that this is a low level query, but in that case, a) what is calling it with such bad parameters, are those hardcoded? b) if no security issue is seen, this is still a normal bug due to odd syntax and absence of limits - which causes it to take >40 seconds to execute.

Event Timeline

There's no SQL injection of backdoor here. It is going through the usual sanitization.

The query comes from here specifically, and the $conds variable referenced there is set slightly earlier. I also note that namespace 8 (the MediaWiki namespace) is generally restricted to editing by wiki administrators, normal users can't edit there.

The use of comma-joins rather than JOIN...ON actually seems to be fairly common in MediaWiki. It's easier to pass the ON condition in $conds than to specify $join_conds for the same end result (unless there are query planning differences?). Old code might predate the addition of $join_conds in 2008 in rMW2705364478a1: Add a way to do different JOINs with $tables.

The lack of quotes in (page_len <= 1024) is because only equalities have a calling method that auto-quotes values, inequalities have to be manually quoted and here it does intval( $wgMaxMsgCacheEntrySize ) rather than $dbr->addQuotes( $wgMaxMsgCacheEntrySize ).

The "1024" is a configurable variable. The query will always contain page_is_redirect = '0' and page_namespace = '8', those values are hard-coded. The LIKE bit might also be page_title LIKE '%/foo' for various "foo".

The lack of a limit appears to be intentional. Looking at the code, it's trying to copy all messages for a language (here the default language, i.e. no '/' in the title) shorter than $wgMaxMsgCacheEntrySize into cache (memcached or redis, I'm not sure which and it doesn't matter) for a day or until someone creates/edits/deletes a page in the MediaWiki namespace.


When I try it (via sql from terbium) that query takes far less than 49 seconds, touches nowhere near millions of rows, and returns only 1463.

mysql:wikiadmin@db1051 [enwiki]> pager cat > /dev/null; SELECT /* MessageCache::loadFromDB(en)-small */ page_title,old_id,old_text,old_flags FROM `page`,`revision`,`text` WHERE page_is_redirect = '0' AND page_namespace = '8' AND (page_title NOT LIKE '%/%' ESCAPE '`' ) AND (page_latest=rev_id) AND (rev_text_id=old_id) AND (page_len <= 1024);
PAGER set to 'cat > /dev/null'
1463 rows in set (0.28 sec)

mysql:wikiadmin@db1051 [enwiki]> pager; SHOW STATUS like 'Hand%';
Default pager wasn't set, using stdout.
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 1738  |
| Handler_icp_match          | 1738  |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 38039 |
| Handler_read_last          | 0     |
| Handler_read_next          | 3201  |
| Handler_read_prev          | 0     |
| Handler_read_retry         | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 150   |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 148   |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+

Timing is similar on other replicas, although they touch many fewer rows (Handler_read_key=2927, Handler_read_next=1738) probably thanks to the lack of partitioning on revision. But maybe I was running into already-warm caches every time, on all the replicas I tried it on.

I'm ok with removing the security issue, but I would prettify this and use batching, or move it to a maintenance call only on vslow slaves (which ironically, probably will make it faster). Not high priority, but believe this is still a (non-security) bug. 49 seconds is really close to the 60-second execution limit, and when it goes over that, it will fail every time. I believe the 40K rows reading + temporary table + special structure + cold caches can make the query so slow on that slave.

One other item of note, the code doesn't seem to actually be requesting any group. It should usually be running on the generic replicas, where it will be faster just as it would on the vslow group.

Is there a reason s1 and s3 have the special contributions replicas as '50' instead of '1' like the other shards? Would it make sense to reduce them to reduce the chances they're used for queries like this?

but I would prettify this

If by "prettify this" you mean have it use JOIN...ON, that's easy to do. If you mean anything else or in addition, please clarify.

and use batching

Batching could be done, although the code would be making all the batched calls one after the other from the same PHP request. What batch size would you recommend?

That'd have to add ORDER BY page_len, page_id to be compatible with the use of the page_redirect_namespace_len index, which is somewhat of an odd thing to be ordering by.

or move it to a maintenance call

I don't think a maintenance script would work for the use case here. When people edit a message they expect it to take effect right away, so even the job queue might be noticeably slow.

I believe the 40K rows reading + temporary table + special structure + cold caches can make the query so slow on that slave.

The 148 Handler_tmp_write and 150 Handler_read_rnd_next in my paste comes from me not using the '-A' option when starting the sql client (and lacking the rights to use FLUSH). Those disappear if I start it with '-A'. In other words, I don't see any temporary table use going on here.

Not that it makes a difference, 49 seconds is still slow no matter what the cause.

Is there a reason s1 and s3 have the special contributions replicas as '50' instead of '1' like the other shards?

s3 special slaves are not special, they do not have any special partitioning because tables are so small.
s1 requires so much load balancing that having 2 dedicated slaves just for rc, etc. would take 2 huge servers's capacity, and in general they would not be warmed up to attend the specially marked queries. I cannot justify the purchase of 10K servers to not run queries on them. As I complained many times in the past, having special slaves is something we should aim to retire, not fix, as it impacts negatively in the long run both reliability, performance, maintenance and high availability, like this very example. There is plan to increase capacity by having multiple instances per server, but that is in the future.

Once that rant is done, the pooling state of the server is controlled by its "main" weight, so all servers that cannot lag must have weight of 1- that is is problematic, and by implementing "a majority" replication check would solve this issue, I think- but performance, which tends to fine-tune that, also has a lot on their plate. Majority rule + multi-instance server could be a long term fix for the weight problem. The slow part would work only as a temporary (bad) patch.

If by "prettify this" you mean have it use JOIN...ON

Yes, I just mean that.

That'd have to add ORDER BY page_len, page_id

Is there a dependency to have 1024 as the configured limit or can it change easily? How do many messages actually go over that limit? Cannot we just iterate over all messages over the primary key index and have a batch size in a php variable, let's say, of 1000 rows as a starting point? Note that that will only make each query faster, the overal API latency will go if any higher, that is why I was concerned of doing this in the foreground.

I don't think a maintenance script would work for the use case here

what about a job? I am not sure if this comes from a job or an API, but I think it is an API call based on the ip of the mediawiki server. Probably too much overhead and not worth it- but I do not know if the original api call succeded or would do it, if this query fails.

any temporary table use going on here.

I see Handler_tmp_write | 148 which normally means a temporary table has been created, but it could be from a previous query.

(and lacking the rights to use FLUSH)

I think I can add RELOAD privileges to wikiadmin- it seems sensible, but it will take me some time to deploy it throughout the fleet. If for any reason that wasn't possible, I will add a root-owned procedure to get the handler statistics callable by wikiadmin (and or a maintenance php script to run EXPLAIN and the query- which we may need to do later further performance testing). In my defense, you didn't formally asked for them before, and I can barely keep up with the ones I am directly asked. 0:-) 0:-) Now it is the moment to ask for more of that- I think you should have access to the performance_schema statistics and the ops.event_logging, at the very least- a proper dashboard on tendril will take years to finish at my pace.

I would like to insist on what I said before- if this is not a security issue, for me, as a requester, this would have "low" priority- I didn't know it was old, it just looked funny and it was Sunday, so I didn't have time to research and notice this wasn't someone doing a full table scan of private data.

If by "prettify this" you mean have it use JOIN...ON

Yes, I just mean that.

Ok. I can try to make a patch for that part of it later today. If you want to retitle this bug, that would save me having to think of a more appropriate title.

That'd have to add ORDER BY page_len, page_id

Is there a dependency to have 1024 as the configured limit or can it change easily?

I don't think there's any reason for 1024 in particular. If someone were to make a graph of page sizes, there's probably be a long tail of really big pages like https://en.wikipedia.org/wiki/MediaWiki:Common.css that aren't worth putting in this cache, and at some point someone thought 1024 would be a good cutoff point.

How do many messages actually go over that limit?

On enwiki, there are 1463 messages under the limit in the default localization and 318 over. In all non-default localizations, there are 275 under the limit and 51 over.

On Commons, which has much more language-stuff going on, there are 1096 under and 289 over in the default localization, and 18314 under and 734 over in all non-default localizations.

On Meta, there are 30374 under and 14271 over in the default localization, and 58308 under and 254 over in all non-default localizations. This seems to be largely due to CentralNotice, excluding anything beginning with "CentralNotice-" brings it down to 631/68 and 2067/201.

Cannot we just iterate over all messages over the primary key index and have a batch size in a php variable, let's say, of 1000 rows as a starting point?

No, because as far as the database is concerned the messages aren't any different from the millions of articles and other pages on the wiki, the only thing that makes them different is that they're in namespace 8. page_redirect_namespace_len makes complete sense here.

Note that that will only make each query faster, the overal API latency will go if any higher, that is why I was concerned of doing this in the foreground.

The main bug here is that the query is very unsuited to running on the contributions slaves, because for each page row it has to check every revision partition to find the match. Batching doesn't really change that:

mysql:wikiadmin@db1051 [enwiki]> pager cat > /dev/null
PAGER set to 'cat > /dev/null'
mysql:wikiadmin@db1051 [enwiki]> SELECT /* MessageCache::loadFromDB(en)-small */ page_title,page_len,page_id,old_id,old_text,old_flags FROM `page`,`revision`,`text` WHERE page_is_redirect = '0' AND page_namespace = '8' AND (page_title NOT LIKE '%/%' ESCAPE '`' ) AND (page_latest=rev_id) AND (rev_text_id=old_id) AND (page_len <= 1024) ORDER BY page_len, page_id LIMIT 1000;
1000 rows in set (0.17 sec)

mysql:wikiadmin@db1051 [enwiki]> pager; SHOW STATUS like 'Hand%';
Default pager wasn't set, using stdout.
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 1156  |
| Handler_icp_match          | 1156  |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 25991 |
| Handler_read_last          | 0     |
| Handler_read_next          | 2154  |
| Handler_read_prev          | 0     |
| Handler_read_retry         | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 0     |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 0     |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+

With a cold cache, I suspect this would still take a while to run.

It's much better on other replicas.

+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 1738  |
| Handler_icp_match          | 1738  |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 1464  |
| Handler_read_last          | 0     |
| Handler_read_next          | 1738  |
| Handler_read_prev          | 0     |
| Handler_read_retry         | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 0     |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 0     |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+

The times I seemed to hit this query with a cold cache on non-contributions replicas it was still somewhat slow, but more like 3 seconds than 50.

I don't think a maintenance script would work for the use case here

what about a job?

A job would be possible for post-edit updating of the cache, although it might confuse users who're used to messages being updated pretty much immediately after they save an edit. It wouldn't work so well for the daily cache expiration, though, unless the stale cache entry is still present to be used for the request that would trigger the job (I'm not sure if that's the case or not).

I am not sure if this comes from a job or an API, but I think it is an API call based on the ip of the mediawiki server. Probably too much overhead and not worth it- but I do not know if the original api call succeded or would do it, if this query fails.

This query comes from the localization lookup cache class, whenever something tries to access a localization message and the memcached/redis copy is missing or expired. It might happen in response to a job (although that seems relatively unlikely), an API hit, a web UI hit, or anything else that accesses localization messages.

In my defense, you didn't formally asked for them before, and I can barely keep up with the ones I am directly asked. 0:-) 0:-)

I don't really need it, which is why I never asked. I had been just ignoring those two ≈150 counts that I knew weren't part of the query I was testing. Now that I've found out that using "sql enwiki -A" instead of "sql enwiki" prevents the client from making the extra queries, I'll remember to use that.

jcrespo renamed this task from This database mediawiki query on enwiki looks so badly written, that I think it is security-suspicious to Long running query MessageCache::loadFromDB(en)-small on WMF "special" slaves.May 8 2017, 4:40 PM

The main bug here is that the query is very unsuited to running on the contributions slaves

Can we move it to "API" server group? I know it is not the "right" way to solve it , but 1) it would be faster (if it is possible, I know sometimes is not) 2) It will work on WMF 3) it will not break 3rd party 4) we will get rid of those special slave in the long run.

Sure, I don't see any reason we couldn't do that.

Anomie changed the visibility from "Custom Policy" to "Public (No Login Required)".

Change 352623 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] Prettify MessageCache::loadFromDB()-small query

https://gerrit.wikimedia.org/r/352623

Change 352624 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] MessageCache: Avoid 'contributions' replicas

https://gerrit.wikimedia.org/r/352624

Change 352623 merged by jenkins-bot:
[mediawiki/core@master] Prettify MessageCache::loadFromDB()-small query

https://gerrit.wikimedia.org/r/352623

Change 352624 merged by jenkins-bot:
[mediawiki/core@master] MessageCache: Avoid 'contributions' replicas

https://gerrit.wikimedia.org/r/352624

Current status for 1.30.0-wmf.1: The query has not materially changed, although it will use JOIN...ON syntax rather than comma joins. It will, however, now be run only on the 'api' replicas where it seems to complete in around 3 seconds if run with a cold cache.

Krinkle moved this task from Untriaged to Older on the Wikimedia-production-error board.

I've personally don't remember this query on the top worring list anymore (unlike a bunch of others), I would personally close it as resolved.

Krinkle claimed this task.

Thanks.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:10 PM