Page MenuHomePhabricator

GlobalPreferences deploy caused a significant increase in reads on s3
Closed, ResolvedPublic8 Story Points

Event Timeline

Niharika triaged this task as High priority.Mar 22 2018, 4:39 PM
Niharika created this task.
TBolliger edited projects, added Community-Tech-Sprint; removed Community-Tech.
jcrespo added a subscriber: Volans.Mar 22 2018, 5:47 PM
MaxSem claimed this task.Mar 22 2018, 8:58 PM

Grabbing this because I unlike Sam have prod access for investigation.

We are logging SQL queries for testwiki (testwiki.log). I've done some slicing throught these, however the enabling on GP roughly coincides with train deployment of a new MW version, making precise analysis hard. Specifically, there's a huge increase in external store queries (due to some MCR work? due to somehow increased traffic?):

maxsem@mwlog1001:~$ for file in *qc.log; do total=`wc -l $file`; count=`grep 'SELECT  blob_text' $file | wc -l`; echo "$file $count $total" ; done
21qc.log 81270 305302 21qc.log
22qc.log 3714588 4874350 22qc.log
23qc.log 1926239 2836952 23qc.log

And a corresponding increase in readonly checks:

maxsem@mwlog1001:~$ for file in *qc.log; do total=`wc -l $file`; count=`grep "SHOW GLOBAL VARIABLES LIKE 'read_only'" $file | wc -l`; echo "$file $count $total" ; done
21qc.log 14419 305302 21qc.log
22qc.log 658799 4874350 22qc.log
23qc.log 326875 2836952 23qc.log

Neither of these dropped after GP was disabled, so clearly a wrong lead. @Marostegui, is there a record of which queries were causing the slowdown somewhere in Tendril?

Hey @MaxSem - thanks for digging into this.
I am not sure I am understanding the relationship between the train and this issue.
So, what I saw is that this config change matches, exactly the spike on the graphs below:

17:35 maxsem@tin: Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/420947/ (duration: 01m 15s)

https://grafana.wikimedia.org/dashboard/db/mysql?panelId=8&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1077&var-port=9104&from=1521637034143&to=1521681962143
https://grafana.wikimedia.org/dashboard/db/mysql?panelId=3&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1077&var-port=9104&from=1521637034143&to=1521681962143

(https://dev.mysql.com/doc/refman/5.7/en/server-status-variables.html#statvar_Handler_read_next) which matches with the huge spike on the first graph (sorts).

As soon as that was reverted, the values came back to normal values. So not sure why the train is involved here.
Thanks for investigating!

@Marostegui (CC @jcrespo) can we get an idea of which queries were causing the slowdown? This is getting hard to debug without that data.
If it's not possible to get the data without the code being deployed, can we go ahead and re-enable this on testwiki for a couple of hours to observe?

TBolliger set the point value for this task to 8.Mar 28 2018, 11:31 PM

🔮

@Niharika as you might know, our DBAs are out for the end of this week, do you think this can wait Monday? If not let me know and I'll try to have a look although I'm missing some context on what analysis was already done.

Quick answer: the queries were not slow (or not yet) but they are probably not correct as they were scanning lots and lots of rows.

Thankfully it was only enabled on testwiki (and it was already visible on the graphs), otherwise it would have caused lots of more issues.

@Volans Yes, this can wait until Monday. We need to take a look at the queries which were causing this and I suspect we'll have to re-enable this on test wiki for a short while to get that data.

kaldari added a subscriber: kaldari.Apr 3 2018, 8:22 PM

@MaxSem: Would it be worth doing a temporary deployment to test wiki to take a closer look at the queries, or would it be possible to recreate the issue on Beta Labs?

@MaxSem: Want to schedule a temporary deployment for sometime next week so that we can get more data?

I'll discuss the action plan with Sam.

I'll discuss the action plan with Sam.

Max, can you also mention the action plan here? It's a good idea to have everyone on the same page.
Thanks.

MaxSem added a comment.Apr 9 2018, 8:45 PM

We decided to add more logging as a starting point.

Change 425199 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[mediawiki/core@master] Log DefaultPreferencesFactory usage

https://gerrit.wikimedia.org/r/425199

Change 425201 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[mediawiki/extensions/GlobalPreferences@master] Add logging

https://gerrit.wikimedia.org/r/425201

Change 425199 merged by jenkins-bot:
[mediawiki/core@master] Log DefaultPreferencesFactory usage

https://gerrit.wikimedia.org/r/425199

Change 425210 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[operations/mediawiki-config@master] Add logging channel for preference stuff

https://gerrit.wikimedia.org/r/425210

Change 425201 merged by jenkins-bot:
[mediawiki/extensions/GlobalPreferences@master] Add logging

https://gerrit.wikimedia.org/r/425201

@Marostegui and @jcrespo, we've fixed the other problems and introduced logging to pinpoint this bug. I'd like to try deploying to group0 wikis again to see how it works now and collect some logs - is that fine by you?

Can you specify which wikis are in group0?

Can you specify which wikis are in group0?

mediaWiki.org; test.wikipedia.org; test2.wikipedia.org for this task.

Can you specify which wikis are in group0?

mediaWiki.org; test.wikipedia.org; test2.wikipedia.org for this task.

I'm fine with that, we should be able to revert if we see issues again quickly

@Marostegui Thank you. We'll keep a watch on tendril and logstash. We'll revert if we see issues.

Check also the mysql graphs for the s3 slaves.

Change 425210 merged by jenkins-bot:
[operations/mediawiki-config@master] Add logging channel for preference stuff

https://gerrit.wikimedia.org/r/425210

We tried running it for ~1 hour, got the same increase in DB load. The new logging is available in the preferences channel. @Samwilson says he has some ideas.

Change 425740 had a related patch set uploaded (by Samwilson; owner: Samwilson):
[mediawiki/extensions/GlobalPreferences@master] Add instance cache for global prefs storage

https://gerrit.wikimedia.org/r/425740

So it looks to me like (for a logged-in user's request) the user three times: once when loading the session, once when setting $wgUser, and once in the ResourceLoaderContext.

When GP is enabled, each of these adds an additional database query, and one additional one because GP has no equivalent to the User class' override cache.

(I'm no doubt over simplifying things, but anyway this is about that additional query.)

So, we could add a runtime cache to GP, so it only fetches things once? Something like: https://gerrit.wikimedia.org/r/#/c/425740/
but I'm not really sure that one additional query is what's causing that huge increase in reads. Seems excessive, doesn't it?

So, sorry if I got your hopes up! I had been thinking earlier that there was some more obvious magic about what's going on in User::loadOptions(), but there isn't.

This measure is definitely good, however the overload happens on the local database, not centralauth.

We are fans of caching! :-)

Our worry is, if some of the least used wikis create this increase, when scaling to large wikis, that could be an outage (load of group0 is less than 1% of the total wikis). Hope the context of our fears is clear.

Change 426841 had a related patch set uploaded (by Samwilson; owner: Samwilson):
[mediawiki/extensions/GlobalPreferences@master] Add runtime cache for central IDs

https://gerrit.wikimedia.org/r/426841

Could this be related to looking up the central ID? I've attempted to reproduce the bug locally by running 1000 requests, and do see considerable increase in Handler_read_next:

Without GP:

With GP:

With GP (patched):

So the patch https://gerrit.wikimedia.org/r/#/c/426841 does improve things, but the numbers are still higher (sorry, the graphs above have different Y-axis scales). And of course this is all very rough (single DB server, one user).

I wouldn't have thought queries like this could be a possible culprit:

SELECT /* LocalIdLookup::lookupUserNames  */  user_id,user_name  FROM `user`    WHERE user_name = 'Admin'

Oh, but if both of the above patches are applied, things look a bit better (still higher):

Change 426841 merged by jenkins-bot:
[mediawiki/extensions/GlobalPreferences@master] Add runtime cache for central IDs

https://gerrit.wikimedia.org/r/426841

Change 425740 abandoned by Samwilson:
Add instance cache for global prefs storage

Reason:
Is being done in https://gerrit.wikimedia.org/r/#/c/427544/ in a better way.

https://gerrit.wikimedia.org/r/425740

With caching in place, we will try another time at 15:00 SF this Wednesday.

@MaxSem Noticeable, yes. But is it temporary?

For my analysis of last week's testing, I used a different strategy than before: instead of comparing logs for time periods of the same length, I compared logs with the same line count.

Both files in comparison had 2149813 lines, however the share of DB queries in them was wildly different:

maxsem@mwlog1001:~$ wc -l t2*
   461628 t2g.log     # With GP
   285585 t2ng.log    # Without GP

which by itself suggests a difference in DB activity per request. Request count:

maxsem@mwlog1001:~$ grep 'DEBUG: Start request' t1ng.log | wc -l
21104    # Without GP
maxsem@mwlog1001:~$ grep 'DEBUG: Start request' t1g.log | wc -l
13029    # With GP

The most popular queries without GP:

maxsem@mwlog1001:~$ sed -r "s/'.*?'/?/" t2ng.log | sed -r "s/\\b\\d+\\b/?/" | sed -r "s/blobs_cluster\\d+/blobs_cluster?/" | sort | uniq -c | sort -rn
  35334 SELECT  page_id,page_len,page_is_redirect,page_latest,page_content_model,page_lang  FROM `page`    WHERE page_namespace = ?  LIMIT ?
  22423 SELECT  blob_text  FROM `blobs_cluster?`    WHERE blob_id = ?  LIMIT ?
  20339 SELECT  page_touched,pp_propname  FROM `page` LEFT JOIN `page_props` ON ((page_id=pp_page) AND pp_propname = ?  LIMIT ?
  18387 BEGIN
  16585 SELECT ts FROM heartbeat.heartbeat WHERE shard = ? ORDER BY ts DESC LIMIT ?
  10306 SELECT  user_id,user_name,user_real_name,user_email,user_touched,user_token,user_email_authenticated,user_email_token,user_email_token_expires,user_registration,user_editcount  FROM `user`    WHERE user_id = ?  LIMIT ?
   9910 SELECT  page_namespace,page_title,page_id,page_len,page_is_redirect,page_latest,page_content_model,page_lang  FROM `page`    WHERE page_id = ?  LIMIT ?
   9399 SELECT  ug_user,ug_group,ug_expiry  FROM `user_groups`    WHERE ug_user = ?
   9301 SELECT  user_id  FROM `user`    WHERE user_name = ?  LIMIT ?
   8810 SELECT  up_property,up_value  FROM `user_properties`    WHERE up_user = ?
   8169 SELECT  page_id,page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_links_updated,page_latest,page_len,page_content_model,page_lang  FROM `page`    WHERE page_namespace = ?  LIMIT ?
   7929 SELECT  pp_value  FROM `page_props`    WHERE pp_propname = ?  LIMIT ?
   5561 COMMIT
   4475 SELECT  rev_id,rev_page,rev_text_id,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,COALESCE( comment_rev_comment.comment_text, rev_comment ) AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,user_name  FROM `revision` LEFT JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) LEFT JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) INNER JOIN `page` ON ((page_id = rev_page)) LEFT JOIN `user` ON ((rev_user != ?) AND (user_id = rev_user))   WHERE page_namespace = ? AND (rev_id=page_latest)  LIMIT 1
   4296 SHOW GLOBAL VARIABLES LIKE ?
   4064 SHOW SESSION VARIABLES LIKE ?
...

While with GP:

maxsem@mwlog1001:~$ sed -r "s/'.*?'/?/" t2g.log | sed -r "s/\\b[0-9]+\\b/?/" | sed -r "s/blobs_cluster[0-9]+/blobs_cluster?/" | sort | uniq -c | sort -rn | head -n 100
 213026 SELECT  blob_text  FROM `blobs_cluster?`    WHERE blob_id = ?  LIMIT ?
  24988 SELECT  page_id,page_len,page_is_redirect,page_latest,page_content_model,page_lang  FROM `page`    WHERE page_namespace = ?  LIMIT ?
  22024 SELECT  blob_text  FROM `blobs`    WHERE blob_id = ?  LIMIT ?
  15009 BEGIN
  14067 SELECT ts FROM heartbeat.heartbeat WHERE shard = ? ORDER BY ts DESC LIMIT ?
  12765 SELECT  page_touched,pp_propname  FROM `page` LEFT JOIN `page_props` ON ((page_id=pp_page) AND pp_propname = ?  LIMIT ?
  10010 SELECT  user_id,user_name,user_real_name,user_email,user_touched,user_token,user_email_authenticated,user_email_token,user_email_token_expires,user_registration,user_editcount  FROM `user`    WHERE user_id = ?  LIMIT ?
   9138 SELECT  page_namespace,page_title,page_id,page_len,page_is_redirect,page_latest,page_content_model,page_lang  FROM `page`    WHERE page_id = ?  LIMIT ?
   8676 SELECT  ug_user,ug_group,ug_expiry  FROM `user_groups`    WHERE ug_user = ?
   8556 SELECT  up_property,up_value  FROM `user_properties`    WHERE up_user = ?
   7490 SELECT  user_id  FROM `user`    WHERE user_name = ?  LIMIT ?
   6715 SELECT  page_id,page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_links_updated,page_latest,page_len,page_content_model,page_lang  FROM `page`    WHERE page_namespace = ?  LIMIT ?
   6393 SELECT  pp_value  FROM `page_props`    WHERE pp_propname = ?  LIMIT ?
   4715 COMMIT
...

Conclusion: GlobalPreferences breaks page blob caching. Which returns us to that One Line Of Doom that we can't remove due to T153256: Unable to overwrite services using MediaWikiServices hook.

Change 430012 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[mediawiki/core@master] Don't initialize MediaWikiServices before extensions have been loaded

https://gerrit.wikimedia.org/r/430012

TBolliger reassigned this task from MaxSem to kaldari.May 8 2018, 10:04 PM
TBolliger edited projects, added Community-Tech; removed Community-Tech-Sprint.

Taking this out of the sprint and assigning to @kaldari to find a DBA to code review and test this patch. If this patch does not alleviate these database spikes, we need to have a meeting to discuss the future of this project.

kaldari reassigned this task from kaldari to MaxSem.May 9 2018, 5:40 PM

Patch is merged now.

@kaldari @MaxSem What's the next step? Another test deploy? (after merging https://gerrit.wikimedia.org/r/#/c/432045/)

Change 402413 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[mediawiki/extensions/GlobalPreferences@master] Don't force PreferencesFactory initialization

https://gerrit.wikimedia.org/r/402413

MaxSem added a comment.May 9 2018, 6:09 PM

Above is the second part of the fix.

Change 430012 merged by jenkins-bot:
[mediawiki/core@master] Don't initialize MediaWikiServices before extensions have been loaded

https://gerrit.wikimedia.org/r/430012

Change 402413 merged by jenkins-bot:
[mediawiki/extensions/GlobalPreferences@master] Don't force PreferencesFactory initialization

https://gerrit.wikimedia.org/r/402413

Let's try another test deployment, say on Tuesday after the train?

Let's try another test deployment, say on Tuesday after the train?

+1. Can you schedule that?

(was scheduled yesterday)

Change 433149 had a related patch set uploaded (by Samwilson; owner: Samwilson):
[operations/mediawiki-config@master] Deploy GlobalPreferences to test wikis and mw.org (forth time)

https://gerrit.wikimedia.org/r/433149

Change 433149 merged by jenkins-bot:
[operations/mediawiki-config@master] Deploy GlobalPreferences to test wikis and mw.org (fourth time)

https://gerrit.wikimedia.org/r/433149

Mentioned in SAL (#wikimedia-operations) [2018-05-15T22:39:09Z] <samwilson@tin> Synchronized wmf-config/InitialiseSettings.php: Deploying GlobalPreferences T190425 (duration: 01m 21s)

During the deployment, we saw no increase in DB load, so we decided to leave it live overnight. In case something goes wrong, it can be reverted by deploying https://gerrit.wikimedia.org/r/#/c/433291/

@Marostegui or @jcrespo, does it look fine now?

It is looking a lot better now. I would suggest to wait for 24h and then deploy to some more wikis and see how it goes.

@MaxSem can you please update this ticket? Can we close as resolved?

MaxSem closed this task as Resolved.May 21 2018, 7:00 PM
MaxSem moved this task from Needs Review/Feedback to Q1 2018-19 on the Community-Tech-Sprint board.
MaxSem removed a project: Patch-For-Review.

\m/ 😂 \m/

Nice work guys!