Page MenuHomePhabricator

s7 master is executing 54K transactions/s
Closed, ResolvedPublic

Event Timeline

jcrespo created this task.Jan 22 2016, 11:13 AM
jcrespo updated the task description. (Show Details)
jcrespo raised the priority of this task from to Unbreak Now!.
jcrespo added a subscriber: jcrespo.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 22 2016, 11:13 AM

I am doubting between unbreak now and high, because it has not yet "broken" the wikis, but it is at 1000-2000 connections, once it reaches 5000, every single wiki will stop working.

The main ofending query is:

SELECT /* CentralAuthUser::loadAttached  */  lu_wiki  FROM `localuser`   WHERE lu_name IS NULL

Other frequent queries:

SELECT /* CentralAuthUser::loadGroups  */  ggp_permission,ggp_group  FROM `global_group_permissions`
SELECT /* CentralAuthUser::loadState [SANITIZED] */ gu_id,gu_name,lu_wiki,gu_salt,gu_password,gu_auth_token,gu_locked,gu_hidden,gu_registration,gu_email,gu_email_authenticated,gu_home_db,gu_cas_token  FROM `globaluser` LEFT OUTER JOIN `localuser` ON ((gu_name=lu_name) AND lu_wiki = 'ptwiki')  WHERE gu_name = '[SANITIZED]'  LIMIT 1

Happening since 18-20 UTC yesteday, large spike since 23 UTC.

For some reason the CentralAuthUser object is create with a null username which ends up producing the lu_name IS NULL queries.

We can see such occurrences on the beta cluster logstash by looking for CentralAuthVerbose. That yields messages such as:

Loading CentralAuthUser for user  from cache object
Loading attached wiki list for global user  from DB

Ie the user is not set.

That is all related to the recent session management change that got deployed this week (cced relevant parties).

jcrespo set Security to None.
jcrespo removed subscribers: bd808, Tgr, Anomie, hashar.

I was looking through dbperformance.log earlier (fluorine running out of space, this was one of the biggest files) and noticed a LOT of centralauth database corrections coming through session checks looking for... I think it was renamed users? Will be in ops channel logs from 7-10 hours ago. That log had wfBacktrace output

jcrespo added a comment.EditedJan 22 2016, 1:09 PM

Reaching 60K QPS now. I can kill manually some queries if needed (it will be at >100K level to avoid a full outage)

These are the entries I was looking at, may or may not be related:

2016-01-22 13:10:12 mw1025 enwiki DBPerformance INFO: Expectation (masterConns <= 0) by MediaWiki::main not met:
[connect to 10.64.16.22 (centralauth)]
TransactionProfiler.php line 311 calls wfBacktrace()
TransactionProfiler.php line 146 calls TransactionProfiler->reportExpectationViolated()
LoadBalancer.php line 584 calls TransactionProfiler->recordConnection()
CentralAuthUtils.php line 47 calls LoadBalancer->getConnection()
CentralAuthUser.php line 150 calls CentralAuthUtils::getCentralDB()
CentralAuthUser.php line 2038 calls CentralAuthUser->getSafeReadDB()
CentralAuthUser.php line 438 calls CentralAuthUser->loadAttached()
CentralAuthUser.php line 421 calls CentralAuthUser->loadFromCacheObject()
CentralAuthUser.php line 284 calls CentralAuthUser->loadFromCache()
CentralAuthUser.php line 2092 calls CentralAuthUser->loadState()
CentralAuthHooks.php line 1674 calls CentralAuthUser->renameInProgress()
Hooks.php line 195 calls CentralAuthHooks::onSessionCheckInfo()
SessionManager.php line 876 calls Hooks::run()
SessionManager.php line 216 calls MediaWiki\Session\SessionManager->loadSessionInfoFromStore()
WebRequest.php line 658 calls MediaWiki\Session\SessionManager->getSessionById()
User.php line 1148 calls WebRequest->getSession()
User.php line 359 calls User->loadFromSession()
User.php line 5019 calls User->load()
User.php line 2605 calls User->loadOptions()
RequestContext.php line 427 calls User->getOption()
ContextSource.php line 152 calls RequestContext->getSkin()
MobileFrontend.hooks.php line 548 calls ContextSource->getSkin()
Hooks.php line 195 calls MobileFrontendHooks::onSpecialPageBeforeExecute()
SpecialPage.php line 381 calls Hooks::run()
SpecialPageFactory.php line 564 calls SpecialPage->run()
MediaWiki.php line 280 calls SpecialPageFactory::executePath()
MediaWiki.php line 738 calls MediaWiki->performRequest()
MediaWiki.php line 510 calls MediaWiki->main()
index.php line 43 calls MediaWiki->run()
index.php line 3 calls include()``

Change 265759 had a related patch set uploaded (by BryanDavis):
Check for username before creating user in onSessionCheckInfo

https://gerrit.wikimedia.org/r/265759

Change 265759 merged by jenkins-bot:
Check for non-null username before creating CentralAuthUser in hooks

https://gerrit.wikimedia.org/r/265759

Change 265767 had a related patch set uploaded (by BryanDavis):
Check for non-null username before creating CentralAuthUser in hooks

https://gerrit.wikimedia.org/r/265767

Change 265767 merged by jenkins-bot:
Check for non-null username before creating CentralAuthUser in hooks

https://gerrit.wikimedia.org/r/265767

bd808 claimed this task.Jan 22 2016, 5:38 PM
Restricted Application added a project: User-bd808. · View Herald TranscriptJan 22 2016, 5:38 PM
bd808 added a comment.Jan 22 2016, 5:39 PM

@jcrespo reported on irc that queries per second are dropping sharply on te s7 master following deployment of the backport.

jcrespo closed this task as Resolved.Jan 22 2016, 6:27 PM

This is no longer an unbreak now. I am resolving it because there is not danger of database stress. However, I have 2 followups/questions:

  • Queries on the master are still 2x 3x the ones before the deployment. If that is 1) the long tail 2) a difference due to maintenance, post-issues, different times 3) something expected due to the deployment, I cannot say. This not an issue for database load, so I am not personally worried, but want to note it if someones considers it significant. I need more time to reliable tell this was the case.
  • I assume this was some SELECTS already going to the master. I understand that some SELECTS, specially if dealing with centralauth may *really* need to the master, could you confirm this was the case and that they really cannot be sent to the slave (they are time-sensitive). If it wasn't the case, maybe we can open another ticket to move them to a slave to avoid future incidents?
bd808 added a subscriber: aaron.Jan 22 2016, 7:37 PM
  • I assume this was some SELECTS already going to the master. I understand that some SELECTS, specially if dealing with centralauth may *really* need to the master, could you confirm this was the case and that they really cannot be sent to the slave (they are time-sensitive). If it wasn't the case, maybe we can open another ticket to move them to a slave to avoid future incidents?

The logic to choose between the master and slave databases when populating a CentralAuthUser object was introduced recently in rECAU6b73405fcef5: Allow for CentralAuthUser::loadState() to generally use DB slaves by @aaron. It looks like $wgCentralAuthUseSlaves = true is currently only set on testwiki.

bd808 moved this task from To Do to Done on the User-bd808 board.Feb 1 2016, 3:28 PM
bd808 moved this task from Done to Archive on the User-bd808 board.Feb 21 2016, 8:53 PM
Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald TranscriptDec 15 2016, 4:18 PM