|Resolved||Deskana||T75616 Tracking: API/backend issues blocking Wikipedia app development|
|Resolved||Anomie||T32788 Allow triggering of user password reset email via the API|
|Open||None||T90925 General authentication improvements for MediaWiki|
|Resolved||Anomie||T48179 Allow a challenge stage during authentication|
|Open||None||T5709 Refactoring to make external authentication and identity systems easier|
|Resolved||Tgr||T43201 UserLoadFromSession considered evil|
|Resolved||Anomie||T67493 Session is started by EditAction (problem for extensions using UserLoadFromSession hook)|
|Open||None||T55156 Provide option to force a login session to end within a certain time|
|Open||None||T89459 Modernize MediaWiki authentication system (AuthManager)|
|Resolved||Anomie||T123451 Deploy SessionManager and bot passwords|
|Resolved||bd808||T124406 s7 master is executing 54K transactions/s|
I am doubting between unbreak now and high, because it has not yet "broken" the wikis, but it is at 1000-2000 connections, once it reaches 5000, every single wiki will stop working.
The main ofending query is:
SELECT /* CentralAuthUser::loadAttached */ lu_wiki FROM `localuser` WHERE lu_name IS NULL
Other frequent queries:
SELECT /* CentralAuthUser::loadGroups */ ggp_permission,ggp_group FROM `global_group_permissions` SELECT /* CentralAuthUser::loadState [SANITIZED] */ gu_id,gu_name,lu_wiki,gu_salt,gu_password,gu_auth_token,gu_locked,gu_hidden,gu_registration,gu_email,gu_email_authenticated,gu_home_db,gu_cas_token FROM `globaluser` LEFT OUTER JOIN `localuser` ON ((gu_name=lu_name) AND lu_wiki = 'ptwiki') WHERE gu_name = '[SANITIZED]' LIMIT 1
For some reason the CentralAuthUser object is create with a null username which ends up producing the lu_name IS NULL queries.
We can see such occurrences on the beta cluster logstash by looking for CentralAuthVerbose. That yields messages such as:
Loading CentralAuthUser for user from cache object Loading attached wiki list for global user from DB
Ie the user is not set.
That is all related to the recent session management change that got deployed this week (cced relevant parties).
I was looking through dbperformance.log earlier (fluorine running out of space, this was one of the biggest files) and noticed a LOT of centralauth database corrections coming through session checks looking for... I think it was renamed users? Will be in ops channel logs from 7-10 hours ago. That log had wfBacktrace output
These are the entries I was looking at, may or may not be related:
2016-01-22 13:10:12 mw1025 enwiki DBPerformance INFO: Expectation (masterConns <= 0) by MediaWiki::main not met: [connect to 10.64.16.22 (centralauth)] TransactionProfiler.php line 311 calls wfBacktrace() TransactionProfiler.php line 146 calls TransactionProfiler->reportExpectationViolated() LoadBalancer.php line 584 calls TransactionProfiler->recordConnection() CentralAuthUtils.php line 47 calls LoadBalancer->getConnection() CentralAuthUser.php line 150 calls CentralAuthUtils::getCentralDB() CentralAuthUser.php line 2038 calls CentralAuthUser->getSafeReadDB() CentralAuthUser.php line 438 calls CentralAuthUser->loadAttached() CentralAuthUser.php line 421 calls CentralAuthUser->loadFromCacheObject() CentralAuthUser.php line 284 calls CentralAuthUser->loadFromCache() CentralAuthUser.php line 2092 calls CentralAuthUser->loadState() CentralAuthHooks.php line 1674 calls CentralAuthUser->renameInProgress() Hooks.php line 195 calls CentralAuthHooks::onSessionCheckInfo() SessionManager.php line 876 calls Hooks::run() SessionManager.php line 216 calls MediaWiki\Session\SessionManager->loadSessionInfoFromStore() WebRequest.php line 658 calls MediaWiki\Session\SessionManager->getSessionById() User.php line 1148 calls WebRequest->getSession() User.php line 359 calls User->loadFromSession() User.php line 5019 calls User->load() User.php line 2605 calls User->loadOptions() RequestContext.php line 427 calls User->getOption() ContextSource.php line 152 calls RequestContext->getSkin() MobileFrontend.hooks.php line 548 calls ContextSource->getSkin() Hooks.php line 195 calls MobileFrontendHooks::onSpecialPageBeforeExecute() SpecialPage.php line 381 calls Hooks::run() SpecialPageFactory.php line 564 calls SpecialPage->run() MediaWiki.php line 280 calls SpecialPageFactory::executePath() MediaWiki.php line 738 calls MediaWiki->performRequest() MediaWiki.php line 510 calls MediaWiki->main() index.php line 43 calls MediaWiki->run() index.php line 3 calls include()``
This is no longer an unbreak now. I am resolving it because there is not danger of database stress. However, I have 2 followups/questions:
- Queries on the master are still 2x 3x the ones before the deployment. If that is 1) the long tail 2) a difference due to maintenance, post-issues, different times 3) something expected due to the deployment, I cannot say. This not an issue for database load, so I am not personally worried, but want to note it if someones considers it significant. I need more time to reliable tell this was the case.
- I assume this was some SELECTS already going to the master. I understand that some SELECTS, specially if dealing with centralauth may *really* need to the master, could you confirm this was the case and that they really cannot be sent to the slave (they are time-sensitive). If it wasn't the case, maybe we can open another ticket to move them to a slave to avoid future incidents?
The logic to choose between the master and slave databases when populating a CentralAuthUser object was introduced recently in rECAU6b73405fcef5: Allow for CentralAuthUser::loadState() to generally use DB slaves by @aaron. It looks like $wgCentralAuthUseSlaves = true is currently only set on testwiki.