aaron (Aaron Schulz)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Oct 20 2014, 5:25 PM (209 w, 3 h)
Availability
Available
IRC Nick
AaronSchulz
LDAP User
Aaron Schulz
MediaWiki User
Aaron Schulz [ Global Accounts ]

Recent Activity

Today

aaron added a comment to T207223: getDBLoadBalancerFactory()->getExternalLB() returns a LB with the wrong database..

In the getMasterDatabase() method posted above, I noticed that the database domain (e.g. DB/schema/prefix) is missing from getConnection(). Instead that should be:

Mon, Oct 22, 9:22 PM · Performance-Team, MW-1.32-release, MW-1.31-release, MediaWiki-Database
aaron claimed T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate).
Mon, Oct 22, 8:50 PM · Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error, MediaWiki-Database
aaron moved T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate) from Inbox to Doing on the Performance-Team board.
Mon, Oct 22, 8:49 PM · Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error, MediaWiki-Database
aaron claimed T207223: getDBLoadBalancerFactory()->getExternalLB() returns a LB with the wrong database..
Mon, Oct 22, 8:44 PM · Performance-Team, MW-1.32-release, MW-1.31-release, MediaWiki-Database
aaron moved T207223: getDBLoadBalancerFactory()->getExternalLB() returns a LB with the wrong database. from Inbox to Next-up on the Performance-Team board.
Mon, Oct 22, 8:43 PM · Performance-Team, MW-1.32-release, MW-1.31-release, MediaWiki-Database

Fri, Oct 19

aaron closed T193271: Refactor MessageCache to deal with NS_MEDIAWIKI pages that aren't standard interface messages as Resolved.
Fri, Oct 19, 3:33 AM · MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), Patch-For-Review, MediaWiki-Cache, Performance-Team

Wed, Oct 17

aaron added a comment to T205369: Investigate > 40% Save Timing regression (2018-09-05).

For reference: I see three extension methods that access the ParserOutput via WikiPage::prepareContentFor edit:

  • TemplateDataHooks::onPageContentSave
  • SimpleCaptcha::shouldCheck
  • SpamBlacklistHooks::filterMergedContent

    All of these should be hitting a cached instance, but perhaps they are not for some reason. The caching logic in WikiPage is not nice. Perhaps it would be better to have an in-process cache in the RevisionRenderer service. That would be straight forward, but would not cache PST content for pre-PST content.
Wed, Oct 17, 4:59 AM · Core Platform Team (MCR), Core Platform Team Kanban, Multi-Content-Revisions (Reactive), Performance-Team
aaron closed T193565: Foreign query for metawiki fails with "Table 'centralauth.page' doesn't exist" (DBConnRef mixup?) as Resolved.

Fixed in master.

Wed, Oct 17, 12:11 AM · MW-1.32-notes, MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), Patch-For-Review, Core Platform Team Kanban (Doing), Performance-Team, Core Platform Team (Security, stability, performance and scalability (TEC1)), Wikimedia-production-error, MediaWiki-Database

Tue, Oct 16

aaron closed T202553: Database->insertSelect() generates invalid SQL when * is passed as $conds as Resolved.
Tue, Oct 16, 9:37 PM · MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), Patch-For-Review, Performance-Team, MediaWiki-Database

Mon, Oct 15

aaron created T207090: Requesting deployment access to servers for Performance Team task for perf-roots.
Mon, Oct 15, 8:23 PM · Patch-For-Review, Operations, SRE-Access-Requests
aaron added a comment to T193271: Refactor MessageCache to deal with NS_MEDIAWIKI pages that aren't standard interface messages.

I'm posting here instead of at a separate task, because I can't decipher whether this is a regression, side effect, bug or anything else. As a result of a3d6c1411dad, a lot interface messages (as in, messages actually used in the interface) are no longer cached. This results in a whopping amount of 172 database queries for interface messages on Special:Version on MediaWiki-Vagrant using MW master a3d6c1411dad or newer. Compared to the 10 there were before, this is a 1620% increase. As every query ends up in the debug log, both the Query overview and debug log tab of the debug toolbar have become rather difficult to use.

Mon, Oct 15, 2:53 PM · MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), Patch-For-Review, MediaWiki-Cache, Performance-Team

Sat, Oct 13

aaron added a comment to T202149: Exception thrown for failure to save settings appears ~ 1000 times/day.

I still see 100-200 per 3 hour interval.

Sat, Oct 13, 1:30 AM · MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), Patch-For-Review, BetaFeatures, Wikimedia-production-error, MediaWiki-Authentication-and-authorization

Fri, Oct 12

aaron added a comment to T205369: Investigate > 40% Save Timing regression (2018-09-05).

Looking at https://performance.wikimedia.org/xhgui/run/view?id=5bbfdc7c3f3dfaea44b5847c after a null edit on https://en.wikipedia.org/wiki/1857_in_Sweden I see MediaWiki\Revision\RenderedRevision::getSlotParserOutputUncached being hit 4 times even though normal pages have only 1 slot...

Fri, Oct 12, 7:54 PM · Core Platform Team (MCR), Core Platform Team Kanban, Multi-Content-Revisions (Reactive), Performance-Team

Thu, Oct 11

aaron added a comment to T205369: Investigate > 40% Save Timing regression (2018-09-05).

Looking at https://performance.wikimedia.org/xhgui/run/view?id=5bbfdc7c3f3dfaea44b5847c after a null edit on https://en.wikipedia.org/wiki/1857_in_Sweden I see MediaWiki\Revision\RenderedRevision::getSlotParserOutputUncached being hit 4 times even though normal pages have only 1 slot...

Thu, Oct 11, 11:31 PM · Core Platform Team (MCR), Core Platform Team Kanban, Multi-Content-Revisions (Reactive), Performance-Team
aaron moved T193565: Foreign query for metawiki fails with "Table 'centralauth.page' doesn't exist" (DBConnRef mixup?) from Inbox to Doing on the Performance-Team board.
Thu, Oct 11, 8:16 PM · MW-1.32-notes, MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), Patch-For-Review, Core Platform Team Kanban (Doing), Performance-Team, Core Platform Team (Security, stability, performance and scalability (TEC1)), Wikimedia-production-error, MediaWiki-Database
aaron moved T206475: Users are unable to edit (add new topics to) [[:fa:wikipedia:پرسش‌های متفرقه]] from Inbox to Radar on the Performance-Team board.
Thu, Oct 11, 8:15 PM · Growth-Team (Current Sprint), MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), Performance-Team (Radar), MediaWiki-ResourceLoader, Patch-For-Review, StructuredDiscussions

Wed, Oct 10

aaron added a comment to T205369: Investigate > 40% Save Timing regression (2018-09-05).

The alerts fired but we didn't act on them and then 7 days later it got back to the new "normal":

I think we can change or add another alert where we alert on a hard limit instead (as we done on some other dashboards). In the coming Grafana 5.3.0 the alerts will have reminders.

Wed, Oct 10, 7:09 PM · Core Platform Team (MCR), Core Platform Team Kanban, Multi-Content-Revisions (Reactive), Performance-Team
aaron added a comment to T206580: Use object stash for persisting last-use proprety to control curation toolbar display.

@aaron - Any thoughts about the course of action suggested above in T206580#4655234?

Wed, Oct 10, 6:38 PM · MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), Patch-For-Review, Growth-Team (Current Sprint)

Tue, Oct 9

aaron updated subscribers of T205369: Investigate > 40% Save Timing regression (2018-09-05).
Tue, Oct 9, 11:14 PM · Core Platform Team (MCR), Core Platform Team Kanban, Multi-Content-Revisions (Reactive), Performance-Team
aaron added a comment to T204423: MW 1.31 install reports "InvalidArgumentException ... DatabaseDomain.php: Domain has too few or too many parts ".

https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/465300/ also mentions the ID in the message.

Tue, Oct 9, 12:39 AM · Performance-Team, MW-1.32-release, MW-1.31-release, MediaWiki-Database, MediaWiki-Installer
aaron added a comment to T204423: MW 1.31 install reports "InvalidArgumentException ... DatabaseDomain.php: Domain has too few or too many parts ".

Does this occur in master? I more so wonder if https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/452878/ happens to help.

Tue, Oct 9, 12:36 AM · Performance-Team, MW-1.32-release, MW-1.31-release, MediaWiki-Database, MediaWiki-Installer

Fri, Oct 5

aaron added a comment to T203786: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions.

The throttling mechanism that memcache offers is good of course, but maybe -R 20 is not optimal?

Fri, Oct 5, 10:37 AM · MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), Patch-For-Review, Performance-Team (Radar), Wikimedia-production-error, Gadgets, User-Elukey, MediaWiki-Cache, Operations

Thu, Oct 4

aaron added a comment to T120140: Lock wait timeout exceeded (WikiPage::insertRedirectEntry) when editing a self-redirecting template.

Are these jobs that try to also move user subpages?

Thu, Oct 4, 11:27 PM · Wikimedia-production-error, MediaWiki-Database, MediaWiki-Templates

Wed, Oct 3

aaron claimed T193271: Refactor MessageCache to deal with NS_MEDIAWIKI pages that aren't standard interface messages.
Wed, Oct 3, 8:53 PM · MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), Patch-For-Review, MediaWiki-Cache, Performance-Team
aaron added a comment to T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate).

CAS errors on user might also help pinpoint some causes.

Wed, Oct 3, 12:09 AM · Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error, MediaWiki-Database
aaron added a comment to T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate).

Yes and yes. I think if COMMIT takes a few seconds, then even with this UPDATE near the transaction end, multiple writes can still pile up if enough tabs are opened or other things locking user rows are going on.

Wed, Oct 3, 12:08 AM · Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error, MediaWiki-Database

Tue, Oct 2

aaron added a comment to T204174: FileOperation error "SwiftFileBackend::addMissingMetadata: {path} was not stored with SHA-1 metadata.".

If you go that route, then something like having the getFileSha1() and the sha1 field being null for certain containers plus having doOperations and friends pass a flag to getFileStat/Sha1 to have the current behavior of lazy-loading and not using null, it might work.

Tue, Oct 2, 11:50 PM · Performance-Team, Thumbor, MediaWiki-File-management, Wikimedia-production-error
aaron added a comment to T204174: FileOperation error "SwiftFileBackend::addMissingMetadata: {path} was not stored with SHA-1 metadata.".

Is it that much space? If you add an option, you have to have getFileStat return some dummy value for the SHA1 and also not have that mess up the logic in doOperations(), which is why it seemed easier to just include the header.

Tue, Oct 2, 11:47 PM · Performance-Team, Thumbor, MediaWiki-File-management, Wikimedia-production-error
aaron added a comment to T204346: PHP-timed out requests also emit LoadBalancer::destruct error "you can't run this command now: COMMIT".

Why does Database::close() ever try to commit anyway?

Tue, Oct 2, 9:05 PM · Performance-Team (Radar), Wikimedia-production-error, MediaWiki-Database
aaron closed T189999: Enforce database transaction rollback conventions to protect against certain try/catch patterns as Resolved.
Tue, Oct 2, 8:59 PM · Performance-Team, TechCom, MediaWiki-Database
aaron updated the task description for T189999: Enforce database transaction rollback conventions to protect against certain try/catch patterns.
Tue, Oct 2, 8:59 PM · Performance-Team, TechCom, MediaWiki-Database
aaron added a comment to T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate).

I don't see many of these in the logs for the last 7 days. This is likely caused by editing in parallel (multiple rollback tabs at once).

Tue, Oct 2, 5:33 PM · Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error, MediaWiki-Database
aaron added a comment to T148603: Limit the Quality version of the flagged revision in Arabic Wikipedia to ns=0.

It looks like there is no way to say "Level 2 (reviewed for quality) is not allowed as a tag on pages outside namespace 0". Right, now, I suppose it is just convention that reviewers only mark template revisions as at level 1 (basic review). If $wgFlaggedRevsTags included a 'namespaces' field with (NS => level) as the value (defaulting to all of $wgFlaggedRevsNamespaces at the highest level, the status quo), then this could be configured.

Tue, Oct 2, 3:41 AM · Wikimedia-Site-requests

Mon, Oct 1

aaron added a comment to T204174: FileOperation error "SwiftFileBackend::addMissingMetadata: {path} was not stored with SHA-1 metadata.".

It's used for originals. I don't think it matters much for thumbnails, but it's hard to cleanly tell that to SwiftFileBackend. It seems like it might be easiest to have thumbor hash the local file and save the metadata in the PUT request to avoid these errors (and slowness of triggering a GET to POST the missing data).

Mon, Oct 1, 9:34 PM · Performance-Team, Thumbor, MediaWiki-File-management, Wikimedia-production-error
aaron added a comment to T202553: Database->insertSelect() generates invalid SQL when * is passed as $conds.

I think any callers should use '', not '*', which doesn't make much sense to me. That said, we already started the pattern, so it may as well work here too.

Mon, Oct 1, 9:22 PM · MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), Patch-For-Review, Performance-Team, MediaWiki-Database
aaron closed T172941: Track duplicate parses on page save as Declined.

Closing per comments on patch about MCR refactor.

Mon, Oct 1, 8:52 PM · Patch-For-Review, MediaWiki-Page-editing, MediaWiki-Parser, Performance-Team
aaron claimed T205369: Investigate > 40% Save Timing regression (2018-09-05).
Mon, Oct 1, 8:19 PM · Core Platform Team (MCR), Core Platform Team Kanban, Multi-Content-Revisions (Reactive), Performance-Team
aaron added a comment to T205893: Automatically trigger waitForReplication after a sufficiently high number of rows has been written.

Sometimes callers might want I/O (swift/elastics/blazegraph) near DB I/O transactions, so even if we use setTransactionListener() (like Maintenance) and listen for points where no trx is active anywhere (kind of like DeferredUpdates), we'd want to be careful about waiting for lag too long or erroring out. Then again, mixed source-IO code should generally follow guidelines (https://www.mediawiki.org/wiki/Database_transactions#Updating_secondary_non-RDBMS_stores) and use patterns like doing the key/value writes first and committing or using commit hooks/deferred updates. So...maybe a callback could listen to setTransactionListener(), it could be given the affected row count, and a deferred MergeableUpdate could be added to DeferredUpdates when the count is high for among DBs recently (using pass-by-ref listener callback vars for last-time and running-count or something). The update could wait for replication, and would do so after any related I/O updates that relate to the DB writes.

Mon, Oct 1, 7:35 PM · Wikimedia-Incident, MediaWiki-Database

Sep 2 2018

aaron closed T189702: Replace transcache table with objectcache backend as Resolved.
Sep 2 2018, 9:26 PM · MW-1.32-notes (WMF-deploy-2018-09-04 (1.32.0-wmf.20)), MediaWiki-Database, Patch-For-Review, Core-Platform-Team-Old, Performance-Team, MediaWiki-Templates

Aug 31 2018

aaron added a comment to T196378: Investigate solutions for MySQL connection pooling.

The main blocker right now is to decide on a tunneling technology, as most seem to have issues.

Aug 31 2018, 8:00 PM · DBA, Availability (MediaWiki-MultiDC), Performance-Team (Radar), Operations
aaron added a comment to T202910: add performance team members to webserver_misc_static servers to maintain sitemaps.

perf-roots seems appropriate. If anything extra is needed, that can always be discussed in the future (probably by adding to perf-roots).

Aug 31 2018, 8:21 AM · Patch-For-Review, Performance-Team (Radar), SRE-Access-Requests, Operations

Aug 29 2018

aaron added a comment to T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate).

All calls to incEditCountImmediate currently move it to the end of the transaction. According to logstash << +channel:DBPerformance +"user_editcount=user_editcount+N" +"sub-optimal" >> it seems to usually be very fast. Though I see occasional entries a little over 1 second. I suppose in that case, a fast enough edit rate by a single use could make a pile-up. I wonder if the delay comes from COMMIT itself?

Aug 29 2018, 9:14 PM · Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error, MediaWiki-Database
aaron renamed T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate) from Lock wait timeout exceeded when doing fast edits due to articule edit count locking to Lock wait timeout exceeded when doing fast edits due to article edit count locking.
Aug 29 2018, 8:56 PM · Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error, MediaWiki-Database
aaron added a commit to T32956: Make ResourceLoader a standalone library: rMWdc3fc6cf81ec: resourceloader: Audit use of JSON encoding and use json_encode directly.
Aug 29 2018, 1:34 AM · Performance-Team, Librarization, MediaWiki-ResourceLoader

Aug 28 2018

aaron added a comment to T170596: Could not acquire lock 'LinksUpdate:job:pageid:xxx'.

https://en.wikipedia.org/wiki/User:Sam_Sailor/CSD_log seems to be an offending page (many links, possible parallel updates).

Aug 28 2018, 6:54 PM · MW-1.32-notes (WMF-deploy-2018-08-28 (1.32.0-wmf.19)), Performance-Team, MediaWiki-JobQueue, Wikimedia-production-error
aaron closed T202650: Please add aaron to perf-team as Resolved.

Confirmed.

Aug 28 2018, 6:33 PM · Patch-For-Review, Operations, SRE-Access-Requests
aaron closed T202650: Please add aaron to perf-team, a subtask of T202648: Please add everyone on the performance team to perf-roots, as Resolved.
Aug 28 2018, 6:33 PM · SRE-Access-Requests, Operations
aaron added a comment to T170596: Could not acquire lock 'LinksUpdate:job:pageid:xxx'.

Aside from using a narrower exception type and catching it, it's probably even easier to make acquirePageLock() return a boolean and log the error to a channel (possibly INFO level). The page_id should be extra logstash metadata, to make grouping easier. I suspect certain pages (like Commonist gallery subpages or such) are more likely to be offenders that others.

Aug 28 2018, 5:47 PM · MW-1.32-notes (WMF-deploy-2018-08-28 (1.32.0-wmf.19)), Performance-Team, MediaWiki-JobQueue, Wikimedia-production-error

Aug 27 2018

aaron added a comment to T201240: Transaction timeout for LinksUpdate::updateLinksTimestamp (SET page_links_updated) .

I can't seem to reproduce this slowness (using mwdebug1002).

Aug 27 2018, 8:04 PM · Performance-Team, Core-Platform-Team-Old, Regression, Wikimedia-production-error, MediaWiki-Page-editing
Nemo_bis awarded T189702: Replace transcache table with objectcache backend a Doubloon token.
Aug 27 2018, 7:35 PM · MW-1.32-notes (WMF-deploy-2018-09-04 (1.32.0-wmf.20)), MediaWiki-Database, Patch-For-Review, Core-Platform-Team-Old, Performance-Team, MediaWiki-Templates

Aug 23 2018

aaron added a comment to T164382: Evaluate the need for FORCE INDEX (ls_field_val) [now IGNORE INDEX (ls_log_id)], delete the index hint if not needed anymore.

I don't recall. It's been long enough that it's worth testing how queries run without it.

Aug 23 2018, 7:36 AM · MediaWiki-Logging, DBA

Aug 21 2018

aaron closed T198239: Rollout use of mcrouter for MediaWiki in production as Resolved.
Aug 21 2018, 8:02 AM · MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), Patch-For-Review, Availability (MediaWiki-MultiDC), Performance-Team
aaron closed T198239: Rollout use of mcrouter for MediaWiki in production, a subtask of T192370: Deploy mcrouter to production as a wancache backend, as Resolved.
Aug 21 2018, 8:02 AM · Patch-For-Review, Performance-Team (Radar), Availability (MediaWiki-MultiDC), Operations

Aug 18 2018

aaron updated the task description for T198239: Rollout use of mcrouter for MediaWiki in production.
Aug 18 2018, 5:51 AM · MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), Patch-For-Review, Availability (MediaWiki-MultiDC), Performance-Team

Aug 14 2018

aaron closed T118893: Consider using APC for the individually cached keys (e.g. 'TOO BIG') in MessageCache as Resolved.
Aug 14 2018, 6:16 PM · MW-1.32-notes (WMF-deploy-2018-07-31 (1.32.0-wmf.15)), Patch-For-Review, Performance-Team, MediaWiki-Cache

Aug 13 2018

aaron added a comment to T185724: Publish Doxygen for RunningStat library.

Where are the jenkins jobs defined?

Aug 13 2018, 8:23 PM · Librarization, Performance-Team, RunningStat, Continuous-Integration-Config
aaron claimed T200471: [regression] LBFactorySimple breaks ExternalStorage, trying to connect to external server with local database name.
Aug 13 2018, 8:16 PM · Patch-For-Review, MW-1.32-notes (WMF-deploy-2018-08-21 (1.32.0-wmf.18)), MW-1.31-release-notes, Performance-Team, Regression, MW-1.31-release, MediaWiki-Database
aaron updated the task description for T198239: Rollout use of mcrouter for MediaWiki in production.
Aug 13 2018, 7:59 PM · MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), Patch-For-Review, Availability (MediaWiki-MultiDC), Performance-Team

Aug 10 2018

aaron added a comment to T164860: Update Echo's caching strategy for multi-dc compatibility.

Can this task be closed?

Aug 10 2018, 8:02 PM · MW-1.32-notes (WMF-deploy-2018-08-21 (1.32.0-wmf.18)), Performance-Team (Radar), Growth-Team (Current Sprint), Availability (MediaWiki-MultiDC), Collaboration-Team-Triage, Notifications

Aug 8 2018

aaron added a comment to T201482: LinksUpdate fails, spams exception logs, whenever replication lag on any server rises above 10s.

Something like that approach seems worth trying.

Aug 8 2018, 9:11 PM · Core Platform Team (Security, stability, performance and scalability (TEC1)), Core Platform Team Backlog (Next), MW-1.32-notes (WMF-deploy-2018-09-04 (1.32.0-wmf.20)), Patch-For-Review, Performance-Team (Radar), MediaWiki-Database
aaron added a comment to T154719: PageTriage opens master connection on GET for ArticleMetadata cache misses.

Is it possible to just update pagetriage_page_tags on page saves (and other relavent POST requests) when there are already master connections? For anything that depends on things updated via the job queue (like backlinks), those would have to be attached such LinksUpdates (which already run in POST/jobs). Why do things have to be updated on page views?

Aug 8 2018, 6:38 PM · MW-1.32-notes (WMF-deploy-2018-08-21 (1.32.0-wmf.18)), Performance-Team (Radar), Patch-For-Review, Growth-Team (Current Sprint), Collaboration-Team-Triage (Collab-Team-This-Quarter), Availability, MediaWiki-extensions-PageCuration
aaron closed T196608: Notice: Undefined index: ChronologyProtection in /srv/mediawiki/core/includes/libs/rdbms/lbfactory/LBFactory.php on line 504 in web upgrader as Resolved.
Aug 8 2018, 5:47 PM · MW-1.32-notes (WMF-deploy-2018-08-07 (1.32.0-wmf.16)), Patch-For-Review, Performance-Team, MediaWiki-Installer, MediaWiki-Database

Aug 4 2018

aaron added a comment to T201016: Include ADD operation in memcached stats and grafana dashboard.

I noticed that regular memcached counts ADD as it does SET (cmd_set). This for both STORED and NOT_STORED cases. There is no cmd_add. However, mcrouter does seem to expose a cmd_add counter. Perhaps there can be a mcrouter dashboard similar to the Memcache on in Grafana?

Aug 4 2018, 12:10 AM · Graphite, Operations

Aug 3 2018

aaron removed a subtask for T88445: MediaWiki active/active datacenter investigation and work (tracking): T164504: Tracking: Cleanup x1 database connection patterns.
Aug 3 2018, 6:07 PM · Availability (MediaWiki-MultiDC), Performance-Team, Epic
aaron removed a parent task for T164504: Tracking: Cleanup x1 database connection patterns: T88445: MediaWiki active/active datacenter investigation and work (tracking).
Aug 3 2018, 6:07 PM · DBA
aaron removed a project from T164504: Tracking: Cleanup x1 database connection patterns: Availability (MediaWiki-MultiDC).
Aug 3 2018, 6:05 PM · DBA
aaron added a comment to T164504: Tracking: Cleanup x1 database connection patterns.

Are there any tasks here that remain and are blockers to multi-DC?

Aug 3 2018, 12:10 AM · DBA

Aug 2 2018

aaron created T201016: Include ADD operation in memcached stats and grafana dashboard.
Aug 2 2018, 3:32 PM · Graphite, Operations

Aug 1 2018

aaron updated the task description for T198239: Rollout use of mcrouter for MediaWiki in production.
Aug 1 2018, 5:24 PM · MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), Patch-For-Review, Availability (MediaWiki-MultiDC), Performance-Team

Jul 30 2018

aaron added a comment to T196608: Notice: Undefined index: ChronologyProtection in /srv/mediawiki/core/includes/libs/rdbms/lbfactory/LBFactory.php on line 504 in web upgrader.

Regression from fb51330084b4bde1880c76589e55e7cd87ed0c6d I assume

Jul 30 2018, 11:59 PM · MW-1.32-notes (WMF-deploy-2018-08-07 (1.32.0-wmf.16)), Patch-For-Review, Performance-Team, MediaWiki-Installer, MediaWiki-Database
aaron moved T198239: Rollout use of mcrouter for MediaWiki in production from Next-up to Doing on the Performance-Team board.
Jul 30 2018, 8:19 PM · MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), Patch-For-Review, Availability (MediaWiki-MultiDC), Performance-Team
aaron moved T196608: Notice: Undefined index: ChronologyProtection in /srv/mediawiki/core/includes/libs/rdbms/lbfactory/LBFactory.php on line 504 in web upgrader from Next-up to Doing on the Performance-Team board.
Jul 30 2018, 8:19 PM · MW-1.32-notes (WMF-deploy-2018-08-07 (1.32.0-wmf.16)), Patch-For-Review, Performance-Team, MediaWiki-Installer, MediaWiki-Database
aaron triaged T189702: Replace transcache table with objectcache backend as Low priority.
Jul 30 2018, 8:12 PM · MW-1.32-notes (WMF-deploy-2018-09-04 (1.32.0-wmf.20)), MediaWiki-Database, Patch-For-Review, Core-Platform-Team-Old, Performance-Team, MediaWiki-Templates
aaron moved T189702: Replace transcache table with objectcache backend from Inbox to Blocked on the Performance-Team board.
Jul 30 2018, 8:12 PM · MW-1.32-notes (WMF-deploy-2018-09-04 (1.32.0-wmf.20)), MediaWiki-Database, Patch-For-Review, Core-Platform-Team-Old, Performance-Team, MediaWiki-Templates
aaron moved T196608: Notice: Undefined index: ChronologyProtection in /srv/mediawiki/core/includes/libs/rdbms/lbfactory/LBFactory.php on line 504 in web upgrader from Inbox to Next-up on the Performance-Team board.
Jul 30 2018, 8:11 PM · MW-1.32-notes (WMF-deploy-2018-08-07 (1.32.0-wmf.16)), Patch-For-Review, Performance-Team, MediaWiki-Installer, MediaWiki-Database
aaron claimed T196608: Notice: Undefined index: ChronologyProtection in /srv/mediawiki/core/includes/libs/rdbms/lbfactory/LBFactory.php on line 504 in web upgrader.
Jul 30 2018, 8:11 PM · MW-1.32-notes (WMF-deploy-2018-08-07 (1.32.0-wmf.16)), Patch-For-Review, Performance-Team, MediaWiki-Installer, MediaWiki-Database
aaron moved T200506: Previewing a non-style-only gadget that you already have enabled causes a syntax error from Inbox to Next-up on the Performance-Team board.
Jul 30 2018, 8:10 PM · MW-1.32-notes (WMF-deploy-2018-08-28 (1.32.0-wmf.19)), Performance-Team, MediaWiki-ResourceLoader
aaron assigned T200506: Previewing a non-style-only gadget that you already have enabled causes a syntax error to Krinkle.
Jul 30 2018, 8:10 PM · MW-1.32-notes (WMF-deploy-2018-08-28 (1.32.0-wmf.19)), Performance-Team, MediaWiki-ResourceLoader
aaron triaged T200629: Using fully-qualified function calls is faster as Low priority.
Jul 30 2018, 8:09 PM · MediaWiki-Codesniffer, Performance-Team, Performance, MediaWiki-General-or-Unknown
TK-999 awarded T189702: Replace transcache table with objectcache backend a Love token.
Jul 30 2018, 12:47 PM · MW-1.32-notes (WMF-deploy-2018-09-04 (1.32.0-wmf.20)), MediaWiki-Database, Patch-For-Review, Core-Platform-Team-Old, Performance-Team, MediaWiki-Templates
aaron closed T199762: WikiPage::updateCategoryCounts causing Lock wait timeout exceeded as Resolved.
Jul 30 2018, 3:50 AM · Wikimedia-production-error, Performance-Team, Core-Platform-Team-Old, MediaWiki-Database
aaron closed T199762: WikiPage::updateCategoryCounts causing Lock wait timeout exceeded, a subtask of T30499: 1205: Lock wait timeout exceeded; try restarting transaction (tracking), as Resolved.
Jul 30 2018, 3:50 AM · Technical-Debt, Tracking, MediaWiki-Database

Jul 27 2018

aaron added a comment to T200471: [regression] LBFactorySimple breaks ExternalStorage, trying to connect to external server with local database name.

This may be caused by rMW14ee3f210782 self-merged by @aaron

Jul 27 2018, 1:16 AM · Patch-For-Review, MW-1.32-notes (WMF-deploy-2018-08-21 (1.32.0-wmf.18)), MW-1.31-release-notes, Performance-Team, Regression, MW-1.31-release, MediaWiki-Database

Jul 26 2018

aaron added a comment to T199762: WikiPage::updateCategoryCounts causing Lock wait timeout exceeded.

From https://logstash.wikimedia.org/goto/0b9191830a12ab3d15bce062cdb36a93, this seems to be better. But we should wait longer.

Jul 26 2018, 10:48 PM · Wikimedia-production-error, Performance-Team, Core-Platform-Team-Old, MediaWiki-Database
aaron added a comment to T200468: Percona XtraDB Cluster gives error when using GET_LOCK() when pxc_strict_mode=ENFORCING is set (e.g. By ApiStashEdit.php).

From a glance, it looks like xtradb cluster is build on Galera (which is something itself to consider in the future). Use of GET_LOCK is tricky there since it would have to use wsrep or have such queries directed to dedicated master (perhaps with some HA in front that doesn't split brain).

Jul 26 2018, 10:19 PM · MediaWiki-Database
aaron added a comment to T200420: Wikidata dispatching stuck (not releasing lockmanager locks).

Ah, right, I read that ternary backwards, <<$maxTime < PHP_INT_MAX ? PHP_INT_MAX : 1>>.

Jul 26 2018, 5:34 PM · MW-1.32-notes (WMF-deploy-2018-07-31 (1.32.0-wmf.15)), Patch-For-Review, User-Addshore, Wikidata-Campsite, Wikidata
aaron added a comment to T200420: Wikidata dispatching stuck (not releasing lockmanager locks).

Something to note, because the locks are no longer in the DB, we end up selecting the same 15 or so wikis that are locked all of the time.
It could be that the other wikis actually don't have locks:

before using the redis lock manager the status of the lock from the db was also in the select so that locked dbs would not be selected at all.

Jul 26 2018, 5:05 PM · MW-1.32-notes (WMF-deploy-2018-07-31 (1.32.0-wmf.15)), Patch-For-Review, User-Addshore, Wikidata-Campsite, Wikidata

Jul 19 2018

aaron created T200026: RepoGroup exceptions due to "false" being passed as a key to MapCacheLRU.
Jul 19 2018, 4:20 PM · MW-1.32-notes (WMF-deploy-2018-07-24 (1.32.0-wmf.14)), Performance-Team, Patch-For-Review, Release-Engineering-Team (Kanban), Release, Train Deployments
aaron added a comment to T199594: Exception "Job queue is read-only".

Normally, it would be odd to let jobs pile up but not execute them, though the multi-DC use case of $wgReadOnly in one of the DCs wasn't considered in T130795. Ideally, jobs enqueued on GET/HEAD wouldn't be a thing...but that's not going away anytime soon.

Jul 19 2018, 12:52 PM · Services (done), MW-1.32-notes (WMF-deploy-2018-07-24 (1.32.0-wmf.14)), User-Joe, Operations, Wikimedia-production-error, Core-Platform-Team-Old, WMF-JobQueue

Jul 17 2018

aaron added a comment to T199762: WikiPage::updateCategoryCounts causing Lock wait timeout exceeded.

My first inclination is to try to reduce the refreshCounts() calls.

Jul 17 2018, 8:22 PM · Wikimedia-production-error, Performance-Team, Core-Platform-Team-Old, MediaWiki-Database
aaron updated the task description for T198239: Rollout use of mcrouter for MediaWiki in production.
Jul 17 2018, 3:30 PM · MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), Patch-For-Review, Availability (MediaWiki-MultiDC), Performance-Team

Jul 16 2018

aaron removed a project from T92357: Fix problematic database master queries performed on HTTP GET/HEAD: MW-1.30-release-notes (WMF-deploy-2017-06-13_(1.30.0-wmf.5)).
Jul 16 2018, 9:12 PM · Availability (MediaWiki-MultiDC), Patch-For-Review, MediaWiki-General-or-Unknown
aaron placed T95501: Fix causes of slave lag and get it to under 5 seconds at peak up for grabs.
Jul 16 2018, 9:12 PM · Performance-Team (Radar), Goal, Availability
aaron placed T190260: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" trying to undelete a file up for grabs.
Jul 16 2018, 8:37 PM · MW-1.32-notes (WMF-deploy-2018-07-24 (1.32.0-wmf.14)), Multimedia, Core-Platform-Team-Old, Patch-For-Review, MediaWiki-Page-deletion, MediaWiki-File-management, Performance-Team

Jul 11 2018

aaron closed T194403: Wikimedia\Rdbms\ChronologyProtector::initPositions: expected but failed to find position index. as Resolved.
Jul 11 2018, 12:43 PM · MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), Release-Engineering-Team (Watching / External), Performance-Team, MediaWiki-Database, Wikimedia-production-error
aaron closed T199218: MemcachedBagOStuff.php: Key contains invalid characters as Resolved.

This was fixed by the 61a7e1acd0af4a5386df03335733accfde179fa1 backport.

Jul 11 2018, 10:07 AM · MediaWiki-General-or-Unknown
aaron closed T199039: "Fatal exception of type "Exception"" when using Special:LanguageStats on MediaWiki.org as Resolved.

Fixed with the 61a7e1acd0af4a5386df03335733accfde179fa1 backport.

Jul 11 2018, 10:06 AM · MW-1.32-notes (WMF-deploy-2018-09-18 (1.32.0-wmf.22)), Wikimedia-production-error, MediaWiki-extensions-Translate
aaron updated subscribers of T194403: Wikimedia\Rdbms\ChronologyProtector::initPositions: expected but failed to find position index..

Change 445110 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: fix value of ChronologyProtector::POSITION_COOKIE_TTL

https://gerrit.wikimedia.org/r/445110

Jul 11 2018, 9:39 AM · MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), Release-Engineering-Team (Watching / External), Performance-Team, MediaWiki-Database, Wikimedia-production-error
aaron added a comment to T194403: Wikimedia\Rdbms\ChronologyProtector::initPositions: expected but failed to find position index..

Given how low server_failure_limit is, it might help to lower server_retry_timeout from 30s to something < 5s. Consistent hash ejections seem like the most obvious thing that could cause an acknowledged write to be seen as not being there for any of the next 5 seconds.

Jul 11 2018, 9:07 AM · MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), Release-Engineering-Team (Watching / External), Performance-Team, MediaWiki-Database, Wikimedia-production-error

Jul 10 2018

aaron updated the task description for T198239: Rollout use of mcrouter for MediaWiki in production.
Jul 10 2018, 6:37 PM · MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), Patch-For-Review, Availability (MediaWiki-MultiDC), Performance-Team