aaron (Aaron Schulz)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Oct 20 2014, 5:25 PM (216 w, 23 h)
Availability
Available
IRC Nick
AaronSchulz
LDAP User
Aaron Schulz
MediaWiki User
Aaron Schulz [ Global Accounts ]

Recent Activity

Today

aaron added a comment to T210992: Increase parsercache keys TTL from 22 days back to 30 days.

@Krinkle can you confirm whether those are the two reverts we have to do?
I have been looking around and I haven't found anything else to revert
Thank you!

Tue, Dec 11, 3:50 PM · Operations, Performance-Team, DBA

Yesterday

aaron created T211631: Add info boxes to all save timing graphs on Grafana.
Mon, Dec 10, 9:00 PM · Performance-Team

Sat, Dec 8

aaron added a comment to T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions.

Today another mediawiki alert from ~12:24 to ~12:27 UTC. @Nikerabbit, @aaron - do you think that we can narrow down specific events (beside TTL expiring that may cause this?

Sat, Dec 8, 6:18 AM · Performance-Team (Radar), Wikimedia-production-error, User-Elukey, MediaWiki-Cache, Operations

Sun, Dec 2

aaron added a comment to T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions.

To clarify, the $useMutex logic in WAN cache never triggers due to minAsOf=INF, resulting in stampedes when someone invalidates the cache. Instead, this should be treated like a regular TTL expiration and have one thread at a time doing regeneration.

Sun, Dec 2, 9:58 PM · Performance-Team (Radar), Wikimedia-production-error, User-Elukey, MediaWiki-Cache, Operations
aaron added a comment to T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions.

Is the value so big that every time it is SET it causes a TKO? Besides the 24h TTL, the cache value is updated when there are changes to the underlying data. In other words: when people create or remove translatable pages, aggregate message groups or other similar stuff.

Sun, Dec 2, 10:33 AM · Performance-Team (Radar), Wikimedia-production-error, User-Elukey, MediaWiki-Cache, Operations

Fri, Nov 30

Imarlier awarded T205369: Investigate > 40% Save Timing regression (2018-09-05) a Mountain of Wealth token.
Fri, Nov 30, 7:06 PM · Core Platform Team Kanban (Done with CPT), Performance-Team-notice, Core Platform Team (MCR), Multi-Content-Revisions (Reactive), Performance-Team

Mon, Nov 26

aaron closed T209483: Got connection to 'yuewiktionary', but expected local domain ('aawiktionary'). as Resolved.
Mon, Nov 26, 5:58 PM · MW-1.33-notes (1.33.0-wmf.6; 2018-11-27), Patch-For-Review, MediaWiki-extensions-WikimediaMaintenance

Thu, Nov 22

aaron added a comment to T209857: Create ISP ranking based on RUM data.

@Gilles: Comcast only has cable infrastructure in terms what the ISP provides itself. For customers with cable, they can also get XFinity Mobile (https://www.tomsguide.com/us/xfinity-mobile-faq,news-25223.html) . That's basically just a bunch of Wi-Fi hotspots build off of Verizon. I don't know how many people are using that and it seems new-ish. Also, the latency figures are quite low, which makes me doubt that it is XFinity Mobile and more likely regular wireless/xfinity.

Thu, Nov 22, 7:19 PM · MW-1.33-notes (1.33.0-wmf.8; 2018-12-11), Patch-For-Review, Performance-Team
aaron added a comment to T209857: Create ISP ranking based on RUM data.

It looks sane, though I wonder why Comcast is so high in usage for mobile? Is that mostly from touchpad devices instead of smartphones?

Thu, Nov 22, 8:22 AM · MW-1.33-notes (1.33.0-wmf.8; 2018-12-11), Patch-For-Review, Performance-Team

Tue, Nov 20

aaron added a comment to T196378: Investigate solutions for MySQL connection pooling.
Tue, Nov 20, 8:55 PM · DBA, Availability (MediaWiki-MultiDC), Performance-Team (Radar), Operations
aaron added a comment to T208934: mcrouter does not remove a memcached shard from consistent hashing when timeouts happen.

It definitely seems like something worth doing. Having the potential for high use cache keys becoming unusable for undefined periods of time is too much of a stability concern.

Tue, Nov 20, 8:04 PM · Performance-Team (Radar), User-Elukey, MediaWiki-Cache, Operations
aaron added a comment to T157651: sql.php runs LoadExtensionSchemaUpdates.

Indeed. The updater calls the LoadExtensionSchemaUpdates hook from the constructor (bleh) and Echo and SecurePoll use dropTable/modifyField (which are executed immediately) instead of dropExtensionTable/modifyExtensionField (which would just push an entry to the update list).

This is a very, very ugly accident waiting to happen. @aaron any preference how to prevent it? We could make getSchemaVars static, or split out the extension loading part from the constructor, or add some flag that prevents calling non-extension methods on the updater. Maybe even a unit test which calls the hook with a fake updater which fails the test if any non-extension methods are called on it.

Tue, Nov 20, 7:53 PM · Patch-For-Review, Core Platform Team Backlog (Watching / External), Performance-Team, MediaWiki-Database, MediaWiki-Maintenance-scripts, Beta-Cluster-reproducible
aaron triaged T204423: MW 1.31 install reports "InvalidArgumentException ... DatabaseDomain.php: Domain has too few or too many parts " as Low priority.
Tue, Nov 20, 7:47 PM · Performance-Team, MW-1.32-release, MW-1.31-release, MediaWiki-Database, MediaWiki-Installer

Mon, Nov 19

aaron changed the status of T204423: MW 1.31 install reports "InvalidArgumentException ... DatabaseDomain.php: Domain has too few or too many parts " from Open to Stalled.
Mon, Nov 19, 8:36 PM · Performance-Team, MW-1.32-release, MW-1.31-release, MediaWiki-Database, MediaWiki-Installer

Wed, Nov 14

aaron added a comment to T205369: Investigate > 40% Save Timing regression (2018-09-05).

Since CategoryMembershipChangeJob runs via the job queue, wouldn't that have little effect on save timing itself? I guess it wouldn't hurt to optimize.

Wed, Nov 14, 10:48 PM · Core Platform Team Kanban (Done with CPT), Performance-Team-notice, Core Platform Team (MCR), Multi-Content-Revisions (Reactive), Performance-Team

Nov 9 2018

aaron added a comment to T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions.

Keys are set by add/cas normally, so it seems like some key that takes a long time to regenerate might have expired (there are two data points at the elevated value over more than just a few seconds) or a class of many keys expired. The other possibility is some sudden change in access patterns for keys, which seems less likely, especially the more periodic this is.

Checking keys via tcpdump is not super easy, so I tried to order the packets by size inspecting the keys with bigger size and lower TTL. I keep seeing this pattern of SETs for metawiki:translate-groups (TTL 2 hours) followed by gets during the timeframe in which timeouts happen, that given the size (~380k) could surely aggravate the bandwidth problem that we are seeing. Would it be possible to increase this TTL to say a 24h to see if anything changes? Or possibly migrate the code to something more like gadget-definition, namely setting the new key only if really needed and not every two hours by default. It would help removing a (big) variable from the problem..

To answer the point brought up by @Nikerabbit about the code not changed in ages - I suspect that in the past nutcracker might have masked problems like this one (see T208934), and that the change to mcrouter caused more errors raised due to bandwidth/latency variations. I am probably not right about the metawiki:translate-groups key as culprit, but I'd just need to see if removing a big key makes any difference in what we see in the graphs, of course if this is not going to affect users in any way.

Nov 9 2018, 2:15 AM · Performance-Team (Radar), Wikimedia-production-error, User-Elukey, MediaWiki-Cache, Operations

Nov 8 2018

aaron added a comment to T208487: [RfC] Add CURRENT_TIMESTAMP support for `wl_notificationtimestamps` in watchlist table.

wl_notificationtimestamp is not meant to store the time the article was watched but the last revision the user saw on the page (NULL if they saw the latest revision). This would require a new column. Ideally, if watchlist sizes were limited, this woudn't need an index, but they are not.

Nov 8 2018, 9:53 PM · User-D3r1ck01, TechCom-RFC, Growth-Team, MediaWiki-Database
aaron placed T208487: [RfC] Add CURRENT_TIMESTAMP support for `wl_notificationtimestamps` in watchlist table up for grabs.
Nov 8 2018, 7:19 AM · User-D3r1ck01, TechCom-RFC, Growth-Team, MediaWiki-Database

Nov 7 2018

aaron added a comment to T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions.

Keys are set by add/cas normally, so it seems like some key that takes a long time to regenerate might have expired (there are two data points at the elevated value over more than just a few seconds) or a class of many keys expired. The other possibility is some sudden change in access patterns for keys, which seems less likely, especially the more periodic this is.

Nov 7 2018, 7:19 PM · Performance-Team (Radar), Wikimedia-production-error, User-Elukey, MediaWiki-Cache, Operations

Nov 6 2018

aaron updated subscribers of T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions.
Nov 6 2018, 10:24 AM · Performance-Team (Radar), Wikimedia-production-error, User-Elukey, MediaWiki-Cache, Operations

Nov 5 2018

aaron added a comment to T204346: PHP-timed out requests also emit LoadBalancer::destruct error "you can't run this command now: COMMIT".

Fixed in bf30fcb71427d673f7c83a067b3241040d3470b6. Rollback is used instead and uses $ignoreErrors so as not to trigger the exception in reportQueryError().

Nov 5 2018, 11:15 PM · Performance-Team, Wikimedia-production-error, MediaWiki-Database
aaron closed T39159: sqlite: DatabaseBase::delete and DatabaseBase::update return ResultWrapper object as Resolved.

Cleaned up in 633eb437a3b808518469c6eaf4e86a436941d837

Nov 5 2018, 10:10 PM · Performance-Team, goodfirstbug, SQLite, MediaWiki-Database

Nov 2 2018

aaron added a comment to T194299: Lock wait timeout exceeded in SqlIdGenerator::generateNewId.

openConnection is badly named and still reuses connections. You'd probably want getConnection with CONN_TRX_AUTO

I hate this hack. This may *still* re-use connections, if anything else used CONN_TRX_AUTO. We should have CONN_NEW.

Nov 2 2018, 6:28 AM · MW-1.33-notes (1.33.0-wmf.8; 2018-12-11), Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), User-Addshore, Wikidata-Ministry-Of-Magic, Patch-For-Review, Wikimedia-production-error, MediaWiki-extensions-WikibaseRepository, Wikidata

Nov 1 2018

aaron added a comment to T194299: Lock wait timeout exceeded in SqlIdGenerator::generateNewId.

openConnection is badly named and still reuses connections. You'd probably want getConnection with CONN_TRX_AUTO

Nov 1 2018, 10:23 PM · MW-1.33-notes (1.33.0-wmf.8; 2018-12-11), Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), User-Addshore, Wikidata-Ministry-Of-Magic, Patch-For-Review, Wikimedia-production-error, MediaWiki-extensions-WikibaseRepository, Wikidata
aaron closed T203925: Save times for changes to translation variable text in centralnotice paralysingly slow as Resolved.
Nov 1 2018, 7:52 PM · Core Platform Team Kanban (Done with CPT), Performance-Team-notice, Fundraising Sprint Vestigial tails shoot from the hip, Fundraising Sprint USB stands for underhanded socket bureaucracy, Fundraising Sprint They Live, Core Platform Team (Security, stability, performance and scalability (TEC1)), Performance-Team, Fundraising Sprint Sasquatches can't find us either, Language-Team, Fundraising Sprint Raw data can give you salmonella, MediaWiki-extensions-Translate, Fundraising-Backlog, MediaWiki-extensions-CentralNotice

Oct 29 2018

aaron added a comment to T206341: Evaluate scalability and performance of PHP7 compared to HHVM.

What about our use of register_postsend_function? Is there anything equivalant?

Oct 29 2018, 9:33 PM · Patch-For-Review, Performance-Team (Radar), Operations
aaron closed T207809: PHP error "CdnPurgeJob never inserted." as Resolved.
Oct 29 2018, 9:18 PM · MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), Performance-Team, Core Platform Team Backlog (Watching / External), Patch-For-Review, Services (watching), MediaWiki-Cache, WMF-JobQueue, Wikimedia-production-error
aaron moved T207809: PHP error "CdnPurgeJob never inserted." from Inbox to Doing on the Performance-Team board.
Oct 29 2018, 9:18 PM · MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), Performance-Team, Core Platform Team Backlog (Watching / External), Patch-For-Review, Services (watching), MediaWiki-Cache, WMF-JobQueue, Wikimedia-production-error
aaron added a comment to T207223: getDBLoadBalancerFactory()->getExternalLB() returns a LB with the wrong database..

The recommend fix does work for our extension. I apparently had configured Echo properly in the past so it was working properly. For AbuseFilter I had to patch it to use the same pattern since it only specifies the database and not the cluster to use.

Oct 29 2018, 6:44 PM · Performance-Team, MW-1.32-release, MW-1.31-release, MediaWiki-Database
jcrespo awarded T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate) a Love token.
Oct 29 2018, 8:49 AM · MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), Performance-Team-notice, Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error, MediaWiki-Database

Oct 28 2018

aaron added a comment to T207809: PHP error "CdnPurgeJob never inserted.".

@aaron The fix LGTM, but do we know why this started happening? I'd be nice to know what commit or task prompted it so that we can learn why it wasn't prevented by our tests and/or Jenkins.

In theory, this kind of warning can be triggered in tests and would be captured by Jenkins in a way that fail the build. That definitely worked at some point.

Oct 28 2018, 10:17 PM · MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), Performance-Team, Core Platform Team Backlog (Watching / External), Patch-For-Review, Services (watching), MediaWiki-Cache, WMF-JobQueue, Wikimedia-production-error

Oct 27 2018

aaron closed T54781: "Could not acquire 'messages:de-formal:status' lock" opening page with page name >256 chars; memcache too slow as Invalid.

Closing, per " The Error Occurs if the memcache is too slow".

Oct 27 2018, 9:14 PM · MediaWiki-General-or-Unknown
aaron added a comment to T174549: MessageCache::loadFromDB makes too many slow queries with wrong index.

This will be better with a3d6c1411dad3e057b if there are many message pages that exists for extension use.

Oct 27 2018, 9:07 PM · MediaWiki-Database, MediaWiki-Cache
aaron added a comment to T207979: uselang=sr shows markup tags in installer.

4b1db1190bb8f2a115c6a81a5ee487b7d18cd303 seems more likely.

Oct 27 2018, 9:03 PM · MW-1.32-notes, Regression, MW-1.32-release, MediaWiki-Cache, Performance-Team, I18n, MediaWiki-Installer
aaron added a comment to T204423: MW 1.31 install reports "InvalidArgumentException ... DatabaseDomain.php: Domain has too few or too many parts ".

Note that git master (19dd28798163) installs fine with postgres, which has the same DB domain patches as 1.32.

Oct 27 2018, 3:15 PM · Performance-Team, MW-1.32-release, MW-1.31-release, MediaWiki-Database, MediaWiki-Installer

Oct 26 2018

aaron added a comment to T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate).

It looks like the errors come from some tool (JS?) that fires a bunch of API requests from a Special:Search tab to edit numerous pages in parallel. Each burst always for a certain user ID with a single referrer URL.

Oct 26 2018, 11:43 PM · MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), Performance-Team-notice, Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error, MediaWiki-Database
aaron committed rEBTFcadd89ea1200: Avoid use of IDatabase::insert() return value (authored by aaron).
Avoid use of IDatabase::insert() return value
Oct 26 2018, 9:41 PM
aaron committed rERXB3d0705e6c75f: Avoid using wfSplitWikiID() and use newer cache key methods (authored by aaron).
Avoid using wfSplitWikiID() and use newer cache key methods
Oct 26 2018, 9:23 PM
aaron committed rEBSSMWC0e8b6916ea2a: Avoid use of IDatabase::update return value (authored by aaron).
Avoid use of IDatabase::update return value
Oct 26 2018, 8:32 PM
aaron committed rEBTF63bb559a1c9d: Avoid use of IDatabase::update return value (authored by aaron).
Avoid use of IDatabase::update return value
Oct 26 2018, 7:53 PM
aaron added a comment to T208003: WatchedItemStore::addWatchBatchForUser does not have outer scope..

Does this really need to call commitAndWaitForReplication() when there is only one batch? Is it ever called thousands of times in a row?

Oct 26 2018, 5:55 AM · MW-1.33-notes (1.33.0-wmf.4; 2018-11-13), Patch-For-Review, Growth-Team (Current Sprint), MediaWiki-Watchlist, Regression, Wikimedia-production-error

Oct 25 2018

aaron added a comment to T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate).

COMMIT takes a few seconds

It is unlikely for a commit to take a lot, as in, the actual physical changes that happen then- at commit time there is only a metadata change + flush to disk cache, which should be very fast. Look for contention somewhere as a first option when you see commit taking a long time (e.g. large transactions blocking each other).

Oct 25 2018, 5:48 PM · MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), Performance-Team-notice, Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error, MediaWiki-Database
aaron closed T207223: getDBLoadBalancerFactory()->getExternalLB() returns a LB with the wrong database. as Declined.
Oct 25 2018, 5:46 PM · Performance-Team, MW-1.32-release, MW-1.31-release, MediaWiki-Database

Oct 24 2018

aaron committed rERXB04e424a12439: Avoid using wfSplitWikiID() and use newer cache key methods (authored by aaron).
Avoid using wfSplitWikiID() and use newer cache key methods
Oct 24 2018, 9:32 PM

Oct 22 2018

aaron added a comment to T207223: getDBLoadBalancerFactory()->getExternalLB() returns a LB with the wrong database..

In the getMasterDatabase() method posted above, I noticed that the database domain (e.g. DB/schema/prefix) is missing from getConnection(). Instead that should be:

Oct 22 2018, 9:22 PM · Performance-Team, MW-1.32-release, MW-1.31-release, MediaWiki-Database
aaron claimed T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate).
Oct 22 2018, 8:50 PM · MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), Performance-Team-notice, Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error, MediaWiki-Database
aaron moved T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate) from Inbox to Doing on the Performance-Team board.
Oct 22 2018, 8:49 PM · MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), Performance-Team-notice, Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error, MediaWiki-Database
aaron claimed T207223: getDBLoadBalancerFactory()->getExternalLB() returns a LB with the wrong database..
Oct 22 2018, 8:44 PM · Performance-Team, MW-1.32-release, MW-1.31-release, MediaWiki-Database
aaron moved T207223: getDBLoadBalancerFactory()->getExternalLB() returns a LB with the wrong database. from Inbox to Current Quarter Goals on the Performance-Team board.
Oct 22 2018, 8:43 PM · Performance-Team, MW-1.32-release, MW-1.31-release, MediaWiki-Database

Oct 19 2018

aaron closed T193271: Refactor MessageCache to deal with NS_MEDIAWIKI pages that aren't standard interface messages as Resolved.
Oct 19 2018, 3:33 AM · MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), Patch-For-Review, MediaWiki-Cache, Performance-Team

Oct 17 2018

aaron added a comment to T205369: Investigate > 40% Save Timing regression (2018-09-05).

For reference: I see three extension methods that access the ParserOutput via WikiPage::prepareContentFor edit:

  • TemplateDataHooks::onPageContentSave
  • SimpleCaptcha::shouldCheck
  • SpamBlacklistHooks::filterMergedContent

    All of these should be hitting a cached instance, but perhaps they are not for some reason. The caching logic in WikiPage is not nice. Perhaps it would be better to have an in-process cache in the RevisionRenderer service. That would be straight forward, but would not cache PST content for pre-PST content.
Oct 17 2018, 4:59 AM · Core Platform Team Kanban (Done with CPT), Performance-Team-notice, Core Platform Team (MCR), Multi-Content-Revisions (Reactive), Performance-Team
aaron closed T193565: Foreign query for metawiki fails with "Table 'centralauth.page' doesn't exist" (DBConnRef mixup?) as Resolved.

Fixed in master.

Oct 17 2018, 12:11 AM · Core Platform Team Kanban (Done with CPT), MW-1.32-notes, MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), Patch-For-Review, Performance-Team, Core Platform Team (Security, stability, performance and scalability (TEC1)), Wikimedia-production-error, MediaWiki-Database

Oct 16 2018

aaron closed T202553: Database->insertSelect() generates invalid SQL when * is passed as $conds as Resolved.
Oct 16 2018, 9:37 PM · MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), Patch-For-Review, Performance-Team, MediaWiki-Database

Oct 15 2018

aaron created T207090: Requesting deployment access to servers for Performance Team task for perf-roots.
Oct 15 2018, 8:23 PM · Patch-For-Review, Operations, SRE-Access-Requests
aaron added a comment to T193271: Refactor MessageCache to deal with NS_MEDIAWIKI pages that aren't standard interface messages.

I'm posting here instead of at a separate task, because I can't decipher whether this is a regression, side effect, bug or anything else. As a result of a3d6c1411dad, a lot interface messages (as in, messages actually used in the interface) are no longer cached. This results in a whopping amount of 172 database queries for interface messages on Special:Version on MediaWiki-Vagrant using MW master a3d6c1411dad or newer. Compared to the 10 there were before, this is a 1620% increase. As every query ends up in the debug log, both the Query overview and debug log tab of the debug toolbar have become rather difficult to use.

Oct 15 2018, 2:53 PM · MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), Patch-For-Review, MediaWiki-Cache, Performance-Team

Oct 13 2018

aaron added a comment to T202149: Exception thrown for failure to save settings appears ~ 1000 times/day.

I still see 100-200 per 3 hour interval.

Oct 13 2018, 1:30 AM · MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), BetaFeatures, Wikimedia-production-error, MediaWiki-Authentication-and-authorization

Oct 12 2018

aaron added a comment to T205369: Investigate > 40% Save Timing regression (2018-09-05).

Looking at https://performance.wikimedia.org/xhgui/run/view?id=5bbfdc7c3f3dfaea44b5847c after a null edit on https://en.wikipedia.org/wiki/1857_in_Sweden I see MediaWiki\Revision\RenderedRevision::getSlotParserOutputUncached being hit 4 times even though normal pages have only 1 slot...

Oct 12 2018, 7:54 PM · Core Platform Team Kanban (Done with CPT), Performance-Team-notice, Core Platform Team (MCR), Multi-Content-Revisions (Reactive), Performance-Team

Oct 11 2018

aaron added a comment to T205369: Investigate > 40% Save Timing regression (2018-09-05).

Looking at https://performance.wikimedia.org/xhgui/run/view?id=5bbfdc7c3f3dfaea44b5847c after a null edit on https://en.wikipedia.org/wiki/1857_in_Sweden I see MediaWiki\Revision\RenderedRevision::getSlotParserOutputUncached being hit 4 times even though normal pages have only 1 slot...

Oct 11 2018, 11:31 PM · Core Platform Team Kanban (Done with CPT), Performance-Team-notice, Core Platform Team (MCR), Multi-Content-Revisions (Reactive), Performance-Team
aaron moved T193565: Foreign query for metawiki fails with "Table 'centralauth.page' doesn't exist" (DBConnRef mixup?) from Inbox to Doing on the Performance-Team board.
Oct 11 2018, 8:16 PM · Core Platform Team Kanban (Done with CPT), MW-1.32-notes, MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), Patch-For-Review, Performance-Team, Core Platform Team (Security, stability, performance and scalability (TEC1)), Wikimedia-production-error, MediaWiki-Database
aaron moved T206475: Users are unable to edit (add new topics to) [[:fa:wikipedia:پرسش‌های متفرقه]] from Inbox to Radar on the Performance-Team board.
Oct 11 2018, 8:15 PM · Growth-Team (Current Sprint), MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), Performance-Team (Radar), MediaWiki-ResourceLoader, Patch-For-Review, StructuredDiscussions

Oct 10 2018

aaron added a comment to T205369: Investigate > 40% Save Timing regression (2018-09-05).

The alerts fired but we didn't act on them and then 7 days later it got back to the new "normal":

I think we can change or add another alert where we alert on a hard limit instead (as we done on some other dashboards). In the coming Grafana 5.3.0 the alerts will have reminders.

Oct 10 2018, 7:09 PM · Core Platform Team Kanban (Done with CPT), Performance-Team-notice, Core Platform Team (MCR), Multi-Content-Revisions (Reactive), Performance-Team
aaron added a comment to T206580: Use object stash for persisting last-use proprety to control curation toolbar display.

@aaron - Any thoughts about the course of action suggested above in T206580#4655234?

Oct 10 2018, 6:38 PM · MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), Patch-For-Review, Growth-Team (Current Sprint)

Oct 9 2018

aaron updated subscribers of T205369: Investigate > 40% Save Timing regression (2018-09-05).
Oct 9 2018, 11:14 PM · Core Platform Team Kanban (Done with CPT), Performance-Team-notice, Core Platform Team (MCR), Multi-Content-Revisions (Reactive), Performance-Team
aaron added a comment to T204423: MW 1.31 install reports "InvalidArgumentException ... DatabaseDomain.php: Domain has too few or too many parts ".

https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/465300/ also mentions the ID in the message.

Oct 9 2018, 12:39 AM · Performance-Team, MW-1.32-release, MW-1.31-release, MediaWiki-Database, MediaWiki-Installer
aaron added a comment to T204423: MW 1.31 install reports "InvalidArgumentException ... DatabaseDomain.php: Domain has too few or too many parts ".

Does this occur in master? I more so wonder if https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/452878/ happens to help.

Oct 9 2018, 12:36 AM · Performance-Team, MW-1.32-release, MW-1.31-release, MediaWiki-Database, MediaWiki-Installer

Oct 5 2018

aaron added a comment to T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions.

The throttling mechanism that memcache offers is good of course, but maybe -R 20 is not optimal?

Oct 5 2018, 10:37 AM · Performance-Team (Radar), Wikimedia-production-error, User-Elukey, MediaWiki-Cache, Operations

Oct 4 2018

aaron added a comment to T120140: Lock wait timeout exceeded (WikiPage::insertRedirectEntry) when editing a self-redirecting template.

Are these jobs that try to also move user subpages?

Oct 4 2018, 11:27 PM · Wikimedia-production-error, MediaWiki-Database, MediaWiki-Templates

Oct 3 2018

aaron claimed T193271: Refactor MessageCache to deal with NS_MEDIAWIKI pages that aren't standard interface messages.
Oct 3 2018, 8:53 PM · MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), Patch-For-Review, MediaWiki-Cache, Performance-Team
aaron added a comment to T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate).

CAS errors on user might also help pinpoint some causes.

Oct 3 2018, 12:09 AM · MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), Performance-Team-notice, Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error, MediaWiki-Database
aaron added a comment to T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate).

Yes and yes. I think if COMMIT takes a few seconds, then even with this UPDATE near the transaction end, multiple writes can still pile up if enough tabs are opened or other things locking user rows are going on.

Oct 3 2018, 12:08 AM · MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), Performance-Team-notice, Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error, MediaWiki-Database

Oct 2 2018

aaron added a comment to T204174: FileOperation error "SwiftFileBackend::addMissingMetadata: {path} was not stored with SHA-1 metadata.".

If you go that route, then something like having the getFileSha1() and the sha1 field being null for certain containers plus having doOperations and friends pass a flag to getFileStat/Sha1 to have the current behavior of lazy-loading and not using null, it might work.

Oct 2 2018, 11:50 PM · MW-1.33-notes (1.33.0-wmf.8; 2018-12-11), Patch-For-Review, Performance-Team, Thumbor, MediaWiki-File-management, Wikimedia-production-error
aaron added a comment to T204174: FileOperation error "SwiftFileBackend::addMissingMetadata: {path} was not stored with SHA-1 metadata.".

Is it that much space? If you add an option, you have to have getFileStat return some dummy value for the SHA1 and also not have that mess up the logic in doOperations(), which is why it seemed easier to just include the header.

Oct 2 2018, 11:47 PM · MW-1.33-notes (1.33.0-wmf.8; 2018-12-11), Patch-For-Review, Performance-Team, Thumbor, MediaWiki-File-management, Wikimedia-production-error
aaron added a comment to T204346: PHP-timed out requests also emit LoadBalancer::destruct error "you can't run this command now: COMMIT".

Why does Database::close() ever try to commit anyway?

Oct 2 2018, 9:05 PM · Performance-Team, Wikimedia-production-error, MediaWiki-Database
aaron closed T189999: Enforce database transaction rollback conventions to protect against certain try/catch patterns as Resolved.
Oct 2 2018, 8:59 PM · Performance-Team, TechCom, MediaWiki-Database
aaron updated the task description for T189999: Enforce database transaction rollback conventions to protect against certain try/catch patterns.
Oct 2 2018, 8:59 PM · Performance-Team, TechCom, MediaWiki-Database
aaron added a comment to T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate).

I don't see many of these in the logs for the last 7 days. This is likely caused by editing in parallel (multiple rollback tabs at once).

Oct 2 2018, 5:33 PM · MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), Performance-Team-notice, Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error, MediaWiki-Database
aaron added a comment to T148603: Limit the Quality version of the flagged revision in Arabic Wikipedia to ns=0.

It looks like there is no way to say "Level 2 (reviewed for quality) is not allowed as a tag on pages outside namespace 0". Right, now, I suppose it is just convention that reviewers only mark template revisions as at level 1 (basic review). If $wgFlaggedRevsTags included a 'namespaces' field with (NS => level) as the value (defaulting to all of $wgFlaggedRevsNamespaces at the highest level, the status quo), then this could be configured.

Oct 2 2018, 3:41 AM · Wikimedia-Site-requests

Oct 1 2018

aaron added a comment to T204174: FileOperation error "SwiftFileBackend::addMissingMetadata: {path} was not stored with SHA-1 metadata.".

It's used for originals. I don't think it matters much for thumbnails, but it's hard to cleanly tell that to SwiftFileBackend. It seems like it might be easiest to have thumbor hash the local file and save the metadata in the PUT request to avoid these errors (and slowness of triggering a GET to POST the missing data).

Oct 1 2018, 9:34 PM · MW-1.33-notes (1.33.0-wmf.8; 2018-12-11), Patch-For-Review, Performance-Team, Thumbor, MediaWiki-File-management, Wikimedia-production-error
aaron added a comment to T202553: Database->insertSelect() generates invalid SQL when * is passed as $conds.

I think any callers should use '', not '*', which doesn't make much sense to me. That said, we already started the pattern, so it may as well work here too.

Oct 1 2018, 9:22 PM · MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), Patch-For-Review, Performance-Team, MediaWiki-Database
aaron closed T172941: Track duplicate parses on page save as Declined.

Closing per comments on patch about MCR refactor.

Oct 1 2018, 8:52 PM · Patch-For-Review, MediaWiki-Page-editing, MediaWiki-Parser, Performance-Team
aaron claimed T205369: Investigate > 40% Save Timing regression (2018-09-05).
Oct 1 2018, 8:19 PM · Core Platform Team Kanban (Done with CPT), Performance-Team-notice, Core Platform Team (MCR), Multi-Content-Revisions (Reactive), Performance-Team
aaron added a comment to T205893: Automatically trigger waitForReplication after a sufficiently high number of rows has been written.

Sometimes callers might want I/O (swift/elastics/blazegraph) near DB I/O transactions, so even if we use setTransactionListener() (like Maintenance) and listen for points where no trx is active anywhere (kind of like DeferredUpdates), we'd want to be careful about waiting for lag too long or erroring out. Then again, mixed source-IO code should generally follow guidelines (https://www.mediawiki.org/wiki/Database_transactions#Updating_secondary_non-RDBMS_stores) and use patterns like doing the key/value writes first and committing or using commit hooks/deferred updates. So...maybe a callback could listen to setTransactionListener(), it could be given the affected row count, and a deferred MergeableUpdate could be added to DeferredUpdates when the count is high for among DBs recently (using pass-by-ref listener callback vars for last-time and running-count or something). The update could wait for replication, and would do so after any related I/O updates that relate to the DB writes.

Oct 1 2018, 7:35 PM · Wikimedia-Incident, MediaWiki-Database

Sep 2 2018

aaron closed T189702: Replace transcache table with objectcache backend as Resolved.
Sep 2 2018, 9:26 PM · MW-1.32-notes (WMF-deploy-2018-09-04 (1.32.0-wmf.20)), MediaWiki-Database, Patch-For-Review, Core-Platform-Team-Old, Performance-Team, MediaWiki-Templates

Aug 31 2018

aaron added a comment to T196378: Investigate solutions for MySQL connection pooling.

The main blocker right now is to decide on a tunneling technology, as most seem to have issues.

Aug 31 2018, 8:00 PM · DBA, Availability (MediaWiki-MultiDC), Performance-Team (Radar), Operations
aaron added a comment to T202910: add performance team members to webserver_misc_static servers to maintain sitemaps.

perf-roots seems appropriate. If anything extra is needed, that can always be discussed in the future (probably by adding to perf-roots).

Aug 31 2018, 8:21 AM · Patch-For-Review, Performance-Team (Radar), SRE-Access-Requests, Operations

Aug 29 2018

aaron added a comment to T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate).

All calls to incEditCountImmediate currently move it to the end of the transaction. According to logstash << +channel:DBPerformance +"user_editcount=user_editcount+N" +"sub-optimal" >> it seems to usually be very fast. Though I see occasional entries a little over 1 second. I suppose in that case, a fast enough edit rate by a single use could make a pile-up. I wonder if the delay comes from COMMIT itself?

Aug 29 2018, 9:14 PM · MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), Performance-Team-notice, Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error, MediaWiki-Database
aaron renamed T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate) from Lock wait timeout exceeded when doing fast edits due to articule edit count locking to Lock wait timeout exceeded when doing fast edits due to article edit count locking.
Aug 29 2018, 8:56 PM · MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), Performance-Team-notice, Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error, MediaWiki-Database
aaron added a commit to T32956: Make ResourceLoader a standalone library: rMWdc3fc6cf81ec: resourceloader: Audit use of JSON encoding and use json_encode directly.
Aug 29 2018, 1:34 AM · Performance-Team, Librarization, MediaWiki-ResourceLoader

Aug 28 2018

aaron added a comment to T170596: Could not acquire lock 'LinksUpdate:job:pageid:xxx'.

https://en.wikipedia.org/wiki/User:Sam_Sailor/CSD_log seems to be an offending page (many links, possible parallel updates).

Aug 28 2018, 6:54 PM · MW-1.32-notes (WMF-deploy-2018-08-28 (1.32.0-wmf.19)), Performance-Team, MediaWiki-JobQueue, Wikimedia-production-error
aaron closed T202650: Please add aaron to perf-team as Resolved.

Confirmed.

Aug 28 2018, 6:33 PM · Patch-For-Review, Operations, SRE-Access-Requests
aaron closed T202650: Please add aaron to perf-team, a subtask of T202648: Please add everyone on the performance team to perf-roots, as Resolved.
Aug 28 2018, 6:33 PM · SRE-Access-Requests, Operations
aaron added a comment to T170596: Could not acquire lock 'LinksUpdate:job:pageid:xxx'.

Aside from using a narrower exception type and catching it, it's probably even easier to make acquirePageLock() return a boolean and log the error to a channel (possibly INFO level). The page_id should be extra logstash metadata, to make grouping easier. I suspect certain pages (like Commonist gallery subpages or such) are more likely to be offenders that others.

Aug 28 2018, 5:47 PM · MW-1.32-notes (WMF-deploy-2018-08-28 (1.32.0-wmf.19)), Performance-Team, MediaWiki-JobQueue, Wikimedia-production-error

Aug 27 2018

aaron added a comment to T201240: Transaction timeout for LinksUpdate::updateLinksTimestamp (SET page_links_updated) .

I can't seem to reproduce this slowness (using mwdebug1002).

Aug 27 2018, 8:04 PM · Performance-Team, Core-Platform-Team-Old, Regression, Wikimedia-production-error, MediaWiki-Page-editing
Nemo_bis awarded T189702: Replace transcache table with objectcache backend a Doubloon token.
Aug 27 2018, 7:35 PM · MW-1.32-notes (WMF-deploy-2018-09-04 (1.32.0-wmf.20)), MediaWiki-Database, Patch-For-Review, Core-Platform-Team-Old, Performance-Team, MediaWiki-Templates

Aug 23 2018

aaron added a comment to T164382: Evaluate the need for FORCE INDEX (ls_field_val) [now IGNORE INDEX (ls_log_id)], delete the index hint if not needed anymore.

I don't recall. It's been long enough that it's worth testing how queries run without it.

Aug 23 2018, 7:36 AM · MediaWiki-Logging, DBA

Aug 21 2018

aaron closed T198239: Rollout use of mcrouter for MediaWiki in production as Resolved.
Aug 21 2018, 8:02 AM · MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), Patch-For-Review, Availability (MediaWiki-MultiDC), Performance-Team
aaron closed T198239: Rollout use of mcrouter for MediaWiki in production, a subtask of T192370: Deploy mcrouter to production as a wancache backend, as Resolved.
Aug 21 2018, 8:02 AM · Patch-For-Review, Performance-Team (Radar), Availability (MediaWiki-MultiDC), Operations

Aug 18 2018

aaron updated the task description for T198239: Rollout use of mcrouter for MediaWiki in production.
Aug 18 2018, 5:51 AM · MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), Patch-For-Review, Availability (MediaWiki-MultiDC), Performance-Team

Aug 14 2018

aaron closed T118893: Consider using APC for the individually cached keys (e.g. 'TOO BIG') in MessageCache as Resolved.
Aug 14 2018, 6:16 PM · MW-1.32-notes (WMF-deploy-2018-07-31 (1.32.0-wmf.15)), Patch-For-Review, Performance-Team, MediaWiki-Cache

Aug 13 2018

aaron added a comment to T185724: Publish Doxygen for RunningStat library.

Where are the jenkins jobs defined?

Aug 13 2018, 8:23 PM · Librarization, Performance-Team, RunningStat, Continuous-Integration-Config