aaron (Aaron Schulz)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Oct 20 2014, 5:25 PM (225 w, 5 d)
Availability
Available
IRC Nick
AaronSchulz
LDAP User
Aaron Schulz
MediaWiki User
Aaron Schulz [ Global Accounts ]

Recent Activity

Yesterday

aaron added a comment to T211661: Automatically clean up unused thumbnails in Swift.

It seems that Swift has built-in support for object expiration, which can be requested by setting a header (either X-Delete-After or X-Delete-At).
It also looks like the expiry can be re-set, either by first removing it via X-Remove-Delete-At, and then setting it anew, or by updating the metadata in-place.
Is there a reason why these mechanisms are not under consideration?

A straw-man proposal:

  • On thumbnail creation, set X-Delete-After: 2592000 (one month).
  • Each time a thumbnail is retrieved, there's a 0.01% chance we also used the opportunity to reset the expiry to one month.
Sat, Feb 16, 8:49 AM · Patch-For-Review, Traffic, media-storage, Performance-Team, Operations
aaron added a comment to T146257: Create objectcache/BagOStuff library.

Nope. What do you want to call it? ObjectCache or BagOStuff?

Sat, Feb 16, 7:40 AM · Patch-For-Review, User-Addshore, MediaWiki-Cache, Librarization
aaron added a comment to T146257: Create objectcache/BagOStuff library.

So is anything other than EventRelayer blocking this?

Sat, Feb 16, 7:02 AM · Patch-For-Review, User-Addshore, MediaWiki-Cache, Librarization

Fri, Feb 15

aaron added a comment to T151903: Special:Search performs DB writes on GET request.

I'd prefer we didn't take preferences and bits of data like this onto sessions (we used to many years ago and that was lame), especially given the cross-DC latency. I worry people will go overboard with such features that cause slow writes.

Fri, Feb 15, 7:04 AM · Availability (MediaWiki-MultiDC), Discovery-Search, CirrusSearch, Discovery
aaron added a comment to T215611: MediaWiki errors overloading logstash.

I think Timo backported the change, so it should be live:

Fri, Feb 15, 1:48 AM · MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), Patch-For-Review, Core Platform Team Kanban (Blocked Externally), Core Platform Team (Security, stability, performance and scalability (TEC1)), Performance-Team, Wikimedia-production-error, Wikimedia-Logstash, Operations, MediaWiki-Database, monitoring
aaron added a comment to T151903: Special:Search performs DB writes on GET request.

Wouldn't it be feasible to have the search request generate a simple job that saves that preference asynchronously?

The jobqueue is already capable of listening to the events in both DCs, and execute them in the primary one only, so most things that can happen post-send can probably be deferred to the jobqueue.

I generic preference setting job would be useful for this and similar cases.

A delay in making the new preference available might be unintuitive for users. In particular if they perform another search a few seconds after the one where they saved they will expect the new setting to load. I'm not entirely clear on things, but i think ChronologyProtector ensures this with the current setup, but if the preference pushed into a job there would be no guarantee? It could also be that simple UI affordances to keep everything selected will negate any concern about reading the preference.

Fri, Feb 15, 1:23 AM · Availability (MediaWiki-MultiDC), Discovery-Search, CirrusSearch, Discovery
aaron added a comment to T215850: LoadBalancer.php uses $wgDBname even though $wgDBservers is defined.

$wgDBname and $wgDBprefix are used by methods like wfWikiId() and "DB domain" methods from WikiMap. The former is a very old method. They are assumed to contain the DB name/prefix of the current wiki (wiki farms will set this depending on the vhost or URL or something).

Fri, Feb 15, 1:02 AM · MediaWiki-Database

Thu, Feb 14

aaron closed T215566: MediaWiki 1.32.0 Installation Error: "Could not select database" as Resolved.
Thu, Feb 14, 8:23 PM · MW-1.32-notes, MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), MediaWiki-Database, MW-1.32-release, MediaWiki-Installer
aaron added a comment to T215611: MediaWiki errors overloading logstash.
Thu, Feb 14, 8:20 PM · MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), Patch-For-Review, Core Platform Team Kanban (Blocked Externally), Core Platform Team (Security, stability, performance and scalability (TEC1)), Performance-Team, Wikimedia-production-error, Wikimedia-Logstash, Operations, MediaWiki-Database, monitoring
aaron added a comment to T151903: Special:Search performs DB writes on GET request.

Wouldn't it be feasible to have the search request generate a simple job that saves that preference asynchronously?

The jobqueue is already capable of listening to the events in both DCs, and execute them in the primary one only, so most things that can happen post-send can probably be deferred to the jobqueue.

Thu, Feb 14, 7:05 PM · Availability (MediaWiki-MultiDC), Discovery-Search, CirrusSearch, Discovery
aaron added a comment to T215611: MediaWiki errors overloading logstash.

^I don't have enough context for this patch, is configuration for regular servers setup to nor produce DEBUG lines outside? Otherwise, I don't understand how the patch can produce the desired output.

Thu, Feb 14, 5:38 PM · MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), Patch-For-Review, Core Platform Team Kanban (Blocked Externally), Core Platform Team (Security, stability, performance and scalability (TEC1)), Performance-Team, Wikimedia-production-error, Wikimedia-Logstash, Operations, MediaWiki-Database, monitoring

Wed, Feb 13

aaron added a comment to T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions.

@aaron Any more ideas about what could be tuned to make this work? I am also wondering if https://gerrit.wikimedia.org/r/487622 should work out of the box or if a change is in the translation extension is needed to make the lockTSE logic working?

Wed, Feb 13, 6:48 PM · MW-1.33-notes (1.33.0-wmf.16; 2019-02-05), Patch-For-Review, Performance-Team (Radar), Wikimedia-production-error, User-Elukey, MediaWiki-Cache, Operations
aaron closed T111877: incorporate master/slave datacenter guidelines into developer docs as Resolved.

I updated the "Database transactions" and "Performance guidelines" pages, which I'd hope most people that work on MediaWiki will encounter when looking for generic guidelines. Things can be better linked, though that's a broader matter...

Wed, Feb 13, 1:18 AM · Performance-Team, MediaWiki-General-or-Unknown, Documentation
aaron closed T111877: incorporate master/slave datacenter guidelines into developer docs, a subtask of T88666: RFC: Master/slave datacenter strategy for MediaWiki, as Resolved.
Wed, Feb 13, 1:18 AM · Performance-Team, TechCom-RFC (TechCom-Approved), Availability (MediaWiki-MultiDC)

Mon, Feb 11

aaron moved T111877: incorporate master/slave datacenter guidelines into developer docs from Backlog: Small & Maintenance to Current Quarter Goals on the Performance-Team board.
Mon, Feb 11, 8:16 PM · Performance-Team, MediaWiki-General-or-Unknown, Documentation

Sun, Feb 10

aaron added a comment to T215740: Create Icinga check for ArcLamp (xenon-log) service health.

I recall something like this when testing around with the old python-memcached-relay daemon expirement (before we decided on mcrouter instead). That used non-blocking checks and conditional sleep (polling), but since there is only one server here, the message() timeout could work. Something like https://github.com/andymccurdy/redis-py/issues/631 with try/catch for redis.TimeoutError and resubscribe logic should be doable.

Sun, Feb 10, 10:48 PM · Wikimedia-Incident, Performance-Team

Thu, Feb 7

aaron committed rMSCA96e4666bdf8c: Convert mwversionsinuse to pure python (authored by bd808).
Convert mwversionsinuse to pure python
Thu, Feb 7, 12:10 PM
aaron committed rMSCA818684363996: Compile wikiversions.json to cdb for local sync (authored by bd808).
Compile wikiversions.json to cdb for local sync
Thu, Feb 7, 12:10 PM

Wed, Feb 6

aaron closed T207247: WANObjectCacheTest::testGetWithSeveralCheckKeys is flaky as Resolved.
Wed, Feb 6, 7:10 PM · MW-1.33-notes (1.33.0-wmf.16; 2019-02-05), Patch-For-Review, Performance-Team, Wikimedia-production-error (Shared Build Failure), MediaWiki-Cache

Mon, Feb 4

aaron moved T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate) from Doing to Backlog: Small & Maintenance on the Performance-Team board.
Mon, Feb 4, 9:24 PM · Performance-Team-notice, Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error

Sun, Feb 3

aaron added a comment to T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions.

Also note that even the old LRU algorithm avoided bumping things in a slab more than once a minute (https://memcached.org/blog/modern-lru/), so that perhaps make it more likely that some other keys are flooding out the stab, since being "hot" does not mean being at the top of the list so much (if there are other things like prepared edit parse blobs and parsoid serialization blobs coming in on every edit).

Sun, Feb 3, 1:54 AM · MW-1.33-notes (1.33.0-wmf.16; 2019-02-05), Patch-For-Review, Performance-Team (Radar), Wikimedia-production-error, User-Elukey, MediaWiki-Cache, Operations

Wed, Jan 30

aaron added a comment to T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions.

@aaron, @Nikerabbit - if you guys have time during the next days can we chat about the next steps for this task?

Wed, Jan 30, 8:08 AM · MW-1.33-notes (1.33.0-wmf.16; 2019-02-05), Patch-For-Review, Performance-Team (Radar), Wikimedia-production-error, User-Elukey, MediaWiki-Cache, Operations

Tue, Jan 29

aaron added a comment to T207247: WANObjectCacheTest::testGetWithSeveralCheckKeys is flaky.

Is this still reproducible in master?

Tue, Jan 29, 1:28 AM · MW-1.33-notes (1.33.0-wmf.16; 2019-02-05), Patch-For-Review, Performance-Team, Wikimedia-production-error (Shared Build Failure), MediaWiki-Cache

Sat, Jan 26

elukey awarded T214275: Consider removing the last traces of nutcracker in Mediawiki configs a Like token.
Sat, Jan 26, 8:01 AM · Patch-For-Review, Performance-Team, User-Elukey, Operations, MediaWiki-Cache

Fri, Jan 25

aaron added a comment to T214275: Consider removing the last traces of nutcracker in Mediawiki configs.

Things needed here:

  • Use only mcrouter in deployment-prep (no multiwrite) from MW
  • Remove puppet code for deployment-prep
  • Install mcrouter on the memached servers used by labswiki
  • Make MW use mcrouter on labswiki
  • Remove "memcached-pecl" cache entry from config
  • Remove labswiki nutcracker code from puppet
Fri, Jan 25, 10:11 PM · Patch-For-Review, Performance-Team, User-Elukey, Operations, MediaWiki-Cache
aaron added a comment to T214471: wdio browser tests fail locally due to session not being persisted before 2nd stage of login starts.

I don't think ChronologyProtector is involved.

Fri, Jan 25, 8:15 PM · MW-1.33-notes (1.33.0-wmf.17; 2019-02-12), Wikimedia-production-error (Shared Build Failure), Patch-For-Review, User-Addshore, MediaWiki-Authentication-and-authorization, Release-Engineering-Team (Kanban), MediaWiki-Core-Testing, User-zeljkofilipin

Wed, Jan 23

aaron added a comment to T172497: Fix mediawiki heartbeat model, change pt-heartbeat model to not use super-user, avoid SPOF and switch automatically to the real master without puppet dependency.

From a perspective of layered architecture and separation of concern, I'm not sure I like the idea of a MediaWiki script. But some script that reads from etcd and does the updates to the table seems reasonable.

Wed, Jan 23, 8:41 PM · Core Platform Team Backlog (Watching / External), Performance-Team (Radar), MediaWiki-Database, Wikimedia-Incident, DBA
aaron added a comment to T212881: addWiki.php broken creating ES tables.

To be useful, the tables.sql and other bits need to also handle idempotence or have some script parameter to skip them though...

Wed, Jan 23, 7:49 AM · Patch-For-Review, Performance-Team, MediaWiki-extensions-WikimediaMaintenance
aaron added a comment to T212881: addWiki.php broken creating ES tables.

I see

Error: 1050 Table 'blobs_cluster24' already exists (10.64.32.184)
Wed, Jan 23, 7:45 AM · Patch-For-Review, Performance-Team, MediaWiki-extensions-WikimediaMaintenance

Fri, Jan 18

aaron added a comment to T212881: addWiki.php broken creating ES tables.

Did it finish?

Fri, Jan 18, 10:17 PM · Patch-For-Review, Performance-Team, MediaWiki-extensions-WikimediaMaintenance

Jan 16 2019

aaron added a comment to T111264: Decouple chronology protector from authentication .

Another option, is using the existing 'ChronologyClientId' header that MediaWiki supports for acting on behalf of a client (e.g. no need to forward the agent/IP).

Jan 16 2019, 10:28 PM · Core Platform Team Backlog (Watching / External), Performance-Team (Radar), Services (watching), Parsoid, RESTBase, Availability

Jan 14 2019

aaron added a comment to T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions.

@Nikerabbit looking forward to see it deployed, thanks to both you and Aaron for this work.

One question - would it be possible, as long term view, to see this key being broken down into smaller pieces to avoid a never ending increase? I am a bit worried that we are now resolving this issue but leaving another one (a big key that can generate a lot of traffic) aside.. It is fine if the answer is No, I am just wondering what's possible for the future :)

Jan 14 2019, 11:47 PM · MW-1.33-notes (1.33.0-wmf.16; 2019-02-05), Patch-For-Review, Performance-Team (Radar), Wikimedia-production-error, User-Elukey, MediaWiki-Cache, Operations

Jan 12 2019

aaron added a comment to T213398: Move performance team dashboards to /performance/ folder in Grafana.
Jan 12 2019, 7:14 AM · Performance-Team

Jan 9 2019

aaron added a comment to T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate).

I assume the 6 errors correlate to the 6 edits in the same minute given the update happens post-send (it is already biased towards later), As such, it would seem that 6 out of 6 timed out.

From that I would say the locking of the row happens previously, on the same piece of code for another user or more likely at a different piece of code that locks the same rows, and just happens that the editing of the edit count is frequent enough to be only the "sufferer".

We could setup some monitoring so that we identify what is locking the rows when that happens for debugging.

Jan 9 2019, 7:04 PM · Performance-Team-notice, Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error

Jan 7 2019

aaron added a comment to T197733: Error after install using CLI: "Cannot execute query from MessageCache::loadFromDB(en)-big while transaction status is ERROR.".

This was probably fixed alongside T207979.

Jan 7 2019, 5:59 PM · MW-1.32-release, MediaWiki-Cache, Performance-Team, MediaWiki-Installer
aaron added a comment to T166718: Make FlaggedRevs revision rating box responsive.

I don't recall an overriding reason it has to be there. Seems reasonable to experiment with changing it as long as the common desktop case (fuller window) looks the same.

Jan 7 2019, 5:39 PM · Mobile, MediaWiki-extensions-FlaggedRevs

Dec 21 2018

aaron added a comment to T194299: Lock wait timeout exceeded in SqlIdGenerator::generateNewId.

That sounds right.

Dec 21 2018, 6:10 PM · Wikidata-Campsite, User-Addshore, Wikimedia-production-error, MediaWiki-extensions-WikibaseRepository, Wikidata

Dec 20 2018

aaron closed T211631: Add info boxes to all save timing graphs on Grafana as Resolved.
Dec 20 2018, 11:13 PM · Performance-Team
aaron updated the task description for T211631: Add info boxes to all save timing graphs on Grafana.
Dec 20 2018, 11:13 PM · Performance-Team

Dec 19 2018

aaron added a comment to T93142: [Task] Look into Wikibase use of memcached to see what needs broadcasted purges.

I see EntityRevisionCache and CacheAwarePropertyInfoStore seems to use set() on invalidation. Also, EntityRevisionCache
and CachingEntityRevisionLookup, and PopulateInterwiki seems to call delete() on a non-WAN cache instance.

Dec 19 2018, 1:18 AM · Performance, Wikidata
aaron updated the task description for T211631: Add info boxes to all save timing graphs on Grafana.
Dec 19 2018, 12:55 AM · Performance-Team
aaron added a comment to T212129: Use a multi-dc aware store for ObjectCache's MainStash if needed..

The current callers don't assume the level of durability as with mysql, just that the data will likely not be randomly removed (e.g. high eviction rate, power outage, network blips). The WAN cache callers can handle a fair amount of that on the other hand.

Dec 19 2018, 12:06 AM · User-mobrovac, Services (doing), User-jijiki, Core Platform Team Kanban (Doing), Core Platform Team (Security, stability, performance and scalability (TEC1)), Performance-Team (Radar), Operations, MediaWiki-Cache, serviceops

Dec 18 2018

aaron added a comment to T211721: Establish an SLA for session storage.

To add to what @Tgr found, we have to search for usage of MediaWikiServices::getInstance()->getMainObjectStash(); as that's what that method uses under the hood.

There is quite a few uses of it here:

https://codesearch.wmflabs.org/search/?q=getMainObjectStash&i=nope&files=&repos=

I would propose that whatever is not extremely valuable (so anything that's not in the session store) should be stored on memcached via mcrouter, so that we can have multi-dc writes and broadcast both writes and evictions if needed. We can't realistically support sub-millisecond latencies in the service we're desinging and most uses of this cache are for caching purposes indeed.

@aaron do you think using mcrouter for MainObjectStash is feasible?

Dec 18 2018, 11:56 PM · Core Platform Team Backlog (Later), Performance-Team (Radar), TechCom, Services (next), Operations, User-Clarakosi, Core Platform Team (Session Management Service (CDP2)), User-Eevans
aaron updated the task description for T211631: Add info boxes to all save timing graphs on Grafana.
Dec 18 2018, 8:40 PM · Performance-Team
aaron added a comment to T212129: Use a multi-dc aware store for ObjectCache's MainStash if needed..

We need persistence and replication. The plan is to use the same store as session for the rest of the object stash usage (probably Cassandra). Flags like WRITE_SYNC might be used in a few callers, and should use appropriate backend requests (e.g. QUOROM_* settings in Cassandra). The callers of the main object stash all need persistence and replication though (callers have already been migrated to stash vs WAN cache and such).

Dec 18 2018, 8:22 PM · User-mobrovac, Services (doing), User-jijiki, Core Platform Team Kanban (Doing), Core Platform Team (Security, stability, performance and scalability (TEC1)), Performance-Team (Radar), Operations, MediaWiki-Cache, serviceops
aaron added a subtask for T88445: MediaWiki active/active datacenter investigation and work (tracking): T212129: Use a multi-dc aware store for ObjectCache's MainStash if needed..
Dec 18 2018, 8:19 PM · User-mobrovac, Services (designing), Core Platform Team (Security, stability, performance and scalability (TEC1)), Core Platform Team Backlog (Epic), Performance-Team (Radar), Availability (MediaWiki-MultiDC), Epic
aaron added a parent task for T212129: Use a multi-dc aware store for ObjectCache's MainStash if needed.: T88445: MediaWiki active/active datacenter investigation and work (tracking).
Dec 18 2018, 8:19 PM · User-mobrovac, Services (doing), User-jijiki, Core Platform Team Kanban (Doing), Core Platform Team (Security, stability, performance and scalability (TEC1)), Performance-Team (Radar), Operations, MediaWiki-Cache, serviceops

Dec 13 2018

aaron updated the task description for T211631: Add info boxes to all save timing graphs on Grafana.
Dec 13 2018, 2:50 PM · Performance-Team

Dec 11 2018

aaron added a comment to T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions.

I'm not sure why the recache() calls would cause many CAS commands though. The only threads doing the regeneration (and CAS) would be those of requests doing updates...which should not be that frequent.

Dec 11 2018, 11:27 PM · MW-1.33-notes (1.33.0-wmf.16; 2019-02-05), Patch-For-Review, Performance-Team (Radar), Wikimedia-production-error, User-Elukey, MediaWiki-Cache, Operations
aaron added a comment to T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions.

Today another mediawiki alert from ~12:24 to ~12:27 UTC. @Nikerabbit, @aaron - do you think that we can narrow down specific events (beside TTL expiring that may cause this?

Actually, from TranslatePostInitGroups, I only see global variable and config file timestamp dependencies, so those are not actually user controlled purges. We don't use TranslateSVG, which calls delete().

Dec 11 2018, 8:12 PM · MW-1.33-notes (1.33.0-wmf.16; 2019-02-05), Patch-For-Review, Performance-Team (Radar), Wikimedia-production-error, User-Elukey, MediaWiki-Cache, Operations
aaron closed T204174: FileOperation error "SwiftFileBackend::addMissingMetadata: {path} was not stored with SHA-1 metadata." as Resolved.
Dec 11 2018, 7:59 PM · MW-1.33-notes (1.33.0-wmf.8; 2018-12-11), Patch-For-Review, Performance-Team, Thumbor, MediaWiki-File-management, Wikimedia-production-error
aaron moved T211631: Add info boxes to all save timing graphs on Grafana from Current Quarter Goals to Backlog: Small & Maintenance on the Performance-Team board.
Dec 11 2018, 7:05 PM · Performance-Team
aaron moved T211631: Add info boxes to all save timing graphs on Grafana from Backlog: Small & Maintenance to Current Quarter Goals on the Performance-Team board.
Dec 11 2018, 7:05 PM · Performance-Team
aaron added a comment to T210992: Increase parsercache keys TTL from 22 days back to 30 days.

@Krinkle can you confirm whether those are the two reverts we have to do?
I have been looking around and I haven't found anything else to revert
Thank you!

Dec 11 2018, 3:50 PM · Performance-Team (Radar), Patch-For-Review, Operations, DBA

Dec 10 2018

aaron created T211631: Add info boxes to all save timing graphs on Grafana.
Dec 10 2018, 9:00 PM · Performance-Team

Dec 8 2018

aaron added a comment to T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions.

Today another mediawiki alert from ~12:24 to ~12:27 UTC. @Nikerabbit, @aaron - do you think that we can narrow down specific events (beside TTL expiring that may cause this?

Dec 8 2018, 6:18 AM · MW-1.33-notes (1.33.0-wmf.16; 2019-02-05), Patch-For-Review, Performance-Team (Radar), Wikimedia-production-error, User-Elukey, MediaWiki-Cache, Operations

Dec 2 2018

aaron added a comment to T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions.

To clarify, the $useMutex logic in WAN cache never triggers due to minAsOf=INF, resulting in stampedes when someone invalidates the cache. Instead, this should be treated like a regular TTL expiration and have one thread at a time doing regeneration.

Dec 2 2018, 9:58 PM · MW-1.33-notes (1.33.0-wmf.16; 2019-02-05), Patch-For-Review, Performance-Team (Radar), Wikimedia-production-error, User-Elukey, MediaWiki-Cache, Operations
aaron added a comment to T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions.

Is the value so big that every time it is SET it causes a TKO? Besides the 24h TTL, the cache value is updated when there are changes to the underlying data. In other words: when people create or remove translatable pages, aggregate message groups or other similar stuff.

Dec 2 2018, 10:33 AM · MW-1.33-notes (1.33.0-wmf.16; 2019-02-05), Patch-For-Review, Performance-Team (Radar), Wikimedia-production-error, User-Elukey, MediaWiki-Cache, Operations

Nov 30 2018

Imarlier awarded T205369: Investigate > 40% Save Timing regression (2018-09-05) a Mountain of Wealth token.
Nov 30 2018, 7:06 PM · Core Platform Team Kanban (Done with CPT), Performance-Team-notice, Core Platform Team (MCR), Multi-Content-Revisions (Reactive), Performance-Team

Nov 26 2018

aaron closed T209483: Got connection to 'yuewiktionary', but expected local domain ('aawiktionary'). as Resolved.
Nov 26 2018, 5:58 PM · MW-1.32-notes, MW-1.33-notes (1.33.0-wmf.6; 2018-11-27), Patch-For-Review, MediaWiki-extensions-WikimediaMaintenance

Nov 22 2018

aaron added a comment to T209857: Create Autonomous Systems ranking based on RUM data.

@Gilles: Comcast only has cable infrastructure in terms what the ISP provides itself. For customers with cable, they can also get XFinity Mobile (https://www.tomsguide.com/us/xfinity-mobile-faq,news-25223.html) . That's basically just a bunch of Wi-Fi hotspots build off of Verizon. I don't know how many people are using that and it seems new-ish. Also, the latency figures are quite low, which makes me doubt that it is XFinity Mobile and more likely regular wireless/xfinity.

Nov 22 2018, 7:19 PM · MW-1.33-notes (1.33.0-wmf.8; 2018-12-11), Performance-Team
aaron added a comment to T209857: Create Autonomous Systems ranking based on RUM data.

It looks sane, though I wonder why Comcast is so high in usage for mobile? Is that mostly from touchpad devices instead of smartphones?

Nov 22 2018, 8:22 AM · MW-1.33-notes (1.33.0-wmf.8; 2018-12-11), Performance-Team

Nov 20 2018

aaron added a comment to T196378: Investigate solutions for MySQL connection pooling.
Nov 20 2018, 8:55 PM · DBA, Availability (MediaWiki-MultiDC), Performance-Team (Radar), Operations
aaron added a comment to T208934: mcrouter does not remove a memcached shard from consistent hashing when timeouts happen.

It definitely seems like something worth doing. Having the potential for high use cache keys becoming unusable for undefined periods of time is too much of a stability concern.

Nov 20 2018, 8:04 PM · Performance-Team (Radar), User-Elukey, MediaWiki-Cache, Operations
aaron added a comment to T157651: sql.php runs LoadExtensionSchemaUpdates.

Indeed. The updater calls the LoadExtensionSchemaUpdates hook from the constructor (bleh) and Echo and SecurePoll use dropTable/modifyField (which are executed immediately) instead of dropExtensionTable/modifyExtensionField (which would just push an entry to the update list).

This is a very, very ugly accident waiting to happen. @aaron any preference how to prevent it? We could make getSchemaVars static, or split out the extension loading part from the constructor, or add some flag that prevents calling non-extension methods on the updater. Maybe even a unit test which calls the hook with a fake updater which fails the test if any non-extension methods are called on it.

Nov 20 2018, 7:53 PM · Patch-For-Review, Core Platform Team Backlog (Watching / External), Performance-Team, MediaWiki-Database, MediaWiki-Maintenance-scripts, Beta-Cluster-reproducible
aaron triaged T204423: MW 1.31 install reports "InvalidArgumentException ... DatabaseDomain.php: Domain has too few or too many parts " as Low priority.
Nov 20 2018, 7:47 PM · Performance-Team, MW-1.31-release, MediaWiki-Database, MediaWiki-Installer

Nov 19 2018

aaron changed the status of T204423: MW 1.31 install reports "InvalidArgumentException ... DatabaseDomain.php: Domain has too few or too many parts " from Open to Stalled.
Nov 19 2018, 8:36 PM · Performance-Team, MW-1.31-release, MediaWiki-Database, MediaWiki-Installer

Nov 14 2018

aaron added a comment to T205369: Investigate > 40% Save Timing regression (2018-09-05).

Since CategoryMembershipChangeJob runs via the job queue, wouldn't that have little effect on save timing itself? I guess it wouldn't hurt to optimize.

Nov 14 2018, 10:48 PM · Core Platform Team Kanban (Done with CPT), Performance-Team-notice, Core Platform Team (MCR), Multi-Content-Revisions (Reactive), Performance-Team

Nov 9 2018

aaron added a comment to T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions.

Keys are set by add/cas normally, so it seems like some key that takes a long time to regenerate might have expired (there are two data points at the elevated value over more than just a few seconds) or a class of many keys expired. The other possibility is some sudden change in access patterns for keys, which seems less likely, especially the more periodic this is.

Checking keys via tcpdump is not super easy, so I tried to order the packets by size inspecting the keys with bigger size and lower TTL. I keep seeing this pattern of SETs for metawiki:translate-groups (TTL 2 hours) followed by gets during the timeframe in which timeouts happen, that given the size (~380k) could surely aggravate the bandwidth problem that we are seeing. Would it be possible to increase this TTL to say a 24h to see if anything changes? Or possibly migrate the code to something more like gadget-definition, namely setting the new key only if really needed and not every two hours by default. It would help removing a (big) variable from the problem..

To answer the point brought up by @Nikerabbit about the code not changed in ages - I suspect that in the past nutcracker might have masked problems like this one (see T208934), and that the change to mcrouter caused more errors raised due to bandwidth/latency variations. I am probably not right about the metawiki:translate-groups key as culprit, but I'd just need to see if removing a big key makes any difference in what we see in the graphs, of course if this is not going to affect users in any way.

Nov 9 2018, 2:15 AM · MW-1.33-notes (1.33.0-wmf.16; 2019-02-05), Patch-For-Review, Performance-Team (Radar), Wikimedia-production-error, User-Elukey, MediaWiki-Cache, Operations

Nov 8 2018

aaron added a comment to T208487: [RfC] Add CURRENT_TIMESTAMP support for `wl_notificationtimestamps` in watchlist table.

wl_notificationtimestamp is not meant to store the time the article was watched but the last revision the user saw on the page (NULL if they saw the latest revision). This would require a new column. Ideally, if watchlist sizes were limited, this woudn't need an index, but they are not.

Nov 8 2018, 9:53 PM · User-D3r1ck01, TechCom-RFC, Growth-Team, MediaWiki-Database
aaron placed T208487: [RfC] Add CURRENT_TIMESTAMP support for `wl_notificationtimestamps` in watchlist table up for grabs.
Nov 8 2018, 7:19 AM · User-D3r1ck01, TechCom-RFC, Growth-Team, MediaWiki-Database

Nov 7 2018

aaron added a comment to T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions.

Keys are set by add/cas normally, so it seems like some key that takes a long time to regenerate might have expired (there are two data points at the elevated value over more than just a few seconds) or a class of many keys expired. The other possibility is some sudden change in access patterns for keys, which seems less likely, especially the more periodic this is.

Nov 7 2018, 7:19 PM · MW-1.33-notes (1.33.0-wmf.16; 2019-02-05), Patch-For-Review, Performance-Team (Radar), Wikimedia-production-error, User-Elukey, MediaWiki-Cache, Operations

Nov 6 2018

aaron updated subscribers of T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions.
Nov 6 2018, 10:24 AM · MW-1.33-notes (1.33.0-wmf.16; 2019-02-05), Patch-For-Review, Performance-Team (Radar), Wikimedia-production-error, User-Elukey, MediaWiki-Cache, Operations

Nov 5 2018

aaron added a comment to T204346: PHP-timed out requests also emit LoadBalancer::destruct error "you can't run this command now: COMMIT".

Fixed in bf30fcb71427d673f7c83a067b3241040d3470b6. Rollback is used instead and uses $ignoreErrors so as not to trigger the exception in reportQueryError().

Nov 5 2018, 11:15 PM · Performance-Team, Wikimedia-production-error, MediaWiki-Database
aaron closed T39159: sqlite: DatabaseBase::delete and DatabaseBase::update return ResultWrapper object as Resolved.

Cleaned up in 633eb437a3b808518469c6eaf4e86a436941d837

Nov 5 2018, 10:10 PM · Performance-Team, good first bug, SQLite, MediaWiki-Database

Nov 2 2018

aaron added a comment to T194299: Lock wait timeout exceeded in SqlIdGenerator::generateNewId.

openConnection is badly named and still reuses connections. You'd probably want getConnection with CONN_TRX_AUTO

I hate this hack. This may *still* re-use connections, if anything else used CONN_TRX_AUTO. We should have CONN_NEW.

Nov 2 2018, 6:28 AM · Wikidata-Campsite, User-Addshore, Wikimedia-production-error, MediaWiki-extensions-WikibaseRepository, Wikidata

Nov 1 2018

aaron added a comment to T194299: Lock wait timeout exceeded in SqlIdGenerator::generateNewId.

openConnection is badly named and still reuses connections. You'd probably want getConnection with CONN_TRX_AUTO

Nov 1 2018, 10:23 PM · Wikidata-Campsite, User-Addshore, Wikimedia-production-error, MediaWiki-extensions-WikibaseRepository, Wikidata
aaron closed T203925: Save times for changes to translation variable text in centralnotice paralysingly slow as Resolved.
Nov 1 2018, 7:52 PM · Core Platform Team Kanban (Done with CPT), Performance-Team-notice, Fundraising Sprint Vestigial tails shoot from the hip, Fundraising Sprint USB stands for underhanded socket bureaucracy, Fundraising Sprint They Live, Core Platform Team (Security, stability, performance and scalability (TEC1)), Performance-Team, Fundraising Sprint Sasquatches can't find us either, Language-Team, Fundraising Sprint Raw data can give you salmonella, MediaWiki-extensions-Translate, Fundraising-Backlog, MediaWiki-extensions-CentralNotice

Oct 29 2018

aaron added a comment to T206341: Evaluate scalability and performance of PHP7 compared to HHVM.

What about our use of register_postsend_function? Is there anything equivalant?

Oct 29 2018, 9:33 PM · Patch-For-Review, Performance-Team (Radar), Operations
aaron closed T207809: PHP error "CdnPurgeJob never inserted." as Resolved.
Oct 29 2018, 9:18 PM · MW-1.32-notes, MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), Performance-Team, Core Platform Team Backlog (Watching / External), Patch-For-Review, Services (watching), MediaWiki-Cache, WMF-JobQueue, Wikimedia-production-error
aaron moved T207809: PHP error "CdnPurgeJob never inserted." from Inbox to Doing on the Performance-Team board.
Oct 29 2018, 9:18 PM · MW-1.32-notes, MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), Performance-Team, Core Platform Team Backlog (Watching / External), Patch-For-Review, Services (watching), MediaWiki-Cache, WMF-JobQueue, Wikimedia-production-error
aaron added a comment to T207223: getDBLoadBalancerFactory()->getExternalLB() returns a LB with the wrong database..

The recommend fix does work for our extension. I apparently had configured Echo properly in the past so it was working properly. For AbuseFilter I had to patch it to use the same pattern since it only specifies the database and not the cluster to use.

Oct 29 2018, 6:44 PM · Performance-Team, MW-1.32-release, MW-1.31-release, MediaWiki-Database
jcrespo awarded T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate) a Love token.
Oct 29 2018, 8:49 AM · Performance-Team-notice, Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error

Oct 28 2018

aaron added a comment to T207809: PHP error "CdnPurgeJob never inserted.".

@aaron The fix LGTM, but do we know why this started happening? I'd be nice to know what commit or task prompted it so that we can learn why it wasn't prevented by our tests and/or Jenkins.

In theory, this kind of warning can be triggered in tests and would be captured by Jenkins in a way that fail the build. That definitely worked at some point.

Oct 28 2018, 10:17 PM · MW-1.32-notes, MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), Performance-Team, Core Platform Team Backlog (Watching / External), Patch-For-Review, Services (watching), MediaWiki-Cache, WMF-JobQueue, Wikimedia-production-error

Oct 27 2018

aaron closed T54781: "Could not acquire 'messages:de-formal:status' lock" opening page with page name >256 chars; memcache too slow as Invalid.

Closing, per " The Error Occurs if the memcache is too slow".

Oct 27 2018, 9:14 PM · MediaWiki-General-or-Unknown
aaron added a comment to T174549: MessageCache::loadFromDB makes too many slow queries with wrong index.

This will be better with a3d6c1411dad3e057b if there are many message pages that exists for extension use.

Oct 27 2018, 9:07 PM · MediaWiki-Database, MediaWiki-Cache
aaron added a comment to T207979: uselang=sr shows markup tags in installer.

4b1db1190bb8f2a115c6a81a5ee487b7d18cd303 seems more likely.

Oct 27 2018, 9:03 PM · MW-1.32-notes, Regression, MW-1.32-release, MediaWiki-Cache, Performance-Team, I18n, MediaWiki-Installer
aaron added a comment to T204423: MW 1.31 install reports "InvalidArgumentException ... DatabaseDomain.php: Domain has too few or too many parts ".

Note that git master (19dd28798163) installs fine with postgres, which has the same DB domain patches as 1.32.

Oct 27 2018, 3:15 PM · Performance-Team, MW-1.31-release, MediaWiki-Database, MediaWiki-Installer

Oct 26 2018

aaron added a comment to T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate).

It looks like the errors come from some tool (JS?) that fires a bunch of API requests from a Special:Search tab to edit numerous pages in parallel. Each burst always for a certain user ID with a single referrer URL.

Oct 26 2018, 11:43 PM · Performance-Team-notice, Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error
aaron committed rEBTFcadd89ea1200: Avoid use of IDatabase::insert() return value (authored by aaron).
Avoid use of IDatabase::insert() return value
Oct 26 2018, 9:41 PM
aaron committed rERXB3d0705e6c75f: Avoid using wfSplitWikiID() and use newer cache key methods (authored by aaron).
Avoid using wfSplitWikiID() and use newer cache key methods
Oct 26 2018, 9:23 PM
aaron committed rEBSSMWC0e8b6916ea2a: Avoid use of IDatabase::update return value (authored by aaron).
Avoid use of IDatabase::update return value
Oct 26 2018, 8:32 PM
aaron committed rEBTF63bb559a1c9d: Avoid use of IDatabase::update return value (authored by aaron).
Avoid use of IDatabase::update return value
Oct 26 2018, 7:53 PM
aaron added a comment to T208003: WatchedItemStore::addWatchBatchForUser does not have outer scope..

Does this really need to call commitAndWaitForReplication() when there is only one batch? Is it ever called thousands of times in a row?

Oct 26 2018, 5:55 AM · MW-1.33-notes (1.33.0-wmf.4; 2018-11-13), Patch-For-Review, Growth-Team (Current Sprint), MediaWiki-Watchlist, Regression, Wikimedia-production-error

Oct 25 2018

aaron added a comment to T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate).

COMMIT takes a few seconds

It is unlikely for a commit to take a lot, as in, the actual physical changes that happen then- at commit time there is only a metadata change + flush to disk cache, which should be very fast. Look for contention somewhere as a first option when you see commit taking a long time (e.g. large transactions blocking each other).

Oct 25 2018, 5:48 PM · Performance-Team-notice, Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error
aaron closed T207223: getDBLoadBalancerFactory()->getExternalLB() returns a LB with the wrong database. as Declined.
Oct 25 2018, 5:46 PM · Performance-Team, MW-1.32-release, MW-1.31-release, MediaWiki-Database

Oct 24 2018

aaron committed rERXB04e424a12439: Avoid using wfSplitWikiID() and use newer cache key methods (authored by aaron).
Avoid using wfSplitWikiID() and use newer cache key methods
Oct 24 2018, 9:32 PM

Oct 22 2018

aaron added a comment to T207223: getDBLoadBalancerFactory()->getExternalLB() returns a LB with the wrong database..

In the getMasterDatabase() method posted above, I noticed that the database domain (e.g. DB/schema/prefix) is missing from getConnection(). Instead that should be:

Oct 22 2018, 9:22 PM · Performance-Team, MW-1.32-release, MW-1.31-release, MediaWiki-Database
aaron claimed T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate).
Oct 22 2018, 8:50 PM · Performance-Team-notice, Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error
aaron moved T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate) from Inbox to Doing on the Performance-Team board.
Oct 22 2018, 8:49 PM · Performance-Team-notice, Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error