Page MenuHomePhabricator

aaron (Aaron Schulz)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Oct 20 2014, 5:25 PM (314 w, 2 d)
Availability
Available
IRC Nick
AaronSchulz
LDAP User
Aaron Schulz
MediaWiki User
Aaron Schulz [ Global Accounts ]

Recent Activity

Today

aaron added a comment to T213089: Upgrade memcached cluster to Debian Stretch/Buster.

Regarding RedisLockManager (it only needs 2 of the 3 host to be reachable). If one of them is depooled or refuses connections, no one should notice any disruption. For otherwise unreachable servers, there is a 2 second timeout (and the redis server will be avoided for the rest of the request).

Wed, Oct 28, 5:55 PM · Wikidata, Platform Engineering, User-jijiki, serviceops, Performance-Team (Radar), Operations, User-Elukey

Mon, Oct 26

aaron created T266502: Deprecate and remove wfMemcKey().
Mon, Oct 26, 7:34 PM · MW-1.36-notes (1.36.0-wmf.16; 2020-11-03), Patch-For-Review, Performance-Team, MediaWiki-Cache
aaron closed T255479: Replace count metric with timing metric for WAN cache gets as Resolved.

Added cache hit latency panel

Mon, Oct 26, 7:27 PM · MW-1.36-notes (1.36.0-wmf.11; 2020-09-29), Performance-Team, MediaWiki-Cache
aaron added a comment to T265749: Research to create service for DeferredUpdates::addUpdate().

It would be good to get cleanup patches like https://gerrit.wikimedia.org/r/c/mediawiki/core/+/596082 merged first.

Mon, Oct 26, 7:12 PM · MediaWiki-General, Dependency injection
aaron moved T250407: Deprecate wfForeignMemcKey() and BagOStuff::getKeyInternal() from Blocked or Needs-CR to Doing on the Performance-Team board.
Mon, Oct 26, 6:37 PM · MW-1.36-notes (1.36.0-wmf.16; 2020-11-03), Technical-Debt (Deprecation process), MediaWiki-Cache, Performance-Team
aaron claimed T250407: Deprecate wfForeignMemcKey() and BagOStuff::getKeyInternal().
Mon, Oct 26, 6:37 PM · MW-1.36-notes (1.36.0-wmf.16; 2020-11-03), Technical-Debt (Deprecation process), MediaWiki-Cache, Performance-Team

Wed, Oct 21

aaron closed T257009: WikitextContentTest::testIsCountable failure: Argument 1 passed to AbstractContent::getParserOutput() must be an instance of Title as Resolved.
Wed, Oct 21, 8:44 PM · MW-1.36-notes (1.36.0-wmf.16; 2020-11-03), MediaWiki-ContentHandler

Tue, Oct 20

aaron renamed T257009: WikitextContentTest::testIsCountable failure: Argument 1 passed to AbstractContent::getParserOutput() must be an instance of Title from Argument 1 passed to AbstractContent::getParserOutput() must be an instance of Title to WikitextContentTest::testIsCountable failure: Argument 1 passed to AbstractContent::getParserOutput() must be an instance of Title.
Tue, Oct 20, 10:36 PM · MW-1.36-notes (1.36.0-wmf.16; 2020-11-03), MediaWiki-ContentHandler
aaron added a project to T265778: Make FileBackend code that scrapes PHP warning robust: Performance-Team.
Tue, Oct 20, 7:30 PM · Performance-Team, Commons, MediaWiki-File-management

Fri, Oct 16

aaron created T265778: Make FileBackend code that scrapes PHP warning robust.
Fri, Oct 16, 11:50 PM · Performance-Team, Commons, MediaWiki-File-management

Tue, Oct 13

aaron created T265386: Rewrite LoadMonitor to better handle cache regeneration and improve separation of concern.
Tue, Oct 13, 6:34 PM · Wikimedia-Rdbms, Performance-Team

Tue, Oct 6

aaron created T264787: Make WANCache worthRefreshExpiring() account for values with FLD_TTL less than $lowTTL.
Tue, Oct 6, 7:12 PM · Patch-For-Review, Performance-Team, MediaWiki-Cache

Mon, Oct 5

aaron added a comment to T255334: Translation page does not contain the latest translations/last translation.

The default timeout for waitForReplication() for web requests is 1 second. JobRunner passes 3 seconds to its calls and sets setDefaultReplicationWaitTimeout( 3 ). Maybe TranslateRenderJob should use a different timeout?

Mon, Oct 5, 7:41 PM · Performance-Team (Radar), MW-1.36-notes (1.36.0-wmf.13; 2020-10-12), Language-Team (Language-2020-October-December), MW-1.35-notes (1.35.0-wmf.39; 2020-06-30), MediaWiki-extensions-Translate
aaron triaged T264604: MediaWiki to route specific keys to /*/mw-with-onhost-tier/ as Medium priority.
Mon, Oct 5, 6:36 PM · Patch-For-Review, User-jijiki, Operations, serviceops, Performance-Team

Thu, Oct 1

aaron added a comment to T257009: WikitextContentTest::testIsCountable failure: Argument 1 passed to AbstractContent::getParserOutput() must be an instance of Title.

This only happens with paratest, so I suspect that some tests depend on each other running by accident.

Thu, Oct 1, 4:36 AM · MW-1.36-notes (1.36.0-wmf.16; 2020-11-03), MediaWiki-ContentHandler

Wed, Sep 30

aaron added a comment to T93049: Same MassMessage is being sent more than once.

Are the jobs failing near the end and getting retried? Also, job B can still be enqueued if a duplicate job A is marked as running. This was long-since the semantic logic (and the kafka queue does this too last I checked). If job A fails, it seems like the redis job queue (https://github.com/wikimedia/mediawiki-services-jobrunner/blob/master/redisJobChronService) would then allow duplicate job C, though JobQueueDB would not. Not sure about the Kafka queue in that situation. The behavior there should be standardized in any case.

Wed, Sep 30, 8:58 PM · Platform Team Workboards (Clinic Duty Team), WMF-JobQueue, Operations, MassMessage
aaron added a comment to T255479: Replace count metric with timing metric for WAN cache gets.

The Grafana dashboards will still need updating.

Wed, Sep 30, 4:29 AM · MW-1.36-notes (1.36.0-wmf.11; 2020-09-29), Performance-Team, MediaWiki-Cache

Sep 28 2020

aaron added a comment to T243233: MediaWiki should provide a LocalClusterObjectCache service.

The problem with LocalClusterCache is that there can be cycles or odd behavior for early calls. As long as the wiring works, then I'd love to have such a method.

Sep 28 2020, 6:29 PM · Patch-For-Review, MediaWiki-Cache

Sep 17 2020

aaron closed T253697: Support hash-based deduplication in KeyValueDependencyStore as Declined.

Declined, assuming that the DB main stash has enough GB of space.

Sep 17 2020, 9:05 PM · Performance-Team, MediaWiki-ResourceLoader
aaron closed T253697: Support hash-based deduplication in KeyValueDependencyStore, a subtask of T113916: Redesign ResourceLoader's file dependency tracking (module_deps), as Declined.
Sep 17 2020, 9:05 PM · Epic, Performance-Team, MediaWiki-ResourceLoader
aaron added a comment to T262978: flaggedpage_config.fpc_select is unused.

It's not meaningfully used, though it is NON NULL so the code still references it during insertion (using -1).

Sep 17 2020, 5:47 PM · User-DannyS712, MediaWiki-extensions-FlaggedRevs, Schema-change

Sep 14 2020

aaron added a comment to T259520: Lots of Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode.

1k/min would be considered high for codfw I would say, eqiad is a different story.
What looks strange to me is the fact that I stopped centralauth master and the wiki that showed up as lagged was enwiki. Although on this last event, stopping wikidata's master, only shows wikidata as lagged, which is what I would have expected with s7 (where centralauth lives).

By "showed up as lagged", do you mean that there were mosly logstash messages with wiki:enwiki or did you see this in some graph of MW statsd data or such? I could probably tweak the "all replica DBs lagged" entries to have some extra fields.

Sep 14 2020, 8:45 PM · Patch-For-Review, Platform Team Workboards (Clinic Duty Team), Performance-Team
aaron added a comment to T254430: Rdbms overhead due to "SELECT @@GLOBAL.read_only" queries.

Analysis:

LoadBalancer::isMasterConnectionReadOnly relies on caching in $this->srvCache. srvCache defaults to EmptyBagOStuff, which would explain why isMasterConnectionReadOnly hits serverIsReadOnly every time. However, LoadBalancer doesn't use the default value for srvCache, since LBFactory actually provides that values. It in turn gets it from ServiceWiring via MWLBFactory. ServiceWiring gets it from ObjectCache::makeLocalServerCache(). But ObjectCache::makeLocalServerCache() will return an EmptyBagOStuff in CLI mode.

I suppose that means that this isn't a real problem for us in production, since we call the job runner via the web. But for people who use maintenance/runJobs.php, it would be a problem.

In general, it seems strange that ObjectCache::makeLocalServerCache() wouldn't just return a HashBagOStuff. Code that wants a "local server cache" would probably work better with a cache that is transient, than with no cache at all. Is there a good reason not to have makeLocalServerCache() return a HashBagOStuff as a fallback?

Sep 14 2020, 8:36 PM · Language-Team (Language-2020-July-September), MW-1.36-notes (1.36.0-wmf.10; 2020-09-22), Patch-For-Review, Platform Team Workboards (Clinic Duty Team), Wikimedia-Rdbms, Regression

Sep 10 2020

aaron closed T257460: Support the configuration of mcrouter routing prefixes in MemcachedBagOStuff as Resolved.
Sep 10 2020, 5:06 PM · MW-1.36-notes (1.36.0-wmf.6; 2020-08-25), MediaWiki-Cache, Performance-Team

Aug 12 2020

aaron closed T259520: Lots of Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode as Resolved.
Aug 12 2020, 7:16 AM · Patch-For-Review, Platform Team Workboards (Clinic Duty Team), Performance-Team

Aug 10 2020

aaron renamed T230245: Make SwiftFileBackend::doStoreInternal defer the opening of file handles to stay in the concurrency limit from GenerateFancyCaptchas.php crashes with "FormatJson.php: File not found in" after 1000 iterations to Make SwiftFileBackend::doStoreInternal defer the opening of file handles to stay in the concurrency limit.
Aug 10 2020, 11:44 PM · Performance-Team, Patch-For-Review, Commons, MediaWiki-File-management, Platform Engineering (Icebox), Operations, SRE-swift-storage, Editing-team, ConfirmEdit (CAPTCHA extension)
aaron moved T230245: Make SwiftFileBackend::doStoreInternal defer the opening of file handles to stay in the concurrency limit from Doing to Backlog: Small & Maintenance on the Performance-Team board.
Aug 10 2020, 8:21 PM · Performance-Team, Patch-For-Review, Commons, MediaWiki-File-management, Platform Engineering (Icebox), Operations, SRE-swift-storage, Editing-team, ConfirmEdit (CAPTCHA extension)
aaron closed T253055: Multiversion bug in relative module_deps path as Resolved.
Aug 10 2020, 8:11 PM · MW-1.36-notes (1.36.0-wmf.2; 2020-07-28), EngProd-Virtual-Hackathon, MediaWiki-ResourceLoader, Performance-Team
aaron moved T257460: Support the configuration of mcrouter routing prefixes in MemcachedBagOStuff from Doing to Blocked or Needs-CR on the Performance-Team board.
Aug 10 2020, 8:11 PM · MW-1.36-notes (1.36.0-wmf.6; 2020-08-25), MediaWiki-Cache, Performance-Team

Aug 6 2020

aaron awarded T250248: Fast stale ParserCache responses on PoolCounter contention a Yellow Medal token.
Aug 6 2020, 10:01 PM · Platform Team Sprints Board (Sprint 1), Patch-For-Review, MW-1.35-notes (1.35.0-wmf.36; 2020-06-09), Platform Team Workboards (Clinic Duty Team), MediaWiki-Parser

Aug 5 2020

aaron added a comment to T221159: FY18/19 TEC1.6 Q4: Improve or replace the usage of GTID_WAIT with pt-heartbeat in MW.

Patch was merged, removing the patch for review tag.

Pn the patch, @jcrespo said:

I believe this is not working as intended, not because code- I can see now SELECT MASTER_GTID_WAIT('171978924-171978924-255115225', 1) on the logs, but errors out as:
"Timed out waiting for replication to reach 171970645-171970645-288070551,171978778-171978778-3298185533,171978924-171978924-255115225,180359242-180359242-170963125"

1.34.0-wmf.25

It may need some shaking on infra, maybe, to make it work? Otherwise I don't understand where those extra coords come from.

@Marostegui replied:

I believe this is not working as intended, not because code- I can
see now SELECT MASTER_GTID_WAIT('171978924-171978924-255115225', 1)
on the logs, but errors out as:
"Timed out waiting for replication to reach 171970645-171970645-288070551,171978778-171978778-3298185533,171978924-171978924-255115225,180359242-180359242-170963125"

1.34.0-wmf.25

It may need some shaking on infra, maybe, to make it work?
Otherwise I don't understand where those extra coords come from.

https://phabricator.wikimedia.org/T224422#5558330

Aug 5 2020, 4:34 AM · MW-1.36-notes (1.36.0-wmf.5; 2020-08-18), Patch-For-Review, Performance-Team (Radar), User-mobrovac, Services (watching), Goal, Wikimedia-Rdbms, DBA

Jul 29 2020

aaron closed T240684: Test gutter pool failover in production and memcached 1.5.x, a subtask of T244852: Upgrade and improve our application object caching service (memcached), as Resolved.
Jul 29 2020, 1:38 AM · Patch-For-Review, Operations, serviceops
aaron closed T240684: Test gutter pool failover in production and memcached 1.5.x as Resolved.

Yeah, technically, all sorts of anomalies are possible, so callers should always (a) avoid DB updates based on cache data, (b) choose reasonable, possibly adaptive, TTLs for the given use case (e.g. would the world suffer if cache was wrong for X days/hours/minutes?), (c) be cognizant of CDN/ObjectCache TTL interactions (race conditions already dictate this awareness anyway, even without lost updates).

Jul 29 2020, 1:38 AM · Performance-Team, Sustainability (Incident Followup), Operations, serviceops

Jul 23 2020

aaron added a comment to T244340: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached).

Side note: if not already done, I'd double check how the WarmUp route behaves when the local memcached is not available for any reason (roll restart, temporary issues for weird bugs, etc..). Ideally the timeout should be very tight, few ms, and mcrouter should behave as the GET ended up with a miss. The thing that I am worried about is that mcrouter uses the 1s timeout that we set for the main shards, but I didn't investigate a lot in the doc/code so feel free to discard if already discussed :)

Jul 23 2020, 4:31 AM · User-jijiki, Sustainability (Incident Followup), Performance-Team, Patch-For-Review, Operations, serviceops
aaron added a comment to T252391: Reimage one memcached shard to Buster.

Given the libketama-style consistent hashing in twemproxy and that, AFAIK, CentralAuth sessions can regenerate (notwithstanding one-off CSRF token failures and such) from the existence of the long-term centralauth_Token cookie. That would at least prevent logouts.

Jul 23 2020, 12:21 AM · User-jijiki, Growth-Team (Current Sprint), User-Elukey, Patch-For-Review, Operations, serviceops

Jul 22 2020

aaron added a comment to T254422: Move CentralAuth sessions from redis backend to kask.

In my KeyValueStore refactor patches, a consume() method was added, since implementing it with TTL changes is not generically the most convenient way to best-effort atomically consume a key.

Jul 22 2020, 11:47 PM · Code-Health-Objective, Platform Engineering Roadmap Decision Making, Platform Engineering Roadmap, Platform Team Workboards (Epics), Platform Team Initiatives (Session Management Service (CDP2)), User-Clarakosi, User-Eevans

Jul 17 2020

aaron added a comment to T240684: Test gutter pool failover in production and memcached 1.5.x.

We don't really need purges to go to the gutter cache, given the low TTL there.

Jul 17 2020, 1:30 AM · Performance-Team, Sustainability (Incident Followup), Operations, serviceops

Jul 16 2020

aaron added a comment to T154552: ApiLogin should not open master connection to centralauth DB.

This entry point will just be treated by CDN like POST is w.r.t multi-DC for the foreseeable future.

Jul 16 2020, 9:02 PM · Sustainability (MediaWiki-MultiDC), MediaWiki-Authentication-and-authorization, MediaWiki-extensions-CentralAuth
aaron removed a subtask for T92357: Fix database master queries from HTTP GET/HEAD before active-active multi-dc: T154552: ApiLogin should not open master connection to centralauth DB.
Jul 16 2020, 9:01 PM · Performance-Team, Sustainability (MediaWiki-MultiDC), MediaWiki-General
aaron removed a parent task for T154552: ApiLogin should not open master connection to centralauth DB: T92357: Fix database master queries from HTTP GET/HEAD before active-active multi-dc.
Jul 16 2020, 9:01 PM · Sustainability (MediaWiki-MultiDC), MediaWiki-Authentication-and-authorization, MediaWiki-extensions-CentralAuth
aaron added a comment to T134842: SpecialCentralAutoLogin calls User::saveSettings() on HTTP GET presend.

So this only affects users with user_token=''? The field is not nullable. Apparently only users created between March and June 2012 are affected by this:

mysql:wikiadmin@db1080 [enwiki]> select floor(user_id/1000000),count(*) from user where user_token='';
+------------------------+----------+
| floor(user_id/1000000) | count(*) |
+------------------------+----------+
|                     16 |   475413 |
+------------------------+----------+
1 row in set (13.51 sec)

mysql:wikiadmin@db1080 [enwiki]> select min(user_registration),max(user_registration),min(user_id),max(user_id) from user where user_id between 16000000 and 17000000 and user_token='';
+------------------------+------------------------+--------------+--------------+
| min(user_registration) | max(user_registration) | min(user_id) | max(user_id) |
+------------------------+------------------------+--------------+--------------+
| 20120322143714         | 20120615121417         |     16522112 |     16999995 |
+------------------------+------------------------+--------------+--------------+
1 row in set (0.54 sec)
Jul 16 2020, 9:00 PM · MediaWiki-Authentication-and-authorization, MediaWiki-extensions-CentralAuth, Sustainability
aaron renamed T216287: BannerMessageGroup::registerGroupHook of CentralNotice must not query master on GET request (page views) from CentralNotice must not query master on GET request (page views) to BannerMessageGroup::registerGroupHook of CentralNotice must not query master on GET request (page views).
Jul 16 2020, 8:19 PM · Fr-CentralNotice-Translation-Bugs, FR-Q2-FY2019-20-cleanup-list, Fundraising-Backlog, Performance-Team (Radar), Sustainability (MediaWiki-MultiDC), Wikimedia-production-error, MediaWiki-extensions-CentralNotice
aaron added a comment to T216287: BannerMessageGroup::registerGroupHook of CentralNotice must not query master on GET request (page views).

BannerMessageGroup::registerGroupHook is triggering master queries (a) on HTTP GET requests and also (b) inside of a getWithSetCallback() call, which should use DB_REPLICA data.

Jul 16 2020, 8:19 PM · Fr-CentralNotice-Translation-Bugs, FR-Q2-FY2019-20-cleanup-list, Fundraising-Backlog, Performance-Team (Radar), Sustainability (MediaWiki-MultiDC), Wikimedia-production-error, MediaWiki-extensions-CentralNotice
aaron created T258125: Treat 3X increases in non-exempt GET reqs/sec that do master updates as a deploy blocker.
Jul 16 2020, 2:00 AM · Sustainability (MediaWiki-MultiDC), Performance-Team

Jul 8 2020

aaron created T257460: Support the configuration of mcrouter routing prefixes in MemcachedBagOStuff.
Jul 8 2020, 2:25 PM · MW-1.36-notes (1.36.0-wmf.6; 2020-08-25), MediaWiki-Cache, Performance-Team

Jul 6 2020

aaron moved T257003: Make memcached BagOStuff classes use 'gets' only if a CAS token is needed from Inbox to Doing on the Performance-Team board.
Jul 6 2020, 8:05 PM · MW-1.36-notes (1.36.0-wmf.1; 2020-07-21), Performance-Team, MediaWiki-Cache

Jul 2 2020

aaron created T257009: WikitextContentTest::testIsCountable failure: Argument 1 passed to AbstractContent::getParserOutput() must be an instance of Title.
Jul 2 2020, 11:56 PM · MW-1.36-notes (1.36.0-wmf.16; 2020-11-03), MediaWiki-ContentHandler
aaron created T257003: Make memcached BagOStuff classes use 'gets' only if a CAS token is needed.
Jul 2 2020, 9:29 PM · MW-1.36-notes (1.36.0-wmf.1; 2020-07-21), Performance-Team, MediaWiki-Cache
DannyS712 awarded T88044: Make rollback use POST instead of GET (use AJAX in GUI) a Dislike token.
Jul 2 2020, 3:02 PM · MediaWiki-Page-History, Performance-Team (Radar), User-notice, MediaWiki-Page-Diffs

Jun 30 2020

aaron added a comment to T250205: Reduce rate of purges emitted by MediaWiki.

I'm not fond of the idea of not sending purges for indirect edits

Agreed. The proposal to stop sending these purges does not stand by itself but rather would be an implementation step of "moving" the purges from one place to another (remove here, add there).

I'm not fond of […] using RefreshLinksJob instead of HtmlCacheUpdateJob (too slow IMO).

[…] Skipping backlink CDN purges for templates/entitites/files with millions of backlinks might work, assuming the editor is logged in and not showing things to logged out users...I can't think of a clever way to maintain expectations.

I think for someting like making a change to Template:Information or Template:Infobox, it's more valuable for logged-out users that pages get served quickly, than for those indirect changes to be applied instantly (e.g. with a cache miss resulting in a 2-60 second blank screen, awaiting an expensive reparse, possibly hitting the lower timeout threshold from GET compared to on-edit/job).

As I understand it, that last part is exactly what we're proposing. Although it would work as you'd like for unregistered editors as well, I think? They too get a full sessiont that bypasses the CDN and lasts for days/weeks (tied to their browser, incl typical session restore/continuation).

In a nut shell:

  • Still purge from edit.
  • Still bump page_touched recursively from a (quick) job. This means any natural cache miss or passthrough (editor with session) will still lazy re-parse as needed.
  • Move recursive purges away from the quick job that does the page_touched bumps, and move it to the job that does the re-parses.
Jun 30 2020, 5:44 PM · Platform Engineering Roadmap Decision Making, MW-1.35-notes (1.35.0-wmf.35; 2020-06-02), Sustainability (Incident Followup), Performance-Team (Radar), Platform Engineering, serviceops, Traffic, Operations

Jun 24 2020

aaron claimed T256287: "Database selection is disallowed to enable reuse" is badly worded.
Jun 24 2020, 8:14 PM · MW-1.35-notes, MW-1.36-notes (1.36.0-wmf.3; 2020-08-04), MW-1.34-notes, Patch-For-Review, Developer Productivity, Platform Team Workboards (Clinic Duty Team), Performance-Team, Wikimedia-Rdbms
aaron added a comment to T255493: Consider phasing out ILoadBalancer::getConnectionRef in favour of ILoadBalancer::getLazyConnectionRef.

It is still useful to have a gauge of connectivity, in some cases, before attempting to use DB handles. That makes me lean towards keeping both out of convenience.

Jun 24 2020, 4:45 PM · Platform Engineering Roadmap Decision Making, Platform Engineering, Developer Productivity, Wikimedia-Rdbms
aaron added a comment to T254608: Monitor read and write traffic to Memcached at the keygroup level.

@aaron Do I understand correctly that the approach of storing the size of data, in the data stored in Memcached itself, mainly to avoid re-serialisation because we currently use the PECL implementation which deserialises natively first, and that this would be redundant after T234455?

Jun 24 2020, 4:43 PM · Patch-For-Review, observability, Sustainability (Incident Followup), Performance-Team, MediaWiki-Cache
aaron added a comment to T253697: Support hash-based deduplication in KeyValueDependencyStore.

Some more info:

aaron@mwmaint1002:~$ mwscript eval.php --wiki=testwiki
> echo strlen(json_encode(['paths'=>'ref:af18298daea159b6ca5283c0c1aa45e7155e4412', 'asOf' => time()]));
74
Jun 24 2020, 2:45 AM · Performance-Team, MediaWiki-ResourceLoader

Jun 22 2020

aaron moved T253679: Dedicated Flamegraphs for save timing from Backlog: Future Goals to Next In This Quarter / Oct-Dec 2020 on the Performance-Team board.
Jun 22 2020, 8:24 PM · Patch-For-Review, Arc-Lamp, Performance-Team

Jun 16 2020

aaron renamed T255493: Consider phasing out ILoadBalancer::getConnectionRef in favour of ILoadBalancer::getLazyConnectionRef from Consider phasing out ILoadBalancer::getLazyConnectionRef in favour of getConnectionRef to Consider phasing out ILoadBalancer::getConnectionRef in favour of ILoadBalancer::getLazyConnectionRef.
Jun 16 2020, 8:25 PM · Platform Engineering Roadmap Decision Making, Platform Engineering, Developer Productivity, Wikimedia-Rdbms

Jun 15 2020

aaron created T255479: Replace count metric with timing metric for WAN cache gets.
Jun 15 2020, 8:16 PM · MW-1.36-notes (1.36.0-wmf.11; 2020-09-29), Performance-Team, MediaWiki-Cache

Jun 10 2020

aaron committed rECACf23d7622d671: Fix stray & in "wpShowRejects" URL parameter (authored by aaron).
Fix stray & in "wpShowRejects" URL parameter
Jun 10 2020, 9:29 AM

Jun 4 2020

aaron added a comment to T244340: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached).

I suspect that the keys that cause trouble are big text/JSON blobs and ParserOutput objects, all of which don't directly get purged. READ_VERIFIED is used by MultiWriteBagOStuff already when deciding whether it is safe to do cache-aside/backfill into lower cache tiers. This could probably be integrated into mcrouter by having such calls use a route prefix that uses a warmup route. A similar flag could be added/used by WANObjectCache for other blob keys that don't receive delete()/touchCheckKey() calls. This would would at a lot of I/O resistance without making hard-to-reason-about changes to cache invalidation.

Jun 4 2020, 9:50 PM · User-jijiki, Sustainability (Incident Followup), Performance-Team, Patch-For-Review, Operations, serviceops

Jun 3 2020

aaron placed T238493: Frontend save timing regression on/after 30 October 2019 up for grabs.
Jun 3 2020, 1:38 AM · Wikimedia-Incident, Performance-Team
aaron reopened T238493: Frontend save timing regression on/after 30 October 2019 as "Open".
Jun 3 2020, 1:38 AM · Wikimedia-Incident, Performance-Team
aaron closed T238493: Frontend save timing regression on/after 30 October 2019 as Declined.

Unfortunately, I'm not seeing any xenon .log files going back that far. In terms of backend contribution, I don't see anything to do here.

Jun 3 2020, 1:24 AM · Wikimedia-Incident, Performance-Team

Jun 2 2020

aaron added a comment to T246456: Performance review of Wikidata Bridge.

From the perspective of popular/major articles, likely to have infoboxes, the extra 42.1 KB for loading the "app" JS doesn't seem crazy. I've looked through code several times and it seems reasonable. Testing with fast/slow 3G doesn't reveal obnoxious reflows or delay either. Having the edit link go directly to a Q<X> page when the JS hasn't fully loaded felt somewhat jarring, though I don't image that happening often. I don't see much editing at all given how discrete the icon is (a good thing).

Jun 2 2020, 4:20 AM · Wikidata, Wikidata-Bridge, Performance-Team

May 30 2020

aaron added a comment to T246456: Performance review of Wikidata Bridge.

I've been looking at this from time to time, and haven't found anything real problems yet. Some of the things I'm looking out for are:

May 30 2020, 7:13 PM · Wikidata, Wikidata-Bridge, Performance-Team

May 26 2020

aaron created T253697: Support hash-based deduplication in KeyValueDependencyStore.
May 26 2020, 10:08 PM · Performance-Team, MediaWiki-ResourceLoader
aaron closed T247717: Reduce flamegraph.pl threshold from minwidth=2 to minwidth=1 as Resolved.

Per-function flame graphs are T253679

May 26 2020, 8:01 PM · EngProd-Virtual-Hackathon, Performance-Team, Arc-Lamp
aaron created T253679: Dedicated Flamegraphs for save timing.
May 26 2020, 8:01 PM · Patch-For-Review, Arc-Lamp, Performance-Team

May 23 2020

aaron closed T250578: Add WANObjectCache size metrics to Grafana dashboards as Resolved.

Added two panels to the regeneration row.

May 23 2020, 6:46 PM · observability, MediaWiki-Cache, Performance-Team

May 21 2020

aaron added a comment to T250205: Reduce rate of purges emitted by MediaWiki.

I'm not fond of the idea of not sending purges for indirect edits nor using RefreshLinksJob instead of HtmlCacheUpdateJob (too slow IMO).

May 21 2020, 9:38 AM · Platform Engineering Roadmap Decision Making, MW-1.35-notes (1.35.0-wmf.35; 2020-06-02), Sustainability (Incident Followup), Performance-Team (Radar), Platform Engineering, serviceops, Traffic, Operations

May 11 2020

aaron moved T139044: Enable GTID on beta cluster mariaDB once upgraded from Inbox to Radar on the Performance-Team board.
May 11 2020, 8:25 PM · Performance-Team (Radar), Release-Engineering-Team, Beta-Cluster-Infrastructure

May 8 2020

aaron added a comment to T246456: Performance review of Wikidata Bridge.

hey @aaron, @Gilles,

could you, please, give us an update on this task? could you also tell us something we could tackle proactively that may help?

We are currently implementing the last stories and that information would help us enormously to shape our next steps.

Thanks a lot in advance!

May 8 2020, 2:45 PM · Wikidata, Wikidata-Bridge, Performance-Team
aaron added a comment to T133821: Make CDN purges reliable.

At a later time, we could think of changing the logic, and make purges avoid race conditions, removing the need for the rebound purges.
One way to implement this would be the following:

  • No more changes are needed at the application layer
  • All purged servers join a single consumer group per datacenter. This will ensure each purge message is consumed by only one purged instance.
  • This instance will take care of sending the purges to all the cache backends in the DC first, and to all the frontends afterwards

This would ensure there are no fe/be race conditions.

May 8 2020, 12:15 AM · Sustainability, serviceops, Performance-Team (Radar), Operations, Traffic

May 4 2020

aaron moved T250578: Add WANObjectCache size metrics to Grafana dashboards from Inbox to Backlog: Small & Maintenance on the Performance-Team board.
May 4 2020, 2:35 PM · observability, MediaWiki-Cache, Performance-Team

Apr 22 2020

aaron added a comment to T245900: Introduce dependency injection into jobs.

I agree that jobs are not always "events" or "event handler hooks" (what about jobs the subdivide or are added by maintenance scripts and so on?). There is probably a lot of stuff that can indeed be moved in that direction though.

Apr 22 2020, 1:26 PM · Platform Engineering Roadmap Decision Making, Platform Team Initiatives (Decoupling (CDP2)), Dependency injection, MediaWiki-JobQueue, Platform Engineering

Apr 20 2020

aaron closed T248147: Wikimedia\Rdbms\Database::normalizeUpsertKeys called with deprecated parameter style: the unique key array should be a string or array of string arrays generating 2 million warnings in 24 hours, a subtask of T233872: 1.35.0-wmf.24 deployment blockers, as Resolved.
Apr 20 2020, 7:35 PM · Patch-For-Review, Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)), Release, Train Deployments
aaron closed T248147: Wikimedia\Rdbms\Database::normalizeUpsertKeys called with deprecated parameter style: the unique key array should be a string or array of string arrays generating 2 million warnings in 24 hours as Resolved.
Apr 20 2020, 7:35 PM · MW-1.35-notes (1.35.0-wmf.30; 2020-04-28), Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), AntiSpoof, ProofreadPage, PageCuration, Wikidata, Growth-Team, Wikimedia-General-or-Unknown, Platform Team Workboards (Clinic Duty Team)

Apr 18 2020

aaron committed rESRXe1c22c6243ba: Convert $wgMemc use to WANObjectCache (authored by aaron).
Convert $wgMemc use to WANObjectCache
Apr 18 2020, 8:36 PM

Apr 17 2020

aaron closed T230025: Create HtmlCacheUpdater service class to normalize purging code as Resolved.
Apr 17 2020, 10:01 AM · MW-1.35-notes (1.35.0-wmf.28; 2020-04-14), MediaWiki-Page-derived-data, Platform Team Workboards (Clinic Duty Team), User-Daniel, Performance-Team
aaron placed T228468: Move stats updates from AuthManager::autoCreateUser() HTTP GET to the job queue up for grabs.
Apr 17 2020, 10:00 AM · Sustainability (MediaWiki-MultiDC), Performance-Team
aaron closed T206283: Failed deferred updates should be queued as jobs if possible (Deadlock from LinksUpdate in WikiPage::updateCategoryCounts) as Resolved.
Apr 17 2020, 10:00 AM · MW-1.35-notes (1.35.0-wmf.24; 2020-03-17), Platform Team Workboards (Clinic Duty Team), MediaWiki-Page-derived-data, Performance-Team, Wikimedia-production-error
aaron closed T206283: Failed deferred updates should be queued as jobs if possible (Deadlock from LinksUpdate in WikiPage::updateCategoryCounts), a subtask of T30599: Deadlock tracking bug (tracking), as Resolved.
Apr 17 2020, 10:00 AM · MediaWiki-General, Tracking-Neverending
aaron added a comment to T250248: Fast stale ParserCache responses on PoolCounter contention.

Tim mentioned this in conversation earlier but I think it's worth writing down: we'd have to ensure that the post-save page view isn't stale. It would be pretty confusing if editors didn't see their own edits.

Apr 17 2020, 1:06 AM · Platform Team Sprints Board (Sprint 1), Patch-For-Review, MW-1.35-notes (1.35.0-wmf.36; 2020-06-09), Platform Team Workboards (Clinic Duty Team), MediaWiki-Parser

Apr 16 2020

aaron created T250407: Deprecate wfForeignMemcKey() and BagOStuff::getKeyInternal().
Apr 16 2020, 4:43 PM · MW-1.36-notes (1.36.0-wmf.16; 2020-11-03), Technical-Debt (Deprecation process), MediaWiki-Cache, Performance-Team

Apr 15 2020

aaron updated the task description for T250239: Make BagOStuff key encoding more consistent.
Apr 15 2020, 1:33 AM · Patch-For-Review, MediaWiki-Cache, Performance-Team

Apr 14 2020

aaron created T250239: Make BagOStuff key encoding more consistent.
Apr 14 2020, 10:55 PM · Patch-For-Review, MediaWiki-Cache, Performance-Team

Apr 11 2020

aaron added a comment to T248890: MW Memcached get hit ratio trend over the past months.

I think the wikibase improvement simply avoided a bunch of repetitive duplicate queries, which technically would lower the hit rate.

Apr 11 2020, 4:12 PM · Performance-Team, Operations
aaron added a comment to T157651: sql.php must not run LoadExtensionSchemaUpdates.

I would suggest the opposite: keep sql.php, drop patchSql.php. I don't think many people are familiar with the latter (compare patchSql vs sql docs for example) and I don't think it's terribly useful - passing a file path is more user-friendly than passing a patch name. And it does not even replace the schema vars, meaning it's actively harmful to anyone who uses table prefixes or non-default table settings.

So, IMO

  • keep mysql.php which is indeed widely used at least in Wikimedia production for debugging, more user-friendly than sql.php (which channels query output through PHP which does weird things to it) and not problematic (it already requires a write flag for performing any changes - although that relies on a master/slave distinction so that could be improved, cf T249683#6039238 - and does not accidentally run updares).
  • kill patchSql.php which is IMO pretty useless. (Probably worth a wikitech question to ensure it is indeed not used.)
  • keep sql.php manual debugging mode, which is the only way to debug a non-MySQL server, but require an explicit --debug flag used. Do not invoke the updater (not even for variable transformations) when that's used, it seems pointless and just extra code path exposure (cf the fatal error it gave during the incident).
  • keep sql.php for running scripts but require a --write flag like mysql.php does for scripts that change data. (I would even separate an admin mode for schema changes and a write mode for data changes via a restricted user.)
  • if sql.php is invoked with no script file and --debug flag just exit with an error (and without creating a DatabaseUpdater or doing anything else nontrivial).
Apr 11 2020, 12:18 AM · Sustainability (Incident Followup), MW-1.35-notes (1.35.0-wmf.30; 2020-04-28), Wikidata, Growth-Team, StructuredDiscussions, Platform Team Workboards (Clinic Duty Team), Patch-For-Review, Performance-Team, MediaWiki-Maintenance-system

Apr 10 2020

aaron added a comment to T244058: Strategy for storing parser output for "old revision" (Popular diffs and permalinks).

Another though from the TechCom meeting: we could just have the CDN cache output for old revisions and diffs for a short time (5 minutes?)

Apr 10 2020, 9:54 PM · Platform Team Workboards (Clinic Duty Team), Sustainability (Incident Followup), Performance-Team (Radar), Parsing-Team--ARCHIVED, TechCom, Performance Issue, serviceops, Operations

Apr 9 2020

aaron added a comment to T248962: Occasional NIC Tx bandwidth saturation for mc1027 .

@aaron one thing that it would be useful is, in my opinion, having instrumentation in MediaWiki about key size volume/bytes. Even per "key family" would be enough, just to spot regressions in bandwidth usage from grafana rather than tracking them down via tcpdump. Do you think that it would be possible?

Apr 9 2020, 8:40 PM · Patch-For-Review, MW-1.35-notes (1.35.0-wmf.30; 2020-04-28), Performance-Team, Operations

Apr 7 2020

aaron added a comment to T248962: Occasional NIC Tx bandwidth saturation for mc1027 .

Hmm, it would help if cacheGetTree() /cacheSetTree() were replaced by getWithSetCallback() perhaps. Lots of optimizations are not used atm due to that fact.

Apr 7 2020, 10:29 PM · Patch-For-Review, MW-1.35-notes (1.35.0-wmf.30; 2020-04-28), Performance-Team, Operations
aaron added a comment to T248962: Occasional NIC Tx bandwidth saturation for mc1027 .

I wonder if it would be useful for the template name to appear in the key when possible. Right now it's just an opaque hash. I doubt that many invocations of different templates have the same top-level text, so I don't see it adding much fragmentation. Maybe some statsd logging could be done instead though.

Apr 7 2020, 6:25 PM · Patch-For-Review, MW-1.35-notes (1.35.0-wmf.30; 2020-04-28), Performance-Team, Operations

Apr 3 2020

aaron added a comment to T248962: Occasional NIC Tx bandwidth saturation for mc1027 .

It stores the serialized naive "top frame" (e.g. headings, paragraphs, template invocation parameters) of the wikitext of pages, as well as the "sub-frames" from template invocations, upon recursive expansion. This all happens on page parse. Note that these keys do not need purges. If template X is invoked the same way on multiple pages, then parses of those pages will reuse a common sub-frame cache key for those template invocations and likewise for templates invoked from that template. So, I suppose a popular template invoked with a low enough cardinality of paramater/context bundles would trigger traffic spikes upon invalidation. The traffic would come from either (a) refreshLinks jobs or (b) page views to backlink pages that got purged via htmlCacheUpdate jobs.

Apr 3 2020, 4:55 PM · Patch-For-Review, MW-1.35-notes (1.35.0-wmf.30; 2020-04-28), Performance-Team, Operations
aaron claimed T248147: Wikimedia\Rdbms\Database::normalizeUpsertKeys called with deprecated parameter style: the unique key array should be a string or array of string arrays generating 2 million warnings in 24 hours.
Apr 3 2020, 4:30 PM · MW-1.35-notes (1.35.0-wmf.30; 2020-04-28), Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), AntiSpoof, ProofreadPage, PageCuration, Wikidata, Growth-Team, Wikimedia-General-or-Unknown, Platform Team Workboards (Clinic Duty Team)

Mar 30 2020

aaron added a comment to T248890: MW Memcached get hit ratio trend over the past months.

The timing of this SAL entry suggests some relation:

Mar 30 2020, 7:54 PM · Performance-Team, Operations

Mar 29 2020

aaron committed rESPRb46e95374c83: Convert $wgMemc use to WANObjectCache (authored by aaron).
Convert $wgMemc use to WANObjectCache
Mar 29 2020, 10:14 PM

Mar 25 2020

aaron added a comment to P10769 Rev IDs of en.wikipedia.org pages that were popular on 2020-03-24.
$titles = [
        'COVID-19',
        'Pandemia di COVID-19 del 2020 in Italia',
        'Pandemia di COVID-19 del 2019-2020',
        'Influenza spagnola',
        'Pandemia di COVID-19 del 2019-2020 nel mondo'
];
Mar 25 2020, 7:38 PM

Mar 20 2020

Krinkle awarded T246077: SQlite has wrong DB structure after upgrading to 1.35 a Orange Medal token.
Mar 20 2020, 5:26 PM · MW-1.35-notes (1.35.0-wmf.25; 2020-03-24), Platform Team Workboards (Clinic Duty Team), MW-1.35-release, MediaWiki-Installer, MediaWiki-User-management, SQLite
aaron closed T244095: assertArraySubset() will be removed in PHPUnit 9, a subtask of T243600: Preparation for the PHPUnit 9 upgrade, as Resolved.
Mar 20 2020, 1:50 AM · MW-1.36-notes (1.36.0-wmf.11; 2020-09-29), MW-1.35-notes (1.35.0-wmf.27; 2020-04-07), User-Daimona, Patch-For-Review, MediaWiki-Core-Testing
aaron closed T244095: assertArraySubset() will be removed in PHPUnit 9 as Resolved.
Mar 20 2020, 1:50 AM · MW-1.35-notes (1.35.0-wmf.25; 2020-03-24), Patch-For-Review, MediaWiki-extensions-DonationInterface, Product-Infrastructure-Team-Backlog, Reading List Service, Growth-Team, PageCuration, MediaWiki-Core-Testing

Mar 19 2020

aaron added a comment to T244095: assertArraySubset() will be removed in PHPUnit 9.

Is anyone working on this atm?

I don't think so, mainly because it's not clear how to proceed. Although probably, given how much assertArraySubset is used, we should probably just implement our own simplified version of it.

Mar 19 2020, 10:02 PM · MW-1.35-notes (1.35.0-wmf.25; 2020-03-24), Patch-For-Review, MediaWiki-extensions-DonationInterface, Product-Infrastructure-Team-Backlog, Reading List Service, Growth-Team, PageCuration, MediaWiki-Core-Testing