aaron (Aaron Schulz)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Oct 20 2014, 5:25 PM (156 w, 4 d)
Availability
Available
IRC Nick
AaronSchulz
LDAP User
Aaron Schulz
MediaWiki User
Aaron Schulz

Recent Activity

Yesterday

aaron added a comment to T173696: Cache constraint check results.

Probably hotTTR is way to high. It's really "expected time till refresh given 1 hit/sec". With 50/min, you'd get maybe 2 updates (new values) per regex. I'll put up a patch for that.

Fri, Oct 20, 4:16 PM · MW-1.30-release-notes (WMF-deploy-2017-07-25_(1.30.0-wmf.11)), Patch-For-Review, Wikidata-Sprint, Wikibase-Quality-Constraints, Wikibase-Quality, Wikidata

Thu, Oct 19

aaron added a comment to T173696: Cache constraint check results.

I did a bunch of requests against https://www.wikidata.org/w/api.php?action=wbcheckconstraints&format=json&id=Q42&constraintid=P1476%24F24FF782-E994-4946-BEEC-104CC592534F, which checks a format constraint for “title”. It’s always the same regex and only a handful of different values (17). But while I could see a sharp rise in requests in Grafana corresponding to the times when I sent those requests (permalink), most of them are still cache misses. I’m not sure how to interpret that – it seems values aren’t entering the cache map very often?

Thu, Oct 19, 5:45 PM · MW-1.30-release-notes (WMF-deploy-2017-07-25_(1.30.0-wmf.11)), Patch-For-Review, Wikidata-Sprint, Wikibase-Quality-Constraints, Wikibase-Quality, Wikidata

Wed, Oct 18

Krinkle awarded T178531: Add statsd metric to WANObjectCache a Orange Medal token.
Wed, Oct 18, 8:37 PM · Patch-For-Review, MediaWiki-Cache, monitoring, Performance-Team
aaron created T178531: Add statsd metric to WANObjectCache.
Wed, Oct 18, 8:31 PM · Patch-For-Review, MediaWiki-Cache, monitoring, Performance-Team
aaron placed T160298: "Special:ActiveUsers" throws database query error with sql_mode=only_full_group_by up for grabs.
Wed, Oct 18, 6:21 PM · MW-1.27-release-notes, MW-1.29-release-notes, MW-1.28-release-notes, MW-1.29-release, MW-1.28-release, MW-1.27-release, Technical-Debt, Easy, MediaWiki-Special-pages

Tue, Oct 17

aaron added a comment to T173696: Cache constraint check results.

Reopening. This task is supposed to be for caching results in general, which isn’t done yet at all, though we had a lot of discussion on caching regex checks specifically here, which in hindsight should’ve been in a separate task. Also, IMO the regex caching isn’t done yet, since the Grafana stats are pretty unsatisfactory.

(Perhaps we should repurpose this task to be just about regex checking, open a new one for general caching, and reshuffle the parent tasks so that this one is a child of the new task?)

Tue, Oct 17, 3:53 PM · MW-1.30-release-notes (WMF-deploy-2017-07-25_(1.30.0-wmf.11)), Patch-For-Review, Wikidata-Sprint, Wikibase-Quality-Constraints, Wikibase-Quality, Wikidata

Thu, Oct 12

aaron placed T75174: Make PHPUnit tests pass with PHP 5.5/PostgreSQL on Travis CI up for grabs.
Thu, Oct 12, 9:38 PM · MW-1.30-release-notes, PostgreSQL, Goal, MediaWiki-Core-Tests

Fri, Oct 6

aaron moved T177073: Split the backend savetiming metric into submetrics from Next-up to Doing on the Performance-Team board.
Fri, Oct 6, 6:02 PM · MW-1.31-release-notes (WMF-deploy-2017-10-10 (1.31.0-wmf.3)), Patch-For-Review, Performance-Team

Thu, Oct 5

aaron added a comment to T175672: Make apache/maintenance hosts TLS connections to mariadb work.

We discussed proxies in the last performance meeting and we're OK with that (it would cut down on handshake latency anyway).

Thu, Oct 5, 10:06 PM · Patch-For-Review, Performance-Team (Radar), Availability (Multiple-active-datacenters), DBA, Operations
aaron added a comment to T155110: JobRunner transaction fname for Job::run() can mismatch __METHOD__ in a subclass.

JobRunner always starts an LBFactory transaction.

Thu, Oct 5, 10:01 PM · MediaWiki-JobQueue
aaron closed T42451: "Transaction already in progress" error in sqlite as Resolved.

This was actually fixed for new installs before that patch by moving the object cache table to a separate DB.

Thu, Oct 5, 9:41 PM · SQLite, MediaWiki-Database
aaron closed T42451: "Transaction already in progress" error in sqlite, a subtask of T72710: StorageException in EditEntityActionTest::testActionForPage (edit-already-exists) and related failures, as Resolved.
Thu, Oct 5, 9:41 PM · § Wikidata-Sprint-2015-02-25, Patch-For-Review, Wikidata, MediaWiki-extensions-WikibaseRepository
aaron placed T134811: Consider REST with SSL (HyperSwitch/Cassandra) for session storage up for grabs.
Thu, Oct 5, 1:12 AM · Services (blocked), Availability (Multiple-active-datacenters), Operations, Performance-Team

Wed, Oct 4

aaron added a comment to T175672: Make apache/maintenance hosts TLS connections to mariadb work.

So what I extract from the errors is you're trying to connect to db2048 by IP and not by hostname, and the certificates we expose for mysql do not include verification information for the ip address in its SAN. In fact, I don't think we ever did add that info to our certs.

So if we had the hostname instead of the IP in db-codfw.php, it should work. I think performance was a reason for using IPs instead of hostnames there, so we might need to reissue the certificates if we want to keep using IPs. I think the implications for DBAs would be a huge maintenance work.

Wed, Oct 4, 8:47 PM · Patch-For-Review, Performance-Team (Radar), Availability (Multiple-active-datacenters), DBA, Operations
aaron added a comment to T175672: Make apache/maintenance hosts TLS connections to mariadb work.

Also, there is https://bugs.php.net/bug.php?id=74445 :)

Wed, Oct 4, 8:39 PM · Patch-For-Review, Performance-Team (Radar), Availability (Multiple-active-datacenters), DBA, Operations
aaron updated the task description for T175672: Make apache/maintenance hosts TLS connections to mariadb work.
Wed, Oct 4, 8:33 PM · Patch-For-Review, Performance-Team (Radar), Availability (Multiple-active-datacenters), DBA, Operations
aaron renamed T175672: Make apache/maintenance hosts TLS connections to mariadb work from Make client certs available for apache/maintenance hosts for TLS connections to mariadb to Make apache/maintenance hosts TLS connections to mariadb work.
Wed, Oct 4, 7:07 PM · Patch-For-Review, Performance-Team (Radar), Availability (Multiple-active-datacenters), DBA, Operations
aaron added a comment to T155110: JobRunner transaction fname for Job::run() can mismatch __METHOD__ in a subclass.

You can always do what extensions/CentralAuth/includes/LocalRenameJob/LocalRenameJob.php does AFAIK.

Wed, Oct 4, 6:03 PM · MediaWiki-JobQueue

Tue, Oct 3

aaron added a comment to T175672: Make apache/maintenance hosts TLS connections to mariadb work.

Looking at http://php.net/manual/en/mysqli.ssl-set.php, I would think you'd only need to set capath=/etc/ssl/certs, while setting all other parameters to NULL (except maybe cipher, as I have no idea what is the actual default cipherlist for mysqli on HHVM).

I tried that first but it yields "SSL connection error: SSL_CTX_set_default_verify_paths failed (10.192.32.108)".

Tue, Oct 3, 11:01 PM · Patch-For-Review, Performance-Team (Radar), Availability (Multiple-active-datacenters), DBA, Operations
aaron added a comment to T177017: Re-enable per-filter profiling on wikis where it was disabled.

I'd look for the new method calls that are being reached and whether they show up and how large their profile is if they do. Note that you can use cntl-F on the svg images to highlight matches in purple.

Tue, Oct 3, 9:15 PM · Anti-Harassment, AbuseFilter

Mon, Oct 2

aaron created T177258: Update.php fails with postgres due to ip_changes population.
Mon, Oct 2, 9:58 PM · PostgreSQL, MediaWiki-Maintenance-scripts, MediaWiki-Installer
aaron added a comment to T177017: Re-enable per-filter profiling on wikis where it was disabled.

I think it's fine to roll out there as long as you are watching https://grafana.wikimedia.org/dashboard/db/save-timing?refresh=5m&orgId=1 and check the -index.svg flamegraph at https://performance.wikimedia.org/xenon/svgs/daily/ for day of deployment the next day (current day values are always useless/incomplete).

Mon, Oct 2, 9:45 PM · Anti-Harassment, AbuseFilter
aaron closed T160298: "Special:ActiveUsers" throws database query error with sql_mode=only_full_group_by as Resolved.
Mon, Oct 2, 6:09 PM · MW-1.27-release-notes, MW-1.29-release-notes, MW-1.28-release-notes, MW-1.29-release, MW-1.28-release, MW-1.27-release, Technical-Debt, Easy, MediaWiki-Special-pages
aaron placed T173450: Setup grafana alert for job error rate up for grabs.
Mon, Oct 2, 6:08 PM · Performance-Team

Tue, Sep 26

aaron added a comment to T175672: Make apache/maintenance hosts TLS connections to mariadb work.

Looking at http://php.net/manual/en/mysqli.ssl-set.php, I would think you'd only need to set capath=/etc/ssl/certs, while setting all other parameters to NULL (except maybe cipher, as I have no idea what is the actual default cipherlist for mysqli on HHVM).

Tue, Sep 26, 3:29 PM · Patch-For-Review, Performance-Team (Radar), Availability (Multiple-active-datacenters), DBA, Operations

Sep 20 2017

aaron moved T166199: Add metrics for master queries on HTTP GET/HEAD from Next-up to Doing on the Performance-Team board.
Sep 20 2017, 7:21 PM · MW-1.31-release-notes (WMF-deploy-2017-10-03 (1.31.0-wmf.2)), Performance-Team, Availability (Multiple-active-datacenters)
aaron added a comment to T173696: Cache constraint check results.

Interesting idea! It feels a bit weird to implement logic like this on top of the cache (I thought that’s the cache’s job?), but you’re the expert :) it sounds like it makes a lot of sense, at least, since the set of regexes is mostly static and the set of values is highly dynamic, with some very commonly used values.

I think I’ll remove the “don’t bother” microtime check, though, since it seems that even for an extremely simple query like SELECT (1 AS ?x) {}, the query service rarely responds in less than 0.04 seconds, and never in less than 0.02 seconds (tested from a Cloud VPS system within the Eqiad cluster).

Sep 20 2017, 10:30 AM · MW-1.30-release-notes (WMF-deploy-2017-07-25_(1.30.0-wmf.11)), Patch-For-Review, Wikidata-Sprint, Wikibase-Quality-Constraints, Wikibase-Quality, Wikidata
aaron added a comment to T173696: Cache constraint check results.

If want to avoid flooding cache with rarely used long-tail combinations, maybe something like this could be done:

Sep 20 2017, 3:24 AM · MW-1.30-release-notes (WMF-deploy-2017-07-25_(1.30.0-wmf.11)), Patch-For-Review, Wikidata-Sprint, Wikibase-Quality-Constraints, Wikibase-Quality, Wikidata

Sep 19 2017

aaron added a comment to T176101: Cannot delete File:MKC,S.jpg on zhwiki due to DBQueryError.

Problem seems to be:

if ( $this->stage <= MIGRATION_WRITE_BOTH ) {
	$fields[$this->key] = $this->lang->truncate( $comment->text, 255 );
}

...LocalFile already used addQuotes(), and this can remove the ending quote character.

That's not the real problem. The problem is that the different behavior of IDatabase->insertSelect()'s $varMap versus ->insert()'s $a wasn't noticed, so the code was incorrectly quoting the value passed to CommentStore->insert() (from the original pre-CommentStore code) rather than quoting the returned literal fields for passing into IDatabase->insertSelect().

Sep 19 2017, 3:07 AM · Patch-For-Review, Vuln-Inject, Wikimedia-log-errors, Chinese-Sites, MediaWiki-Page-deletion

Sep 18 2017

aaron removed a project from T176101: Cannot delete File:MKC,S.jpg on zhwiki due to DBQueryError: Security.

Problem seems to be:

if ( $this->stage <= MIGRATION_WRITE_BOTH ) {
	$fields[$this->key] = $this->lang->truncate( $comment->text, 255 );
}
Sep 18 2017, 9:24 AM · Patch-For-Review, Vuln-Inject, Wikimedia-log-errors, Chinese-Sites, MediaWiki-Page-deletion
aaron triaged T176101: Cannot delete File:MKC,S.jpg on zhwiki due to DBQueryError as Unbreak Now! priority.
Sep 18 2017, 9:23 AM · Patch-For-Review, Vuln-Inject, Wikimedia-log-errors, Chinese-Sites, MediaWiki-Page-deletion
aaron added a project to T176101: Cannot delete File:MKC,S.jpg on zhwiki due to DBQueryError: Security.
Sep 18 2017, 9:10 AM · Patch-For-Review, Vuln-Inject, Wikimedia-log-errors, Chinese-Sites, MediaWiki-Page-deletion

Sep 16 2017

aaron added a comment to T175834: TranslatablePageMoveJob commit while in atomic sections.

Probably the onMoveTranslationUnits handler should be a closure sent to DeferredUpdates. Anything that needs to COMMIT/BEGIN within it's scope needs full transaction control, which is not usually guaranteed when some hook triggers.

Sep 16 2017, 9:43 PM · MediaWiki-extensions-Translate, Wikimedia-log-errors

Sep 15 2017

aaron added a comment to T173477: wmf.14 Blocker - Post Mortem - Cannot flush pre-lock snapshot because writes are pending.
  • PROBLEM: in LinksUpdate, runForTitle() starting off with acquirePageLock(), then calling doUpdate() for the secondary update list, and returning without committing. This meant that any caller using this method inside a loop had to call commitMasterChanges() itself somehow, otherwise, the acquirePageLock() call would fail. The multi-title case of RefreshLinksJob had a for-loop that did not do this. Note that acquirePageLock() uses getScopedLockAndFlush() which is intended for "critical sections" (https://en.wikipedia.org/wiki/Critical_section) involving read/writing to the database. Since it makes to sense to acquire a lock and then read a stale snapshot (from REPEATABLE-READ) from *before* lock acquisition, Database demands that any transaction be cleared. It will do so automatically if there are no writes, but otherwise it fails since committing prematurely may break atomicity.
  • INTRODUCTION: This was broken since 63a3911a67507731695bad3188f486219a563b7d but nothing used multi-title refreshlinks jobs. 0df49eeaf49dcd84cee5afc678de43ebd6c984c5 introduced a use case for this and made the bug manifest itself.
  • AVOIDANCE: since this would seem to happen for any multi-tutle job run, I'm not sure how this got past testing unless there were (a) no links updated and the test jobs were triggered by null or non-link changed edits or (b) the edited test entities only had one backlink. Future backlink change propagation testing should cover these cases.
Sep 15 2017, 9:39 AM · RelEng-Archive-FY201718-Q1
aaron added a comment to T174993: Vandalism in "In the news" articles persisting in the app' ?.

I'd leave it open. The above change avoids the jobqueue and thus fast tail jobs piling due to slow wikidata jobs in the head of the queue. It should help, I'd assume.

Sep 15 2017, 6:26 AM · Reading-Infrastructure-Team-Backlog, Services (watching), Mobile, Wikipedia-iOS-App-Backlog, Wikipedia-Android-App-Backlog, iOS-app-Bugs, Android-app-Bugs

Sep 14 2017

aaron added a comment to T174993: Vandalism in "In the news" articles persisting in the app' ?.

I think LinksUpdate for the page directly edited can probably be moved (back) to doing the actual work post-send.

The 'enqueue' parameter can be removed from MediaWiki::restInPeace() since, unlike in the PRESEND run, the user is not waiting on it to run. If a caller really wants to enqueue a job post-send, it can always use lazyPush() instead of adding an EnqueueableDataUpdate to the POSTSEND deferred update list.

Sep 14 2017, 8:25 AM · Reading-Infrastructure-Team-Backlog, Services (watching), Mobile, Wikipedia-iOS-App-Backlog, Wikipedia-Android-App-Backlog, iOS-app-Bugs, Android-app-Bugs
aaron added a comment to T174993: Vandalism in "In the news" articles persisting in the app' ?.

I think LinksUpdate for the page directly edited can probably be moved (back) to doing the actual work post-send.

Sep 14 2017, 8:02 AM · Reading-Infrastructure-Team-Backlog, Services (watching), Mobile, Wikipedia-iOS-App-Backlog, Wikipedia-Android-App-Backlog, iOS-app-Bugs, Android-app-Bugs
aaron added a comment to T174993: Vandalism in "In the news" articles persisting in the app' ?.

As far as I can tell, the page image(s) are handled as part of deferred linksUpdate processing. This means that the updates would be executed after the main web request, but on the same PHP thread that handled the original edit request.

Sep 14 2017, 7:56 AM · Reading-Infrastructure-Team-Backlog, Services (watching), Mobile, Wikipedia-iOS-App-Backlog, Wikipedia-Android-App-Backlog, iOS-app-Bugs, Android-app-Bugs

Sep 13 2017

aaron added a comment to T102899: Implement or find a generic leaderboard web interface.

For catching slow queries, we can use logging to logstash when the runtime passes a certain threshold (to avoid spamming the service). A leaderboard could be added to Kibana for the top occurrences of normalized messages.

Sep 13 2017, 1:22 PM · Performance-Team
aaron renamed T99060: Create a dashboard of key user-centric performance metrics from Performance key metrics dashboard(s) to Create a dashboard of key user-centric performance metrics.
Sep 13 2017, 12:25 PM · Performance-Team

Sep 12 2017

aaron moved T95501: Fix causes of slave lag and get it to under 5 seconds at peak from Next-up to Blocked on the Performance-Team board.
Sep 12 2017, 10:08 AM · Goal, Performance-Team, Availability
aaron moved T161749: Introduce InterruptMutexManager from Next-up to Backlog on the Performance-Team board.
Sep 12 2017, 10:08 AM · TechCom-RfC (ArchCom-Approved), User-Daniel, Performance-Team, MediaWiki-General-or-Unknown
aaron moved T121440: Dedicated post-edit cache busting cookie to prevent stale reads (session consistency) from Potential goals to Backlog on the Performance-Team board.
Sep 12 2017, 10:08 AM · Performance-Team
aaron moved T121440: Dedicated post-edit cache busting cookie to prevent stale reads (session consistency) from Next-up to Potential goals on the Performance-Team board.
Sep 12 2017, 10:07 AM · Performance-Team
aaron moved T171071: Perform testing for TLS effect on connection rate from Doing to Blocked on the Performance-Team board.
Sep 12 2017, 10:01 AM · Availability (Multiple-active-datacenters), DBA, Operations, Performance-Team
aaron updated the task description for T175672: Make apache/maintenance hosts TLS connections to mariadb work.
Sep 12 2017, 8:44 AM · Patch-For-Review, Performance-Team (Radar), Availability (Multiple-active-datacenters), DBA, Operations
aaron created T175672: Make apache/maintenance hosts TLS connections to mariadb work.
Sep 12 2017, 8:43 AM · Patch-For-Review, Performance-Team (Radar), Availability (Multiple-active-datacenters), DBA, Operations

Sep 9 2017

aaron updated the task description for T175437: Improve [rollback] logic when it encounters null edits.
Sep 9 2017, 1:08 AM · MediaWiki-Recent-changes, MediaWiki-History-or-Diffs
aaron updated the task description for T175437: Improve [rollback] logic when it encounters null edits.
Sep 9 2017, 12:54 AM · MediaWiki-Recent-changes, MediaWiki-History-or-Diffs
aaron created T175439: SQL error with postgres during 1.30 update.php run.
Sep 9 2017, 12:29 AM · MW-1.28-release-notes, MW-1.27-release-notes, MW-1.29-release-notes, Patch-For-Review, PostgreSQL, MediaWiki-Database

Sep 8 2017

aaron created T175437: Improve [rollback] logic when it encounters null edits.
Sep 8 2017, 11:48 PM · MediaWiki-Recent-changes, MediaWiki-History-or-Diffs
aaron created T175418: Create new instances memc05 and memc06 running memcached.
Sep 8 2017, 8:26 PM · Release-Engineering-Team (Watching / External), Availability (Multiple-active-datacenters), Beta-Cluster-Infrastructure
aaron added a comment to T171071: Perform testing for TLS effect on connection rate.

Hey,

I would be nice to do a test with MariaDB 10.0 and 10.1 if possible, to see if there are any regressions.
For that matters, on codfw you can pretty much use any slave for 10.0 (they all have pretty much the same HW), so for the sake of picking one from s1:
db2048

If you want to do the same test with MariaDB 10.1, you could use db2062. db2062 is being used lately to reclone some hosts, so you might want to give us a heads up before using it, to make sure it doesn't have mysql down for one of those maintenances.

Sep 8 2017, 7:04 PM · Availability (Multiple-active-datacenters), DBA, Operations, Performance-Team

Sep 5 2017

aaron added a comment to T173710: Job queue is increasing non-stop.

Those refreshLInks jobs (from wikibase) are the only ones that use multiple titles per job, so they will be a lot slower (seems to be 50 pages/job) than the regular ones from MediaWiki core. That is a bit on the slow side for a run time of a non-rare job type (e.g. TMH or GWT).

Sep 5 2017, 8:12 PM · Patch-For-Review, Services (watching), Performance-Team (Radar), CirrusSearch, Discovery, Wikidata-Sprint, Wikidata, Operations, MediaWiki-JobQueue

Sep 2 2017

aaron closed T173520: Fatal error: Stack overflow in [files] for wmf.14 as Resolved.
Sep 2 2017, 7:10 PM · MW-1.30-release-notes (WMF-deploy-2017-09-05 (1.30.0-wmf.17)), Patch-For-Review, ProofreadPage, Wikimedia-log-errors
aaron closed T173520: Fatal error: Stack overflow in [files] for wmf.14, a subtask of T170632: 1.30.0-wmf.14 deployment blockers, as Resolved.
Sep 2 2017, 7:10 PM · RelEng-Archive-FY201718-Q1, Train Deployments, Release
aaron closed T173520: Fatal error: Stack overflow in [files] for wmf.14, a subtask of T170634: 1.30.0-wmf.16 deployment blockers, as Resolved.
Sep 2 2017, 7:10 PM · RelEng-Archive-FY201718-Q1, Train Deployments, Release

Aug 31 2017

aaron added a comment to T173710: Job queue is increasing non-stop.

Could we always bump page_touched, but only send the purges to varnish if the timestamp is within the past four days? Would that let us run the older jobs faster since if I understand correctly the throttling is to avoid overloading varnish with purges?

Unfortunately the throttling still happens regardless of page touched. Throttling isn't based on actual purges performed but on the number of work items in a job. Work items are a simple count of pages in the job, rather than how many pages will actually be purged. Changing this behavior would basically increase the number of purges we send to varnish.

Aug 31 2017, 9:05 PM · Patch-For-Review, Services (watching), Performance-Team (Radar), CirrusSearch, Discovery, Wikidata-Sprint, Wikidata, Operations, MediaWiki-JobQueue
aaron added a comment to T173710: Job queue is increasing non-stop.

Correcting myself after a discussion with @ema: since we have up to 4 cache layers (at most), we should process any job with a root timestamp newer than 4 times the cache TTL cap. So anything older than 4 days should be safely discardable.

This would account for about 1% of jobs according to Gwicke's sampling, but I suspect that under large pressure the distribution could get significantly worse.

Aug 31 2017, 6:15 PM · Patch-For-Review, Services (watching), Performance-Team (Radar), CirrusSearch, Discovery, Wikidata-Sprint, Wikidata, Operations, MediaWiki-JobQueue
aaron added a comment to T173710: Job queue is increasing non-stop.

As far as retries go, the attempts hash for wikidatawiki:htmlCacheUpdate has few entries with run counts no greater than 3. The onl incrementing code is doPop() in MediaWiki, the same code that made them go up to 3 to begin with. If the same job ran many times, I'd expect there to be very high values there.

Aug 31 2017, 12:21 AM · Patch-For-Review, Services (watching), Performance-Team (Radar), CirrusSearch, Discovery, Wikidata-Sprint, Wikidata, Operations, MediaWiki-JobQueue

Aug 30 2017

aaron moved T157210: Gadget dependencies sometimes don't update from Next-up to Backlog on the Performance-Team board.
Aug 30 2017, 6:54 PM · MW-1.30-release-notes (WMF-deploy-2017-09-05 (1.30.0-wmf.17)), Performance-Team, MediaWiki-Cache, Gadgets
aaron added a project to T157210: Gadget dependencies sometimes don't update: Performance-Team.
Aug 30 2017, 6:54 PM · MW-1.30-release-notes (WMF-deploy-2017-09-05 (1.30.0-wmf.17)), Performance-Team, MediaWiki-Cache, Gadgets
aaron moved T173949: ext.math.styles is sometimes loaded when not needed - causing CSS size in production to fluctuate from Inbox to Radar on the Performance-Team board.
Aug 30 2017, 6:48 PM · Performance-Team (Radar), MediaWiki-Parser, Readers-Web-Backlog (Tracking), Math, Performance
aaron added a comment to T174422: Make dbBatchSize in WikiPageUpdater configurable.

@Ladsgroup @aaron Would $wgUpdateRowsPerQuery be appropiate here, too? Or is it important for this particular query to use a different batch size?

Aug 30 2017, 5:53 PM · Patch-For-Review, Performance-Team (Radar), Wikidata-Sprint, User-Ladsgroup, Wikidata

Aug 25 2017

aaron renamed T172357: ChronologyProtector redirect optimization depends on inappropriate $wgLocalVirtualHosts setting from $wgLocalVirtualHosts should include login.wikimedia.org, wikidata.org and others? to ChronologyProtector redirect optimization depends on inappropriate $wgLocalVirtualHosts setting.
Aug 25 2017, 11:24 PM · MW-1.30-release-notes, Performance-Team, Deployments, MediaWiki-extensions-CentralAuth
aaron added a comment to T173710: Job queue is increasing non-stop.

Though this bit is problematic:

Aug 25 2017, 10:37 PM · Patch-For-Review, Services (watching), Performance-Team (Radar), CirrusSearch, Discovery, Wikidata-Sprint, Wikidata, Operations, MediaWiki-JobQueue
aaron added a comment to T173710: Job queue is increasing non-stop.

Ignored purges still count as work items, yes.

Aug 25 2017, 9:09 PM · Patch-For-Review, Services (watching), Performance-Team (Radar), CirrusSearch, Discovery, Wikidata-Sprint, Wikidata, Operations, MediaWiki-JobQueue
aaron closed T172559: Ensure getLagTimes.php is working properly as Resolved.
Aug 25 2017, 8:03 PM · MW-1.30-release-notes, Operations, monitoring, Performance-Team
aaron added a comment to T173710: Job queue is increasing non-stop.

Note that for de-duplication, as long as the job has rootJobTimestamp set, it will ignore rows already touched (page_touched) to a higher/equal value, and likewise not send purges to the corresponding pages. So the CDN aspects *should* already have lots of de-duplication, the job spam notwithstanding.

Aug 25 2017, 8:03 PM · Patch-For-Review, Services (watching), Performance-Team (Radar), CirrusSearch, Discovery, Wikidata-Sprint, Wikidata, Operations, MediaWiki-JobQueue

Aug 24 2017

aaron added a comment to T173710: Job queue is increasing non-stop.

Secondary purges where for dealing with replication lag scenarios, not lost purges. That was one extra purge (2X).

One easy change I can see to not use CdnCacheUpdate from HtmlCacheUpdateJob (but still for the pages directly being edited). There is already processing delay anyway (and if there is none, there less likely to be replag, though not guaranteed), so there is less "de facto" use in a secondary purge for backlinks.

Aug 24 2017, 10:17 PM · Patch-For-Review, Services (watching), Performance-Team (Radar), CirrusSearch, Discovery, Wikidata-Sprint, Wikidata, Operations, MediaWiki-JobQueue
aaron added a comment to T173710: Job queue is increasing non-stop.

Secondary purges where for dealing with replication lag scenarios, not lost purges. That was one extra purge (2X).

Aug 24 2017, 10:03 PM · Patch-For-Review, Services (watching), Performance-Team (Radar), CirrusSearch, Discovery, Wikidata-Sprint, Wikidata, Operations, MediaWiki-JobQueue
aaron added a comment to T173710: Job queue is increasing non-stop.

In other words, base jobs for entities that will divide up and purge all backlinks to the given entity. Note that each job has two entries.

Wait - each job has two entries? You mean, there are duplicates inserted, and not pruned?...

Aug 24 2017, 7:07 PM · Patch-For-Review, Services (watching), Performance-Team (Radar), CirrusSearch, Discovery, Wikidata-Sprint, Wikidata, Operations, MediaWiki-JobQueue
aaron added a comment to T173710: Job queue is increasing non-stop.

From

mwscript maintenance/runJobs.php wikidatawiki --type htmlCacheUpdate --nothrottle --maxjobs 100 | grep "IsSelf=1"
Aug 24 2017, 1:28 AM · Patch-For-Review, Services (watching), Performance-Team (Radar), CirrusSearch, Discovery, Wikidata-Sprint, Wikidata, Operations, MediaWiki-JobQueue

Aug 23 2017

aaron committed rELGN810fe34c0822: Use global stash instance instead of local cluster instance (authored by Bawolff).
Use global stash instance instead of local cluster instance
Aug 23 2017, 9:59 PM
aaron placed T156924: Allow integration of data from etcd into the MediaWiki configuration up for grabs.
Aug 23 2017, 7:13 PM · MediaWiki-Platform-Team (MWPT-Q2-Oct-Dec-2017), Patch-For-Review, Performance-Team (Radar), Availability (Multiple-active-datacenters), Services (watching), discovery-system, User-Joe, User-mobrovac, Operations
aaron added a comment to T166199: Add metrics for master queries on HTTP GET/HEAD.

CommonSettings.php will still need updating.

Aug 23 2017, 7:11 PM · MW-1.31-release-notes (WMF-deploy-2017-10-03 (1.31.0-wmf.2)), Performance-Team, Availability (Multiple-active-datacenters)
aaron moved T171071: Perform testing for TLS effect on connection rate from Next-up to Doing on the Performance-Team board.
Aug 23 2017, 7:09 PM · Availability (Multiple-active-datacenters), DBA, Operations, Performance-Team
aaron moved T172357: ChronologyProtector redirect optimization depends on inappropriate $wgLocalVirtualHosts setting from Next-up to Blocked on the Performance-Team board.
Aug 23 2017, 7:09 PM · MW-1.30-release-notes, Performance-Team, Deployments, MediaWiki-extensions-CentralAuth
aaron moved T172941: Track duplicate parses on page save from Doing to Next-up on the Performance-Team board.
Aug 23 2017, 7:09 PM · Performance-Team
aaron triaged T173786: Convert Wikimedia production HHVM instances to have hhvm.php7.all set true as Low priority.
Aug 23 2017, 6:54 PM · MediaWiki-Platform-Team, Performance-Team, Operations, HHVM
aaron moved T173786: Convert Wikimedia production HHVM instances to have hhvm.php7.all set true from Inbox to Blocked on the Performance-Team board.
Aug 23 2017, 6:53 PM · MediaWiki-Platform-Team, Performance-Team, Operations, HHVM
aaron moved T173710: Job queue is increasing non-stop from Inbox to Radar on the Performance-Team board.
Aug 23 2017, 6:49 PM · Patch-For-Review, Services (watching), Performance-Team (Radar), CirrusSearch, Discovery, Wikidata-Sprint, Wikidata, Operations, MediaWiki-JobQueue
aaron moved T173796: Performance review of ArticleCreationWorkflow extension from Inbox to Next-up on the Performance-Team board.
Aug 23 2017, 6:47 PM · Community-Tech, MediaWiki-extensions-ArticleCreationWorkflow, Performance-Team
aaron assigned T173796: Performance review of ArticleCreationWorkflow extension to Krinkle.
Aug 23 2017, 6:47 PM · Community-Tech, MediaWiki-extensions-ArticleCreationWorkflow, Performance-Team

Aug 22 2017

aaron added a comment to T172357: ChronologyProtector redirect optimization depends on inappropriate $wgLocalVirtualHosts setting.

Looks like using WikiMap is the best choice IMO.

Aug 22 2017, 7:30 PM · MW-1.30-release-notes, Performance-Team, Deployments, MediaWiki-extensions-CentralAuth
aaron added a comment to T173710: Job queue is increasing non-stop.

Mostly htmlCacheUpdate jobs on wikidatawiki:

Aug 22 2017, 7:28 PM · Patch-For-Review, Services (watching), Performance-Team (Radar), CirrusSearch, Discovery, Wikidata-Sprint, Wikidata, Operations, MediaWiki-JobQueue
aaron placed T150506: Run lazyImportLocalNames() on creation and run script to backfill them up for grabs.
Aug 22 2017, 7:23 PM · Availability (Multiple-active-datacenters), MediaWiki-extensions-CentralAuth, Performance-Team
jcrespo awarded T172559: Ensure getLagTimes.php is working properly a Pterodactyl token.
Aug 22 2017, 4:56 PM · MW-1.30-release-notes, Operations, monitoring, Performance-Team

Aug 17 2017

aaron added a comment to T154424: TransactionProfiler should not apply to SqlBagOStuff.

In config, with $wgTrxProfilerLimits you can do:

Aug 17 2017, 11:05 PM · MW-1.29-release-notes, MediaWiki-Cache, MediaWiki-Database

Aug 16 2017

aaron added a comment to T172357: ChronologyProtector redirect optimization depends on inappropriate $wgLocalVirtualHosts setting.

Another option is to have the MediaWiki.php code use InterWikiLookup::getAllPrefixes() and the 'iw_url' field for each item.

Aug 16 2017, 7:21 PM · MW-1.30-release-notes, Performance-Team, Deployments, MediaWiki-extensions-CentralAuth
aaron moved T173450: Setup grafana alert for job error rate from Inbox to Next-up on the Performance-Team board.
Aug 16 2017, 7:14 PM · Performance-Team
aaron triaged T173450: Setup grafana alert for job error rate as Low priority.
Aug 16 2017, 7:14 PM · Performance-Team
aaron created T173450: Setup grafana alert for job error rate.
Aug 16 2017, 7:08 PM · Performance-Team
aaron moved T172941: Track duplicate parses on page save from Inbox to Doing on the Performance-Team board.
Aug 16 2017, 6:54 PM · Performance-Team
aaron triaged T172941: Track duplicate parses on page save as Normal priority.
Aug 16 2017, 6:53 PM · Performance-Team

Aug 14 2017

aaron closed T171371: Investigate 30x increase in Jobrunner errors as Resolved.

Closing. The two logging-related improvement action item left have their own tasks.

Aug 14 2017, 8:49 PM · MW-1.30-release-notes, Patch-For-Review, Release-Engineering-Team (Watching / External), Regression, Performance-Team, JobRunner-Service
aaron updated the task description for T171371: Investigate 30x increase in Jobrunner errors.
Aug 14 2017, 8:49 PM · MW-1.30-release-notes, Patch-For-Review, Release-Engineering-Team (Watching / External), Regression, Performance-Team, JobRunner-Service
aaron updated the task description for T171371: Investigate 30x increase in Jobrunner errors.
Aug 14 2017, 8:48 PM · MW-1.30-release-notes, Patch-For-Review, Release-Engineering-Team (Watching / External), Regression, Performance-Team, JobRunner-Service
aaron added a comment to T171371: Investigate 30x increase in Jobrunner errors.

Error rate went from 500-1000/s to 50-80/s.

Aug 14 2017, 8:24 PM · MW-1.30-release-notes, Patch-For-Review, Release-Engineering-Team (Watching / External), Regression, Performance-Team, JobRunner-Service
aaron added a comment to T171371: Investigate 30x increase in Jobrunner errors.

Mentioned in SAL (#wikimedia-operations) [2017-08-12T20:00:52Z] <krinkle@tin> Synchronized php-1.30.0-wmf.13/includes/jobqueue/JobQueueGroup.php: T171371 - Log job pushes to bogus wikis (duration: 00m 53s)

Aug 14 2017, 8:06 PM · MW-1.30-release-notes, Patch-For-Review, Release-Engineering-Team (Watching / External), Regression, Performance-Team, JobRunner-Service