aaron (Aaron Schulz)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Oct 20 2014, 5:25 PM (182 w, 3 d)
Availability
Available
IRC Nick
AaronSchulz
LDAP User
Aaron Schulz
MediaWiki User
Aaron Schulz

Recent Activity

Today

aaron added a comment to T192473: deployment-prep has jobqueue/caching issues.

The warnings are pointless, the patch above adds an isset() check.

Thu, Apr 19, 4:42 AM · MW-1.32-release-notes (WMF-deploy-2018-04-24 (1.32.0-wmf.1)), Patch-For-Review, Puppet, Beta-Cluster-Infrastructure

Thu, Apr 12

aaron added a comment to T191802: [Epic] Determine a strategy to store files between 5 and 100 Gb.

This is related to T149847 in that we would *have* to stop moving file content around in Special:MovePage just to rename files.

Thu, Apr 12, 5:35 PM · media-storage, Multimedia
aaron added a parent task for T149847: RFC: Use content hash based image / thumb URLs: T191802: [Epic] Determine a strategy to store files between 5 and 100 Gb.
Thu, Apr 12, 5:35 PM · Performance-Team (Radar), Services (later), Traffic, Operations, TechCom-RFC, Zero, Wikipedia-iOS-App-Backlog, Wikipedia-Android-App-Backlog, Reading-Admin, Commons, Epic, RESTBase-API, Parsoid, Multimedia, MediaWiki-File-management
aaron added a subtask for T191802: [Epic] Determine a strategy to store files between 5 and 100 Gb: T149847: RFC: Use content hash based image / thumb URLs.
Thu, Apr 12, 5:34 PM · media-storage, Multimedia

Wed, Apr 11

aaron added a comment to T191916: Warning: Destructor threw an object exception: exception 'Wikimedia\Rdbms\DBUnexpectedError' with message 'Wikimedia\Rdbms\Database::close: Expected mass commit of all peer transactions (DBO_TRX set).' in /srv/mediawiki/php-1.31.0-wmf.29/includes/libs/rdbms/database/Database.php:3602.

I suspect the transactions are just empty ones with SELECT statements, which don't need to give errors here.

Wed, Apr 11, 5:28 AM · MW-1.31-release-notes (WMF-deploy-2018-04-10 (1.31.0-wmf.29)), Performance-Team, MediaWiki-Database, Patch-For-Review, Wikimedia-log-errors

Tue, Apr 10

aaron added a comment to T175834: TranslatablePageMoveJob commit while in atomic sections.

The message index code could do for a large amount of rework. In the meantime, I can't tell why the MessageIndexRebuildJob::newJob() instance must run immediately in isValid()...it's not like the method recheck's what it did before after the rebuild. If nothing else depends on it being immediate, then it should use a DeferredUpdate. If it has to be immediate...then CONN_TRX_AUTO can be considered (as long as it doesn't deadlock by having to transactions updating the same rows).

Tue, Apr 10, 11:13 PM · Language-2018-Apr-June, MW-1.32-release-notes (WMF-deploy-2018-04-24 (1.32.0-wmf.1)), Patch-For-Review, Language-Team, MediaWiki-extensions-Translate, Wikimedia-log-errors

Mon, Apr 9

aaron added a subtask for T151466: Performance Q2 2017/18 goal: Install and use mcrouter in deployment-prep: T190979: build new version of mcrouter package.
Mon, Apr 9, 6:52 PM · Release-Engineering-Team (Watching / External), Availability (MediaWiki-MultiDC), Beta-Cluster-Infrastructure, Performance-Team
aaron added a parent task for T190979: build new version of mcrouter package: T151466: Performance Q2 2017/18 goal: Install and use mcrouter in deployment-prep.
Mon, Apr 9, 6:52 PM · Patch-For-Review, User-Joe, Operations

Wed, Apr 4

aaron added a comment to T190960: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query".

@aaron - if you don't like the current model, should we think an alternative, simpler one based on the heartbeat table- or is this still ok for you?

Wed, Apr 4, 9:08 PM · Wikimedia-Incident, MW-1.31-release-notes (WMF-deploy-2018-03-27 (1.31.0-wmf.27)), User-notice, Patch-For-Review, DBA, Wikimedia-log-errors

Thu, Mar 29

aaron added a comment to T190960: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query".

I don't mean "noise" as "unrelated to deploy", rather "expected, but doesn't matter".

Thu, Mar 29, 9:38 PM · Wikimedia-Incident, MW-1.31-release-notes (WMF-deploy-2018-03-27 (1.31.0-wmf.27)), User-notice, Patch-For-Review, DBA, Wikimedia-log-errors
aaron added a comment to T190960: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query".

The temp table stat increases just seem like noise due to some queries going from "SELECT @@" to "SHOW GLOBAL VARIABLES LIKE 'gtid_%'. E.g.:

Thu, Mar 29, 2:19 AM · Wikimedia-Incident, MW-1.31-release-notes (WMF-deploy-2018-03-27 (1.31.0-wmf.27)), User-notice, Patch-For-Review, DBA, Wikimedia-log-errors

Sat, Mar 24

aaron added a comment to T189999: Enforce database transaction rollback conventions to protect against certain try/catch patterns.

DBO_IGNORE can only be enabled through the config or Database::factory directly. This flag is now largely irrelevant and could probably be finished off with a deprecation.

Sat, Mar 24, 12:06 PM · MW-1.31-release-notes (WMF-deploy-2018-04-10 (1.31.0-wmf.29)), Patch-For-Review, MediaWiki-Database

Thu, Mar 22

aaron added a comment to T190396: Consider splitting the IDatabase interface.

It's kind of hard to do this in practice, give the use of load balancers and so on. Some stuff can be removed, deprecated, or moved to IMaintainableDatabase though.

Thu, Mar 22, 3:48 PM · MW-1.31-release-notes (WMF-deploy-2018-03-27 (1.31.0-wmf.27)), Patch-For-Review, User-Addshore, MediaWiki-Database

Mar 19 2018

aaron updated the task description for T189999: Enforce database transaction rollback conventions to protect against certain try/catch patterns.
Mar 19 2018, 2:25 AM · MW-1.31-release-notes (WMF-deploy-2018-04-10 (1.31.0-wmf.29)), Patch-For-Review, MediaWiki-Database
aaron created T189999: Enforce database transaction rollback conventions to protect against certain try/catch patterns.
Mar 19 2018, 1:34 AM · MW-1.31-release-notes (WMF-deploy-2018-04-10 (1.31.0-wmf.29)), Patch-For-Review, MediaWiki-Database

Mar 14 2018

aaron closed T188875: Unexpected errors when ROLLBACK fails due to the DB server having "gone away" as Resolved.
Mar 14 2018, 8:07 AM · MW-1.31-release-notes (WMF-deploy-2018-03-20 (1.31.0-wmf.26)), Performance-Team, Patch-For-Review, MediaWiki-Database
aaron renamed T188875: Unexpected errors when ROLLBACK fails due to the DB server having "gone away" from Slow transaction killer breaks error handling to Unexpected errors when ROLLBACK fails due to the DB server having "gone away".
Mar 14 2018, 8:07 AM · MW-1.31-release-notes (WMF-deploy-2018-03-20 (1.31.0-wmf.26)), Performance-Team, Patch-For-Review, MediaWiki-Database

Mar 10 2018

mmodell awarded T187956: Array to string conversion in MySQLMasterPos.php on line 41 a Cookie token.
Mar 10 2018, 10:54 PM · MediaWiki-Database, Wikimedia-log-errors
aaron closed T187956: Array to string conversion in MySQLMasterPos.php on line 41 as Resolved.
Mar 10 2018, 10:11 PM · MediaWiki-Database, Wikimedia-log-errors

Mar 8 2018

aaron added a comment to T188875: Unexpected errors when ROLLBACK fails due to the DB server having "gone away".

Note that the code making the "Expectation (readQueryTime <= 30) by JobRunner::run" logs does not roll anything back.

Mar 8 2018, 8:29 PM · MW-1.31-release-notes (WMF-deploy-2018-03-20 (1.31.0-wmf.26)), Performance-Team, Patch-For-Review, MediaWiki-Database

Mar 4 2018

aaron added a comment to T188721: Global rename of Erik_Fastman to Glorious_Engine stuck "in progress" since 28th February on wikidatawiki.

Reconnecting in the case of rollback is an corner case, since normally just closing like that should error out. If ROLLBACK fails due to connection loss, there really isn't a need to reconnect, since everything should have rolled back on connection loss in the first place. Some sort flag to disable reconnection during rollback would be needed.

Mar 4 2018, 8:50 PM · GlobalRename, MediaWiki-extensions-CentralAuth, Wikimedia-Site-requests

Mar 3 2018

aaron added a comment to T184670: [wmf.16-regression] Fatal exception of type "Flow\Exception\InvalidDataException" for opting out from "Structured Discussions on user talk".

Don't null revisions just reuse the same rev_text_id and insert no new blob? At least that's how it used to work.

Mar 3 2018, 3:13 AM · MW-1.31-release-notes (WMF-deploy-2018-02-27 (1.31.0-wmf.23)), StructuredDiscussions, User-notice-collaboration, Patch-For-Review, Collaboration-Team-Triage (Collab-Team-This-Quarter), Regression
aaron triaged T188801: Migrate wl_notificationtimestamp updates to the job queue as Normal priority.
Mar 3 2018, 12:46 AM · MediaWiki-Watchlist, Patch-For-Review, Availability (MediaWiki-MultiDC)
aaron added a comment to T160993: MysqlUpdater::doWatchlistUpdate is very slow.

Is there anything actionable here?

Mar 3 2018, 12:41 AM · MW-1.31-release-notes (WMF-deploy-2018-03-06 (1.31.0-wmf.24)), Performance, MediaWiki-Database

Mar 2 2018

aaron committed R1981:fcfe3879da95: Fix comment typo (authored by aaron).
Fix comment typo
Mar 2 2018, 10:24 AM

Feb 27 2018

aaron added a comment to T97562: WANObjectCache relay daemon or mcrouter support.

For reference, there is T156938 , for evaluating dynomite.

Feb 27 2018, 11:28 PM · Services (watching), Availability (MediaWiki-MultiDC), Analytics, User-mobrovac, EventBus
aaron added a comment to T97562: WANObjectCache relay daemon or mcrouter support.
In T97562#3977706, @Joe wrote:

I'm reopening this since the status of the FLOSS mcrouter project in the last year has been dire:

  • It's one year (!!!) they don't have a release
  • There is no indication of what could be stable or not
  • The build has changed radically between 0.24.0 (the version I packaged) and the current 0.36.0 (the last tagged version, already 1 year old), and it's broken again. I had to spend almost a week making the first build behave, and it seems I'd need to spend a similar amount of time this time around.

    This circles me back to looking at alternatives. @aaron I'm taking a look at Netflix's dynomite, which I'm not sure would do what we want exactly, but right now the situation of the FLOSS version of mcrouter is not such I can endorse its production use.
Feb 27 2018, 11:24 PM · Services (watching), Availability (MediaWiki-MultiDC), Analytics, User-mobrovac, EventBus
RandomDSdevel awarded T187942: Replication lag detection broken in wmf.22 a Baby Tequila token.
Feb 27 2018, 9:52 PM · MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), User-notice, Patch-For-Review, Performance-Team, MediaWiki-Database, Wikimedia-log-errors

Feb 25 2018

aaron closed T187942: Replication lag detection broken in wmf.22 as Resolved.
Feb 25 2018, 2:38 PM · MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), User-notice, Patch-For-Review, Performance-Team, MediaWiki-Database, Wikimedia-log-errors
aaron closed T187942: Replication lag detection broken in wmf.22, a subtask of T183961: 1.31.0-wmf.22 deployment blockers, as Resolved.
Feb 25 2018, 2:38 PM · Release-Engineering-Team (Kanban), Release, Train Deployments

Feb 22 2018

aaron added a comment to T187980: Memcached error "A TIMEOUT OCCURRED" for key "WANCache:v:enwiki:sidebar:en".

Since the value is "false", the callback runs, unless it's running somewhere else and there is no interim value. When this happens a lot in a short time, there will be interim values (lasting up 30 sec) used, unless they also return false due to some memcached error. If everything returns falls, then the callback runs all the time, regardless of the mutex. It won't be empty though.

Feb 22 2018, 11:08 PM · Performance-Team (Radar), Wikimedia-log-errors, MediaWiki-Cache, MediaWiki-Interface
aaron added a comment to T187980: Memcached error "A TIMEOUT OCCURRED" for key "WANCache:v:enwiki:sidebar:en".

I noticed that too yesterday. Note that there is a PECL memcached bug that causes things to say TIMEOUT after a KEY TO LONG or VALUE TOO LARGE error, which makes for confusing failures and logs. I'm not sure if that is it play, but it wouldn't surprise me, and statistically it would affect the most-fetched keys (whatever they are).

Feb 22 2018, 5:18 PM · Performance-Team (Radar), Wikimedia-log-errors, MediaWiki-Cache, MediaWiki-Interface
aaron added a comment to T187942: Replication lag detection broken in wmf.22.

The php warning is noise. The "Database is read-only" flood is an actual bug...no idea why that happened.

Feb 22 2018, 3:18 AM · MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), User-notice, Patch-For-Review, Performance-Team, MediaWiki-Database, Wikimedia-log-errors
aaron added a comment to T187942: Replication lag detection broken in wmf.22.

$dbr->getLag() and $lb->getLagTimes() works fine in eval.php on wmf22 wikis as well.

Feb 22 2018, 2:37 AM · MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), User-notice, Patch-For-Review, Performance-Team, MediaWiki-Database, Wikimedia-log-errors
aaron added a comment to T187942: Replication lag detection broken in wmf.22.

I've looking at the 21->22 logs, changes, and trying things on mw.org. I don't see a read-only problem there and https://www.mediawiki.org/w/api.php?action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb= looks fine. I don't see any lag in the DBs or seen by MW in that time (LoadBalancer graph at Grafana, those the resolution is low).

Feb 22 2018, 2:32 AM · MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), User-notice, Patch-For-Review, Performance-Team, MediaWiki-Database, Wikimedia-log-errors

Feb 20 2018

zeljkofilipin awarded T185328: "User should be able to change preferences" Selenium test fails when targeting mediawiki-vagrant a Party Time token.
Feb 20 2018, 9:07 AM · MW-1.31-release-notes (WMF-deploy-2018-02-06 (1.31.0-wmf.20)), Patch-For-Review, Performance-Team (Radar), MediaWiki-Cache, MediaWiki-Vagrant, User-zeljkofilipin, Release-Engineering-Team (Kanban)

Feb 17 2018

aaron updated the task description for T185664: FlaggedRevs: code stewardship review.
Feb 17 2018, 9:06 PM · MediaWiki-extensions-FlaggedRevs, Code-Stewardship-Reviews

Feb 15 2018

aaron closed T186947: many statistics have fallen to 0 on azwiktionary, ruwikiquote, and ptwikisource as Resolved.
Feb 15 2018, 11:01 PM · MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), Patch-For-Review, Performance-Team, MediaWiki-General-or-Unknown, MediaWiki-Special-pages

Feb 14 2018

aaron placed T169249: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam up for grabs.
Feb 14 2018, 9:48 PM · Patch-For-Review, Performance-Team, Operations

Feb 8 2018

aaron committed rESRXa24942ef371e: Add TTL to set() call (authored by aaron).
Add TTL to set() call
Feb 8 2018, 8:51 PM
aaron committed rERXBa39a6a4ac147: Add TTL to set() call (authored by aaron).
Add TTL to set() call
Feb 8 2018, 8:50 PM

Feb 7 2018

aaron added a comment to T184854: hhvm memcached and php7 memcached extensions do not play well together.

I see, hhvm works with and without the flags, so they could be set in the background.

Feb 7 2018, 10:23 PM · PHP 7.0 support, Performance-Team (Radar), MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), Patch-For-Review, User-ArielGlenn, MediaWiki-Platform-Team
aaron added a comment to T184854: hhvm memcached and php7 memcached extensions do not play well together.

Lots of keys use no value, 0, or TTL_INDEFINITE (all infinite), so there will be a lot of old keys.

Feb 7 2018, 10:07 PM · PHP 7.0 support, Performance-Team (Radar), MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), Patch-For-Review, User-ArielGlenn, MediaWiki-Platform-Team
aaron added a comment to T186752: Swap objectcache table for MEMORY engine?.

MEMORY tables were kind of lame last time anyone checked, though I suppose someone can take a look. I doubt it would be too useful given a good innodb buffer pool size.

Feb 7 2018, 9:50 PM · MediaWiki-Cache, MediaWiki-Database
aaron added a comment to T152934: Log accessing private information by those with 'abusefilter-private' permission.

Sorry about the slow review...this extension has a bit of an ownership problem, with random people stepping in for CR. I was thinking someone else would have merged this by now.

Feb 7 2018, 9:48 PM · Epic, MW-1.31-release-notes (WMF-deploy-2018-02-13 (1.31.0-wmf.21)), Stewards-and-global-tools, Security-Team, AbuseFilter

Feb 6 2018

aaron closed T185328: "User should be able to change preferences" Selenium test fails when targeting mediawiki-vagrant as Resolved.

Verified by local selenium test runs (passes with the fix and fails without the fix).

Feb 6 2018, 11:56 PM · MW-1.31-release-notes (WMF-deploy-2018-02-06 (1.31.0-wmf.20)), Patch-For-Review, Performance-Team (Radar), MediaWiki-Cache, MediaWiki-Vagrant, User-zeljkofilipin, Release-Engineering-Team (Kanban)

Jan 26 2018

aaron added a comment to T185328: "User should be able to change preferences" Selenium test fails when targeting mediawiki-vagrant.

Do these tests actually used replication or is it singe DB server? Header logs would also be useful.

Jan 26 2018, 8:00 PM · MW-1.31-release-notes (WMF-deploy-2018-02-06 (1.31.0-wmf.20)), Patch-For-Review, Performance-Team (Radar), MediaWiki-Cache, MediaWiki-Vagrant, User-zeljkofilipin, Release-Engineering-Team (Kanban)

Jan 17 2018

aaron added a comment to T161190: Configure External Store on Vagrant with blobs tables in databases named after wiki DB and proper isolation.

@aaron Am I right that having two databases with the same name (i.e. "enwiki where revision, page, etc. are" and "enwiki where blobs is") on the same machine requires two different MySQL data directories and ports/sockets?

Jan 17 2018, 10:35 PM · Patch-For-Review, MediaWiki-Database, MediaWiki-Vagrant
aaron added a comment to T185055: Stack overflow when Redis is down.

So I cannot contact redis via nutcracker on tin. I noticed the password was not actually set for redis (trying to AUTH when no password set results in an error); using CONFIG SET requirepass <x> didn't make a difference though. In any case, I can use redis-cli to talk to the local redis instance on 01/02 themselves. I'm not sure how much of this is nutcracker vs redis. Restarting either does not help.

Jan 17 2018, 6:07 PM · Beta-Cluster-Infrastructure, Performance-Team (Radar), MediaWiki-JobQueue, Operations, Beta-Cluster-reproducible

Jan 13 2018

aaron added a comment to T182322: ChronologyProtector breaks if two requests write different sets of databases.

Yes, https://gerrit.wikimedia.org/r/396546 .

Jan 13 2018, 8:55 PM · MW-1.31-release-notes (WMF-deploy-2018-01-16 (1.31.0-wmf.17)), Patch-For-Review, MediaWiki-Database, Wikidata, Performance-Team, User-Addshore, User-notice

Jan 10 2018

aaron added a comment to T171071: Perform testing for TLS effect on connection rate.

I fixed a stupid hostname var bug. Now I get numbers that make sense:

Same-DC (db2070.codfw.wmnet):
string(57) "0.001196186542511 sec/conn (non-SSL) [db2070.codfw.wmnet]"
string(60) "0.00027136325836182 sec/query (non-SSL) [db2070.codfw.wmnet]"
string(53) "0.059528641700745 sec/conn (SSL) [db2070.codfw.wmnet]"
string(56) "0.00028834581375122 sec/query (SSL) [db2070.codfw.wmnet]"
Cross-DC (db1055.eqiad.wmnet):
string(56) "0.10918385744095 sec/conn (non-SSL) [db1055.eqiad.wmnet]"
string(57) "0.03636349439621 sec/query (non-SSL) [db1055.eqiad.wmnet]"
string(52) "0.25189030647278 sec/conn (SSL) [db1055.eqiad.wmnet]"
string(54) "0.036419949531555 sec/query (SSL) [db1055.eqiad.wmnet]"
Jan 10 2018, 12:09 AM · Patch-For-Review, Availability (MediaWiki-MultiDC), DBA, Operations, Performance-Team

Jan 9 2018

aaron added a comment to T184529: Define a way to get a database connection based on a logical wiki ID..

I see wiki IDs as a type of "domain ID" that just uses two ASCII components, (dbname,prefix), neither using slashes to avoid the ugliness of using things like "mysite?hnewswiki-en" have to appear on config or in "table_wiki" DB fields. For B/C, the non-slash rule can't be a hard-rule that throws errors. Given that, the getWiki() functions should use known-to-be-encoded wiki ID values or use use DatabaseDomain to derive them. There could be a stricter WikiDatabaseDomain subclass. Changing those methods would probably both fix and break things for the slash-scenario; maybe the "doesn't use domain hierarchy delimiter character" restriction could then be enforced by default behind a flag that could be disabled for legacy-mode.

Jan 9 2018, 11:00 PM · User-Daniel, MediaWiki-Database
RandomDSdevel awarded T182322: ChronologyProtector breaks if two requests write different sets of databases a Doubloon token.
Jan 9 2018, 1:42 AM · MW-1.31-release-notes (WMF-deploy-2018-01-16 (1.31.0-wmf.17)), Patch-For-Review, MediaWiki-Database, Wikidata, Performance-Team, User-Addshore, User-notice

Dec 14 2017

aaron added a comment to T171071: Perform testing for TLS effect on connection rate.

I keep coming with times like:

Dec 14 2017, 9:55 PM · Patch-For-Review, Availability (MediaWiki-MultiDC), DBA, Operations, Performance-Team
aaron moved T171071: Perform testing for TLS effect on connection rate from Blocked to Doing on the Performance-Team board.
Dec 14 2017, 9:45 PM · Patch-For-Review, Availability (MediaWiki-MultiDC), DBA, Operations, Performance-Team

Dec 12 2017

aaron added a comment to T173450: Setup grafana alert for job error rate.

I started a quick dashboard at https://grafana.wikimedia.org/dashboard/db/job-queue-alerts?orgId=1&from=now-12h&to=now with some alerts.

Dec 12 2017, 11:29 PM · Performance-Team
aaron added a comment to T175672: Make apache/maintenance hosts TLS connections to mariadb work.

@aaron the proxy is installed but unconfigured, - we still have to fix some issues with the start and process, but do you want me to point it to the real master? Do you want me to point it to a soon to be setup master test host?

Dec 12 2017, 6:29 PM · Performance-Team (Radar), Availability (MediaWiki-MultiDC), DBA, Operations

Dec 11 2017

aaron added a comment to T173450: Setup grafana alert for job error rate.

I suppose we can use jobrunner.runner-status.error.rate, sumSeries(jobrunner.pop.*.failed.*.rate), and sumSeries(jobrunner.pop.*.ok.*.rate) to make alerts in a Grafana dashboard.

Dec 11 2017, 9:53 PM · Performance-Team
aaron closed T182390: 2017-12-07 Huge SaveTiming spike as Resolved.

Yeah, same thing.

Dec 11 2017, 7:38 PM · Performance-Team

Dec 8 2017

aaron added a comment to T182322: ChronologyProtector breaks if two requests write different sets of databases.

I'm not sure why the time check logic is so complicated, I guess it got prematurely generalized from the single-DB case.

Dec 8 2017, 8:33 PM · MW-1.31-release-notes (WMF-deploy-2018-01-16 (1.31.0-wmf.17)), Patch-For-Review, MediaWiki-Database, Wikidata, Performance-Team, User-Addshore, User-notice

Dec 7 2017

aaron updated the task description for T151466: Performance Q2 2017/18 goal: Install and use mcrouter in deployment-prep.
Dec 7 2017, 6:53 AM · Release-Engineering-Team (Watching / External), Availability (MediaWiki-MultiDC), Beta-Cluster-Infrastructure, Performance-Team

Dec 6 2017

Envlh awarded T181385: Wikidata entity dumpers stuck with 100% CPU on snapshot1007 a Heartbreak token.
Dec 6 2017, 8:50 AM · MW-1.31-release-notes (WMF-deploy-2018-01-02 (1.31.0-wmf.15)), Performance-Team, Wikidata, Datasets-General-or-Unknown

Dec 5 2017

aaron moved T181385: Wikidata entity dumpers stuck with 100% CPU on snapshot1007 from Doing to Blocked on the Performance-Team board.
Dec 5 2017, 11:36 PM · MW-1.31-release-notes (WMF-deploy-2018-01-02 (1.31.0-wmf.15)), Performance-Team, Wikidata, Datasets-General-or-Unknown
aaron closed T178531: Add statsd metric to WANObjectCache as Resolved.

Probably some MW fixes actually reaching production.

Dec 5 2017, 11:07 PM · MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Patch-For-Review, MediaWiki-Cache, monitoring, Performance-Team
aaron added a comment to T180035: MediaWiki core Selenium tests fail when targeting Vagrant.

Does this still occur?

Dec 5 2017, 4:19 AM · Performance-Team (Radar), MediaWiki-Cache, MediaWiki-Vagrant, Release-Engineering-Team (Kanban), User-zeljkofilipin
aaron closed T180793: Frequent "Wikimedia\\Rdbms\\DatabaseMysqlBase::lock failed to acquire lock" errors on WMF mediawiki logs as Resolved.
Dec 5 2017, 4:17 AM · MW-1.31-release-notes (WMF-deploy-2017-12-05 (1.31.0-wmf.11)), Patch-For-Review, DBA, Wikimedia-log-errors, Performance-Team
aaron added a comment to T180793: Frequent "Wikimedia\\Rdbms\\DatabaseMysqlBase::lock failed to acquire lock" errors on WMF mediawiki logs.

By reducing the lock max wait times and pushing the brunt of lag waits out if the critical section, then less real time should be wasted.

Dec 5 2017, 4:17 AM · MW-1.31-release-notes (WMF-deploy-2017-12-05 (1.31.0-wmf.11)), Patch-For-Review, DBA, Wikimedia-log-errors, Performance-Team

Dec 4 2017

aaron moved T181385: Wikidata entity dumpers stuck with 100% CPU on snapshot1007 from Inbox to Doing on the Performance-Team board.
Dec 4 2017, 9:00 PM · MW-1.31-release-notes (WMF-deploy-2018-01-02 (1.31.0-wmf.15)), Performance-Team, Wikidata, Datasets-General-or-Unknown

Dec 2 2017

aaron added a comment to T181385: Wikidata entity dumpers stuck with 100% CPU on snapshot1007.

Change 394779 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Try to opportunistically flush statsd data in maintenance scripts

https://gerrit.wikimedia.org/r/394779

Dec 2 2017, 9:20 PM · MW-1.31-release-notes (WMF-deploy-2018-01-02 (1.31.0-wmf.15)), Performance-Team, Wikidata, Datasets-General-or-Unknown
aaron added a comment to T178531: Add statsd metric to WANObjectCache.

There is some caller that is not making keys correctly, which causes this. I can't find anymore looking though all of core and extensions and mediawiki-config.

Thanks for looking into it! In the meantime I've blackholed said metrics to avoid graphite disks filling up

Dec 2 2017, 8:27 PM · MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Patch-For-Review, MediaWiki-Cache, monitoring, Performance-Team
aaron added a comment to T178531: Add statsd metric to WANObjectCache.

Change 394493 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.10] Add temporary logging for bad WAN cache statsd keys

https://gerrit.wikimedia.org/r/394493

Dec 2 2017, 8:26 PM · MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Patch-For-Review, MediaWiki-Cache, monitoring, Performance-Team
aaron added a comment to T181385: Wikidata entity dumpers stuck with 100% CPU on snapshot1007.

How long do these run? The sample rate in config is set to be extremely low. So perhaps:

  • The buffering class buffers things that won't even be saved
  • The buffering could be disable in CLI mode
Dec 2 2017, 8:18 PM · MW-1.31-release-notes (WMF-deploy-2018-01-02 (1.31.0-wmf.15)), Performance-Team, Wikidata, Datasets-General-or-Unknown

Nov 29 2017

aaron added a comment to T180035: MediaWiki core Selenium tests fail when targeting Vagrant.

I noticed a worse bug of cpPosTime cookies not being used (not related to WAN cache). The patch for that is above.

Change 393983 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Make ChronologyProtector actually use cpPosTime cookies

https://gerrit.wikimedia.org/r/393983

Nov 29 2017, 5:51 AM · Performance-Team (Radar), MediaWiki-Cache, MediaWiki-Vagrant, Release-Engineering-Team (Kanban), User-zeljkofilipin
aaron added a comment to T180035: MediaWiki core Selenium tests fail when targeting Vagrant.

I noticed a worse bug of cpPosTime cookies not being used (not related to WAN cache). The patch for that is above.

Nov 29 2017, 4:52 AM · Performance-Team (Radar), MediaWiki-Cache, MediaWiki-Vagrant, Release-Engineering-Team (Kanban), User-zeljkofilipin
aaron added a comment to T180035: MediaWiki core Selenium tests fail when targeting Vagrant.

The simple thing is to not set INTERIM keys in the same request that purged them. The duration of that rule would be HOLDOF_TTL so that the array holding the purged keys doesn't get too big for long-running maintenance scripts. This can be done with a HashBagOStuff nested in the WAN cache object easily enough.

Nov 29 2017, 12:52 AM · Performance-Team (Radar), MediaWiki-Cache, MediaWiki-Vagrant, Release-Engineering-Team (Kanban), User-zeljkofilipin
aaron added a comment to T180035: MediaWiki core Selenium tests fail when targeting Vagrant.

This looks like an integration issue with ChronologyProtector vs WANObjectCache.

Nov 29 2017, 12:43 AM · Performance-Team (Radar), MediaWiki-Cache, MediaWiki-Vagrant, Release-Engineering-Team (Kanban), User-zeljkofilipin

Nov 28 2017

aaron added a comment to T178531: Add statsd metric to WANObjectCache.

I guess we will need MW side logging now. Probably can just add it to wmf branch.

Nov 28 2017, 5:14 PM · MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Patch-For-Review, MediaWiki-Cache, monitoring, Performance-Team
aaron created T181528: Document the various parameters of each Job class.
Nov 28 2017, 5:02 PM · Documentation, MediaWiki-JobQueue

Nov 27 2017

aaron added a comment to T178531: Add statsd metric to WANObjectCache.

There is some caller that is not making keys correctly, which causes this. I can't find anymore looking though all of core and extensions and mediawiki-config.

Nov 27 2017, 7:59 PM · MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Patch-For-Review, MediaWiki-Cache, monitoring, Performance-Team

Nov 23 2017

aaron renamed T181216: Get rid of pointless EnqueueJob usage from Get rid of enqueue job to Get rid of pointless EnqueueJob usage.
Nov 23 2017, 10:03 AM · MW-1.31-release-notes (WMF-deploy-2018-01-02 (1.31.0-wmf.15)), Patch-For-Review, Services (done), MediaWiki-JobQueue
aaron added a comment to T181216: Get rid of pointless EnqueueJob usage.

They were mentioned in https://www.mediawiki.org/wiki/Requests_for_comment/Master-slave_datacenter_strategy_for_MediaWiki#Job_queuing though it was never set up (partly from people being busy with other things). In general jobs are enqueued on POST requests or from other jobs, all in the master datacenter. In some cases, jobs are enqueued on GET or possibly POST (if the api-promise-nonwrite thing is set up in vlc) in rare cases. This should work in a way where the cross-DC propagation is async, rather than having JobQueue::push() blocking on cross-DC traffic.

Nov 23 2017, 10:01 AM · MW-1.31-release-notes (WMF-deploy-2018-01-02 (1.31.0-wmf.15)), Patch-For-Review, Services (done), MediaWiki-JobQueue

Nov 21 2017

jcrespo awarded T180793: Frequent "Wikimedia\\Rdbms\\DatabaseMysqlBase::lock failed to acquire lock" errors on WMF mediawiki logs a Doubloon token.
Nov 21 2017, 11:57 AM · MW-1.31-release-notes (WMF-deploy-2017-12-05 (1.31.0-wmf.11)), Patch-For-Review, DBA, Wikimedia-log-errors, Performance-Team

Nov 20 2017

aaron moved T180793: Frequent "Wikimedia\\Rdbms\\DatabaseMysqlBase::lock failed to acquire lock" errors on WMF mediawiki logs from Next-up to Doing on the Performance-Team board.
Nov 20 2017, 9:23 PM · MW-1.31-release-notes (WMF-deploy-2017-12-05 (1.31.0-wmf.11)), Patch-For-Review, DBA, Wikimedia-log-errors, Performance-Team
aaron moved T151466: Performance Q2 2017/18 goal: Install and use mcrouter in deployment-prep from Next-up to Doing on the Performance-Team board.
Nov 20 2017, 9:23 PM · Release-Engineering-Team (Watching / External), Availability (MediaWiki-MultiDC), Beta-Cluster-Infrastructure, Performance-Team

Nov 17 2017

aaron added a comment to T171881: CL support for Wikipedia Zero piracy problems.

On the MW side of (2) above, it appears the swiftFileBackend code in MW uses PHP's urlencode to transform the filenames into upload URL paths. urlencode documentation claims that it percent-encodes everything but alphanumerics and -_. (so the set it does not encode is almost the official Unreserved Set, but it's missing the tilde). It also encodes spaces as + rather than %20 because it's meant for query strings rather than paths. PHP's rawurlencode would probably have been more appropriate here as it conforms to the RFC and excludes from encoding exactly the Unreserved Set and doesn't do the +-for-spaces thing. However, in practice, we can deal with the ~ issue and spaces have already been made into underscores, so the plusses shouldn't ever actually appear.

Regardless, this explanation seems consistent with observations of the upload.wm.o paths I've seen. We can normalize on similar rules there (but leave spaces as %20 just to be technically-correct, which again won't matter in practice). If at some later date we want to use a prettier normalization we can do that, too, but for now it would be simplest to leave the MediaWiki side alone and just conform everything else to its expectations.

Nov 17 2017, 11:26 PM · Patch-For-Review, Community-Liaisons (Oct-Dec 2017), Zero

Nov 16 2017

aaron added a comment to T178849: Click on fullImageLink <a> for PDF on File: page no longer rendering in browser.

So, the post_as_copy = true case works if SwiftFileBackend to no longer blacklist Content-Type from non-PUTs. It would always re-assert the old value if nothing was passed in by the describe() caller.

Nov 16 2017, 8:41 PM · MediaWiki-File-management, Commons, MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Patch-For-Review, Regression, media-storage, Multimedia-Team-Working-Board, Multimedia
aaron added a comment to T178849: Click on fullImageLink <a> for PDF on File: page no longer rendering in browser.

We should be mindful of the Swift post_as_copy option when set to false. At the moment that does *not* allowing changing Content-Type via POST.

Nov 16 2017, 8:24 PM · MediaWiki-File-management, Commons, MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Patch-For-Review, Regression, media-storage, Multimedia-Team-Working-Board, Multimedia

Nov 8 2017

aaron closed T178531: Add statsd metric to WANObjectCache as Resolved.
Nov 8 2017, 8:35 PM · MW-1.31-release-notes (WMF-deploy-2017-11-28 (1.31.0-wmf.10)), Patch-For-Review, MediaWiki-Cache, monitoring, Performance-Team
aaron closed T179999: CentralAuthUser::loadFromCache doesn't call the makeKey() methods as needed as Resolved.
Nov 8 2017, 8:35 PM · MW-1.31-release-notes (WMF-deploy-2017-10-31 (1.31.0-wmf.6)), MediaWiki-extensions-CentralAuth, Patch-For-Review, Performance-Team
aaron closed T179999: CentralAuthUser::loadFromCache doesn't call the makeKey() methods as needed, a subtask of T178634: 1.31.0-wmf.7 deployment blockers, as Resolved.
Nov 8 2017, 8:35 PM · RelEng-Archive-FY201718-Q2, Train Deployments, Release
aaron created T179999: CentralAuthUser::loadFromCache doesn't call the makeKey() methods as needed.
Nov 8 2017, 2:50 AM · MW-1.31-release-notes (WMF-deploy-2017-10-31 (1.31.0-wmf.6)), MediaWiki-extensions-CentralAuth, Patch-For-Review, Performance-Team

Oct 31 2017

aaron updated the task description for T151466: Performance Q2 2017/18 goal: Install and use mcrouter in deployment-prep.
Oct 31 2017, 10:27 PM · Release-Engineering-Team (Watching / External), Availability (MediaWiki-MultiDC), Beta-Cluster-Infrastructure, Performance-Team
aaron added a comment to T151466: Performance Q2 2017/18 goal: Install and use mcrouter in deployment-prep.

So, running mcrouter via screen -r with the config in /etc/mcrouter/mcrouter.json on tin seems to work fine. The pool replication works and the timings are comparable to twemproxy -- often better than twemproxy.

Oct 31 2017, 10:21 PM · Release-Engineering-Team (Watching / External), Availability (MediaWiki-MultiDC), Beta-Cluster-Infrastructure, Performance-Team

Oct 30 2017

aaron moved T171071: Perform testing for TLS effect on connection rate from Doing to Blocked on the Performance-Team board.
Oct 30 2017, 9:03 PM · Patch-For-Review, Availability (MediaWiki-MultiDC), DBA, Operations, Performance-Team

Oct 26 2017

aaron closed T175418: Create new instances memc05 and memc06 running memcached as Resolved.
Oct 26 2017, 10:26 PM · Release-Engineering-Team (Watching / External), Availability (MediaWiki-MultiDC), Beta-Cluster-Infrastructure
aaron closed T175418: Create new instances memc05 and memc06 running memcached, a subtask of T151466: Performance Q2 2017/18 goal: Install and use mcrouter in deployment-prep, as Resolved.
Oct 26 2017, 10:26 PM · Release-Engineering-Team (Watching / External), Availability (MediaWiki-MultiDC), Beta-Cluster-Infrastructure, Performance-Team

Oct 24 2017

aaron added a comment to T135261: {{REVISION*}} magic words should not display usernames and timestamps for null edits.

That "cannot merge" message is mostly useless and overly-technical in a Gerrit specific way (e.g. you can't "submit" without "+2", which is obvious anyway). Just look for "merge conflict" on the changeset page or where the patch shows up in listings, since that actually matters and is common.

Oct 24 2017, 1:54 AM · MW-1.31-release-notes (WMF-deploy-2017-11-07 (1.31.0-wmf.7)), MW-1.28-release (WMF-deploy-2016-06-14_(1.28.0-wmf.6)), Patch-For-Review, MediaWiki-Parser
aaron added a comment to T178857: Is the ConfirmAccount extension maintained?.

There have always been a lot feature requests or bug reports due to misconfiguration/version-mismatch and so on. I don't really have the time anymore (for some time in fact) to sift through and find the serious bugs. When I become aware of one I try to fix it, but if it's not major then I probably won't look at it.

Oct 24 2017, 1:52 AM · MediaWiki-extensions-ConfirmAccount

Oct 23 2017

aaron added a comment to T177073: Split the backend savetiming metric into submetrics.

Actually, I just moved them to https://grafana-admin.wikimedia.org/dashboard/db/backend-save-timing-breakdown?refresh=5m&orgId=1 .

Oct 23 2017, 8:50 PM · MW-1.31-release-notes (WMF-deploy-2017-10-10 (1.31.0-wmf.3)), Patch-For-Review, Performance-Team
aaron moved T171071: Perform testing for TLS effect on connection rate from Blocked to Doing on the Performance-Team board.
Oct 23 2017, 8:29 PM · Patch-For-Review, Availability (MediaWiki-MultiDC), DBA, Operations, Performance-Team