Page MenuHomePhabricator

Replace parsercache keys to something more meaningful on db-XXXX.php
Closed, ResolvedPublic

Description

Right now the parsercache structure on db-eqiad.php and db-codfw.php looks like:

$wmgParserCacheDBs = [
    '10.64.0.12'   => '10.192.0.104',  # pc2007, A1 4.4TB 256GB
    '10.64.32.72'  => '10.192.16.35',  # pc2008, B3 4.4TB 256GB
    '10.64.48.128' => '10.192.32.10',  # pc2009, C1 4.4TB 256GB
    # 'spare' => '10.192.48.14',  # pc2010, D3 4.4TB 256GB # spare host. Use it to replace any of the above if needed

Those first column of IPs (10.64.0.12, 10.64.32.72, 10.64.48.128) are not really IPs but sharding keys.
Those are very confusing specially while doing maintenance (pooling/depooling) and it is very prone to cause human error.
ie: "oh that IP isn't in use, should be fine to delete"

Ideally we would like to replace them by something like:

$wmgParserCacheDBs = [
    'pc1'   => '10.192.0.104',  # pc2007, A1 4.4TB 256GB
    'pc2'  => '10.192.16.35',  # pc2008, B3 4.4TB 256GB
    'pc3' => '10.192.32.10',  # pc2009, C1 4.4TB 256GB
    # 'spare' => '10.192.48.14',  # pc2010, D3 4.4TB 256GB # spare host. Use it to replace any of the above if needed

We are not sure though about what cause:

  • Will that generate a miss as soon as they get changed in config?
  • Will that break something else mediawiki-wise?
  • The old keys would need to be purged manually after $retention_period probably?

Details

Related Gerrit Patches:
operations/mediawiki-config : masterdb-eqiad.php: Depool pc1009
operations/mediawiki-config : masterdb-codfw.php: Depool pc2009.
operations/mediawiki-config : masterdb-eqiad,db-codfw.php: Change last parsercache key
operations/mediawiki-config : masterdb-eqiad.php: Depool pc1008
operations/mediawiki-config : masterdb-eqiad,db-codfw.php: Change second parsercache key
operations/mediawiki-config : masterdb-eqiad.php: Depool pc1007
operations/mediawiki-config : masterdb-eqiad.php: Change parsercache key
operations/mediawiki-config : masterdb-codfw.php: Change parsercache key

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

T210992

Shouldn't we wait for the ttl to efectively increase before doing more operations? (aka wait 22-30 days after the deploy was done)

Definitely - I was not going to do it next week, it was a re kickoff of this :-)
We should also change codfw first, and navigate on codfw with x-wikimedia-debug to make sure nothing horrible and unexpected breaks by changing it.

Change 498322 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Change parsercache key

https://gerrit.wikimedia.org/r/498322

I would like to start moving this forward. I want to merge https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/498322/ which changes one parsercache key on codfw, and I will browse codfw in look for obvious breakages on logtash, and if nothing arises I would like to roll it to db-eqiad.php of course.
Any objections to the above patchset to get codfw changed?

Change 498322 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Change parsercache key

https://gerrit.wikimedia.org/r/498322

Mentioned in SAL (#wikimedia-operations) [2019-03-27T06:02:43Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Change one parsercache key on codfw - T210725 (duration: 00m 57s)

Marostegui added a comment.EditedMar 27 2019, 6:17 AM

I have merged and deployed the above change and use x-wikimedia-debug with profile and log mode ON to browse codfw.
The first time I have browsed enwiki there were 17 errors.
The first time I have browsed dewiki, there were 10 errors.
Apparently errors on donatewiki were also generated, a total of 13 errors.

I am not sure if they are strictly related to the parsercache key change:
https://logstash.wikimedia.org/goto/74a2efc376b8eb38208420cf7a323304

unique_id	       	XJsSVQrAEEIAAAPVh1oAAAAR
{"id":"XJsTcgrAEEIAAAPVh84AAAAY","type":"ErrorException","file":"/srv/mediawiki/php-1.33.0-wmf.22/includes/utils/FileContentsHasher.php","line":57,"message":"PHP Warning: filemtime(): No such file or directory","code":0,"url":"/w/load.php?lang=de&modules=startup&only=scripts&skin=vector","caught_by":"mwe_handler","suppressed":true,"backtrace":[{"file":"/srv/mediawiki/php-1.33.0-wmf.22/includes/utils/FileContentsHasher.php","line":57,"function":"handleError","class":"MWExceptionHandler","type":"::","args":["integer","string","string","integer","array","array"]},{"file":"/srv/mediawiki/php-1.33.0-wmf.22/includes/utils/FileContentsHasher.php","line":99,"function":"getFileContentsHashInternal","class":"FileContentsHasher","type":"->","args":["string","string"]},{"file":"/srv/mediawiki/php-1.33.0-wmf.22/includes/resourceloader/ResourceLoaderModule.php","line":1021,"function":"getFileContentsHash","class":"FileContentsHasher","type":"::","args":["array"]},{"function":"safeFileHash","class":"ResourceLoaderModule","type":"::","args":["string"]},{"file":"/srv/mediawiki/php-1.33.0-wmf.22/includes/resourceloader/ResourceLoaderFileModule.php","line":586,"function":"array_map","args":["array","array"]},{"file":"/srv/mediawiki/php-1.33.0-wmf.22/includes/resourceloader/ResourceLoaderFileModule.php","line":625,"function":"getFileHashes","class":"ResourceLoaderFileModule","type":"->","args":["ResourceLoaderContext"]},{"file":"/srv/mediawiki/php-1.33.0-wmf.22/includes/resourceloader/ResourceLoaderModule.php","line":833,"function":"getDefinitionSummary","class":"ResourceLoaderFileModule","type":"->","args":["ResourceLoaderContext"]},{"file":"/srv/mediawiki/php-1.33.0-wmf.22/includes/resourceloader/ResourceLoaderStartUpModule.php","line":255,"function":"getVersionHash","class":"ResourceLoaderModule","type":"->","args":["ResourceLoaderContext"]},{"file":"/srv/mediawiki/php-1.33.0-wmf.22/includes/resourceloader/ResourceLoaderStartUpModule.php","line":438,"function":"getModuleRegistrations","class":"ResourceLoaderStartUpModule","type":"->","args":["ResourceLoaderContext"]},{"file":"/srv/mediawiki/php-1.33.0-wmf.22/includes/resourceloader/ResourceLoaderModule.php","line":727,"function":"getScript","class":"ResourceLoaderStartUpModule","type":"->","args":["ResourceLoaderContext"]},{"file":"/srv/mediawiki/php-1.33.0-wmf.22/includes/resourceloader/ResourceLoaderModule.php","line":694,"function":"buildContent","class":"ResourceLoaderModule","type":"->","args":["ResourceLoaderContext"]},{"file":"/srv/mediawiki/php-1.33.0-wmf.22/includes/resourceloader/ResourceLoaderModule.php","line":830,"function":"getModuleContent","class":"ResourceLoaderModule","type":"->","args":["ResourceLoaderContext"]},{"file":"/srv/mediawiki/php-1.33.0-wmf.22/includes/resourceloader/ResourceLoader.php","line":662,"function":"getVersionHash","class":"ResourceLoaderModule","type":"->","args":["ResourceLoaderContext"]},{"function":"Closure$ResourceLoader::getCombinedVersion","args":["string"]},{"file":"/srv/mediawiki/php-1.33.0-wmf.22/includes/resourceloader/ResourceLoader.php","line":674,"function":"array_map","args":["Closure$ResourceLoader::getCombinedVersion;4443","array"]},{"file":"/srv/mediawiki/php-1.33.0-wmf.22/includes/resourceloader/ResourceLoader.php","line":755,"function":"getCombinedVersion","class":"ResourceLoader","type":"->","args":["ResourceLoaderContext","array"]},{"file":"/srv/mediawiki/php-1.33.0-wmf.22/load.php","line":46,"function":"respond","class":"ResourceLoader","type":"->","args":["ResourceLoaderContext"]},{"file":"/srv/mediawiki/w/load.php","line":3,"function":"include","args":["string"]}]}
normalized_message	       	{"id":"XJsSVQrAEEIAAAPVh1oAAAAR","type":"ErrorException","file":"/srv/mediawiki/php-1.33.0-wmf.22/includes/utils/FileContentsHasher.php","line":57,"message":"PHP Warning: filemtime(): No such file or directory","code":0,"url":"/w/load.php?lang=en&modules=
url	       	/w/load.php?lang=en&modules=startup&only=scripts&skin=vector

@aaron @Krinkle any thoughts on those errors?

aaron added a comment.Mar 27 2019, 7:57 AM

They don't seem related nor are actually thrown exceptions (just warnings caught by MWExceptionHandler and logged using Exception objects).

Thanks @aaron!
I am going to keep browsing codfw during today, but so far it doesn't look it broke anything.

We should start thinking about deploying the same thing on eqiad and how to do that. There have been some ideas about how to do that in a safe way. Let me throw some here and we can discuss:

  1. Just do a normal deploy and change the first key, the same thing we did on codfw.
  2. Just change a mw host manually and wait for errors which will also warm up the keys
  3. Just deploy the change to the canaries (11 hosts)

I personally prefer #3.
What if we deploy that change on a early in the morning on Monday or Tuesday to the canaries and also manually create a lock at /var/lock/scap.operations_mediawiki-config.lock until the SWAT (around 5 hours after the initial deployment to the canaries) to prevent any further deployments by mistake.
After 4-5 hours, if no errors have shown up, we can fully deploy the change to the rest of hosts.
The reason I prefer option #3 is because we'll have more hosts than #2, big enough to catch errors easier than with just one and to get the key more warmed up faster too, but small enough farm of hosts to quickly revert if we see something that breaks and narrow the user impact.

More ideas welcome!

aaron added a comment.Mar 27 2019, 5:09 PM

FYI, I was playing around with my PHP interpreter last night using:

<?php
$wmgUseNewPCShards = useNewShardKeys(
	1553792594, /* UNIX timestamp to finish transition */
	86400 /* seconds of transition */
);
$wmgParserCacheDBs = [
	# 'sharding key' => 'server ip address' # DO NOT CHANGE THE SHARDING KEY - T210725
	( $wmgUseNewPCShards ? 'pc1' : '10.64.0.12' ) => '10.192.0.104',  # pc2007, A1 4.4TB 256GB # pc1
	'10.64.32.72'  => '10.192.16.35',  # pc2008, B3 4.4TB 256GB # pc2
	'10.64.48.128' => '10.192.32.10',  # pc2009, C1 4.4TB 256GB # pc3
	# 'spare' => '10.192.48.14',  # pc2010, D3 4.4TB 256GB # spare host. Use it to replace any of the above if needed
];

function useNewShardKeys( $deadlineTimestamp, $windowSeconds ) {
	$windowLeft = $deadlineTimestamp - microtime( true );
	if ( $windowLeft <= 0 ) {
		$useNewHash = true;
	} elseif ( $windowLeft >= $windowSeconds ) {
		$useNewHash = false;
	} else {
		$chance = ( 1 - $windowLeft / $windowSeconds );
		$useNewHash = mt_rand( 1, 1e9 ) <= 1e9 * $chance;
	}

	return $useNewHash;
}

print_r( $wmgParserCacheDBs );

By the way, what do you plan to do in terms of cleaning up old hash-scheme entries? Just wait it out for the purgeParserCache.php cron to run?

Personally, if there is space and works,

Just wait it out for the purgeParserCache.php cron to run

Seems ok to me. If we are low on space or there are issues, we can run it manually or create a script to delete old data.

By the way, what do you plan to do in terms of cleaning up old hash-scheme entries? Just wait it out for the purgeParserCache.php cron to run?

Yep, that was my idea :)

Change 499737 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Change parsercache key

https://gerrit.wikimedia.org/r/499737

Marostegui added a comment.EditedMar 28 2019, 9:40 AM

I would like to elaborate more on my idea on how to deploy the change to eqiad for the first key (the one we already changed in codfw):

Someday at 05:00AM UTC

mw1279.eqiad.wmnet
mw1276.eqiad.wmnet
mw1261.eqiad.wmnet
mw1264.eqiad.wmnet
mwdebug1002.eqiad.wmnet
mwdebug1001.eqiad.wmnet
mw1263.eqiad.wmnet
mw1262.eqiad.wmnet
mw1278.eqiad.wmnet
mw1277.eqiad.wmnet
mw1265.eqiad.wmnet
  • Create a manual lock on deploy1001 by doing: touch /var/lock/scap.operations_mediawiki-config.lock
  • Check the parsercache graphs, errors, browse the sites using mwdebug1001 and mwdebug1002, keep checking for errors.
  • After 3-4 hours, if all goes good and nothing weird is seeing on graphs and logs, remove the lock.
  • Do a normal scap sync-file to deploy the key everywhere.

I would like to proceed with the above ^ plan Tuesday or Thursday next week
cc @aaron @Krinkle @jcrespo

Did you run the warmup script on codfw? what was the effect?

Did you run the warmup script on codfw? what was the effect?

No, I didn't yet, I wanted to see if the proposal made sense before getting into it :)

Mentioned in SAL (#wikimedia-operations) [2019-04-03T14:18:37Z] <marostegui> Stop replication on pc2007 for testing - T210725

So, I have done the following test with X-Wikimedia-debug I have browsed codfw for a few wikis and I have seen the keys being inserted correctly on pc2007's binlog (I had stopped replication from eqiad and pt-heartbeat to only get "my" writes on the binlog).

REPLACE /* SqlBagOStuff::insertMulti  */ INTO `pc247` (keyname,value,exptime) VALUES ('tiwiki:pcache:idhash:595-0!<redacted>
REPLACE /* SqlBagOStuff::insertMulti  */ INTO `pc136` (keyname,value,exptime) VALUES ('tumwiki:pcache:idoptions:1' <redacted>

logstash links:
https://logstash.wikimedia.org/goto/3a4426ab5a9812219c04b4c5ae95d23f
https://logstash.wikimedia.org/goto/95d41f3226ac07df80a650e2e41b4926

I have blocked a window on Tuesday to tentatively get it deployed if no one objects after the above tests and comments: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1822580&oldid=1822579

Can I get some reviews and either -1 or +1 on the actual change?: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/499737/

aaron added a comment.Apr 5 2019, 6:12 PM

I would like to elaborate more on my idea on how to deploy the change to eqiad for the first key (the one we already changed in codfw):
Someday at 05:00AM UTC

  • Create a manual lock on deploy1001 by doing: touch /var/lock/scap.operations_mediawiki-config.lock
  • Check the parsercache graphs, errors, browse the sites using mwdebug1001 and mwdebug1002, keep checking for errors.
  • After 3-4 hours, if all goes good and nothing weird is seeing on graphs and logs, remove the lock.
  • Do a normal scap sync-file to deploy the key everywhere.

So no one will be able to do deploys for 3-4 hours? Maybe it would be easier to just do a conditional on wfHostname() and deploy that the regular way, followed by deploying a second patch to make it unconditional after a few hours? I suppose if's a quite time of the day, a 4 hour lock is OK (any unrelated emergency deploy would just require coordination).

Toying around some more...to do a host-based deploy in one patch one could do:

<?php
// Function takes (hostname, UNIX timestamp transition deadline, transition window seconds)
$wmgUseNewPCShards = wmfHostInNewPhase( posix_uname()['nodename'], '20190308160000', 86400 );
$wmgParserCacheDBs = [
	# 'sharding key' => 'server ip address' # DO NOT CHANGE THE SHARDING KEY - T210725
	( $wmgUseNewPCShards ? 'pc1' : '10.64.0.12' ) => '10.192.0.104',  # pc2007, A1 4.4TB 256GB # pc1
	'10.64.32.72'  => '10.192.16.35',  # pc2008, B3 4.4TB 256GB # pc2
	'10.64.48.128' => '10.192.32.10',  # pc2009, C1 4.4TB 256GB # pc3
	# 'spare' => '10.192.48.14',  # pc2010, D3 4.4TB 256GB # spare host. Use it to replace any of the above if needed
];

function wmfHostInNewPhase( $host, $deadlineTimestamp, $windowSeconds ) {
	$windowLeft = strtotime( $deadlineTimestamp ) - microtime( true );
	if ( $windowLeft <= 0 ) {
		return true;
	} elseif ( $windowLeft >= $windowSeconds ) {
		return false;
	} else { // map hosts to a 32-bit unsigned advancing clock (needs 64-bits in PHP)
		return crc32( $host ) <= ( 1 - $windowLeft / $windowSeconds ) * ( pow( 2, 32 ) - 1 );
	}
}

// TESTING //
$hoursLeft = 12;
$deadline = DateTime::createFromFormat( 'U', time() + $hoursLeft * 3600 )->format( 'YmdHis' );
echo "$deadline\n";
$moved = 0;
for ( $i = 0; $i < 40; ++$i ) {
	$host = "host-$i";
	$moved += wmfHostInNewPhase( $host, $deadline, 86400 );
	echo "$host " . (int)wmfHostInNewPhase( $host, $deadline, 86400 ) . "\n";
}
echo "$moved host using new config out of $i\n";

I would like to elaborate more on my idea on how to deploy the change to eqiad for the first key (the one we already changed in codfw):
Someday at 05:00AM UTC

  • Create a manual lock on deploy1001 by doing: touch /var/lock/scap.operations_mediawiki-config.lock
  • Check the parsercache graphs, errors, browse the sites using mwdebug1001 and mwdebug1002, keep checking for errors.
  • After 3-4 hours, if all goes good and nothing weird is seeing on graphs and logs, remove the lock.
  • Do a normal scap sync-file to deploy the key everywhere.

So no one will be able to do deploys for 3-4 hours? Maybe it would be easier to just do a conditional on wfHostname() and deploy that the regular way, followed by deploying a second patch to make it unconditional after a few hours? I suppose if's a quite time of the day, a 4 hour lock is OK (any unrelated emergency deploy would just require coordination).

Yeah, it is very early in EU morning, and I will release the lock before the SWAT. Of course I will be online and if there is any emergency deploy I will release the lock :-)
Normally in EU morning it is just us DBAs, deploying wmf-config :-)

Change 499737 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Change parsercache key

https://gerrit.wikimedia.org/r/499737

Mentioned in SAL (#wikimedia-operations) [2019-04-09T06:01:54Z] <marostegui> Deploy parsercache key change on canaries only - T210725

Mentioned in SAL (#wikimedia-operations) [2019-04-09T06:30:14Z] <marostegui> Change parsercache keys on mw[1270-1279] - T210725

Mentioned in SAL (#wikimedia-operations) [2019-04-09T06:39:38Z] <marostegui> Change parsercache keys on mw[1260-1269] - T210725

Mentioned in SAL (#wikimedia-operations) [2019-04-09T07:03:47Z] <marostegui> Change parsercache keys on mw[1280-1289] - T210725

Mentioned in SAL (#wikimedia-operations) [2019-04-09T07:09:50Z] <marostegui> Change parsercache keys on mw[1221-1229] - T210725

Mentioned in SAL (#wikimedia-operations) [2019-04-09T07:21:53Z] <marostegui> Change parsercache keys on mw[1230-1235,1238-1239] - T210725

Mentioned in SAL (#wikimedia-operations) [2019-04-09T07:48:58Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Deploy parsercache key change everywhere T210725 (duration: 00m 53s)

After all the controlled changes on chunks of mw servers, the key have been changed everywhere.
Let's wait a few weeks (specially with Easter being next week) to go for the second key change.

parsercache hit ratio values are back to normal values after the 1st key change past 9th. So it took around 3 days to be fully back (https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1&from=1554620012644&to=1555063147728).
I am going to wait until 9th of May to see how the disk space trends go before going for the second key change.

Mentioned in SAL (#wikimedia-operations) [2019-04-12T12:57:18Z] <marostegui> Purge old rows and optimize tables on spare host pc1010 T210725

pc1010 (which replicates from pc1007) finished its old rows deletion (rows that were not purged) + optimization and has 300G extra than pc1007.
I will pool pc1010 instead of pc1007 and optimize pc1007 so it can get some extra space so we can forget about it for the upcoming Easter days.

(1) pc1007.eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.7T  1.7T  62% /srv
===== NODE GROUP =====
(2) pc[1008-1009].eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.5T  2.0T  56% /srv
===== NODE GROUP =====
(1) pc1010.eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.4T  2.0T  56% /srv

Change 503921 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool pc1007

https://gerrit.wikimedia.org/r/503921

Change 503921 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool pc1007

https://gerrit.wikimedia.org/r/503921

pc1007 is now back to 56% after the optimization

pc[2007-2010].codfw.wmnet,pc[1007-1010].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(1) pc2010.codfw.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.3T  2.2T  51% /srv
===== NODE GROUP =====
(1) pc2009.codfw.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.4T  2.0T  55% /srv
===== NODE GROUP =====
(1) pc2007.codfw.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.1T  2.3T  49% /srv
===== NODE GROUP =====
(1) pc2008.codfw.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.4T  2.0T  54% /srv
===== NODE GROUP =====
(1) pc1007.eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.3T  2.1T  52% /srv
===== NODE GROUP =====
(2) pc[1008-1009].eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.5T  2.0T  56% /srv
===== NODE GROUP =====
(1) pc1010.eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.5T  1.9T  57% /srv
================

pc1007 disk space has now been stable for 9 days moving between 60-61%, once the 30 days have gone by and everything has been purged, I will optimize again and then move to the second key change.

Change 508170 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Change second parsercache key

https://gerrit.wikimedia.org/r/508170

I have reserved a window for Tuesday 14th of May to change the second parsercache key.
This is what will be pushed: https://gerrit.wikimedia.org/r/#/c/508170/ (will appreciate a review of it)
The procedure will be the same as done previously.

  • Canaries
  • Several batches of 10 hosts

@aaron @Krinkle @jcrespo I would appreciate a review of ^ - it should be a quick one (it is the same we did with the first key change that went well), just want another pair of eyes to confirm :-)

Change 508170 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Change second parsercache key

https://gerrit.wikimedia.org/r/508170

Mentioned in SAL (#wikimedia-operations) [2019-05-14T06:01:41Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Change parsercache on codfw T210725 (duration: 00m 54s)

Mentioned in SAL (#wikimedia-operations) [2019-05-14T06:01:47Z] <marostegui> Lock wmf-config deployment on deploy1001 to slowly change parsercache key on eqiad - T210725

Mentioned in SAL (#wikimedia-operations) [2019-05-14T06:02:11Z] <marostegui> Deploy parsercache change to eqiad canaries - T210725

Mentioned in SAL (#wikimedia-operations) [2019-05-14T09:27:36Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Deploy second parsercache key change everywhere after deploying it in batches first T210725 (duration: 00m 50s)

The second key has been replaced everywhere.
It will take a few days to get the hit rate back to previous values as we saw with the previous key change:
https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1&from=now-12h&to=now&refresh=30s

The hit rate seems back to normal, so I am guessing in a couple of weeks this could get resolved?

The hit rate seems back to normal, so I am guessing in a couple of weeks this could get resolved?

No, there is still one key to be changed - my plan was, in 2 weeks, to:

  1. optimize pc1008 and pc2008
  2. change the last key (pc1009-pc2009)

there is still one key to be changed

That is what I meant with "resolved" :-P.

Any objection against replacing the last key the 25th of June (Tuesday) early in the morning with the same procedure that was followed with the other two?

Mentioned in SAL (#wikimedia-operations) [2019-06-14T10:22:05Z] <marostegui> Optimize tables on pc2008 - T210725

I have started a defragmentation on all tables on pc2008.
For the record:

root@pc2008:~# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.6T  1.8T  60% /srv

The optimize finished on pc2008:

root@pc2008:~# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.1T  2.3T  48% /srv

Change 517357 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool pc1008

https://gerrit.wikimedia.org/r/517357

Change 517357 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool pc1008

https://gerrit.wikimedia.org/r/517357

Mentioned in SAL (#wikimedia-operations) [2019-06-17T05:03:49Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool pc1008 and pool pc1010 temporarily while pc1008 gets all its tables optimized T210725 (duration: 00m 59s)

Mentioned in SAL (#wikimedia-operations) [2019-06-17T05:03:55Z] <marostegui> Optimize all pc1008's tables T210725

pc1008 tables optimization finished:

root@pc1008:~# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.0T  2.4T  46% /srv

Mentioned in SAL (#wikimedia-operations) [2019-06-18T04:28:27Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool pc1008 after optimizing its tables T210725 (duration: 00m 47s)

Change 517807 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Change last parsercache key

https://gerrit.wikimedia.org/r/517807

So this is the change I will push the 25th of June to change the last key: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/517807/
I will follow the same procedure that was followed for the previous two keys.

Change 517807 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Change last parsercache key

https://gerrit.wikimedia.org/r/517807

Mentioned in SAL (#wikimedia-operations) [2019-06-25T05:01:13Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Change parsercache key T210725 (duration: 00m 58s)

Mentioned in SAL (#wikimedia-operations) [2019-06-25T05:01:20Z] <marostegui> Change parsercache key on the canaries T210725

Mentioned in SAL (#wikimedia-operations) [2019-06-25T08:51:01Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Change parsercache key everywhere after deploying it in small batches for a few hours T210725 (duration: 00m 57s)

I have finished deploying the last key change. I did it in small batches during a few hours: https://grafana.wikimedia.org/render/d-solo/000000106/parser-cache?panelId=1&orgId=1&from=1561367808736&to=1561454208737&refresh=10s&var-contentModel=wikitext&width=1000&height=500&tz=Europe%2FMadrid

It will take a few days until the hit ratio is back to previous values as we have seen with the previous changes. I have added a reminder on my calendar in a month to optimize tables on pc1009 and pc2009 as we have done on the previous changes.

Marostegui closed this task as Resolved.Jun 25 2019, 1:20 PM

Change 532287 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool pc2009.

https://gerrit.wikimedia.org/r/532287

Change 532287 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Depool pc2009.

https://gerrit.wikimedia.org/r/532287

Mentioned in SAL (#wikimedia-operations) [2019-08-26T05:08:13Z] <marostegui> Optimize tables on pc2009 - T210725

Mentioned in SAL (#wikimedia-operations) [2019-08-26T05:09:04Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Depool pc2009 for optimize T210725 (duration: 02m 53s)

Mentioned in SAL (#wikimedia-operations) [2019-08-27T05:12:37Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Repool pc2009 after optimize T210725 (duration: 00m 47s)

Change 532514 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool pc1009

https://gerrit.wikimedia.org/r/532514

Change 532514 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool pc1009

https://gerrit.wikimedia.org/r/532514

Mentioned in SAL (#wikimedia-operations) [2019-08-27T05:24:42Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Depool pc1009 for optimize T210725 (duration: 00m 45s)

Mentioned in SAL (#wikimedia-operations) [2019-08-27T05:28:07Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool pc1009 for optimize T210725 (duration: 00m 45s)

Mentioned in SAL (#wikimedia-operations) [2019-08-27T05:28:27Z] <marostegui> Optimize pc1009 - T210725

Mentioned in SAL (#wikimedia-operations) [2019-08-28T05:03:52Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool pc1009 after optimize T210725 (duration: 00m 54s)

pc1009 and pc2009 have now been optimized.
As all the active parsercache have been cleaned up and optimized, I am going to work on the spare (pc1010 and pc2010) to clean up all the old garbage from using it to temporary replace the other ones, as there are around 170k rows per table that were not cleaned up + optimize them.
Currently pc1010 is 81% usage, and I expect this operation to reduce it to 59% or so, like pc1007 (from where it replicates).

Mentioned in SAL (#wikimedia-operations) [2019-08-28T05:54:41Z] <marostegui> Remove old rows from pc1010 - T210725

Mentioned in SAL (#wikimedia-operations) [2019-08-28T11:31:09Z] <marostegui> Optimize pc1010 after deleting old rows - T210725

Mentioned in SAL (#wikimedia-operations) [2019-08-28T13:15:16Z] <marostegui> Optimize pc2010 after deleting old rows - T210725

Marostegui added a comment.EditedAug 29 2019, 8:17 AM

pc2010 is done with all the cleanups+optimizations

ssh pc2010.codfw.wmnet df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.1T  2.3T  49% /srv

pc1010 is done:

root@pc1010:/srv/sqldata-cache/parsercache# df -hT /srv/
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.4T  2.0T  55% /srv