Page MenuHomePhabricator

pc100[123] maintenance and upgrade
Closed, ResolvedPublic

Description

pc100[123] havn't had attention for some time, and deserve the usual:

  • pkg/kernel/trusty upgrade
  • mediawiki code review for mariadb 10 compatability (probably fine)
  • if ok, mariadb 10 re-puppetization and upgrade

These boxes can be depooled one-at-a-time in db-eqiad.php; they just rebuild themselves on repool.

Buffer pool preload would be useful since load balancing isn't(?) possible.

Also worth a p_s trial? Your call.

Event Timeline

Springle assigned this task to jcrespo.
Springle raised the priority of this task from to Needs Triage.
Springle updated the task description. (Show Details)
Springle subscribed.
Springle set Security to None.

Running the following on pc1001 in order to retrieve a sample of traffic for later analysis of 10 compatibility:

SET @slow_query_log := @@GLOBAL.slow_query_log; -- ON
SET @slow_query_log_file := @@GLOBAL.slow_query_log_file; -- 
SET @long_query_time := @@GLOBAL.long_query_time;
SET @log_slow_rate_limit := @@GLOBAL.log_slow_rate_limit;

SET GLOBAL slow_query_log := ON;
SET GLOBAL log_slow_rate_limit := 100;
SET GLOBAL slow_query_log_file := '/tmp/slow.log';
SET GLOBAL long_query_time := 0;

MariaDB [(none)]> SELECT @slow_query_log, @slow_query_log_file, @long_query_time, @log_slow_rate_limit;
+-----------------+----------------------+------------------+----------------------+
| @slow_query_log | @slow_query_log_file | @long_query_time | @log_slow_rate_limit |
+-----------------+----------------------+------------------+----------------------+
|               1 | pc1001-slow.log      |                1 |                    1 |
+-----------------+----------------------+------------------+----------------------+
1 row in set (0.00 sec)

Log set back to normal.

This wiki page documents the queries being done there and how fast, in order to compare with 10:

It is a simple key-value store, using a unique key, so it should not create much of a problem. Tomorrow, the actual migration.

If someone needs the full log or output (includes potential private queries), it is on root@pc1001:/home/jynus

Current state: OS/Kernel/Package updated. Rebooted. MariaDB-WMF 10 installed but not started yet and mysql_upgrade not run. Old configuration in place (puppet agent disabled). pc1001 depooled from mediawiki.

@Springle, could you double-check (+1/-1) https://gerrit.wikimedia.org/r/#/c/213784/ ? I am mostly worried about site.pp/mariadb role, I will try the my.cnf changes and performance offline first.

pc100 1 & 2 upgraded. 3 is left. I am geting slightly lower QPS in 1. Need to investigate and check more thoroughly the performance.

Actionables:

  • Check read_only on master servers on icinga with a puppet rule
  • Create a query on kibana-logstash for db-related errors
  • Plot SELECT sum(errors) as errors, sum(warnings) as warnings FROM sys.statements_with_errors_or_warnings; on tendril

Actionables:

  • Create a query on kibana-logstash for db-related errors

I just made https://logstash.wikimedia.org/#/dashboard/elasticsearch/wfLogDBError for this from the MediaWiki point of view.

Thanks, @bd808, I would add a couple of panels (group by error message & count, for example) and add it to the main page, if everyone is ok with that (but obviously, I can do that myself).

I think I can say this is done, I will reopen if I find another issue. Extra monitoring will be ticketed if needed on a separate issue.