Page MenuHomePhabricator

es[12]00[123] maintenance and upgrade
Closed, ResolvedPublic

Description

  • pkg/kernel/trusty upgrade
  • mediawiki code review for mariadb 10 compatability
  • Database upgrade

Related Objects

Mentioned In
rOMWCb9ecb8e2fce4: Repool es2003 after maintenance
rOMWCec224fc746d6: Depooling es2003 and es2004 for maintenance
rOMWC67d51a062b16: Repool es2001, es2002 after maintenance
rOMWC1c7c9f39e0ab: Promote es1009 as the new master
rOMWC915acb58885c: Repool es1004 after maintenance
rOPUP06f993c57930: Updating es1004 to mariadb10
rOMWC83bd0b08c053: Repool es1003, depool es1004 for regular maintenance
rOPUP669e801ba5fd: Upgrading es1003 to mariadb10
rOMWC796d140beb50: Repool es1002, depool es1003 for regular maintenance
rOMWC7cb025188350: Repool es1001, depool es1002 for regular maintenance
rOMWC4ec0ba4723a2: Depool es1001 for regular maintenance
rOMWCef7b7120bba7: Repool es2005, es2006 and es2007 after maintenance
rOMWC03617acf08f2: Depool es2005, es2006 and es2007 for maintenance
rOMWC19fd3a6a0390: Repool es1005 after maintenance
rOMWC18b041633438: Depool es1005
rOMWC85ac3b4dca0b: Repool es1007
rOPUP46d10afea1bc: Upgrade es1007 from MariaDB 5.5 to 10
rOMWC259a74a278a5: Depool es1007
rOMWC109fb7661d16: Repool db1009
rOMWC64c13db246a3: Master->Slave switchover of es1009 to es1008 for maintenance
rOMWCd6f596dd781c: Repool es1008
rOMWC882487408757: Repool es2008, es2009 and es2010
rOMWCcdb846e3b053: Depool es1008 and es2008 (and its slaves) for CHANGE MASTER
rOMWCadc60cf713e3: Repool es2008, es2009 and es2010 after maintenance
rOMWC6cb914a9ff96: Depool es2008 for maintenance
rOMWC52c2b0e075dd: Depool es2009 for maintenance
rOMWC72cf28d69dd6: Repool es2010 after maintenance
rOMWCf419db955be8: Repool es1008 after maintenance
rOMWC93d218cacc30: depool es2010 for maintenance
rOMWC250736197500: depool es1008
rOMWC9857c9cf092b: Repool es1010
rOMWC74e979aa0bcc: Depooling es1010 for maintenance
Mentioned Here
T103843: Faulty memory on es2004 (purchase one module)

Event Timeline

jcrespo claimed this task.
jcrespo raised the priority of this task from to Medium.
jcrespo updated the task description. (Show Details)
jcrespo added projects: DBA, acl*sre-team.
jcrespo added a subscriber: Springle.

While a baseline of "5.6/10 or higher" is needed for some of the features we want/need, and normalization is great for maintenance, I have some reserves against 100% of exact same OS and database package/version.

At least for the OS all machines should be updated consistently to trusty, which brings many important improvements to the low-level OS components.

(trusty at minimum, ideally jessie!)

es1010 upgraded (if you do not want to be notified, unsubscribe!)

Tomorrow, after a bit more of investigation, I will perform a master-master failover of es1009 (with es1008), a bit more delicate as it is a major version upgrade.

That is the last server to upgrade on es3.

Switchover completed. No relevant errors on kibana, there are some errors in 1008 error log about 1009 disconnecting, but before the fail over (probably caused by the temporary 10 -> 5.5 replication). I will investigate checking for differences and rebuild 1009 if necessary.

es1008 is the (temporarily?) new master of es3 and es1009 is down for maintenance. I have updated tendril to avoid confusion. After the upgrade of es1009 we can return it to its previous state or (I prefer this one) leave it as is and puppetize the change (hearbeat check is running on the new master, but icinga is checking the wrong node).

I've just realized that we do not have any 10s as masters in production ("pc" are the closest thing, but they are not real "masters"), and we do not have things like read-write and good icinga checks for those. I am going to wait to update 1009 until:

a) we have decided if we want to do a potentially irreversible jump to 10 (there are some incompatibilities in MariaDB 10's GTID implementation)
and
b) I or someone else implements better puppet/icinga support for a 10 production master

We can also use it for Master-Master tests between versions.

Current status: es1005 is depooled, but not yet scheduled for maintenance because it is still being used by wikiadmin (snapshoting).

Depooling es2005, es2006 and es2007.

With today's work I would say the task is finished.

Summary:

  • Upgraded to trusty with kernel upgrade and reboot and MariaDB 10.0.16 all es* slaves (18) hosts
  • Decided to keep the 2 masters/rw nodes in 5.5 (es1009 and es1006), although 1009 was upgraded to trusty, too

Issues:

  • If we decide to upgrade to 10 the masters, a better puppet template and monitoring will be needed, plus deciding if to retire or substitute things like heartbeat
  • es2004 was upgraded as the others, but a fault memory module is preventing from repooling it right now. Tracked on issue T103843. Not a problem for production.