eqiad row D switch upgrade
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	• ayounsi
	Aug 3 2017, 11:04 PM

Description

Similar to T167274, T168462, T169345, T170380 and especially T148506.

This is the last switch that needs to be upgraded to fix T133387.

As eqiad is the active datacenter and row D contains multiple systems, the first step is to evaluate and discuss the difficulty/impact of the upgrade. Note that all services are supposed to have at least row redundancy.
Thanks to the efforts put in the preparation of T148506, this might be easier than previously.

I'm planning on doing the upgrade on Wednesday Feb. 14th at 0800 PDT; 1500 UTC; 1100 EDT
That's for the sake of picking a date, I can reschedule at will, let me know if that's an issue for anyone.

2h total maintenance time.

As the previous standard upgrade was smooth, we're doing the same thing. Which means the whole ROW will go down for between 10 and 20min (previous rows were ~10min) if no complications.

the full list of servers is available at: https://racktables.wikimedia.org/index.php?row_id=2102&location_id=2006&page=row&tab=default

To summarize, here is the types of hosts in that row:

analytics*
aqs*
auth*
cp*
db*
dbstore*
druid*
dumpsdata*
einsteinium <- Icinga
elastic*
es*
kafka*
kubernetes*
labpuppetmaster*
labservices*
labstore*
labweb*
logstash*
maps*
mc*
ms-be*
mw*
ocg*
ores*
pc*
puppetmaster*
rdb*
restbase*
restbase-dev*
scb*
snapshot*
stat*
thorium
thumbor*
wdqs*
wtp*

I subscribed users on that task based on https://etherpad.wikimedia.org/p/p0Iq39YWZg . Don't hesitate to add people I could have missed, or remove yourself from the task if you're not involved.

Timeline, - to be completed - please edit and add anyone who needs to be notified or any extra step that needs to be done.

Days before:

Depool puppetmaster1002 ( T148506#3196641 )
Switchover from einsteinium to tegmen T163324
Switchover from oresrdb1001 to oresrdb1002 ( T163326 )
Ban all elasticsearch nodes of row D ( @Gehel )
fail etcd over to codfw
Ensure we can survive a loss of labservices1001 ( T163402 )
Failover affected DB masters (T186188)

1h before the window:

Warn people of the upcoming maintenance
Ping @elukey to disable kafka
Ping @elukey to drain traffic from hadoop nodes
Ping @Eevans to mute restbase* alerts ( T148506#3202477 )
Ping @Gehel for elasticsearch and logstash coordination
Disable elasticsearch check https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=search.svc.eqiad.wmnet&service=ElasticSearch+health+check+for+shards
Downtime switch in Icinga/LibreNMS
...

After the upgrade:

Confirm switches are in a healthy state
Re-enable igmp-snooping
Confirm T133387 is fixed (otherwise re-disable igmp-snooping)
Run an LibreNMS discovery/pool
Ask confirmation of "all good" to the list of users above
Repool puppetmaster1002 (@akosiaris)
Unban elasticsearch nodes of row D (@Gehel )
Elasticsearch reindex that time period if we see any lost writes (@Gehel )
Re-enable all Icinga checks
Remove monitoring downtime
...

Details

	Subject	Repo	Branch	Lines +/-
	mariadb: Adding rack allocations, some formatting fixes, read-only	operations/mediawiki-config	master	+153 -127

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Declined	• ayounsi	T172459 eqiad row D switch upgrade
Resolved	elukey	T165519 rack and setup mw1307-1348
Resolved	• Cmjohnson	T167130 Decom mw1170-mw1179, and replace them with new systems.
Resolved	• Cmjohnson	T168271 Decommission mw1170-mw1179
Resolved	• Cmjohnson	T177387 Decomission mw1161-69
Resolved	• Cmjohnson	T181613 Please move db1110 and change its ip
Resolved	• Cmjohnson	T183895 Decommission mw1180-1200
Resolved	• Cmjohnson	T185004 Decommission mw1201-mw1220

Event Timeline

• ayounsi created this task.Aug 3 2017, 11:04 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 3 2017, 11:04 PM

• ayounsi mentioned this in T163402: Ensure we can survive a loss of labservices1001.Aug 3 2017, 11:09 PM

• ayounsi updated the task description. (Show Details)Aug 4 2017, 1:40 AM

Paladox subscribed.Aug 4 2017, 1:45 AM

Hi,

We have some critical DB hosts on that row that would need to be either failed over or to communicate to users that a period of read-only is happening.
To fail over those hosts we'd need more time and we will probably not be able to make it on time for the date that you suggests.

Those critical systems are:

db1068 - s4 master (commons)
db1062 - s7 master (among other big wikis....centralauth is there)

Marostegui added a subscriber: • jcrespo.Aug 4 2017, 4:49 AM

A couple of notes from my side after reading the host list:

Analytics:

all the analytics* host in row D down shouldn't be an issue for a brief amount of time since the Hadoop cluster data is replicated in 3 rows
we'd need to inform data analysts and probably the analytics mailing list since thorium runs all the Analytics websites and stat* are hosts on which our colleagues run various kind of jobs (some requiring mysql data, others Hadoop, etc..).
two Kafka hosts of the Analytics cluster will go down at the same time, not ideal but we already verified that it is fine for a brief time window.

Session/Object cache:

As far as I can see 4 mc10* hosts will go down at the same time (1/4 of the cluster), meaning that some logged in users will probably see some impact (session lost, auth issues, etc..). This will happen since Mediawiki connects to the mc* cluster via nutcracker on localhost, and each mc* host represents a non replicated shard. So when one mc* goes down, nutcracker detects it and takes it out, re-calculating its consistent hashing pool (so requests that would have targeted the hosts down will go to another one). This is probably fine but it might be good to alert our community liaisons beforehand just in case (so editors will be informed straight away if the maintenance takes a bit longer than expected).

Config cluster:

conf1003 will go down but it shouldn't be a big issue since zookeeper/etcd will keep working without any problem. etcd mirror is not pulling data from conf1003 so that one should be ok too.

Job Queues:

rdb1006 is a eqiad replica of rdb1005 so we are fine on this side too. Just to be super paranoid it might be good to force a Redis restart on rdb2005 to force replication and make sure that if rdb1005 goes down at the same time we have all the data in perfect sync with the master (I know it might be too paranoid but I am writing down everything :)

MediaWiki:

Better to double check all the hosts that we take down (if the related clusters are balanced), and possibly depool them beforehand from pybal.

Last but not the least, I will be on vacation from the 20th to the 27th so I will not be around for the Analytics part, @Ottomata should be around though! (need to check with him but there shouldn't be any issue.

For the elasticsearch cluster:

The cluster should be able to survive the loss of a full row. This is not something we have tested under load yet, so this is a goo occasion to do it. In case of trouble, we can always switch the search traffic to codfw.

The risks are:

performance (obviously, with 1/4 of the capacity gone, we might see increased response times)
the shard check will most probably alert (with 1 row gone, not all replicas can be allocated)
in the worst case, we might loose writes to some indices, but we can reindex those afterward.

Conclusion:

We will not drain row D before the maintenance, but will keep an eye on it and be ready to switch traffic to codfw in case of trouble.

• ayounsi updated the task description. (Show Details)Aug 10 2017, 5:39 PM

Change 371444 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Adding rack allocations, some formatting fixes

https://gerrit.wikimedia.org/r/371444

gerritbot added a project: Patch-For-Review.Aug 11 2017, 6:01 AM

I don't think we're safe to do this maintenance until we do rack all the new mediawiki machines. We have almost half of our capacity for MediaWiki in row D. We have plans to remediate that when the new mediawiki servers will be racked (see T165519) but I'd say racking and setting up those servers should be a hard blocker for this maintenance at the moment.

Specifically: 19 out of 48 API appservers are in row D, and 21 out of 58 appservers are in row D.

We might be able to withstand such a loss, but we'd need to explicitly depool these servers before the maintenance (set them to "inactive" in etcd). We already know this is an inbalance (that was due to space reasons, at the time) and we already have a task to solve it. I'd honestly not perform this maintenance until we do that.

I'll add a task dependency.

Sounds fair :)
Marking this task as a dependency of T165519. Any idea of the time-line for T165519?

• ayounsi updated the task description. (Show Details)Aug 11 2017, 6:37 PM

Marostegui mentioned this in T163190: Checksum data on s7.Aug 23 2017, 10:28 AM

Change 371444 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Adding rack allocations, some formatting fixes, read-only

https://gerrit.wikimedia.org/r/371444

Restricted Application added a subscriber: jeblad. · View Herald TranscriptAug 25 2017, 6:07 AM

jeblad unsubscribed.Aug 25 2017, 9:28 PM

Joe mentioned this in T167130: Decom mw1170-mw1179, and replace them with new systems..Aug 28 2017, 7:52 AM

• ema moved this task from Backlog to General on the Traffic board.Aug 29 2017, 8:36 AM

• ema triaged this task as Medium priority.Sep 28 2017, 2:48 PM

• ema moved this task from General to Network on the Traffic board.Nov 10 2017, 4:37 PM

Joe closed subtask T165519: rack and setup mw1307-1348 as Resolved.Jan 16 2018, 1:31 PM

Reviving this thread as T165519 is now resolved.

Aiming to do the upgrade on Wednesday Feb 14. All the details from the description should still be accurate. Let me know if not or if it should be postponed more.

• ayounsi updated the task description. (Show Details)Feb 1 2018, 12:23 AM

In T172459#3935923, @ayounsi wrote:

Reviving this thread as T165519 is now resolved.

Aiming to do the upgrade on Wednesday Feb 14. All the details from the description should still be accurate. Let me know if not or if it should be postponed more.

I am afraid this cannot be done on that date.
After the last events on s5 (master crash) and s8 (split from s5) we now have 4 masters in row D that we'd need to failover.
I am going to create a task so we can have it on our radar (I completely forgot about this), but realistically failing over 4 masters in 2 weeks will not happen (this requires read only time).

Sorry!

Marostegui mentioned this in T186188: Failover DB masters in row D.Feb 1 2018, 7:13 AM

@Marostegui do you have an approximate timeline I can base this task on?

In T172459#3938158, @ayounsi wrote:

@Marostegui do you have an approximate timeline I can base this task on?

Not really. We'd probably won't be doing more than failover per week (there are 4).
I just saw that we also have to re-allocate some servers to other rows in order to move those best candidates out from row D and then fail them over.
I will talk to Jaime on Monday and see how we can include this to our (already) huge backlog :-)

Maybe your manager should talk to our manager to help us prioritize it? Right now, goals are on our top priority unless said the opposite. Failing over those servers was on the roadmap, but a) within 2 months, not 2 weeks and b) I was going to ask if it could be done at the same time that a datacenter failover- which we were told it should happen "soon", and would made things infinitely easier for us, and probably for you, too. We failed to sync on this on our meetings, managers could discuss this on our behalf, I guess?

The question (after the initially proposed 2 week timeline was rejected) was merely "do you have an approximate timeline I can base this task on?". The answer seems to be either "within 2 months" or "after the next datacenter failover testing" (which I think is expected sometime during Q4, so probably more like 3-4 months out?). I don't think managers necessarily need to discuss this on anyone's behalf (:p), you just need to find an answer to the question from your end, which may (or may not?) depend on setting a date for the next datacenter failover date.

I would definitely vote for waiting to the DC switchover if that is possible and a reasonable timeframe for NetOps. Otherwise, we'd need to squeeze this into our huge backlog and that might affect goals planning.

Marostegui updated the task description. (Show Details)Feb 1 2018, 5:50 PM

@BBlack The thing is, we physically could do this in 2 weeks- if we put it on our top priority and do nothing else- I don't know how urgent is this- if it is long tail maintenance that can wait, or things are literally breaking apart. A manager would know were to put it on our pile and how to prioritize with more context, that is why I mentioned, so we can provide you with a more accurate timing.

In our case, literally as I write things, mysql is breaking apart... wait for the ticket.

you just need to find an answer to the question from your end

If it was entirerly on a reasonable expectation, I would wait until a failover, in 4+ months, as it save us time- but I do not think it is entirely up to us, we can discuss.

In T172459#3939141, @jcrespo wrote:

@BBlack The thing is, we physically could do this in 2 weeks- if we put it on our top priority and do nothing else- I don't know how urgent is this- if it is long tail maintenance that can wait, or things are literally breaking apart. A manager would know were to put it on our pile and how to prioritize with more context, that is why I mentioned, so we can provide you with a more accurate timing.

In our case, literally as I write things, mysql is breaking apart... wait for the ticket.

you just need to find an answer to the question from your end

If it was entirerly on a reasonable expectation, I would wait until a failover, in 4+ months, as it save us time- but I do not think it is entirely up to us, we can discuss.

As far as I'm aware there's currently no reason for this being a high priority (and the ticket also says this can be delayed). Clearly the current proposal is putting a lot of pressure for DB operations, so let's reschedule this to (much?) later.

One option (which we discussed in the meeting yesterday) is to delay this upgrade until we do the eqiad->codfw switchover, in the next quarter (Q4). Would that work for everyone or is there a need for this upgrade to happen sooner than that?

(And obviously... we should aim to avoid this situation of not being able to reboot network stacks in the long-term, but that's not a reason to drop all other DB work now.)

Indeed not urgent, I was not aware of the DB requirements. Waiting for the next DC switchover works for me.

• ayounsi mentioned this in T186756: Move labstore1006 and 1007 to 10G enabled racks in row A & D.Feb 7 2018, 11:16 PM

• Cmjohnson closed subtask T185004: Decommission mw1201-mw1220 as Resolved.Aug 7 2018, 8:06 PM

Marostegui changed the status of subtask T186188: Failover DB masters in row D from Open to Stalled.Aug 20 2018, 9:29 AM

• ayounsi mentioned this in T133387: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw).Oct 15 2018, 4:07 PM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 9:25 PM

Marostegui changed the status of subtask T186188: Failover DB masters in row D from Stalled to Open.Jun 19 2019, 10:23 AM

• jcrespo changed the status of subtask T186188: Failover DB masters in row D from Open to Stalled.Apr 29 2020, 7:29 AM

Marostegui changed the status of subtask T186188: Failover DB masters in row D from Stalled to Open.Aug 28 2020, 10:36 AM

Forgot about that old task! Not needed anymore as we're not using multicast anymore.

Marostegui removed a subtask: T186188: Failover DB masters in row D.Sep 28 2020, 8:46 AM

BBlack moved this task from Network to Done on the Traffic board.Oct 8 2021, 6:01 PM

Restricted Application added a project: Infrastructure-Foundations. · View Herald TranscriptOct 8 2021, 6:02 PM

• ayounsi mentioned this in T327248: eqiad/codfw virtual-chassis upgrades.Jan 18 2023, 10:33 AM

eqiad row D switch upgradeClosed, DeclinedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

eqiad row D switch upgrade
Closed, DeclinedPublic
Actions

Related Objects
Search...