Page MenuHomePhabricator

eqiad row D switch upgrade
Open, Stalled, MediumPublic

Description

Similar to T167274, T168462, T169345, T170380 and especially T148506.

This is the last switch that needs to be upgraded to fix T133387.

As eqiad is the active datacenter and row D contains multiple systems, the first step is to evaluate and discuss the difficulty/impact of the upgrade. Note that all services are supposed to have at least row redundancy.
Thanks to the efforts put in the preparation of T148506, this might be easier than previously.

I'm planning on doing the upgrade on Wednesday Feb. 14th at 0800 PDT; 1500 UTC; 1100 EDT
That's for the sake of picking a date, I can reschedule at will, let me know if that's an issue for anyone.

2h total maintenance time.

As the previous standard upgrade was smooth, we're doing the same thing. Which means the whole ROW will go down for between 10 and 20min (previous rows were ~10min) if no complications.

the full list of servers is available at: https://racktables.wikimedia.org/index.php?row_id=2102&location_id=2006&page=row&tab=default

To summarize, here is the types of hosts in that row:

analytics*
aqs*
auth*
cp*
db*
dbstore*
druid*
dumpsdata*
einsteinium <- Icinga
elastic*
es*
kafka*
kubernetes*
labpuppetmaster*
labservices*
labstore*
labweb*
logstash*
maps*
mc*
ms-be*
mw*
ocg*
ores*
pc*
puppetmaster*
rdb*
restbase*
restbase-dev*
scb*
snapshot*
stat*
thorium
thumbor*
wdqs*
wtp*

I subscribed users on that task based on https://etherpad.wikimedia.org/p/p0Iq39YWZg . Don't hesitate to add people I could have missed, or remove yourself from the task if you're not involved.

Timeline, - to be completed - please edit and add anyone who needs to be notified or any extra step that needs to be done.

Days before:

  • Depool puppetmaster1002 ( T148506#3196641 )
  • Switchover from einsteinium to tegmen T163324
  • Switchover from oresrdb1001 to oresrdb1002 ( T163326 )
  • Ban all elasticsearch nodes of row D ( @Gehel )
  • fail etcd over to codfw
  • Ensure we can survive a loss of labservices1001 ( T163402 )
  • Failover affected DB masters (T186188)

1h before the window:

After the upgrade:

  • Confirm switches are in a healthy state
  • Re-enable igmp-snooping
  • Confirm T133387 is fixed (otherwise re-disable igmp-snooping)
  • Run an LibreNMS discovery/pool
  • Ask confirmation of "all good" to the list of users above
  • Repool puppetmaster1002 (@akosiaris)
  • Unban elasticsearch nodes of row D (@Gehel )
  • Elasticsearch reindex that time period if we see any lost writes (@Gehel )
  • Re-enable all Icinga checks
  • Remove monitoring downtime
  • ...

Details

Related Gerrit Patches:

Related Objects

Event Timeline

ayounsi created this task.Aug 3 2017, 11:04 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 3 2017, 11:04 PM
ayounsi updated the task description. (Show Details)Aug 4 2017, 1:40 AM
Paladox added a subscriber: Paladox.Aug 4 2017, 1:45 AM

Hi,

We have some critical DB hosts on that row that would need to be either failed over or to communicate to users that a period of read-only is happening.
To fail over those hosts we'd need more time and we will probably not be able to make it on time for the date that you suggests.

Those critical systems are:

db1068 - s4 master (commons)
db1062 - s7 master (among other big wikis....centralauth is there)

elukey added a subscriber: Ottomata.EditedAug 4 2017, 8:04 AM

A couple of notes from my side after reading the host list:

Analytics:

  1. all the analytics* host in row D down shouldn't be an issue for a brief amount of time since the Hadoop cluster data is replicated in 3 rows
  2. we'd need to inform data analysts and probably the analytics mailing list since thorium runs all the Analytics websites and stat* are hosts on which our colleagues run various kind of jobs (some requiring mysql data, others Hadoop, etc..).
  3. two Kafka hosts of the Analytics cluster will go down at the same time, not ideal but we already verified that it is fine for a brief time window.

Session/Object cache:

As far as I can see 4 mc10* hosts will go down at the same time (1/4 of the cluster), meaning that some logged in users will probably see some impact (session lost, auth issues, etc..). This will happen since Mediawiki connects to the mc* cluster via nutcracker on localhost, and each mc* host represents a non replicated shard. So when one mc* goes down, nutcracker detects it and takes it out, re-calculating its consistent hashing pool (so requests that would have targeted the hosts down will go to another one). This is probably fine but it might be good to alert our community liaisons beforehand just in case (so editors will be informed straight away if the maintenance takes a bit longer than expected).

Config cluster:

conf1003 will go down but it shouldn't be a big issue since zookeeper/etcd will keep working without any problem. etcd mirror is not pulling data from conf1003 so that one should be ok too.

Job Queues:

rdb1006 is a eqiad replica of rdb1005 so we are fine on this side too. Just to be super paranoid it might be good to force a Redis restart on rdb2005 to force replication and make sure that if rdb1005 goes down at the same time we have all the data in perfect sync with the master (I know it might be too paranoid but I am writing down everything :)

MediaWiki:

Better to double check all the hosts that we take down (if the related clusters are balanced), and possibly depool them beforehand from pybal.

Last but not the least, I will be on vacation from the 20th to the 27th so I will not be around for the Analytics part, @Ottomata should be around though! (need to check with him but there shouldn't be any issue.

Gehel added a comment.Aug 8 2017, 8:15 AM

For the elasticsearch cluster:

The cluster should be able to survive the loss of a full row. This is not something we have tested under load yet, so this is a goo occasion to do it. In case of trouble, we can always switch the search traffic to codfw.

The risks are:

  • performance (obviously, with 1/4 of the capacity gone, we might see increased response times)
  • the shard check will most probably alert (with 1 row gone, not all replicas can be allocated)
  • in the worst case, we might loose writes to some indices, but we can reindex those afterward.

Conclusion:

We will not drain row D before the maintenance, but will keep an eye on it and be ready to switch traffic to codfw in case of trouble.

ayounsi updated the task description. (Show Details)Aug 10 2017, 5:39 PM

Change 371444 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Adding rack allocations, some formatting fixes

https://gerrit.wikimedia.org/r/371444

Joe added a comment.Aug 11 2017, 7:20 AM

I don't think we're safe to do this maintenance until we do rack all the new mediawiki machines. We have almost half of our capacity for MediaWiki in row D. We have plans to remediate that when the new mediawiki servers will be racked (see T165519) but I'd say racking and setting up those servers should be a hard blocker for this maintenance at the moment.

Specifically: 19 out of 48 API appservers are in row D, and 21 out of 58 appservers are in row D.

We might be able to withstand such a loss, but we'd need to explicitly depool these servers before the maintenance (set them to "inactive" in etcd). We already know this is an inbalance (that was due to space reasons, at the time) and we already have a task to solve it. I'd honestly not perform this maintenance until we do that.

I'll add a task dependency.

ayounsi changed the task status from Open to Stalled.Aug 11 2017, 6:35 PM

Sounds fair :)
Marking this task as a dependency of T165519. Any idea of the time-line for T165519?

ayounsi updated the task description. (Show Details)Aug 11 2017, 6:37 PM

Change 371444 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Adding rack allocations, some formatting fixes, read-only

https://gerrit.wikimedia.org/r/371444

Restricted Application added a subscriber: jeblad. · View Herald TranscriptAug 25 2017, 6:07 AM
jeblad removed a subscriber: jeblad.Aug 25 2017, 9:28 PM
ema moved this task from Triage to General on the Traffic board.Aug 29 2017, 8:36 AM
ema triaged this task as Medium priority.Sep 28 2017, 2:48 PM
ema moved this task from General to Network on the Traffic board.Nov 10 2017, 4:37 PM

Reviving this thread as T165519 is now resolved.

Aiming to do the upgrade on Wednesday Feb 14. All the details from the description should still be accurate. Let me know if not or if it should be postponed more.

ayounsi updated the task description. (Show Details)Feb 1 2018, 12:23 AM

Reviving this thread as T165519 is now resolved.
Aiming to do the upgrade on Wednesday Feb 14. All the details from the description should still be accurate. Let me know if not or if it should be postponed more.

I am afraid this cannot be done on that date.
After the last events on s5 (master crash) and s8 (split from s5) we now have 4 masters in row D that we'd need to failover.
I am going to create a task so we can have it on our radar (I completely forgot about this), but realistically failing over 4 masters in 2 weeks will not happen (this requires read only time).

Sorry!

@Marostegui do you have an approximate timeline I can base this task on?

@Marostegui do you have an approximate timeline I can base this task on?

Not really. We'd probably won't be doing more than failover per week (there are 4).
I just saw that we also have to re-allocate some servers to other rows in order to move those best candidates out from row D and then fail them over.
I will talk to Jaime on Monday and see how we can include this to our (already) huge backlog :-)

Maybe your manager should talk to our manager to help us prioritize it? Right now, goals are on our top priority unless said the opposite. Failing over those servers was on the roadmap, but a) within 2 months, not 2 weeks and b) I was going to ask if it could be done at the same time that a datacenter failover- which we were told it should happen "soon", and would made things infinitely easier for us, and probably for you, too. We failed to sync on this on our meetings, managers could discuss this on our behalf, I guess?

BBlack added a comment.Feb 1 2018, 5:37 PM

The question (after the initially proposed 2 week timeline was rejected) was merely "do you have an approximate timeline I can base this task on?". The answer seems to be either "within 2 months" or "after the next datacenter failover testing" (which I think is expected sometime during Q4, so probably more like 3-4 months out?). I don't think managers necessarily need to discuss this on anyone's behalf (:p), you just need to find an answer to the question from your end, which may (or may not?) depend on setting a date for the next datacenter failover date.

I would definitely vote for waiting to the DC switchover if that is possible and a reasonable timeframe for NetOps. Otherwise, we'd need to squeeze this into our huge backlog and that might affect goals planning.

Marostegui updated the task description. (Show Details)Feb 1 2018, 5:50 PM
jcrespo added a comment.EditedFeb 1 2018, 8:53 PM

@BBlack The thing is, we physically could do this in 2 weeks- if we put it on our top priority and do nothing else- I don't know how urgent is this- if it is long tail maintenance that can wait, or things are literally breaking apart. A manager would know were to put it on our pile and how to prioritize with more context, that is why I mentioned, so we can provide you with a more accurate timing.

In our case, literally as I write things, mysql is breaking apart... wait for the ticket.

you just need to find an answer to the question from your end

If it was entirerly on a reasonable expectation, I would wait until a failover, in 4+ months, as it save us time- but I do not think it is entirely up to us, we can discuss.

mark added a subscriber: mark.Feb 6 2018, 1:10 PM

@BBlack The thing is, we physically could do this in 2 weeks- if we put it on our top priority and do nothing else- I don't know how urgent is this- if it is long tail maintenance that can wait, or things are literally breaking apart. A manager would know were to put it on our pile and how to prioritize with more context, that is why I mentioned, so we can provide you with a more accurate timing.
In our case, literally as I write things, mysql is breaking apart... wait for the ticket.

you just need to find an answer to the question from your end

If it was entirerly on a reasonable expectation, I would wait until a failover, in 4+ months, as it save us time- but I do not think it is entirely up to us, we can discuss.

As far as I'm aware there's currently no reason for this being a high priority (and the ticket also says this can be delayed). Clearly the current proposal is putting a lot of pressure for DB operations, so let's reschedule this to (much?) later.

One option (which we discussed in the meeting yesterday) is to delay this upgrade until we do the eqiad->codfw switchover, in the next quarter (Q4). Would that work for everyone or is there a need for this upgrade to happen sooner than that?

(And obviously... we should aim to avoid this situation of not being able to reboot network stacks in the long-term, but that's not a reason to drop all other DB work now.)

Indeed not urgent, I was not aware of the DB requirements. Waiting for the next DC switchover works for me.

Marostegui changed the status of subtask T186188: Failover DB masters in row D from Open to Stalled.Aug 20 2018, 9:29 AM
Marostegui changed the status of subtask T186188: Failover DB masters in row D from Stalled to Open.Jun 19 2019, 10:23 AM