Can't commit on asw-b-codfw
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ayounsi
	Nov 5 2021, 9:13 AM

Description

commit or commit synchronize on asw-b-codfw fails with:

fpc2: 
configuration check succeeds
error: failed to copy file '//var/etc/if_alias_map+' to 'fpc7'

Log shows:

Nov 5 09:03:30 asw-b-codfw ffp[77709]: LIBJNX_REPLICATE_RCP_ERROR: rcp -l -Ji var/etc/if_alias_map+ fpc7://var/etc/if_alias_map+ : rcp: var/etc/if_alias_map+: Input/output error

Looking around I was not able to find relevant doc. The closest was https://kb.juniper.net/InfoCenter/index?page=content&id=KB36459&showDraft=false or https://kb.juniper.net/InfoCenter/index?page=content&id=KB33527&cat=ROUTER_PRODUCTS&actp=LIST but for a different platform.

As support is expired I'm not able to open a JTAC case (cf. T294792).

One option is to try a master switchover, or more aggressively reboot FPC7, but those can be risky without anyone onsite and no support if it gets worse.

Details

Other Assignee: cmooney

Event Timeline

ayounsi triaged this task as High priority.Nov 5 2021, 9:13 AM

ayounsi created this task.

Restricted Application added a project: Infrastructure-Foundations. · View Herald TranscriptNov 5 2021, 9:13 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

ayounsi updated the task description. (Show Details)Nov 5 2021, 9:14 AM

Maintenance_bot added a project: SRE.Nov 5 2021, 9:45 AM

RhinosF1 subscribed.Nov 6 2021, 10:05 PM

According to @akosiaris this is due to a failed hard drive, and it might not come back up from a reboot.

@Papaul when you're back, let's replace FPC7 with one of our spares until we can RMA it.

wiki_willy added a project: ops-codfw.Nov 8 2021, 4:28 PM

wiki_willy subscribed.

Papaul moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.Nov 15 2021, 12:54 PM

I downgrade Junos on QFX5100 at https://netbox.wikimedia.org/dcim/devices/3423/ and did a request system zeroize on it . This is the one we will be using to replace fpc7 in row b. @ayounsi let me know if tomorrow 9:15am my time works for you for us to replace fpc7. Note root password is set to server mgmt password

Model: qfx5100-48s-6q
Junos: 14.1X53-D43.7

Thanks

That works for me, thanks, can you send a calendar invite? Note that the link in your comment doesn't point to any specific device.

This will cause a hard downtime for 6 servers (rack B7), for up to 1h, but most likely less:

(1) thanos-be2002.codfw.wmnet
role::thanos::backend:
Observability SREs - @lmata , @fgiunchedi
As this is one server, according to https://wikitech.wikimedia.org/wiki/Service_restarts#Thanos nothing needs to be done pre-emptively

(1) furud.codfw.wmnet
role::analytics_cluster::hadoop/client:
Analytics SREs - @razzi @Ottomata, @BTullis
Is any prep work or depool needed?

(2) ms-be[2033,2047].codfw.wmnet
role::swift::storage:
Data Persistence SREs - @LSobanski
As it's more than one server, according to https://wikitech.wikimedia.org/wiki/Service_restarts#Swift is anything needed before the maintenance?

(2) elastic[2043-2044].codfw.wmnet
role::elasticsearch::cirrus:
Search Platform SREs - @Gehel
As it's more than one server, according to https://wikitech.wikimedia.org/wiki/Service_restarts#Elasticsearch is anything needed before the maintenance?

Doing it Today is too soon, so let's plan it for Thursday.

ayounsi added a subscriber: BTullis.Nov 17 2021, 10:08 AM

Adding @MatthewVernon for the Swift hosts.

LSobanski added a project: SRE-swift-storage.Nov 17 2021, 10:25 AM

I don't think so, no - the frontends will not route requests to down servers (at least in theory!); we'll be more vulnerable to failures elsewhere, I think we have to live with that.

I don't believe that we need to do any prep or depooling work for furud.codfw.wmnet
We can downtime it in Icinga, but I think that's the limit of what we need to do.

The elasticsearch cluster should be able to cope with loosing 2 nodes with no issues. Thanks for flagging this, and please ping @RKemper and myself when starting the maintenance, so that we can keep our eyes opened!

@ayounsi after a chat with the team we think we should be fine, we will monitor and be available should something happen.

For the record, there is also a link to lvs2007, after chatting with @BBlack on irc, the usual disable puppet then stop pybal is to do before the maintenance.

furud does not run any active services; it can be restarted anytime.

Mentioned in SAL (#wikimedia-operations) [2021-11-18T15:35:43Z] <XioNoX> cr2-codfw# set interfaces et-1/0/3 disable - T295118

Mentioned in SAL (#wikimedia-operations) [2021-11-18T15:39:24Z] <XioNoX> lvs2007:~$ sudo service pybal stop - T295118

Mentioned in SAL (#wikimedia-operations) [2021-11-18T15:49:56Z] <XioNoX> asw-b-codfw> request system power-off member 7 - T295118

Mentioned in SAL (#wikimedia-operations) [2021-11-18T18:52:21Z] <XioNoX> asw-b-codfw> request system reboot member 7 - T295118

Mentioned in SAL (#wikimedia-operations) [2021-11-19T09:30:35Z] <XioNoX> re-enable cr2-codfw<->asw-b7-codfw link after disabling inet6 on cr2-codfw:ae2 - T295118

Mentioned in SAL (#wikimedia-operations) [2021-11-19T09:53:27Z] <XioNoX> run commit full on asw-b-codfw - T295118

Current status:

IPv6 is still broken on asw-b7-codfw (for traffic local and transiting through the switch)
inet6 is disabled on cr2-codfw:ae2 (to row B)
- That means row B have uplink redundancy for v4 but not v6
lvs2007 and codfw will stay depooled until Monday, when more intrusive remediation will be performed
- codfw can be repooled if needed (eg. eqiad issue)
JTAC ticket can't be opened until T294792 is done

On Monday will be tried (in order of impact):

master switchover,
reboot B2 (master),

If not successful we will plan a fabric upgrade.

Hopefully we won't need to, but if asw1-b2-codfw needs to be rebooted, here are the impacted servers:
ms-be2041
ms-be2046
ms-be2031
ms-be2032
ms-fe2006
moss-be2002 (not active)
@MatthewVernon

elastic2041
elastic2042
elastic2057
@RKemper

thanos-fe2002
kafka-logging2002 (not active)
@fgiunchedi

cp2031
cp2032
lvs2009
lvs2008
@BBlack

Please let me know if any depool is needed, especially if not listed on https://wikitech.wikimedia.org/wiki/Service_restarts
The LVS might be particularly problematic?

@ayounsi both lvs2008 and lvs2009 are primary LVS, so lvs2010 would assume the load of both during asw1-b2-codfw reboot. Far from ideal but it should be ok

Mentioned in SAL (#wikimedia-operations) [2021-11-22T13:04:02Z] <XioNoX> asw-b-codfw# set virtual-chassis member 7 mastership-priority 255 - T295118

The above command doesn't commit on a pre-provisioned VC.

I did this instead:

[edit virtual-chassis member 2]
-   role routing-engine;
+   role line-card;

With a commit confirmed 1.

fpc7 is now the master, but after the rollback, fpc2 still shows as linecard. Commit full didn't solve that point.

However the IPv6 issue is solved.

Edit: another round of flipping the fpc2 roles fixed it.

Mentioned in SAL (#wikimedia-operations) [2021-11-22T13:30:17Z] <XioNoX> re-enabling V6 between cr2-codfw and asw-b-codfw - T295118

Mentioned in SAL (#wikimedia-operations) [2021-11-22T13:32:54Z] <XioNoX> re-enable pybal on lvs2007 - T295118

Codfw repooled, everything is back to normal.

Can't commit on asw-b-codfwClosed, ResolvedPublicActions

Description

Details

Event Timeline

Can't commit on asw-b-codfw
Closed, ResolvedPublic
Actions