Page MenuHomePhabricator

Can't commit on asw-b-codfw
Closed, ResolvedPublic

Description

commit or commit synchronize on asw-b-codfw fails with:

fpc2: 
configuration check succeeds
error: failed to copy file '//var/etc/if_alias_map+' to 'fpc7'

Log shows:

Nov 5 09:03:30 asw-b-codfw ffp[77709]: LIBJNX_REPLICATE_RCP_ERROR: rcp -l -Ji var/etc/if_alias_map+ fpc7://var/etc/if_alias_map+ : rcp: var/etc/if_alias_map+: Input/output error

Looking around I was not able to find relevant doc. The closest was https://kb.juniper.net/InfoCenter/index?page=content&id=KB36459&showDraft=false or https://kb.juniper.net/InfoCenter/index?page=content&id=KB33527&cat=ROUTER_PRODUCTS&actp=LIST but for a different platform.

As support is expired I'm not able to open a JTAC case (cf. T294792).

One option is to try a master switchover, or more aggressively reboot FPC7, but those can be risky without anyone onsite and no support if it gets worse.

Details

Other Assignee
cmooney

Event Timeline

ayounsi triaged this task as High priority.Nov 5 2021, 9:13 AM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ayounsi added a subscriber: akosiaris.

According to @akosiaris this is due to a failed hard drive, and it might not come back up from a reboot.

@Papaul when you're back, let's replace FPC7 with one of our spares until we can RMA it.

I downgrade Junos on QFX5100 at https://netbox.wikimedia.org/dcim/devices/3423/ and did a request system zeroize on it . This is the one we will be using to replace fpc7 in row b. @ayounsi let me know if tomorrow 9:15am my time works for you for us to replace fpc7. Note root password is set to server mgmt password

Model: qfx5100-48s-6q
Junos: 14.1X53-D43.7

Thanks

That works for me, thanks, can you send a calendar invite? Note that the link in your comment doesn't point to any specific device.

This will cause a hard downtime for 6 servers (rack B7), for up to 1h, but most likely less:

(1) thanos-be2002.codfw.wmnet
role::thanos::backend:
Observability SREs - @lmata , @fgiunchedi
As this is one server, according to https://wikitech.wikimedia.org/wiki/Service_restarts#Thanos nothing needs to be done pre-emptively

(1) furud.codfw.wmnet
role::analytics_cluster::hadoop/client:
Analytics SREs - @razzi @Ottomata, @BTullis
Is any prep work or depool needed?

(2) ms-be[2033,2047].codfw.wmnet
role::swift::storage:
Data Persistence SREs - @LSobanski
As it's more than one server, according to https://wikitech.wikimedia.org/wiki/Service_restarts#Swift is anything needed before the maintenance?

(2) elastic[2043-2044].codfw.wmnet
role::elasticsearch::cirrus:
Search Platform SREs - @Gehel
As it's more than one server, according to https://wikitech.wikimedia.org/wiki/Service_restarts#Elasticsearch is anything needed before the maintenance?

Doing it Today is too soon, so let's plan it for Thursday.

I don't think so, no - the frontends will not route requests to down servers (at least in theory!); we'll be more vulnerable to failures elsewhere, I think we have to live with that.

I don't believe that we need to do any prep or depooling work for furud.codfw.wmnet
We can downtime it in Icinga, but I think that's the limit of what we need to do.

The elasticsearch cluster should be able to cope with loosing 2 nodes with no issues. Thanks for flagging this, and please ping @RKemper and myself when starting the maintenance, so that we can keep our eyes opened!

@ayounsi after a chat with the team we think we should be fine, we will monitor and be available should something happen.

For the record, there is also a link to lvs2007, after chatting with @BBlack on irc, the usual disable puppet then stop pybal is to do before the maintenance.

furud does not run any active services; it can be restarted anytime.

Mentioned in SAL (#wikimedia-operations) [2021-11-18T15:35:43Z] <XioNoX> cr2-codfw# set interfaces et-1/0/3 disable - T295118

Mentioned in SAL (#wikimedia-operations) [2021-11-18T15:39:24Z] <XioNoX> lvs2007:~$ sudo service pybal stop - T295118

Mentioned in SAL (#wikimedia-operations) [2021-11-18T15:49:56Z] <XioNoX> asw-b-codfw> request system power-off member 7 - T295118

Mentioned in SAL (#wikimedia-operations) [2021-11-18T18:52:21Z] <XioNoX> asw-b-codfw> request system reboot member 7 - T295118

Mentioned in SAL (#wikimedia-operations) [2021-11-19T09:30:35Z] <XioNoX> re-enable cr2-codfw<->asw-b7-codfw link after disabling inet6 on cr2-codfw:ae2 - T295118

Mentioned in SAL (#wikimedia-operations) [2021-11-19T09:53:27Z] <XioNoX> run commit full on asw-b-codfw - T295118

Current status:

  • IPv6 is still broken on asw-b7-codfw (for traffic local and transiting through the switch)
  • inet6 is disabled on cr2-codfw:ae2 (to row B)
    • That means row B have uplink redundancy for v4 but not v6
  • lvs2007 and codfw will stay depooled until Monday, when more intrusive remediation will be performed
    • codfw can be repooled if needed (eg. eqiad issue)
  • JTAC ticket can't be opened until T294792 is done

On Monday will be tried (in order of impact):

  1. master switchover,
  2. reboot B2 (master),

If not successful we will plan a fabric upgrade.

Hopefully we won't need to, but if asw1-b2-codfw needs to be rebooted, here are the impacted servers:
ms-be2041
ms-be2046
ms-be2031
ms-be2032
ms-fe2006
moss-be2002 (not active)
@MatthewVernon

elastic2041
elastic2042
elastic2057
@RKemper

thanos-fe2002
kafka-logging2002 (not active)
@fgiunchedi

cp2031
cp2032
lvs2009
lvs2008
@BBlack

Please let me know if any depool is needed, especially if not listed on https://wikitech.wikimedia.org/wiki/Service_restarts
The LVS might be particularly problematic?

@ayounsi both lvs2008 and lvs2009 are primary LVS, so lvs2010 would assume the load of both during asw1-b2-codfw reboot. Far from ideal but it should be ok

Mentioned in SAL (#wikimedia-operations) [2021-11-22T13:04:02Z] <XioNoX> asw-b-codfw# set virtual-chassis member 7 mastership-priority 255 - T295118

The above command doesn't commit on a pre-provisioned VC.

I did this instead:

[edit virtual-chassis member 2]
-   role routing-engine;
+   role line-card;

With a commit confirmed 1.

fpc7 is now the master, but after the rollback, fpc2 still shows as linecard. Commit full didn't solve that point.

However the IPv6 issue is solved.

Edit: another round of flipping the fpc2 roles fixed it.

Mentioned in SAL (#wikimedia-operations) [2021-11-22T13:30:17Z] <XioNoX> re-enabling V6 between cr2-codfw and asw-b-codfw - T295118

Mentioned in SAL (#wikimedia-operations) [2021-11-22T13:32:54Z] <XioNoX> re-enable pybal on lvs2007 - T295118

Codfw repooled, everything is back to normal.