Page MenuHomePhabricator

Move codfw frack to new infra
Closed, ResolvedPublic

Description

As the new codfw is ready is configured, it's time to move the servers over and test it.

That will start on Tuesday August 8th, at 15:00 UTC / 10:00 CDT / 08:00 PST, and should takes max 4h, 6h if need to rollback.

During the window (2h):

  • Deactivate interfaces on cr1/2-codfw to pfw1/2-codfw
  • Advertise codfw-frack routes from pfw3-codfw
delete protocols bgp group Production export NONE
delete protocols bgp group VPN export NONE
set protocols bgp group Production export BGP_fundraising_export
set protocols bgp group VPN export BGP_fundraising_aggregates
  • Repatch servers to new switch stack
hostnamenew portold port
payments2001ge-0/0/0pfw1:ge-2/0/0
payments2003ge-0/0/1pfw1:ge-2/0/1
pay-lvs2001ge-0/0/2pfw1:ge-2/0/2
hekage-0/0/5pfw1:ge-2/0/3
saiphge-0/0/13pfw1:ge-2/0/4
alnilamge-0/0/9pfw1:ge-2/0/5
rigelge-0/0/14pfw1:ge-2/0/6
frdb2001ge-0/0/10pfw1:ge-2/0/7
frbackup2001ge-0/0/6pfw1:ge-2/0/8
payments2002ge-1/0/3pfw2:ge-11/0/0
pay-lvs2002ge-1/0/4pfw2:ge-11/0/1
mintakage-1/0/11pfw2:ge-11/0/2
alnitakge-1/0/12pfw2:ge-11/0/3
bellatrixge-1/0/7pfw2:ge-11/0/4
betelgeusege-1/0/8pfw2:ge-11/0/5

After the migration (2h) testing:

  • Verify monitoring is all green
  • Verify BGP sessions are UP (pybal)
  • Do failover tests (unplug each devices and core links, verify failover time/behavior)
  • Verify NAT
  • Verify cross DC syncs

Rollback decision

Cleanup

  • cr1/2
# That's mr1-codfw (unrelated)
delete policy-options prefix-list fundraising-codfw4 208.80.153.196/32
# Not needed after migration (old pfw-codfw lo0)
delete policy-options prefix-list fundraising-codfw4 208.80.153.195/32
# Old BGP neighbor IPs
delete protocols bgp group fundraising neighbor 208.80.153.215
delete protocols bgp group fundraising neighbor 208.80.153.217
# Multipath not needed
delete protocols bgp group fundraising multipath
# Static routes only needed during transition
delete routing-options static route 208.80.153.197/32

Set previous interfaces to pfw-codfw as disabled

  • pfw-eqiad
delete firewall family inet filter loopback4 term allow_codfw from source-address 208.80.153.195/32

Remove IPsec/BGP to old pfw-codfw

  • Remove dns entries
  • Remove rancid config
  • Remove from Icinga
  • Remove from LibreNMS
  • Remove from torrus

Unrack final part of rack elevation, back to T169643

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 368824 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Add codfw frack to Smokeping, Icinga and Rancid

https://gerrit.wikimedia.org/r/368824

Change 368824 merged by Ayounsi:
[operations/puppet@production] Add codfw frack to Smokeping, Icinga and Rancid

https://gerrit.wikimedia.org/r/368824

Change 369436 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Icinga: remove Juniper Alarms check as not exposed via SNMP

https://gerrit.wikimedia.org/r/369436

Change 369436 merged by Ayounsi:
[operations/puppet@production] Icinga: remove Juniper Alarms check as not exposed via SNMP

https://gerrit.wikimedia.org/r/369436

Mentioned in SAL (#wikimedia-operations) [2017-08-08T15:03:06Z] <XioNoX> starting pfw-codfw migration - T171970

Change 370658 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Remove pfw-codfw from Smokeping, Rancid, Torrus, Icinga

https://gerrit.wikimedia.org/r/370658

Change 370661 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Remove pfw-codfw from DNS

https://gerrit.wikimedia.org/r/370661

Failover tests

Rate: 1 ping per second

Tested itemping from bast2001 to rigelping from external to pfw3ping from tellurium to frbackup2001Notes
cr1 – pfw3a linkno lossno lossno loss
cr2 – pfw3b linkno lossno lossno loss
pfw3a node (primary)28 pings lost25 pings lost60 pings lost~6min to full up
pfw3b node (secondary)no lossno lossno lossIcinga alert about fab links being down + cr interfaces down;
RG0 Manual failover316 pings lost84 pings lost326 pings lostTo be investigated, triggered a rg1 failover as well
RG0 Manual failover; node0 to node132 pings lost32 pings lost70 pings lostSecond try, failover count went from 0 to 5
RG0 Manual failover; node1 to node016 pings lost16 pings lost54 pings lost
pfw3a – fasw-c8a linkno lossno lossno loss
pfw3b – fasw-c8b link1 ping lost1 ping lost1 ping lost
pfw3a – pfw3b control link25 pings lostno loss4 pings lostRespects https://kb.juniper.net/InfoCenter/index?page=content&id=KB22717
pfw3a – pfw3b data linkno lossno lossno loss

To be investigated:

  • Manual failover
  • Packet loss on two interfaces:

https://librenms.wikimedia.org/device/device=153/tab=port/port=13330/ (possibly MTU related)
Possible existing PR with Juniper. JTAC still investigating.

https://librenms.wikimedia.org/device/device=154/tab=port/port=13451/ (possibly bad optic/fiber)
EDIT: Fixed by replacing the fiber

Change 370658 merged by Ayounsi:
[operations/puppet@production] Remove pfw-codfw from Smokeping, Rancid, Torrus, Icinga

https://gerrit.wikimedia.org/r/370658

Change 370661 merged by Ayounsi:
[operations/dns@master] Remove pfw-codfw from DNS

https://gerrit.wikimedia.org/r/370661

Mentioned in SAL (#wikimedia-operations) [2017-08-09T15:19:45Z] <XioNoX> removing old pfw related config from cr1-codfw - T171970

https://librenms.wikimedia.org/device/device=153/tab=port/port=13330/ (possibly MTU related)
Possible existing PR with Juniper. JTAC still investigating.

This seems to be a regression from https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1233005
Cosmetic issue, doesn't impact production traffic.

The fix has been committed on version 15.1X49-D110 which currently is expected to be released on 13th of Sept 2017.

some answers from Juniper about the other issues noticed:

  • Presence of core dumps
/var/crash/corefiles:
total blocks: 70484
-rw-r--r--  1 root  wheel    5148694 Jul 25 19:49 localhost.srxpfe.4653.1501012051.core.tgz
-rw-r--r--  1 root  wheel    5138996 Jul 25 19:50 localhost.srxpfe.5250.1501012159.core.tgz
-rw-r--r--  1 root  wheel    5154594 Jul 25 19:52 localhost.srxpfe.5481.1501012264.core.tgz
-rw-r--r--  1 root  wheel    5153916 Jul 25 19:54 localhost.srxpfe.5711.1501012369.core.tgz
-rw-r--r--  1 root  wheel    5142979 Jul 25 19:56 localhost.srxpfe.5947.1501012473.core.tgz
-rw-r--r--  1 root  wheel    5169073 Jul 25 19:57 localhost.srxpfe.6177.1501012578.core.tgz
-rw-r--r--  1 root  wheel    5158215 Jul 25 19:59 localhost.srxpfe.6406.1501012683.core.tgz

couldn’t match the core files with any relevant information in our database

Those are from before the firewall went in production haven't occurred since.

  • Recurring critical syslog message "ksyncd: PVIDB: Error retrieving 'platform.ipc_version_icu_bypass' variable value"

tl;dr;

In conclusion these logs will not have any impact on SRX functionality or production traffic.

  • xntpd: receive: Unexpected origin timestamp from xxx

tl;dr;

It has been confirmed that this is more of a cosmetic log and does not seem to have any impact in the network performance.

suggested workaround is to filter them out with

#set file messages match "!(Error retrieving 'platform.ipc_version_icu_bypass' variable value | Unexpected origin timestamp)"
  • Spike of ~500 messages "pfe_stats_notify_update: I am here " every hours, with a severity of "ERROR" from the process "pfed".

This messages resulted to be also harmless and they can be suppressed same way we did with previews logs.

The fix has been committed on version 15.1X49-D110 which currently is expected to be released on 13th of Sept 2017.

Firewalls upgraded, confirmed fixed.