Page MenuHomePhabricator

Migrate hydrogen/chromium to jessie
Closed, ResolvedPublic

Description

Still on precise, migrate to jessie (maerlant and nescio are already on jessie, others are using trusty)

Details

Related Gerrit Patches:

Event Timeline

MoritzMuehlenhoff raised the priority of this task from to Needs Triage.
MoritzMuehlenhoff updated the task description. (Show Details)
MoritzMuehlenhoff added a project: Operations.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptJan 15 2016, 12:21 PM
Dzahn added a subscriber: Dzahn.Apr 19 2016, 8:39 PM

since these are dnsrecursors (i addition to urldownloader), what steps have to be taken before one of them can be taken down for reinstall? any?

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 19 2016, 8:39 PM

For the dnsrec service the server should be depooled via confctl. For NTP all our servers are configured to use multiple NTP servers, so as long as only one system is being reimaged at a time, it should be fine.

fgiunchedi triaged this task as Medium priority.Apr 27 2016, 2:58 PM

Change 285753 had a related patch set uploaded (by Dzahn):
installserver: let hydrogen use jessie installer

https://gerrit.wikimedia.org/r/285753

Dzahn added a comment.Apr 27 2016, 9:12 PM

For the dnsrec service the server should be depooled via confctl.

get:

[palladium:~] $ sudo confctl --tags dc=eqiad,cluster=dns,service=pdns_recursor --action get hydrogen.wikimedia.org
{"hydrogen.wikimedia.org": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=dns,service=pdns_recursor"}

[palladium:~] $ sudo confctl --tags dc=eqiad,cluster=dns,service=pdns_recursor --action get chromium.wikimedia.org
{"chromium.wikimedia.org": {"pooled": "yes", "weight": 10}, "tags": "dc=eqiad,cluster=dns,service=pdns_recursor"}

set (not executed) would then be:

sudo confctl --tags dc=eqiad,cluster=dns,service=pdns_recursor --action set/pooled-no hydrogen.wikimedia.org

and no edit needed in the puppet repo in conftool-data.. does that all seem right?

Change 285753 merged by Dzahn:
installserver: let hydrogen use jessie installer

https://gerrit.wikimedia.org/r/285753

hydrogen and chromium also appear on T136562 for not having RAID.

that should be done as part of this ticket too

Dzahn added a comment.Aug 23 2016, 6:49 PM

we picked hydrogen to start with.

https://gerrit.wikimedia.org/r/#/c/306262/

removes it from /etc/resolv.conf on LVS servers

after that we are going to depool it

Dzahn added a comment.Aug 23 2016, 7:57 PM

hydrogen was in netboot.cfg twice with different partman recipe

https://gerrit.wikimedia.org/r/#/c/306272/

had to racreset to see console output after reboot, booted into PXE, reinstalled with jessie now

re-added to puppet, re-added to salt

Dzahn added a comment.Aug 23 2016, 9:56 PM

1[hydrogen:~] $ gen_fingerprints
2+---------+---------+-------------------------------------------------+
3| Cipher | Algo | Fingerprint |
4+---------+---------+-------------------------------------------------+
5| RSA | MD5 | 3d:b7:20:30:7f:3b:d4:78:6d:b0:f9:96:2b:f9:32:00 |
6| RSA | SHA-256 | 7tJdX+OpxpRab4RniQJdC0gh4xwEO5anOMjRPHAhZ9o= |
7+---------+---------+-------------------------------------------------+
8| DSA | MD5 | d8:b1:0f:dd:7e:3a:06:09:97:82:6c:0b:32:e3:f5:d1 |
9| DSA | SHA-256 | fJQI1z+Fc8NXy7oatkUGZqA1wZpZYiIxql08X8qw/L0= |
10+---------+---------+-------------------------------------------------+
11| ECDSA | MD5 | 23:99:0d:af:94:0d:1f:33:4b:e8:bb:c6:a2:ec:50:23 |
12| ECDSA | SHA-256 | Pg6ebLcSEsMAqo7PAC4SAdoPOwFg7Z+JnYzwo4bcMQM= |
13+---------+---------+-------------------------------------------------+
14| ED25519 | MD5 | 64:f1:05:5e:19:fa:c3:d5:6f:14:7f:c9:7d:d2:50:09 |
15| ED25519 | SHA-256 | AEo1xsX4w0bBOKtmDPSld8/87rh8ibTYpA+g9zaaBgg= |
16+---------+---------+-------------------------------------------------+

Dzahn added a comment.EditedAug 23 2016, 10:01 PM

21:05 mutante: hydrogen - reinstall finished, re-added to salt, restarted ntpd
20:42 mutante: hydrogen - signing new puppet cert
20:22 mutante: hydrogen - reinstalling one more time, wrong partitioning
19:55 mutante: re-signing new puppet certs for hydrogen, initial run, new salt key


installed a second time. now with RAID (/dev/md0)

restarted NTP server, checked that it was in sync with chromium.. icinga recovered..

checked that /etc/powerdns was populated, service is running after second puppet run

tested with dig that it answers to requests over from palladium

21:27 logmsgbot: dzahn@palladium conftool action : set/pooled=yes; selector: dc=eqiad,cluster=dns,name=hydrogen.wikimedia.org

saw traffic coming back in ganglia

https://ganglia.wikimedia.org/latest/graph_all_periods.php?h=hydrogen.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1471978686&g=network_report&z=large&c=Miscellaneous%20eqiad

waited a little while and then reverted the LVS config change:

https://gerrit.wikimedia.org/r/#/c/306275/

ran puppet on lvs100x...

NTP in sync with chromium:

root@hydrogen:~# ntpdc -c peers | grep chrom
+chromium.wikime 2620:0:861:1:20 3 128 377 0.00009 -0.012633 0.08954

and Icinga: NTP OK: Offset -0.006912 secs

also, counter increasing here on lvs1002

root@lvs1002:~# ipvsadm -Ln -u 208.80.154.239:53
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
UDP  208.80.154.239:53 wrr
  -> 208.80.154.50:53             Route   10     0          1215214   
  -> 208.80.154.157:53            Route   10     0          2709388
Dzahn claimed this task.Aug 23 2016, 10:10 PM
Dzahn removed a project: Patch-For-Review.
Dzahn set Security to None.
Dzahn added a comment.EditedAug 24 2016, 11:50 PM

21:27 mutante: depooling chromium for reinstall. scheduled downtime for host and service IPs
21:50 mutante: running puppet on lvs servers, removing chromium from resolv.conf for reinstall
22:05 mutante: stopping puppet and pdns-recursor on chromium
22:20 mutante: rebooting chromium into PXE

22:49 mutante: chromium - revoking and re-signing puppet certs, salt keys, initial puppet run..

https://gerrit.wikimedia.org/r/#/c/306555/

https://gerrit.wikimedia.org/r/#/c/306311/

reinstalled with jessie

< mutante> !log chromium - install ntpdate, stop ntp, sync time with hydrogen, start ntp, remove ntpdate

< icinga-wm> RECOVERY - NTP peers on chromium is OK: NTP OK: Offset -0.00151 secs

16:49 < logmsgbot> !log dzahn@palladium conftool action : set/pooled=yes; selector: name=chromium.wikimedia.org

https://gerrit.wikimedia.org/r/#/c/306559/

Dzahn closed this task as Resolved.Aug 25 2016, 12:09 AM

17:14 < mutante> !log chromium back in service - both eqiad DNS recursors now on jessie