Page MenuHomePhabricator

Request VM for webperf (metrics processing)
Closed, ResolvedPublic

Description

Labs Project Tested: N/A
Site: Main DCs (EQIAD and CODFW)
Number of systems: 1 in each main DC (multi-dc, active/inactive)
Service: role::webperf (python-based eventlogging subscribers that publish to statsd, https://wikitech.wikimedia.org/wiki/webperf)
Networking Requirements: internal, access to Kafka and Statsd

This will replace and obsolete hafnium, which should be decommissioned.

Current specs (hafnium)

New specs, per VM:

  • Processor Requirements: 4 Virtual CPUs
  • Memory: 8GB
  • Disks: 50GB HHD

Event Timeline

For now it'll be active/inactive. Current interaction with Statsd, Graphite and Kafka complicate a multi-dc active/active situation (or a situation where they automatically switchover if one fails). Although we are interested in exploring that in the future.

Change 387215 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] introduce webperf1001

https://gerrit.wikimedia.org/r/387215

suggesting we introduce webperf1001.eqiad.wmnet/webperf2001.codfw.wmnet for this rather than using the misc names. @akosiaris does that sounds ok (and the requirements to be fulfilled by ganeti VMs?)

@Dzahn Yes and yes, both sound fine.

@Krinkle Nicely written task! Thanks!

Change 387215 merged by Dzahn:
[operations/dns@master] introduce webperf1001

https://gerrit.wikimedia.org/r/387215

Change 387270 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] introduce webperf2001.codfw.wmnet

https://gerrit.wikimedia.org/r/387270

Change 387270 merged by Dzahn:
[operations/dns@master] introduce webperf2001.codfw.wmnet

https://gerrit.wikimedia.org/r/387270

Krinkle triaged this task as Medium priority.Oct 30 2017, 8:27 PM
mmodell added a subscriber: R3609901.

Mentioned in SAL (#wikimedia-operations) [2017-11-17T11:01:55Z] <akosiaris> create webperf1001, webperf2001 in ganeti T179036

Can we use stretch? I'll assume stretch unless there are reasons not to.

Change 392030 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] add webperf1001/2001 to site, using webperf role

https://gerrit.wikimedia.org/r/392030

Change 392031 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server/DHCP: add webperf1001/2001

https://gerrit.wikimedia.org/r/392031

Change 392035 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Introduce webperf1001, webperf2001

https://gerrit.wikimedia.org/r/392035

Change 392035 abandoned by Alexandros Kosiaris:
Introduce webperf1001, webperf2001

Reason:
Abandoning in favor of https://gerrit.wikimedia.org/r/392030 and https://gerrit.wikimedia.org/r/392031

https://gerrit.wikimedia.org/r/392035

Change 392031 merged by Dzahn:
[operations/puppet@production] install_server/DHCP: add webperf1001/2001

https://gerrit.wikimedia.org/r/392031

Still having issues with these, both 1001 and 2001. despite re-trying the install on 1001 and restarting/reboot/connecting to console many times. For some reason i don' t get to see console output even though the status is shown as "Up/Up". I can PXE boot them and see it getting an ACK and starting to serve the installer and then sending the initrd.tar.gz. But after that i don't see anything anymore. And whether i wait a long time or not, assuming i just don't see console output (because @akosiaris apparently could see the console just fine and how it was installing) i never get it to a state where it would respond to pings or where i can use "install-console" from puppetmaster to connect to it. It's just not reachable to me, also after setting boot_order to disk and waiting and restarting it etc. I also double confirmed the row is A and the IP is in that network and it's not a DHCP issue either. ..

Change 392617 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] install_server: Assign VMs the correct tty

https://gerrit.wikimedia.org/r/392617

Change 392617 merged by Alexandros Kosiaris:
[operations/puppet@production] install_server: Assign VMs the correct tty

https://gerrit.wikimedia.org/r/392617

Fixed the console issue in above patch, reimaged the VMs and just run puppet for the first time. I am guessing this is successfully done, the only thing left is assigning the hosts the correct role (which is it btw?)

@akosiaris thank you! Wow so many others were in the wrong file as well. ..

re: role Krinkle pointed out that it's NOT yet the webperf role (https://gerrit.wikimedia.org/r/#/c/392030/)

@akosiaris thank you! Wow so many others were in the wrong file as well. ..

Yup, I fixed manually each and everyone of those already. Logged in SAL as well (no task though, doubt there's a need for one). It's also documented in https://wikitech.wikimedia.org/wiki/Ganeti#Update_DHCP, not sure how we ended up with all these as a mistake.

Change 392653 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] add webperf nodes with test role, add shell for perf-roots

https://gerrit.wikimedia.org/r/392653

Change 392653 merged by Dzahn:
[operations/puppet@production] add webperf nodes with test role, add shell for perf-roots

https://gerrit.wikimedia.org/r/392653

Krinkle closed this task as Resolved.EditedNov 21 2017, 6:31 PM
Krinkle moved this task from Inbox to Next: Goal / Oct-Dec '21 on the Performance-Team board.
Krinkle moved this task from Next: Goal / Oct-Dec '21 to Blocked (old) on the Performance-Team board.

Thanks!

Next step is to actually migrate the role, which will be done by Performance Team and tracked via parent task (T158837).

Next step is to actually migrate the role, which will be done by Performance Team and tracked via parent task (T179036).

That's actually T158837.