Page MenuHomePhabricator

Request VM for webperf (metrics processing)
Closed, ResolvedPublic

Description

Labs Project Tested: N/A
Site: Main DCs (EQIAD and CODFW)
Number of systems: 1 in each main DC (multi-dc, active/inactive)
Service: role::webperf (python-based eventlogging subscribers that publish to statsd, https://wikitech.wikimedia.org/wiki/webperf)
Networking Requirements: internal, access to Kafka and Statsd

This will replace and obsolete hafnium, which should be decommissioned.

Current specs (hafnium)

New specs, per VM:

  • Processor Requirements: 4 Virtual CPUs
  • Memory: 8GB
  • Disks: 50GB HHD

Event Timeline

Krinkle created this task.Oct 25 2017, 7:33 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 25 2017, 7:33 PM

For now it'll be active/inactive. Current interaction with Statsd, Graphite and Kafka complicate a multi-dc active/active situation (or a situation where they automatically switchover if one fails). Although we are interested in exploring that in the future.

Krinkle updated the task description. (Show Details)Oct 25 2017, 7:36 PM

Change 387215 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] introduce webperf1001

https://gerrit.wikimedia.org/r/387215

suggesting we introduce webperf1001.eqiad.wmnet/webperf2001.codfw.wmnet for this rather than using the misc names. @akosiaris does that sounds ok (and the requirements to be fulfilled by ganeti VMs?)

@Dzahn Yes and yes, both sound fine.

@Krinkle Nicely written task! Thanks!

Change 387215 merged by Dzahn:
[operations/dns@master] introduce webperf1001

https://gerrit.wikimedia.org/r/387215

Change 387270 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] introduce webperf2001.codfw.wmnet

https://gerrit.wikimedia.org/r/387270

Change 387270 merged by Dzahn:
[operations/dns@master] introduce webperf2001.codfw.wmnet

https://gerrit.wikimedia.org/r/387270

Krinkle triaged this task as Medium priority.Oct 30 2017, 8:27 PM
faidon added a subscriber: faidon.Nov 17 2017, 4:21 AM

What's the status of this?

Restricted Application assigned this task to R3609901. · View Herald TranscriptNov 17 2017, 4:21 AM
mmodell removed R3609901 as the assignee of this task.Nov 17 2017, 4:33 AM
mmodell added a subscriber: R3609901.

Mentioned in SAL (#wikimedia-operations) [2017-11-17T11:01:55Z] <akosiaris> create webperf1001, webperf2001 in ganeti T179036

Dzahn claimed this task.Nov 17 2017, 11:30 AM

Can we use stretch? I'll assume stretch unless there are reasons not to.

Change 392030 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] add webperf1001/2001 to site, using webperf role

https://gerrit.wikimedia.org/r/392030

Change 392031 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server/DHCP: add webperf1001/2001

https://gerrit.wikimedia.org/r/392031

Change 392035 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Introduce webperf1001, webperf2001

https://gerrit.wikimedia.org/r/392035

Change 392035 abandoned by Alexandros Kosiaris:
Introduce webperf1001, webperf2001

Reason:
Abandoning in favor of https://gerrit.wikimedia.org/r/392030 and https://gerrit.wikimedia.org/r/392031

https://gerrit.wikimedia.org/r/392035

Change 392031 merged by Dzahn:
[operations/puppet@production] install_server/DHCP: add webperf1001/2001

https://gerrit.wikimedia.org/r/392031

Dzahn added a comment.Nov 18 2017, 7:15 AM

Still having issues with these, both 1001 and 2001. despite re-trying the install on 1001 and restarting/reboot/connecting to console many times. For some reason i don' t get to see console output even though the status is shown as "Up/Up". I can PXE boot them and see it getting an ACK and starting to serve the installer and then sending the initrd.tar.gz. But after that i don't see anything anymore. And whether i wait a long time or not, assuming i just don't see console output (because @akosiaris apparently could see the console just fine and how it was installing) i never get it to a state where it would respond to pings or where i can use "install-console" from puppetmaster to connect to it. It's just not reachable to me, also after setting boot_order to disk and waiting and restarting it etc. I also double confirmed the row is A and the IP is in that network and it's not a DHCP issue either. ..

Change 392617 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] install_server: Assign VMs the correct tty

https://gerrit.wikimedia.org/r/392617

Change 392617 merged by Alexandros Kosiaris:
[operations/puppet@production] install_server: Assign VMs the correct tty

https://gerrit.wikimedia.org/r/392617

Fixed the console issue in above patch, reimaged the VMs and just run puppet for the first time. I am guessing this is successfully done, the only thing left is assigning the hosts the correct role (which is it btw?)

Dzahn added a comment.Nov 21 2017, 2:16 PM

@akosiaris thank you! Wow so many others were in the wrong file as well. ..

re: role Krinkle pointed out that it's NOT yet the webperf role (https://gerrit.wikimedia.org/r/#/c/392030/)

@akosiaris thank you! Wow so many others were in the wrong file as well. ..

Yup, I fixed manually each and everyone of those already. Logged in SAL as well (no task though, doubt there's a need for one). It's also documented in https://wikitech.wikimedia.org/wiki/Ganeti#Update_DHCP, not sure how we ended up with all these as a mistake.

Change 392653 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] add webperf nodes with test role, add shell for perf-roots

https://gerrit.wikimedia.org/r/392653

Change 392653 merged by Dzahn:
[operations/puppet@production] add webperf nodes with test role, add shell for perf-roots

https://gerrit.wikimedia.org/r/392653

Krinkle closed this task as Resolved.EditedNov 21 2017, 6:31 PM

Thanks!

Next step is to actually migrate the role, which will be done by Performance Team and tracked via parent task (T158837).

Dzahn added a comment.Nov 21 2017, 6:49 PM

Next step is to actually migrate the role, which will be done by Performance Team and tracked via parent task (T179036).

That's actually T158837.