Page MenuHomePhabricator
Feed Advanced Search

Fri, Sep 13

JHedden claimed T220530: Ensure clouddb1001 is monitored appropriately from the tendril/prometheus side.

Looking into using profile::prometheus::mysqld_exporter_instance to feed clouddb data into tools-prometheus.wmflabs.org.

Fri, Sep 13, 9:04 PM · Data-Services, cloud-services-team (Kanban)
JHedden added a comment to T53434: Implement a system to monitor tools on tool-labs.
Fri, Sep 13, 2:10 PM · cloud-services-team (Kanban), User-Matthewrbowker, community-labs-monitoring, Toolforge

Wed, Sep 11

JHedden updated the task description for T223905: HA for openstack services.
Wed, Sep 11, 3:47 PM · Goal, cloud-services-team (Kanban)
JHedden closed T221301: Toolschecker webservice checks get out of sync likely from timeouts, a subtask of T220650: tools-manifest - webservicemonitor needs a longer timeout, as Resolved.
Wed, Sep 11, 3:34 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge
JHedden closed T221301: Toolschecker webservice checks get out of sync likely from timeouts as Resolved.

Icinga checks for the webservice have been removed.

Wed, Sep 11, 3:34 PM · Toolforge, cloud-services-team (Kanban)
JHedden closed T205524: cloudvps: neutron: agents failed to communicate with server, a subtask of T167293: Nova-network to Neutron migration, as Resolved.
Wed, Sep 11, 3:26 PM · Epic, Cloud-Services
JHedden closed T205524: cloudvps: neutron: agents failed to communicate with server as Resolved.

T224981 seems to have resolved this.

Wed, Sep 11, 3:26 PM · cloud-services-team (Kanban), Cloud-Services
JHedden moved T231793: Remove systemd from openstack-mitaka from Needs discussion to Doing on the cloud-services-team (Kanban) board.
Wed, Sep 11, 3:04 PM · cloud-services-team (Kanban), Operations
JHedden claimed T231793: Remove systemd from openstack-mitaka.
Wed, Sep 11, 3:04 PM · cloud-services-team (Kanban), Operations
JHedden added a comment to T232536: Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail.
18:46:22 <revi> job 6568688 (tools.stewardbots) is not being deleted for few minutes, looks like something abnormal? (in the past it has been deleted within minutes)

(the Q I left on #wikimedia-cloud, crossposting here, assuming it's related)

Wed, Sep 11, 1:34 PM · cloud-services-team (Kanban), Toolforge

Mon, Sep 9

JHedden closed T232322: labspuppetmaster1001 puppet-merge failing as Resolved.

Fixed. The changes for commit 2e5424b4010c29da463eaf3c4ca2898c0a8fb79d were applied, but for some odd reason not the commit? I've cleaned it up and everything is in sync now.

Mon, Sep 9, 1:48 PM · Puppet, Operations, cloud-services-team

Fri, Sep 6

JHedden added a comment to T231222: Address icinga noise from wmflabs.

I've looked at packet captures and traced processes on both ends, as far as I can tell the metrics are being sent and stored correctly in graphite's whisper database.

Fri, Sep 6, 1:26 PM · ORES, Scoring-platform-team (Current)

Thu, Sep 5

JHedden created P9044 recent ores-web-01 metrics.
Thu, Sep 5, 9:07 PM

Wed, Sep 4

JHedden added a comment to T231222: Address icinga noise from wmflabs.

I'm curious if the timeouts are only seen from a single virtual machine, or if you're seeing the same results from multiple ORES virtual machines. Could you please run your script on a couple of different hosts and share the results with host names?

Wed, Sep 4, 8:20 PM · ORES, Scoring-platform-team (Current)
JHedden closed T231999: Icinga nova-compute process check flapping as Resolved.
Wed, Sep 4, 3:13 PM · Cloud-VPS
JHedden created T231999: Icinga nova-compute process check flapping.
Wed, Sep 4, 3:01 PM · Cloud-VPS

Tue, Sep 3

JHedden added a comment to T229657: Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC.

Cloud VPS OpenStack has been fully switched over and all services are back online.

Tue, Sep 3, 1:41 PM · cloud-services-team (Kanban), wikitech.wikimedia.org, Operations, DBA

Thu, Aug 29

JHedden closed T229448: showmount not working on labstore1004 & labstore1005 as Resolved.
Thu, Aug 29, 3:49 PM · Data-Services, cloud-services-team (Kanban)
JHedden added a comment to T229448: showmount not working on labstore1004 & labstore1005.

Since showmount is not reliable with NFS v4 (as seen in other cases too T171508) we could check NFS connectivity over the cluster IP with the nagios rpcinfo wrapper:

Thu, Aug 29, 2:07 PM · Data-Services, cloud-services-team (Kanban)

Wed, Aug 28

JHedden added a comment to T229448: showmount not working on labstore1004 & labstore1005.

The RPC portmapper is out of sync with the NFS server. NFS will need to be restarted to resolve this, but unfortunately that will cause a brief interruption in client IO.

Wed, Aug 28, 9:39 PM · Data-Services, cloud-services-team (Kanban)

Tue, Aug 27

JHedden moved T221301: Toolschecker webservice checks get out of sync likely from timeouts from Needs discussion to Doing on the cloud-services-team (Kanban) board.
Tue, Aug 27, 4:43 PM · Toolforge, cloud-services-team (Kanban)

Aug 16 2019

JHedden closed T227041: Three small ganeti VMs to host haproxy for OpenStack endpoints, a subtask of T223907: Set up HA endpoints for keystone, glance, nova, designate apis, as Resolved.
Aug 16 2019, 3:04 PM · Patch-For-Review, cloud-services-team (Kanban)
JHedden closed T227041: Three small ganeti VMs to host haproxy for OpenStack endpoints as Resolved.

For this phase we're going to install haproxy directly on the openstack controllers. We will not be needing these VMs. Thank you for all the information, it was very helpful.

Aug 16 2019, 3:04 PM · vm-requests, cloud-services-team (Kanban), Operations

Aug 15 2019

JHedden added a comment to T229551: Database-reports can't see packages in its virtualenv on the grid.
Aug 15 2019, 3:31 PM · Community-Tech (Kanban (Q1 2019-20)), Patch-For-Review, Tools, Toolforge

Aug 13 2019

JHedden added a comment to T230442: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.

This host also has a bad disk in slot number 8. T230289

Aug 13 2019, 6:58 PM · ops-eqiad, Operations
JHedden added a comment to T230289: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.

There are no workloads on this host now. We're good to have this replaced anytime. Thanks!

Aug 13 2019, 5:12 PM · cloud-services-team, ops-eqiad, Operations
JHedden closed T230247: Increase VCPU quota for wikidata-query project as Resolved.
Aug 13 2019, 5:03 PM · cloud-services-team (Kanban), Cloud-VPS (Quota-requests), Discovery-Wikidata-Query-Service-Sprint, Wikidata, Wikidata-Query-Service
JHedden moved T221301: Toolschecker webservice checks get out of sync likely from timeouts from Doing to Needs discussion on the cloud-services-team (Kanban) board.
Aug 13 2019, 1:54 PM · Toolforge, cloud-services-team (Kanban)

Aug 9 2019

JHedden added a comment to T229657: Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC.

The plan looks good to me. In the pre-failover stage I'll be shutting down the OpenStack scheduler and designate services to ensure there are no actions in queue, then re-enabling these in the clean up steps.

Aug 9 2019, 2:05 PM · cloud-services-team (Kanban), wikitech.wikimedia.org, Operations, DBA

Aug 8 2019

JHedden closed T230157: tools-sgewebgrid-lighttpd-0915 not responding as Resolved.

Only interesting things from the logs:

Aug 8 2019, 7:58 PM · cloud-services-team
JHedden created T230157: tools-sgewebgrid-lighttpd-0915 not responding.
Aug 8 2019, 7:24 PM · cloud-services-team
JHedden updated subscribers of T213567: Toolforge: refresh grafana dashboard.

I haven't added any new exporters yet, I think you might be referring to the changes @Bstorm made for kublet stats T228573

Aug 8 2019, 1:23 PM · cloud-services-team (Kanban), Toolforge

Aug 7 2019

JHedden added a comment to T221301: Toolschecker webservice checks get out of sync likely from timeouts.

Ran into a new failure scenario on gridengine, might be a false positive but it did cause the webservice to remain running:

queue instance "webgrid-lighttpd@tools-sgewebgrid-lighttpd-0923.tools.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=2.757500 (= 2.757500 + 0.50 * 0.000000 with nproc=4) >= 2.75
Aug 7 2019, 7:03 PM · Toolforge, cloud-services-team (Kanban)
JHedden added a comment to T221301: Toolschecker webservice checks get out of sync likely from timeouts.

Fixed NGINX timeouts to match WSGI and added better status checking after issuing webservice commands.

Aug 7 2019, 5:14 PM · Toolforge, cloud-services-team (Kanban)
JHedden added a comment to T149589: Puppet tab in Horizon unusably slow.

Viewing the instance console log can occasionally take longer than expected. This process queries multiple APIs and communicates directly with the hypervisor supporting the VM, i.e. there's lots of potential places for delay and resource contention.

Aug 7 2019, 3:10 PM · cloud-services-team (Kanban), Patch-For-Review, Operations, Puppet, Cloud-Services
JHedden closed T230003: openstack: cleanup neutron user as Resolved.

Confirmed openstack role assignment list --names is working as expected now.

Aug 7 2019, 2:45 PM · cloud-services-team (Kanban)
JHedden added a comment to T230003: openstack: cleanup neutron user.

Found another one too:

Aug 7 2019, 2:43 PM · cloud-services-team (Kanban)
JHedden added a comment to T230003: openstack: cleanup neutron user.

Nice catch, thanks for the back story too.

Aug 7 2019, 2:30 PM · cloud-services-team (Kanban)

Aug 5 2019

JHedden added a comment to T229657: Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC.

As per the sync on the SRE meeting, @JHedden will be online from WMCS.
I will handle the announcement for wikitech, could you handle the announcement (if it is needed) for the OpenStack part of things?

Aug 5 2019, 4:28 PM · cloud-services-team (Kanban), wikitech.wikimedia.org, Operations, DBA
JHedden closed T229846: anticompositebot tool missing project directory as Resolved.

The anticompositebot tool directory and configuration is present and active now.

Aug 5 2019, 4:20 PM · cloud-services-team, Toolforge
JHedden created T229846: anticompositebot tool missing project directory.
Aug 5 2019, 4:09 PM · cloud-services-team, Toolforge
JHedden closed T229787: Toolforge: sudden issues in both gridengine and k8s webservices as Resolved.

The icingia check description was recently updated for T228878 https://gerrit.wikimedia.org/r/c/operations/puppet/+/525536 . The new name/description for this service appears to have removed the existing ack's and downtime.

Aug 5 2019, 1:29 PM · cloud-services-team (Kanban)

Jul 26 2019

JHedden created P8811 cloudvirt1015 testing VM crash.
Jul 26 2019, 10:00 PM
JHedden added a comment to T221301: Toolschecker webservice checks get out of sync likely from timeouts.

The few I spot checked also lined up with timeouts in the etcd server log:

Jul 26 2019, 9:10 PM · Toolforge, cloud-services-team (Kanban)
JHedden added a comment to T221301: Toolschecker webservice checks get out of sync likely from timeouts.

It seems rare, but I've also noticed a few timeouts from SGE: 2019-07-26T19:29:42.700456 Timed out attempting to start webservice (15s)

Jul 26 2019, 8:31 PM · Toolforge, cloud-services-team (Kanban)
JHedden updated the task description for T225713: CPU scaling governor audit.
Jul 26 2019, 1:42 PM · User-fgiunchedi, Operations

Jul 25 2019

JHedden added a comment to T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory.

Created these VMs

openstack server list --project testlabs --long -c ID -c Name -c Host| grep cv1015
| 30f17a94-252e-46d2-aa28-e6f24c9c457e | cv1015-testing03                  | cloudvirt1015   |
| d1b13075-ace4-44ba-8f26-c9c12a360184 | cv1015-testing02                  | cloudvirt1015   |
| b99a2376-1bb1-48f9-9889-00d3aedb9a43 | cv1015-testing01                  | cloudvirt1015   |
| e65ff310-f0ef-451c-956c-8d21b21cc12a | cv1015-testing04                  | cloudvirt1015   |
Jul 25 2019, 2:42 PM · Operations, ops-eqiad, DC-Ops, User-Zppix, cloud-services-team (Kanban)
JHedden added a comment to T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory.

Thanks @RobH. I'll spin up some stress testing VMs on that host and let them run until Andrew gets back from vacation next week.

Jul 25 2019, 1:56 PM · Operations, ops-eqiad, DC-Ops, User-Zppix, cloud-services-team (Kanban)

Jul 24 2019

JHedden added a comment to T224324: LB for cloudelastic.

@EBernhardson Unfortunately we don't have a good solution for this today or in the near future. We've discussed future load balancing as a service options, but this requires a lot of effort on backend upgrades, automation and configuration.

Jul 24 2019, 2:08 PM · Discovery-Search (Current work), Cloud-Services, Elasticsearch, Discovery

Jul 23 2019

JHedden added a comment to T221301: Toolschecker webservice checks get out of sync likely from timeouts.

The webservice checks are getting better, but it ran into a new failure:

Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: Traceback (most recent call last):
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:   File "/usr/bin/webservice", line 169, in <module>
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:     start(job, 'Starting webservice')
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:   File "/usr/bin/webservice", line 61, in start
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:     job.request_start()
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:   File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 456, in request_start
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:     pykube.Deployment(self.api, self._get_deployment()).create()
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:   File "/usr/lib/python2.7/dist-packages/pykube/objects.py", line 76, in create
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:     self.api.raise_for_status(r)
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:   File "/usr/lib/python2.7/dist-packages/pykube/http.py", line 104, in raise_for_status
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:     raise HTTPError(payload["message"])
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: pykube.exceptions.HTTPError: client: etcd member https://tools-k8s-etcd-01.tools.eqiad.wmflabs:2379 has no leader
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: --------------------------------------------------------------------------------
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: ERROR in toolschecker [/var/lib/toolschecker/toolschecker.py:454]:
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: webservice kubernetes: error starting
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: --------------------------------------------------------------------------------
...
Jul 23 16:53:43 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: --------------------------------------------------------------------------------
Jul 23 16:53:43 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: ERROR in toolschecker [/var/lib/toolschecker/toolschecker.py:448]:
Jul 23 16:53:43 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: webservice kubernetes: found existing webservice running
Jul 23 16:53:43 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: --------------------------------------------------------------------------------
Jul 23 2019, 5:27 PM · Toolforge, cloud-services-team (Kanban)
JHedden moved T221301: Toolschecker webservice checks get out of sync likely from timeouts from Needs discussion to Doing on the cloud-services-team (Kanban) board.
Jul 23 2019, 2:57 PM · Toolforge, cloud-services-team (Kanban)
JHedden claimed T221301: Toolschecker webservice checks get out of sync likely from timeouts.
Jul 23 2019, 2:57 PM · Toolforge, cloud-services-team (Kanban)
JHedden moved T221301: Toolschecker webservice checks get out of sync likely from timeouts from Inbox to Needs discussion on the cloud-services-team (Kanban) board.
Jul 23 2019, 2:34 PM · Toolforge, cloud-services-team (Kanban)
JHedden added a comment to T221301: Toolschecker webservice checks get out of sync likely from timeouts.

The current configuration is set to check every 1 minute and retry every 1 minute after a failure.

Jul 23 2019, 2:29 PM · Toolforge, cloud-services-team (Kanban)
JHedden added a comment to T228731: https://dumps.wikimedia.org/other/pageviews/ lacks hourly pageviews since 20190722-17:00.

Maybe related are some hiera values at:

hieradata/common.yaml
# Dumps distribution server currently serving traffic over NFS to cloud vps instances
dumps_dist_active_vps: labstore1007.wikimedia.org
# Dumps distribution server currently serving web and rsync mirror traffic
# Also serves stat* hosts over nfs
dumps_dist_active_web: labstore1006.wikimedia.org
Jul 23 2019, 12:56 PM · Analytics-Kanban, Analytics, cloud-services-team, Wikimedia-Portals

Jul 22 2019

JHedden added a comment to T228573: toolforge k8s nodes oom?.

That's great! Now that we have that, this query should be helpful for future investigating container_memory_usage_bytes{job="k8s-node",instance="tools-worker-1015.tools.eqiad.wmflabs"}

Jul 22 2019, 6:23 PM · Toolforge, cloud-services-team (Kanban)
JHedden added a comment to T228573: toolforge k8s nodes oom?.

I'm not seeing anything strange for the prometheus-node exporter memory usage. GC and overall heap allocation history looks good on tools-worker-1015.

Jul 22 2019, 2:51 PM · Toolforge, cloud-services-team (Kanban)

Jul 19 2019

JHedden added a comment to T221301: Toolschecker webservice checks get out of sync likely from timeouts.

The recent webservice critical status was related to existing webservice instances left running. When concurrent requests from both icinga1001.wikimedia.org and icinga2001.wikimedia.org are made to the webservice endpoint they can leave the webservice instance running, causing the checks to fail going forward.

Jul 19 2019, 10:48 PM · Toolforge, cloud-services-team (Kanban)
JHedden added a comment to T225265: Fix labstore checks on cloudstore1008/9.

That patch ^ fixes the NRPE error CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds when the host not supporting the VIP runs showmount.

Jul 19 2019, 2:42 PM · Data-Services, cloud-services-team (Kanban)

Jul 18 2019

JHedden closed T179848: Unable to add user to group in debian stretch instance as Resolved.

I've confirmed that this process works as expected on the recent stretch image. If you're still having an issue adding users to local group please let us know.

Jul 18 2019, 7:50 PM · cloud-services-team (Kanban), Cloud-VPS

Jul 17 2019

JHedden added a comment to T227019: Redirect all space.wmflabs.org traffic to HTTPS.

Try reordering the rules with the HTTPS redirect rule on top. Something like:

RewriteEngine On
RewriteCond %{HTTPS} off
RewriteCond %{HTTP:X-Forwarded-Proto} !https
RewriteRule ^(.*)$ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]
Jul 17 2019, 8:41 PM · VPS-Projects, Space (Jul-Sep-2019)

Jul 16 2019

JHedden closed T216040: Start/shutdown VMs automatically on hypervisor boot/shutdown as Resolved.

Changes pushed and verified

Jul 16 2019, 3:46 PM · Patch-For-Review, Cloud-VPS, cloud-services-team (Kanban)
JHedden closed T210995: cloudvps: rabbitmq metrics as Resolved.

I've updated the rabbitmq dashboard for the eqiad prometheus/labs datasource. https://grafana.wikimedia.org/d/000000617/cloudvps-rabbitmq

Jul 16 2019, 2:56 PM · cloud-services-team (Kanban)

Jul 15 2019

JHedden moved T216040: Start/shutdown VMs automatically on hypervisor boot/shutdown from Important to Doing on the cloud-services-team (Kanban) board.
Jul 15 2019, 7:55 PM · Patch-For-Review, Cloud-VPS, cloud-services-team (Kanban)
JHedden closed T227395: tools-worker-1022 k8s duplicate node as Resolved.
Jul 15 2019, 4:04 PM · cloud-services-team
JHedden updated subscribers of T227395: tools-worker-1022 k8s duplicate node.

@aborrero @Bstorm Is there anything else we need or want to check before deleting the bad node?

Jul 15 2019, 1:10 PM · cloud-services-team

Jul 12 2019

JHedden closed T219054: Install fish shell for Toolforge use, a subtask of T55704: Packages to be added to toollabs puppet, as Resolved.
Jul 12 2019, 9:53 PM · Cloud-Services, Tracking-Neverending, Toolforge
JHedden closed T219054: Install fish shell for Toolforge use as Resolved.
Jul 12 2019, 9:53 PM · Toolforge (Software install/update), cloud-services-team (Kanban)
JHedden added a comment to T216040: Start/shutdown VMs automatically on hypervisor boot/shutdown.

Ran some tests with resume_guests_state_on_host_boot enabled and libvirt-guests configured to not start VMs.

Jul 12 2019, 7:56 PM · Patch-For-Review, Cloud-VPS, cloud-services-team (Kanban)
JHedden claimed T216040: Start/shutdown VMs automatically on hypervisor boot/shutdown.
Jul 12 2019, 2:41 PM · Patch-For-Review, Cloud-VPS, cloud-services-team (Kanban)
JHedden added a comment to T216040: Start/shutdown VMs automatically on hypervisor boot/shutdown.

Instead of using virsh autostart, could we let Nova resume the state of VMs after a hypervisor reboot?

Jul 12 2019, 2:25 PM · Patch-For-Review, Cloud-VPS, cloud-services-team (Kanban)

Jul 9 2019

JHedden closed T227060: nova-fullstack: alert due to leaked instances as Resolved.
Jul 9 2019, 1:27 PM · cloud-services-team (Kanban)

Jul 8 2019

JHedden added a comment to T227395: tools-worker-1022 k8s duplicate node.

Looks like this was effected by DNS testing that was happening on cloudservices1003. Based on the logs, the only way I can see the FQDN changing is with the following example.

Jul 8 2019, 5:02 PM · cloud-services-team

Jul 7 2019

JHedden created T227395: tools-worker-1022 k8s duplicate node.
Jul 7 2019, 4:48 AM · cloud-services-team

Jul 5 2019

JHedden added a comment to T223906: Active/active rabbitMQ servers on wmcs controller nodes.

This has been completed:

Jul 5 2019, 1:42 PM · cloud-services-team (Kanban)

Jul 3 2019

JHedden closed T227222: Degraded RAID on cloudelastic1003 as Resolved.

This host was rebooted for T224228. I downtimed the host and services in icinga but it looks like this slipped through.

Jul 3 2019, 7:04 PM · ops-eqiad, Operations
JHedden added a comment to T227041: Three small ganeti VMs to host haproxy for OpenStack endpoints.

Per T223907, the chosen approach for providing HA over the 3 haproxy nodes is pacemaker. Corosync is actually an implementation detail to provide cluster management services, i.e. communication, membership, quorum and could be easily exchanged for heartbeat.

Jul 3 2019, 1:44 PM · vm-requests, cloud-services-team (Kanban), Operations

Jul 2 2019

JHedden added a comment to T227060: nova-fullstack: alert due to leaked instances.

Looks like the list used for all_cloudvirts is not maintaining any order. It's updating libvirtd.conf every puppet run:

Jul 2 2019, 2:17 PM · cloud-services-team (Kanban)
JHedden added a comment to T227060: nova-fullstack: alert due to leaked instances.

With fresh eyes I found that the VMs are intermittently failing with

Jul 2 2019, 2:04 PM · cloud-services-team (Kanban)
JHedden added a comment to T227060: nova-fullstack: alert due to leaked instances.

I didn't find anything in the OpenStack or daemon logs on cloudcontrol1003 or a few cloudvirts I spot checked with fullstack instances. I'll check more in the morning.

Jul 2 2019, 7:52 AM · cloud-services-team (Kanban)
JHedden added a comment to T227060: nova-fullstack: alert due to leaked instances.

I got the page and was also working on this event, cleaning VMs that I found to be active and online. T227057 https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin-monitoring#nova-fullstack

Jul 2 2019, 7:42 AM · cloud-services-team (Kanban)
JHedden added a comment to T227057: cloudcontrol1003/Check for VMs leaked by the nova-fullstack test is CRITICAL.

Fullstack VMs currently leaked

| ID                                   | Name                  | Status | Task State | Power State | Networks                                                          | Availability Zone | Host          | Properties |
+--------------------------------------+-----------------------+--------+------------+-------------+-------------------------------------------------------------------+-------------------+---------------+------------+
| e83bf2eb-0256-4ac7-8b36-e8d3c6e3c519 | fullstackd-1562051634 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.7.181                            | nova              | cloudvirt1004 |            |
| 9551a639-6d8c-4619-8c16-1f652457478f | fullstackd-1562049720 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.7.176, 172.16.7.177, 172.16.7.18 | nova              | cloudvirt1007 |            |
| 3653dacc-568a-4b9c-a126-3db668ea4972 | fullstackd-1562014758 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.7.109, 172.16.7.11               | nova              | cloudvirt1002 |            |
| 8f4a7f1e-7329-4a97-ab79-79f3955a2880 | fullstackd-1561887686 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.6.109, 172.16.6.110              | nova              | cloudvirt1005 |            |
| 21a45c21-476f-4278-b3a2-ca0d134e0fed | fullstackd-1561861047 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.5.44, 172.16.5.45                | nova              | cloudvirt1006 |            |
| 32e92dfc-290f-4c05-8df8-b43ec48033bb | fullstackd-1561740112 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.4.250, 172.16.4.251              | nova              | cloudvirt1002 |            |
Jul 2 2019, 7:18 AM · Cloud-VPS
JHedden created T227057: cloudcontrol1003/Check for VMs leaked by the nova-fullstack test is CRITICAL.
Jul 2 2019, 7:14 AM · Cloud-VPS

Jul 1 2019

JHedden added a comment to T227041: Three small ganeti VMs to host haproxy for OpenStack endpoints.

I feel that 1 CPU might be too limiting, haproxy is multi-threaded and we'll have a number of backends defined.

Jul 1 2019, 10:35 PM · vm-requests, cloud-services-team (Kanban), Operations
JHedden added a comment to T223907: Set up HA endpoints for keystone, glance, nova, designate apis.

There's technically nothing stopping us from using 4 nodes in corosync. I'd vote for using all 4 and keeping the configuration in sync, avoiding any one off special configuration.

Jul 1 2019, 8:41 PM · Patch-For-Review, cloud-services-team (Kanban)
JHedden added a comment to T223907: Set up HA endpoints for keystone, glance, nova, designate apis.

This is strictly true, yes, but that's how AWS provides load balancer HA and how lots of production web services work. Right now, we are just down until we manually move the service. If the DNS TTL is set short, it would only be down for that long (we also handle most manual failovers entirely via DNS already--again this would be superior to that method). It's not 100% foolproof, but is a gigantic leap forward from where we are. Haproxy servers would just need to be manually "depooled" from DNS before maintenance.

Jul 1 2019, 8:02 PM · Patch-For-Review, cloud-services-team (Kanban)
JHedden added a comment to T223907: Set up HA endpoints for keystone, glance, nova, designate apis.

@Bstorm Using round robin DNS alone does not provide HA. All they do is rotate the first host address returned to the client. If the address given to the client is not reachable, the client will not ask for another address or try the other addresses in the record. This also leads to DNS caching issues when one of the addresses in the record is offline. If we were to use round robin DNS entries we'd need to ensure that all the addresses defined are reachable at all times (using something like keepalived/corosync/pacemaker.)

Jul 1 2019, 7:49 PM · Patch-For-Review, cloud-services-team (Kanban)
JHedden added a comment to T223907: Set up HA endpoints for keystone, glance, nova, designate apis.

What's the plan for managing the address HAproxy will use?

Jul 1 2019, 7:15 PM · Patch-For-Review, cloud-services-team (Kanban)
JHedden added a comment to T223907: Set up HA endpoints for keystone, glance, nova, designate apis.

@Andrew How many hosts will support the OpenStack APIs in your HA architecture design?

Jul 1 2019, 6:55 PM · Patch-For-Review, cloud-services-team (Kanban)
JHedden closed T225067: labtestvirt2003: test different power management / CPU setups for faster kvm as Resolved.
Jul 1 2019, 4:00 PM · Continuous-Integration-Infrastructure, cloud-services-team (Kanban)
JHedden closed T225067: labtestvirt2003: test different power management / CPU setups for faster kvm, a subtask of T223971: Old cloudvirt (with Intel Xeon) are twice slower than new ones (Intel Sky Lake), as Resolved.
Jul 1 2019, 4:00 PM · cloud-services-team (Kanban), Continuous-Integration-Infrastructure
JHedden added a comment to T225067: labtestvirt2003: test different power management / CPU setups for faster kvm.

Closing this task. The default (dynamic) power regulator settings are not impacting the virtual machine performance.

Jul 1 2019, 4:00 PM · Continuous-Integration-Infrastructure, cloud-services-team (Kanban)

Jun 28 2019

JHedden added a comment to T213413: Adapt Toolschecker to work with Prometheus.

Fixing these critical alerts in icinga to reduce noise and give us a better direction for migrating over to prometheus.

Jun 28 2019, 9:02 PM · Toolforge, cloud-services-team (Kanban)
JHedden added a comment to T223906: Active/active rabbitMQ servers on wmcs controller nodes.

Both rabbitMQ servers should be disk nodes. Currently cloudcontrol1004 is a RAM node.
{nodes,[{disc,[rabbit@cloudcontrol1003]},{ram,[rabbit@cloudcontrol1004]

Jun 28 2019, 8:15 PM · cloud-services-team (Kanban)
JHedden added a comment to T224688: Outstanding icinga critical on cloudcontrol-dev hosts.
  • understand how notification works (I don't fully understand them yet!), by understanding the relationship between profile::base::notifications, icinga, and defined checks for a given server/service.

I think the best approach to handling this with Icinga is to downtime hosts and/or disable active checks for the services we're working on. This will help reduce noise for both notifications and dashboard visibility.

  • figure out a proper value for the hiera keys, a value that results in servers producing alerts in IRC/email only and not pages (since this is the -dev cluster)

Notifications are disabled, but cloudservices2002-dev.wikimedia.org is configured to alert via IRC only. It is not configured for SMS paging.

  • check which alerts are true/false positives, and fix them, i.e, get the openstack deployment in shape by fixing any remaining missing configuration bits (remember, is a new deployment!)

These alerts were all relevant and now fixed.

Jun 28 2019, 4:24 PM · cloud-services-team (Kanban), Cloud-Services
JHedden added a comment to T226731: Implement nova host-aggregates.

I'm certainly in favor of replacing custom code with upstream code! In particular it seems like we'll need this in order to make live-migration work sensibly between different CPU-typed cloudvirts, right?

Jun 28 2019, 3:57 PM · Cloud-VPS, cloud-services-team

Jun 27 2019

JHedden created T226731: Implement nova host-aggregates.
Jun 27 2019, 4:04 PM · Cloud-VPS, cloud-services-team

Jun 26 2019

JHedden added a comment to T226463: Connecting to Wiki Replicas from whgi.wikidumpparse.eqiad.wmflabs intermittently gives SSL error.

Hi @notconfusing I think this might be related to the mysql-community-client 5.7.25-1debian9 package you have installed on whgi. The mysql community builds are bundled with yaSSL[0] and the wiki replicas are using 10.1.39-MariaDB which is using openSSL.

Jun 26 2019, 10:18 PM · VPS-Projects
JHedden added a comment to T226647: nova-fullstack crashed from a keystone timeout.

This is related to the task T226632. Nova-api was stopped and started for the wmf_scheduler_hosts_pool config update.

Jun 26 2019, 3:45 PM · cloud-services-team (Kanban)

Jun 25 2019

JHedden closed T225823: Request creation of asyncwiki VPS project as Resolved.
Jun 25 2019, 5:17 PM · Cloud-VPS (Project-requests)