Page MenuHomePhabricator

JHedden (Jason Hedden)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
May 28 2019, 6:09 PM (16 w, 6 d)
Availability
Available
LDAP User
Jhedden
MediaWiki User
JHedden (WMF) [ Global Accounts ]

Recent Activity

Yesterday

JHedden added a comment to T233665: Forward our neutron-l3-agent routing hacks to Openstack Newton.

It looks like the purpose of the routing hacks is to disable NAT between the dmz_cidr's source:destination:

profile::openstack::eqiad1::neutron::dmz_cidr:
 - 172.16.0.0/21:91.198.174.0/24
 - 172.16.0.0/21:198.35.26.0/23
 - 172.16.0.0/21:10.0.0.0/8
 - 172.16.0.0/21:208.80.152.0/22
 - 172.16.0.0/21:103.102.166.0/24
 - 172.16.0.0/21:172.16.0.0/21
Mon, Sep 23, 10:30 PM · Patch-For-Review, Cloud-VPS, cloud-services-team (Kanban)

Thu, Sep 19

JHedden closed T221272: Expose new ipblocks.ipb_sitewide column to the replicas as Resolved.

The ipblocks and ipblocks_ipindex views have been rebuilt with the ipb_sitewide field on all of the replicas.

Thu, Sep 19, 8:32 PM · cloud-services-team (Kanban), Data-Services, Security-Team, Anti-Harassment
JHedden added a comment to T219374: Prepare and check storage layer for hi.wikisource.

So everything done?
From my side I can connect fine and query the views.

Thu, Sep 19, 2:38 PM · Core Platform Team Workboards (Clinic Duty Team), cloud-services-team, Analytics, Data-Services, DBA
JHedden added a comment to T219374: Prepare and check storage layer for hi.wikisource.

@JHedden after your +1, I have merged the change. Can you run puppet and try again to see if that index error is no more?

Thu, Sep 19, 2:26 PM · Core Platform Team Workboards (Clinic Duty Team), cloud-services-team, Analytics, Data-Services, DBA
JHedden added a comment to T219374: Prepare and check storage layer for hi.wikisource.

OK, the replica DNS entries are setup for hiwikisource now.

Thu, Sep 19, 1:44 PM · Core Platform Team Workboards (Clinic Duty Team), cloud-services-team, Analytics, Data-Services, DBA
JHedden added a comment to T219374: Prepare and check storage layer for hi.wikisource.

Excellent, you can proceed on labsdb1010 and labsdb1011 too. I have created the DB there too.

Thu, Sep 19, 1:27 PM · Core Platform Team Workboards (Clinic Duty Team), cloud-services-team, Analytics, Data-Services, DBA
JHedden added a comment to T219374: Prepare and check storage layer for hi.wikisource.

Yep, it's working now. The views have been created on labsdb1009 and labsdb1012.

Thu, Sep 19, 1:22 PM · Core Platform Team Workboards (Clinic Duty Team), cloud-services-team, Analytics, Data-Services, DBA
JHedden added a comment to T219374: Prepare and check storage layer for hi.wikisource.

Getting an error when trying to create the views. Maybe similar to what we saw in T193187?

Thu, Sep 19, 1:15 PM · Core Platform Team Workboards (Clinic Duty Team), cloud-services-team, Analytics, Data-Services, DBA

Tue, Sep 17

JHedden added a comment to T220530: Ensure clouddb1001 is monitored appropriately from the tendril/prometheus side.

clouddb100[12].clouddb-services.eqiad.wmflabs node and mariadb metrics are in tools-prometheus.wmflabs.org now.

Tue, Sep 17, 4:37 PM · Data-Services, cloud-services-team (Kanban)

Mon, Sep 16

JHedden added a comment to T220530: Ensure clouddb1001 is monitored appropriately from the tendril/prometheus side.

Added a new security group rule to the clouddb-services openstack project

Mon, Sep 16, 9:27 PM · Data-Services, cloud-services-team (Kanban)

Fri, Sep 13

JHedden claimed T220530: Ensure clouddb1001 is monitored appropriately from the tendril/prometheus side.

Looking into using profile::prometheus::mysqld_exporter_instance to feed clouddb data into tools-prometheus.wmflabs.org.

Fri, Sep 13, 9:04 PM · Data-Services, cloud-services-team (Kanban)
JHedden added a comment to T53434: Implement a system to monitor tools on tool-labs.
Fri, Sep 13, 2:10 PM · cloud-services-team (Kanban), User-Matthewrbowker, community-labs-monitoring, Toolforge

Wed, Sep 11

JHedden updated the task description for T223905: HA for openstack services.
Wed, Sep 11, 3:47 PM · Goal, cloud-services-team (Kanban)
JHedden closed T221301: Toolschecker webservice checks get out of sync likely from timeouts, a subtask of T220650: tools-manifest - webservicemonitor needs a longer timeout, as Resolved.
Wed, Sep 11, 3:34 PM · cloud-services-team (Kanban), Toolforge
JHedden closed T221301: Toolschecker webservice checks get out of sync likely from timeouts as Resolved.

Icinga checks for the webservice have been removed.

Wed, Sep 11, 3:34 PM · Toolforge, cloud-services-team (Kanban)
JHedden closed T205524: cloudvps: neutron: agents failed to communicate with server, a subtask of T167293: Nova-network to Neutron migration, as Resolved.
Wed, Sep 11, 3:26 PM · Epic, Cloud-Services
JHedden closed T205524: cloudvps: neutron: agents failed to communicate with server as Resolved.

T224981 seems to have resolved this.

Wed, Sep 11, 3:26 PM · cloud-services-team (Kanban), Cloud-Services
JHedden moved T231793: Remove systemd from openstack-mitaka from Needs discussion to Doing on the cloud-services-team (Kanban) board.
Wed, Sep 11, 3:04 PM · cloud-services-team (Kanban), Operations
JHedden claimed T231793: Remove systemd from openstack-mitaka.
Wed, Sep 11, 3:04 PM · cloud-services-team (Kanban), Operations
JHedden added a comment to T232536: Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail.
18:46:22 <revi> job 6568688 (tools.stewardbots) is not being deleted for few minutes, looks like something abnormal? (in the past it has been deleted within minutes)

(the Q I left on #wikimedia-cloud, crossposting here, assuming it's related)

Wed, Sep 11, 1:34 PM · cloud-services-team (Kanban), Toolforge

Mon, Sep 9

JHedden closed T232322: labspuppetmaster1001 puppet-merge failing as Resolved.

Fixed. The changes for commit 2e5424b4010c29da463eaf3c4ca2898c0a8fb79d were applied, but for some odd reason not the commit? I've cleaned it up and everything is in sync now.

Mon, Sep 9, 1:48 PM · Puppet, Operations, cloud-services-team

Fri, Sep 6

JHedden added a comment to T231222: Address icinga noise from wmflabs.

I've looked at packet captures and traced processes on both ends, as far as I can tell the metrics are being sent and stored correctly in graphite's whisper database.

Fri, Sep 6, 1:26 PM · ORES, Scoring-platform-team (Current)

Thu, Sep 5

JHedden created P9044 recent ores-web-01 metrics.
Thu, Sep 5, 9:07 PM

Wed, Sep 4

JHedden added a comment to T231222: Address icinga noise from wmflabs.

I'm curious if the timeouts are only seen from a single virtual machine, or if you're seeing the same results from multiple ORES virtual machines. Could you please run your script on a couple of different hosts and share the results with host names?

Wed, Sep 4, 8:20 PM · ORES, Scoring-platform-team (Current)
JHedden closed T231999: Icinga nova-compute process check flapping as Resolved.
Wed, Sep 4, 3:13 PM · Cloud-VPS
JHedden created T231999: Icinga nova-compute process check flapping.
Wed, Sep 4, 3:01 PM · Cloud-VPS

Tue, Sep 3

JHedden added a comment to T229657: Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC.

Cloud VPS OpenStack has been fully switched over and all services are back online.

Tue, Sep 3, 1:41 PM · cloud-services-team (Kanban), wikitech.wikimedia.org, Operations, DBA

Thu, Aug 29

JHedden closed T229448: showmount not working on labstore1004 & labstore1005 as Resolved.
Thu, Aug 29, 3:49 PM · Data-Services, cloud-services-team (Kanban)
JHedden added a comment to T229448: showmount not working on labstore1004 & labstore1005.

Since showmount is not reliable with NFS v4 (as seen in other cases too T171508) we could check NFS connectivity over the cluster IP with the nagios rpcinfo wrapper:

Thu, Aug 29, 2:07 PM · Data-Services, cloud-services-team (Kanban)

Wed, Aug 28

JHedden added a comment to T229448: showmount not working on labstore1004 & labstore1005.

The RPC portmapper is out of sync with the NFS server. NFS will need to be restarted to resolve this, but unfortunately that will cause a brief interruption in client IO.

Wed, Aug 28, 9:39 PM · Data-Services, cloud-services-team (Kanban)

Tue, Aug 27

JHedden moved T221301: Toolschecker webservice checks get out of sync likely from timeouts from Needs discussion to Doing on the cloud-services-team (Kanban) board.
Tue, Aug 27, 4:43 PM · Toolforge, cloud-services-team (Kanban)

Aug 16 2019

JHedden closed T227041: Three small ganeti VMs to host haproxy for OpenStack endpoints, a subtask of T223907: Set up HA endpoints for keystone, glance, nova, designate apis, as Resolved.
Aug 16 2019, 3:04 PM · cloud-services-team (Kanban)
JHedden closed T227041: Three small ganeti VMs to host haproxy for OpenStack endpoints as Resolved.

For this phase we're going to install haproxy directly on the openstack controllers. We will not be needing these VMs. Thank you for all the information, it was very helpful.

Aug 16 2019, 3:04 PM · vm-requests, Operations, cloud-services-team (Kanban)

Aug 15 2019

JHedden added a comment to T229551: Database-reports can't see packages in its virtualenv on the grid.
Aug 15 2019, 3:31 PM · Community-Tech (Kanban (Q1 2019-20)), Patch-For-Review, Tools, Toolforge

Aug 13 2019

JHedden added a comment to T230442: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.

This host also has a bad disk in slot number 8. T230289

Aug 13 2019, 6:58 PM · ops-eqiad, Operations
JHedden added a comment to T230289: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.

There are no workloads on this host now. We're good to have this replaced anytime. Thanks!

Aug 13 2019, 5:12 PM · cloud-services-team, ops-eqiad, Operations
JHedden closed T230247: Increase VCPU quota for wikidata-query project as Resolved.
Aug 13 2019, 5:03 PM · cloud-services-team (Kanban), Cloud-VPS (Quota-requests), Discovery-Wikidata-Query-Service-Sprint, Wikidata, Wikidata-Query-Service
JHedden moved T221301: Toolschecker webservice checks get out of sync likely from timeouts from Doing to Needs discussion on the cloud-services-team (Kanban) board.
Aug 13 2019, 1:54 PM · Toolforge, cloud-services-team (Kanban)

Aug 9 2019

JHedden added a comment to T229657: Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC.

The plan looks good to me. In the pre-failover stage I'll be shutting down the OpenStack scheduler and designate services to ensure there are no actions in queue, then re-enabling these in the clean up steps.

Aug 9 2019, 2:05 PM · cloud-services-team (Kanban), wikitech.wikimedia.org, Operations, DBA

Aug 8 2019

JHedden closed T230157: tools-sgewebgrid-lighttpd-0915 not responding as Resolved.

Only interesting things from the logs:

Aug 8 2019, 7:58 PM · cloud-services-team
JHedden created T230157: tools-sgewebgrid-lighttpd-0915 not responding.
Aug 8 2019, 7:24 PM · cloud-services-team
JHedden updated subscribers of T213567: Toolforge: refresh grafana dashboard.

I haven't added any new exporters yet, I think you might be referring to the changes @Bstorm made for kublet stats T228573

Aug 8 2019, 1:23 PM · cloud-services-team (Kanban), Toolforge

Aug 7 2019

JHedden added a comment to T221301: Toolschecker webservice checks get out of sync likely from timeouts.

Ran into a new failure scenario on gridengine, might be a false positive but it did cause the webservice to remain running:

queue instance "webgrid-lighttpd@tools-sgewebgrid-lighttpd-0923.tools.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=2.757500 (= 2.757500 + 0.50 * 0.000000 with nproc=4) >= 2.75
Aug 7 2019, 7:03 PM · Toolforge, cloud-services-team (Kanban)
JHedden added a comment to T221301: Toolschecker webservice checks get out of sync likely from timeouts.

Fixed NGINX timeouts to match WSGI and added better status checking after issuing webservice commands.

Aug 7 2019, 5:14 PM · Toolforge, cloud-services-team (Kanban)
JHedden added a comment to T149589: Puppet tab in Horizon unusably slow.

Viewing the instance console log can occasionally take longer than expected. This process queries multiple APIs and communicates directly with the hypervisor supporting the VM, i.e. there's lots of potential places for delay and resource contention.

Aug 7 2019, 3:10 PM · cloud-services-team (Kanban), Patch-For-Review, Operations, Puppet, Cloud-Services
JHedden closed T230003: openstack: cleanup neutron user as Resolved.

Confirmed openstack role assignment list --names is working as expected now.

Aug 7 2019, 2:45 PM · cloud-services-team (Kanban)
JHedden added a comment to T230003: openstack: cleanup neutron user.

Found another one too:

Aug 7 2019, 2:43 PM · cloud-services-team (Kanban)
JHedden added a comment to T230003: openstack: cleanup neutron user.

Nice catch, thanks for the back story too.

Aug 7 2019, 2:30 PM · cloud-services-team (Kanban)

Aug 5 2019

JHedden added a comment to T229657: Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC.

As per the sync on the SRE meeting, @JHedden will be online from WMCS.
I will handle the announcement for wikitech, could you handle the announcement (if it is needed) for the OpenStack part of things?

Aug 5 2019, 4:28 PM · cloud-services-team (Kanban), wikitech.wikimedia.org, Operations, DBA
JHedden closed T229846: anticompositebot tool missing project directory as Resolved.

The anticompositebot tool directory and configuration is present and active now.

Aug 5 2019, 4:20 PM · cloud-services-team, Toolforge
JHedden created T229846: anticompositebot tool missing project directory.
Aug 5 2019, 4:09 PM · cloud-services-team, Toolforge
JHedden closed T229787: Toolforge: sudden issues in both gridengine and k8s webservices as Resolved.

The icingia check description was recently updated for T228878 https://gerrit.wikimedia.org/r/c/operations/puppet/+/525536 . The new name/description for this service appears to have removed the existing ack's and downtime.

Aug 5 2019, 1:29 PM · cloud-services-team (Kanban)

Jul 26 2019

JHedden created P8811 cloudvirt1015 testing VM crash.
Jul 26 2019, 10:00 PM
JHedden added a comment to T221301: Toolschecker webservice checks get out of sync likely from timeouts.

The few I spot checked also lined up with timeouts in the etcd server log:

Jul 26 2019, 9:10 PM · Toolforge, cloud-services-team (Kanban)
JHedden added a comment to T221301: Toolschecker webservice checks get out of sync likely from timeouts.

It seems rare, but I've also noticed a few timeouts from SGE: 2019-07-26T19:29:42.700456 Timed out attempting to start webservice (15s)

Jul 26 2019, 8:31 PM · Toolforge, cloud-services-team (Kanban)
JHedden updated the task description for T225713: CPU scaling governor audit.
Jul 26 2019, 1:42 PM · User-fgiunchedi, Operations

Jul 25 2019

JHedden added a comment to T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory.

Created these VMs

openstack server list --project testlabs --long -c ID -c Name -c Host| grep cv1015
| 30f17a94-252e-46d2-aa28-e6f24c9c457e | cv1015-testing03                  | cloudvirt1015   |
| d1b13075-ace4-44ba-8f26-c9c12a360184 | cv1015-testing02                  | cloudvirt1015   |
| b99a2376-1bb1-48f9-9889-00d3aedb9a43 | cv1015-testing01                  | cloudvirt1015   |
| e65ff310-f0ef-451c-956c-8d21b21cc12a | cv1015-testing04                  | cloudvirt1015   |
Jul 25 2019, 2:42 PM · Operations, ops-eqiad, DC-Ops, User-Zppix, cloud-services-team (Kanban)
JHedden added a comment to T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory.

Thanks @RobH. I'll spin up some stress testing VMs on that host and let them run until Andrew gets back from vacation next week.

Jul 25 2019, 1:56 PM · Operations, ops-eqiad, DC-Ops, User-Zppix, cloud-services-team (Kanban)

Jul 24 2019

JHedden added a comment to T224324: LB for cloudelastic.

@EBernhardson Unfortunately we don't have a good solution for this today or in the near future. We've discussed future load balancing as a service options, but this requires a lot of effort on backend upgrades, automation and configuration.

Jul 24 2019, 2:08 PM · Discovery-Search (Current work), Cloud-Services, Elasticsearch, Discovery

Jul 23 2019

JHedden added a comment to T221301: Toolschecker webservice checks get out of sync likely from timeouts.

The webservice checks are getting better, but it ran into a new failure:

Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: Traceback (most recent call last):
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:   File "/usr/bin/webservice", line 169, in <module>
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:     start(job, 'Starting webservice')
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:   File "/usr/bin/webservice", line 61, in start
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:     job.request_start()
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:   File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 456, in request_start
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:     pykube.Deployment(self.api, self._get_deployment()).create()
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:   File "/usr/lib/python2.7/dist-packages/pykube/objects.py", line 76, in create
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:     self.api.raise_for_status(r)
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:   File "/usr/lib/python2.7/dist-packages/pykube/http.py", line 104, in raise_for_status
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:     raise HTTPError(payload["message"])
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: pykube.exceptions.HTTPError: client: etcd member https://tools-k8s-etcd-01.tools.eqiad.wmflabs:2379 has no leader
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: --------------------------------------------------------------------------------
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: ERROR in toolschecker [/var/lib/toolschecker/toolschecker.py:454]:
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: webservice kubernetes: error starting
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: --------------------------------------------------------------------------------
...
Jul 23 16:53:43 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: --------------------------------------------------------------------------------
Jul 23 16:53:43 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: ERROR in toolschecker [/var/lib/toolschecker/toolschecker.py:448]:
Jul 23 16:53:43 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: webservice kubernetes: found existing webservice running
Jul 23 16:53:43 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: --------------------------------------------------------------------------------
Jul 23 2019, 5:27 PM · Toolforge, cloud-services-team (Kanban)
JHedden moved T221301: Toolschecker webservice checks get out of sync likely from timeouts from Needs discussion to Doing on the cloud-services-team (Kanban) board.
Jul 23 2019, 2:57 PM · Toolforge, cloud-services-team (Kanban)
JHedden claimed T221301: Toolschecker webservice checks get out of sync likely from timeouts.
Jul 23 2019, 2:57 PM · Toolforge, cloud-services-team (Kanban)
JHedden moved T221301: Toolschecker webservice checks get out of sync likely from timeouts from Inbox to Needs discussion on the cloud-services-team (Kanban) board.
Jul 23 2019, 2:34 PM · Toolforge, cloud-services-team (Kanban)
JHedden added a comment to T221301: Toolschecker webservice checks get out of sync likely from timeouts.

The current configuration is set to check every 1 minute and retry every 1 minute after a failure.

Jul 23 2019, 2:29 PM · Toolforge, cloud-services-team (Kanban)
JHedden added a comment to T228731: https://dumps.wikimedia.org/other/pageviews/ lacks hourly pageviews since 20190722-17:00.

Maybe related are some hiera values at:

hieradata/common.yaml
# Dumps distribution server currently serving traffic over NFS to cloud vps instances
dumps_dist_active_vps: labstore1007.wikimedia.org
# Dumps distribution server currently serving web and rsync mirror traffic
# Also serves stat* hosts over nfs
dumps_dist_active_web: labstore1006.wikimedia.org
Jul 23 2019, 12:56 PM · Analytics-Kanban, Analytics, cloud-services-team, Wikimedia-Portals

Jul 22 2019

JHedden added a comment to T228573: toolforge k8s nodes oom?.

That's great! Now that we have that, this query should be helpful for future investigating container_memory_usage_bytes{job="k8s-node",instance="tools-worker-1015.tools.eqiad.wmflabs"}

Jul 22 2019, 6:23 PM · Toolforge, cloud-services-team (Kanban)
JHedden added a comment to T228573: toolforge k8s nodes oom?.

I'm not seeing anything strange for the prometheus-node exporter memory usage. GC and overall heap allocation history looks good on tools-worker-1015.

Jul 22 2019, 2:51 PM · Toolforge, cloud-services-team (Kanban)

Jul 19 2019

JHedden added a comment to T221301: Toolschecker webservice checks get out of sync likely from timeouts.

The recent webservice critical status was related to existing webservice instances left running. When concurrent requests from both icinga1001.wikimedia.org and icinga2001.wikimedia.org are made to the webservice endpoint they can leave the webservice instance running, causing the checks to fail going forward.

Jul 19 2019, 10:48 PM · Toolforge, cloud-services-team (Kanban)
JHedden added a comment to T225265: Fix labstore checks on cloudstore1008/9.

That patch ^ fixes the NRPE error CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds when the host not supporting the VIP runs showmount.

Jul 19 2019, 2:42 PM · Data-Services, cloud-services-team (Kanban)

Jul 18 2019

JHedden closed T179848: Unable to add user to group in debian stretch instance as Resolved.

I've confirmed that this process works as expected on the recent stretch image. If you're still having an issue adding users to local group please let us know.

Jul 18 2019, 7:50 PM · cloud-services-team (Kanban), Cloud-VPS

Jul 17 2019

JHedden added a comment to T227019: Redirect all space.wmflabs.org traffic to HTTPS.

Try reordering the rules with the HTTPS redirect rule on top. Something like:

RewriteEngine On
RewriteCond %{HTTPS} off
RewriteCond %{HTTP:X-Forwarded-Proto} !https
RewriteRule ^(.*)$ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]
Jul 17 2019, 8:41 PM · VPS-Projects, Space (Jul-Sep-2019)

Jul 16 2019

JHedden closed T216040: Start/shutdown VMs automatically on hypervisor boot/shutdown as Resolved.

Changes pushed and verified

Jul 16 2019, 3:46 PM · Patch-For-Review, Cloud-VPS, cloud-services-team (Kanban)
JHedden closed T210995: cloudvps: rabbitmq metrics as Resolved.

I've updated the rabbitmq dashboard for the eqiad prometheus/labs datasource. https://grafana.wikimedia.org/d/000000617/cloudvps-rabbitmq

Jul 16 2019, 2:56 PM · cloud-services-team (Kanban)

Jul 15 2019

JHedden moved T216040: Start/shutdown VMs automatically on hypervisor boot/shutdown from Important to Doing on the cloud-services-team (Kanban) board.
Jul 15 2019, 7:55 PM · Patch-For-Review, Cloud-VPS, cloud-services-team (Kanban)
JHedden closed T227395: tools-worker-1022 k8s duplicate node as Resolved.
Jul 15 2019, 4:04 PM · cloud-services-team
JHedden updated subscribers of T227395: tools-worker-1022 k8s duplicate node.

@aborrero @Bstorm Is there anything else we need or want to check before deleting the bad node?

Jul 15 2019, 1:10 PM · cloud-services-team

Jul 12 2019

JHedden closed T219054: Install fish shell for Toolforge use, a subtask of T55704: Packages to be added to toollabs puppet, as Resolved.
Jul 12 2019, 9:53 PM · Cloud-Services, Tracking-Neverending, Toolforge
JHedden closed T219054: Install fish shell for Toolforge use as Resolved.
Jul 12 2019, 9:53 PM · Toolforge (Software install/update), cloud-services-team (Kanban)
JHedden added a comment to T216040: Start/shutdown VMs automatically on hypervisor boot/shutdown.

Ran some tests with resume_guests_state_on_host_boot enabled and libvirt-guests configured to not start VMs.

Jul 12 2019, 7:56 PM · Patch-For-Review, Cloud-VPS, cloud-services-team (Kanban)
JHedden claimed T216040: Start/shutdown VMs automatically on hypervisor boot/shutdown.
Jul 12 2019, 2:41 PM · Patch-For-Review, Cloud-VPS, cloud-services-team (Kanban)
JHedden added a comment to T216040: Start/shutdown VMs automatically on hypervisor boot/shutdown.

Instead of using virsh autostart, could we let Nova resume the state of VMs after a hypervisor reboot?

Jul 12 2019, 2:25 PM · Patch-For-Review, Cloud-VPS, cloud-services-team (Kanban)

Jul 9 2019

JHedden closed T227060: nova-fullstack: alert due to leaked instances as Resolved.
Jul 9 2019, 1:27 PM · cloud-services-team (Kanban)

Jul 8 2019

JHedden added a comment to T227395: tools-worker-1022 k8s duplicate node.

Looks like this was effected by DNS testing that was happening on cloudservices1003. Based on the logs, the only way I can see the FQDN changing is with the following example.

Jul 8 2019, 5:02 PM · cloud-services-team

Jul 7 2019

JHedden created T227395: tools-worker-1022 k8s duplicate node.
Jul 7 2019, 4:48 AM · cloud-services-team

Jul 5 2019

JHedden added a comment to T223906: Active/active rabbitMQ servers on wmcs controller nodes.

This has been completed:

Jul 5 2019, 1:42 PM · cloud-services-team (Kanban)

Jul 3 2019

JHedden closed T227222: Degraded RAID on cloudelastic1003 as Resolved.

This host was rebooted for T224228. I downtimed the host and services in icinga but it looks like this slipped through.

Jul 3 2019, 7:04 PM · ops-eqiad, Operations
JHedden added a comment to T227041: Three small ganeti VMs to host haproxy for OpenStack endpoints.

Per T223907, the chosen approach for providing HA over the 3 haproxy nodes is pacemaker. Corosync is actually an implementation detail to provide cluster management services, i.e. communication, membership, quorum and could be easily exchanged for heartbeat.

Jul 3 2019, 1:44 PM · vm-requests, Operations, cloud-services-team (Kanban)

Jul 2 2019

JHedden added a comment to T227060: nova-fullstack: alert due to leaked instances.

Looks like the list used for all_cloudvirts is not maintaining any order. It's updating libvirtd.conf every puppet run:

Jul 2 2019, 2:17 PM · cloud-services-team (Kanban)
JHedden added a comment to T227060: nova-fullstack: alert due to leaked instances.

With fresh eyes I found that the VMs are intermittently failing with

Jul 2 2019, 2:04 PM · cloud-services-team (Kanban)
JHedden added a comment to T227060: nova-fullstack: alert due to leaked instances.

I didn't find anything in the OpenStack or daemon logs on cloudcontrol1003 or a few cloudvirts I spot checked with fullstack instances. I'll check more in the morning.

Jul 2 2019, 7:52 AM · cloud-services-team (Kanban)
JHedden added a comment to T227060: nova-fullstack: alert due to leaked instances.

I got the page and was also working on this event, cleaning VMs that I found to be active and online. T227057 https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin-monitoring#nova-fullstack

Jul 2 2019, 7:42 AM · cloud-services-team (Kanban)
JHedden added a comment to T227057: cloudcontrol1003/Check for VMs leaked by the nova-fullstack test is CRITICAL.

Fullstack VMs currently leaked

| ID                                   | Name                  | Status | Task State | Power State | Networks                                                          | Availability Zone | Host          | Properties |
+--------------------------------------+-----------------------+--------+------------+-------------+-------------------------------------------------------------------+-------------------+---------------+------------+
| e83bf2eb-0256-4ac7-8b36-e8d3c6e3c519 | fullstackd-1562051634 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.7.181                            | nova              | cloudvirt1004 |            |
| 9551a639-6d8c-4619-8c16-1f652457478f | fullstackd-1562049720 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.7.176, 172.16.7.177, 172.16.7.18 | nova              | cloudvirt1007 |            |
| 3653dacc-568a-4b9c-a126-3db668ea4972 | fullstackd-1562014758 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.7.109, 172.16.7.11               | nova              | cloudvirt1002 |            |
| 8f4a7f1e-7329-4a97-ab79-79f3955a2880 | fullstackd-1561887686 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.6.109, 172.16.6.110              | nova              | cloudvirt1005 |            |
| 21a45c21-476f-4278-b3a2-ca0d134e0fed | fullstackd-1561861047 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.5.44, 172.16.5.45                | nova              | cloudvirt1006 |            |
| 32e92dfc-290f-4c05-8df8-b43ec48033bb | fullstackd-1561740112 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.4.250, 172.16.4.251              | nova              | cloudvirt1002 |            |
Jul 2 2019, 7:18 AM · Cloud-VPS
JHedden created T227057: cloudcontrol1003/Check for VMs leaked by the nova-fullstack test is CRITICAL.
Jul 2 2019, 7:14 AM · Cloud-VPS

Jul 1 2019

JHedden added a comment to T227041: Three small ganeti VMs to host haproxy for OpenStack endpoints.

I feel that 1 CPU might be too limiting, haproxy is multi-threaded and we'll have a number of backends defined.

Jul 1 2019, 10:35 PM · vm-requests, Operations, cloud-services-team (Kanban)
JHedden added a comment to T223907: Set up HA endpoints for keystone, glance, nova, designate apis.

There's technically nothing stopping us from using 4 nodes in corosync. I'd vote for using all 4 and keeping the configuration in sync, avoiding any one off special configuration.

Jul 1 2019, 8:41 PM · cloud-services-team (Kanban)
JHedden added a comment to T223907: Set up HA endpoints for keystone, glance, nova, designate apis.

This is strictly true, yes, but that's how AWS provides load balancer HA and how lots of production web services work. Right now, we are just down until we manually move the service. If the DNS TTL is set short, it would only be down for that long (we also handle most manual failovers entirely via DNS already--again this would be superior to that method). It's not 100% foolproof, but is a gigantic leap forward from where we are. Haproxy servers would just need to be manually "depooled" from DNS before maintenance.

Jul 1 2019, 8:02 PM · cloud-services-team (Kanban)
JHedden added a comment to T223907: Set up HA endpoints for keystone, glance, nova, designate apis.

@Bstorm Using round robin DNS alone does not provide HA. All they do is rotate the first host address returned to the client. If the address given to the client is not reachable, the client will not ask for another address or try the other addresses in the record. This also leads to DNS caching issues when one of the addresses in the record is offline. If we were to use round robin DNS entries we'd need to ensure that all the addresses defined are reachable at all times (using something like keepalived/corosync/pacemaker.)

Jul 1 2019, 7:49 PM · cloud-services-team (Kanban)
JHedden added a comment to T223907: Set up HA endpoints for keystone, glance, nova, designate apis.

What's the plan for managing the address HAproxy will use?

Jul 1 2019, 7:15 PM · cloud-services-team (Kanban)
JHedden added a comment to T223907: Set up HA endpoints for keystone, glance, nova, designate apis.

@Andrew How many hosts will support the OpenStack APIs in your HA architecture design?

Jul 1 2019, 6:55 PM · cloud-services-team (Kanban)
JHedden closed T225067: labtestvirt2003: test different power management / CPU setups for faster kvm as Resolved.
Jul 1 2019, 4:00 PM · Continuous-Integration-Infrastructure, cloud-services-team (Kanban)