Page MenuHomePhabricator

JHedden (Jason Hedden)
User

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
May 28 2019, 6:09 PM (12 w, 2 d)
Availability
Available
LDAP User
Jhedden
MediaWiki User
JHedden (WMF) [ Global Accounts ]

Recent Activity

Fri, Aug 16

JHedden closed T227041: Three small ganeti VMs to host haproxy for OpenStack endpoints, a subtask of T223907: Set up HA endpoints for keystone, glance, nova, designate apis, as Resolved.
Fri, Aug 16, 3:04 PM · Patch-For-Review, cloud-services-team (Kanban)
JHedden closed T227041: Three small ganeti VMs to host haproxy for OpenStack endpoints as Resolved.

For this phase we're going to install haproxy directly on the openstack controllers. We will not be needing these VMs. Thank you for all the information, it was very helpful.

Fri, Aug 16, 3:04 PM · vm-requests, Operations, cloud-services-team (Kanban)

Thu, Aug 15

JHedden added a comment to T229551: Database-reports can't see packages in its virtualenv on the grid.
Thu, Aug 15, 3:31 PM · Community-Tech (Kanban (Q1 2019-20)), Patch-For-Review, Tools, Toolforge

Tue, Aug 13

JHedden added a comment to T230442: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.

This host also has a bad disk in slot number 8. T230289

Tue, Aug 13, 6:58 PM · ops-eqiad, Operations
JHedden added a comment to T230289: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only.

There are no workloads on this host now. We're good to have this replaced anytime. Thanks!

Tue, Aug 13, 5:12 PM · cloud-services-team, ops-eqiad, Operations
JHedden closed T230247: Increase VCPU quota for wikidata-query project as Resolved.
Tue, Aug 13, 5:03 PM · cloud-services-team (Kanban), Cloud-VPS (Quota-requests), Discovery-Wikidata-Query-Service-Sprint, Wikidata, Wikidata-Query-Service
JHedden moved T221301: Toolschecker webservice checks get out of sync likely from timeouts from Doing to Needs discussion on the cloud-services-team (Kanban) board.
Tue, Aug 13, 1:54 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban)

Fri, Aug 9

JHedden added a comment to T229657: Switchover m5 primary master: db1073 to db1133.

The plan looks good to me. In the pre-failover stage I'll be shutting down the OpenStack scheduler and designate services to ensure there are no actions in queue, then re-enabling these in the clean up steps.

Fri, Aug 9, 2:05 PM · Patch-For-Review, cloud-services-team (Kanban), wikitech.wikimedia.org, Operations, DBA

Thu, Aug 8

JHedden closed T230157: tools-sgewebgrid-lighttpd-0915 not responding as Resolved.

Only interesting things from the logs:

Thu, Aug 8, 7:58 PM · cloud-services-team
JHedden created T230157: tools-sgewebgrid-lighttpd-0915 not responding.
Thu, Aug 8, 7:24 PM · cloud-services-team
JHedden updated subscribers of T213567: Toolforge: refresh grafana dashboard.

I haven't added any new exporters yet, I think you might be referring to the changes @Bstorm made for kublet stats T228573

Thu, Aug 8, 1:23 PM · cloud-services-team (Kanban), Toolforge

Wed, Aug 7

JHedden added a comment to T221301: Toolschecker webservice checks get out of sync likely from timeouts.

Ran into a new failure scenario on gridengine, might be a false positive but it did cause the webservice to remain running:

queue instance "webgrid-lighttpd@tools-sgewebgrid-lighttpd-0923.tools.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=2.757500 (= 2.757500 + 0.50 * 0.000000 with nproc=4) >= 2.75
Wed, Aug 7, 7:03 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban)
JHedden added a comment to T221301: Toolschecker webservice checks get out of sync likely from timeouts.

Fixed NGINX timeouts to match WSGI and added better status checking after issuing webservice commands.

Wed, Aug 7, 5:14 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban)
JHedden added a comment to T149589: Puppet tab in Horizon unusably slow.

Viewing the instance console log can occasionally take longer than expected. This process queries multiple APIs and communicates directly with the hypervisor supporting the VM, i.e. there's lots of potential places for delay and resource contention.

Wed, Aug 7, 3:10 PM · cloud-services-team (Kanban), Patch-For-Review, Operations, Puppet, Cloud-Services
JHedden closed T230003: openstack: cleanup neutron user as Resolved.

Confirmed openstack role assignment list --names is working as expected now.

Wed, Aug 7, 2:45 PM · cloud-services-team (Kanban)
JHedden added a comment to T230003: openstack: cleanup neutron user.

Found another one too:

Wed, Aug 7, 2:43 PM · cloud-services-team (Kanban)
JHedden added a comment to T230003: openstack: cleanup neutron user.

Nice catch, thanks for the back story too.

Wed, Aug 7, 2:30 PM · cloud-services-team (Kanban)

Mon, Aug 5

JHedden added a comment to T229657: Switchover m5 primary master: db1073 to db1133.

As per the sync on the SRE meeting, @JHedden will be online from WMCS.
I will handle the announcement for wikitech, could you handle the announcement (if it is needed) for the OpenStack part of things?

Mon, Aug 5, 4:28 PM · Patch-For-Review, cloud-services-team (Kanban), wikitech.wikimedia.org, Operations, DBA
JHedden closed T229846: anticompositebot tool missing project directory as Resolved.

The anticompositebot tool directory and configuration is present and active now.

Mon, Aug 5, 4:20 PM · cloud-services-team, Toolforge
JHedden created T229846: anticompositebot tool missing project directory.
Mon, Aug 5, 4:09 PM · cloud-services-team, Toolforge
JHedden closed T229787: Toolforge: sudden issues in both gridengine and k8s webservices as Resolved.

The icingia check description was recently updated for T228878 https://gerrit.wikimedia.org/r/c/operations/puppet/+/525536 . The new name/description for this service appears to have removed the existing ack's and downtime.

Mon, Aug 5, 1:29 PM · cloud-services-team (Kanban)

Fri, Jul 26

JHedden created P8811 cloudvirt1015 testing VM crash.
Fri, Jul 26, 10:00 PM
JHedden added a comment to T221301: Toolschecker webservice checks get out of sync likely from timeouts.

The few I spot checked also lined up with timeouts in the etcd server log:

Fri, Jul 26, 9:10 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban)
JHedden added a comment to T221301: Toolschecker webservice checks get out of sync likely from timeouts.

It seems rare, but I've also noticed a few timeouts from SGE: 2019-07-26T19:29:42.700456 Timed out attempting to start webservice (15s)

Fri, Jul 26, 8:31 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban)
JHedden updated the task description for T225713: CPU scaling governor audit.
Fri, Jul 26, 1:42 PM · User-fgiunchedi, Operations

Thu, Jul 25

JHedden added a comment to T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory.

Created these VMs

openstack server list --project testlabs --long -c ID -c Name -c Host| grep cv1015
| 30f17a94-252e-46d2-aa28-e6f24c9c457e | cv1015-testing03                  | cloudvirt1015   |
| d1b13075-ace4-44ba-8f26-c9c12a360184 | cv1015-testing02                  | cloudvirt1015   |
| b99a2376-1bb1-48f9-9889-00d3aedb9a43 | cv1015-testing01                  | cloudvirt1015   |
| e65ff310-f0ef-451c-956c-8d21b21cc12a | cv1015-testing04                  | cloudvirt1015   |
Thu, Jul 25, 2:42 PM · Operations, ops-eqiad, DC-Ops, User-Zppix, cloud-services-team (Kanban)
JHedden added a comment to T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory.

Thanks @RobH. I'll spin up some stress testing VMs on that host and let them run until Andrew gets back from vacation next week.

Thu, Jul 25, 1:56 PM · Operations, ops-eqiad, DC-Ops, User-Zppix, cloud-services-team (Kanban)

Wed, Jul 24

JHedden added a comment to T224324: LB for cloudelastic.

@EBernhardson Unfortunately we don't have a good solution for this today or in the near future. We've discussed future load balancing as a service options, but this requires a lot of effort on backend upgrades, automation and configuration.

Wed, Jul 24, 2:08 PM · Discovery-Search (Current work), Cloud-Services, Elasticsearch, Discovery

Jul 23 2019

JHedden added a comment to T221301: Toolschecker webservice checks get out of sync likely from timeouts.

The webservice checks are getting better, but it ran into a new failure:

Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: Traceback (most recent call last):
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:   File "/usr/bin/webservice", line 169, in <module>
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:     start(job, 'Starting webservice')
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:   File "/usr/bin/webservice", line 61, in start
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:     job.request_start()
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:   File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 456, in request_start
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:     pykube.Deployment(self.api, self._get_deployment()).create()
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:   File "/usr/lib/python2.7/dist-packages/pykube/objects.py", line 76, in create
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:     self.api.raise_for_status(r)
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:   File "/usr/lib/python2.7/dist-packages/pykube/http.py", line 104, in raise_for_status
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]:     raise HTTPError(payload["message"])
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: pykube.exceptions.HTTPError: client: etcd member https://tools-k8s-etcd-01.tools.eqiad.wmflabs:2379 has no leader
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: --------------------------------------------------------------------------------
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: ERROR in toolschecker [/var/lib/toolschecker/toolschecker.py:454]:
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: webservice kubernetes: error starting
Jul 23 16:50:32 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: --------------------------------------------------------------------------------
...
Jul 23 16:53:43 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: --------------------------------------------------------------------------------
Jul 23 16:53:43 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: ERROR in toolschecker [/var/lib/toolschecker/toolschecker.py:448]:
Jul 23 16:53:43 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: webservice kubernetes: found existing webservice running
Jul 23 16:53:43 tools-checker-03 uwsgi-toolschecker_webservice_kubernetes[7908]: --------------------------------------------------------------------------------
Jul 23 2019, 5:27 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban)
JHedden moved T221301: Toolschecker webservice checks get out of sync likely from timeouts from Needs discussion to Doing on the cloud-services-team (Kanban) board.
Jul 23 2019, 2:57 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban)
JHedden claimed T221301: Toolschecker webservice checks get out of sync likely from timeouts.
Jul 23 2019, 2:57 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban)
JHedden moved T221301: Toolschecker webservice checks get out of sync likely from timeouts from Inbox to Needs discussion on the cloud-services-team (Kanban) board.
Jul 23 2019, 2:34 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban)
JHedden added a comment to T221301: Toolschecker webservice checks get out of sync likely from timeouts.

The current configuration is set to check every 1 minute and retry every 1 minute after a failure.

Jul 23 2019, 2:29 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban)
JHedden added a comment to T228731: https://dumps.wikimedia.org/other/pageviews/ lacks hourly pageviews since 20190722-17:00.

Maybe related are some hiera values at:

hieradata/common.yaml
# Dumps distribution server currently serving traffic over NFS to cloud vps instances
dumps_dist_active_vps: labstore1007.wikimedia.org
# Dumps distribution server currently serving web and rsync mirror traffic
# Also serves stat* hosts over nfs
dumps_dist_active_web: labstore1006.wikimedia.org
Jul 23 2019, 12:56 PM · Analytics-Kanban, Analytics, cloud-services-team, Wikimedia-Portals

Jul 22 2019

JHedden added a comment to T228573: toolforge k8s nodes oom?.

That's great! Now that we have that, this query should be helpful for future investigating container_memory_usage_bytes{job="k8s-node",instance="tools-worker-1015.tools.eqiad.wmflabs"}

Jul 22 2019, 6:23 PM · Toolforge, cloud-services-team (Kanban)
JHedden added a comment to T228573: toolforge k8s nodes oom?.

I'm not seeing anything strange for the prometheus-node exporter memory usage. GC and overall heap allocation history looks good on tools-worker-1015.

Jul 22 2019, 2:51 PM · Toolforge, cloud-services-team (Kanban)

Jul 19 2019

JHedden added a comment to T221301: Toolschecker webservice checks get out of sync likely from timeouts.

The recent webservice critical status was related to existing webservice instances left running. When concurrent requests from both icinga1001.wikimedia.org and icinga2001.wikimedia.org are made to the webservice endpoint they can leave the webservice instance running, causing the checks to fail going forward.

Jul 19 2019, 10:48 PM · Patch-For-Review, Toolforge, cloud-services-team (Kanban)
JHedden added a comment to T225265: Fix labstore checks on cloudstore1008/9.

That patch ^ fixes the NRPE error CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds when the host not supporting the VIP runs showmount.

Jul 19 2019, 2:42 PM · Data-Services, cloud-services-team (Kanban)

Jul 18 2019

JHedden closed T179848: Unable to add user to group in debian stretch instance as Resolved.

I've confirmed that this process works as expected on the recent stretch image. If you're still having an issue adding users to local group please let us know.

Jul 18 2019, 7:50 PM · cloud-services-team (Kanban), Cloud-VPS

Jul 17 2019

JHedden added a comment to T227019: Redirect all space.wmflabs.org traffic to HTTPS.

Try reordering the rules with the HTTPS redirect rule on top. Something like:

RewriteEngine On
RewriteCond %{HTTPS} off
RewriteCond %{HTTP:X-Forwarded-Proto} !https
RewriteRule ^(.*)$ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]
Jul 17 2019, 8:41 PM · VPS-Projects, Space (Jul-Sep-2019)

Jul 16 2019

JHedden closed T216040: Start/shutdown VMs automatically on hypervisor boot/shutdown as Resolved.

Changes pushed and verified

Jul 16 2019, 3:46 PM · Patch-For-Review, Cloud-VPS, cloud-services-team (Kanban)
JHedden closed T210995: cloudvps: rabbitmq metrics as Resolved.

I've updated the rabbitmq dashboard for the eqiad prometheus/labs datasource. https://grafana.wikimedia.org/d/000000617/cloudvps-rabbitmq

Jul 16 2019, 2:56 PM · cloud-services-team (Kanban)

Jul 15 2019

JHedden moved T216040: Start/shutdown VMs automatically on hypervisor boot/shutdown from Important to Doing on the cloud-services-team (Kanban) board.
Jul 15 2019, 7:55 PM · Patch-For-Review, Cloud-VPS, cloud-services-team (Kanban)
JHedden closed T227395: tools-worker-1022 k8s duplicate node as Resolved.
Jul 15 2019, 4:04 PM · cloud-services-team
JHedden updated subscribers of T227395: tools-worker-1022 k8s duplicate node.

@aborrero @Bstorm Is there anything else we need or want to check before deleting the bad node?

Jul 15 2019, 1:10 PM · cloud-services-team

Jul 12 2019

JHedden closed T219054: Install fish shell for Toolforge use, a subtask of T55704: Packages to be added to toollabs puppet, as Resolved.
Jul 12 2019, 9:53 PM · Cloud-Services, Tracking-Neverending, Toolforge
JHedden closed T219054: Install fish shell for Toolforge use as Resolved.
Jul 12 2019, 9:53 PM · Toolforge (Software install/update), cloud-services-team (Kanban)
JHedden added a comment to T216040: Start/shutdown VMs automatically on hypervisor boot/shutdown.

Ran some tests with resume_guests_state_on_host_boot enabled and libvirt-guests configured to not start VMs.

Jul 12 2019, 7:56 PM · Patch-For-Review, Cloud-VPS, cloud-services-team (Kanban)
JHedden claimed T216040: Start/shutdown VMs automatically on hypervisor boot/shutdown.
Jul 12 2019, 2:41 PM · Patch-For-Review, Cloud-VPS, cloud-services-team (Kanban)
JHedden added a comment to T216040: Start/shutdown VMs automatically on hypervisor boot/shutdown.

Instead of using virsh autostart, could we let Nova resume the state of VMs after a hypervisor reboot?

Jul 12 2019, 2:25 PM · Patch-For-Review, Cloud-VPS, cloud-services-team (Kanban)

Jul 9 2019

JHedden closed T227060: nova-fullstack: alert due to leaked instances as Resolved.
Jul 9 2019, 1:27 PM · cloud-services-team (Kanban)

Jul 8 2019

JHedden added a comment to T227395: tools-worker-1022 k8s duplicate node.

Looks like this was effected by DNS testing that was happening on cloudservices1003. Based on the logs, the only way I can see the FQDN changing is with the following example.

Jul 8 2019, 5:02 PM · cloud-services-team

Jul 7 2019

JHedden created T227395: tools-worker-1022 k8s duplicate node.
Jul 7 2019, 4:48 AM · cloud-services-team

Jul 5 2019

JHedden added a comment to T223906: Active/active rabbitMQ servers on wmcs controller nodes.

This has been completed:

Jul 5 2019, 1:42 PM · cloud-services-team (Kanban)

Jul 3 2019

JHedden closed T227222: Degraded RAID on cloudelastic1003 as Resolved.

This host was rebooted for T224228. I downtimed the host and services in icinga but it looks like this slipped through.

Jul 3 2019, 7:04 PM · ops-eqiad, Operations
JHedden added a comment to T227041: Three small ganeti VMs to host haproxy for OpenStack endpoints.

Per T223907, the chosen approach for providing HA over the 3 haproxy nodes is pacemaker. Corosync is actually an implementation detail to provide cluster management services, i.e. communication, membership, quorum and could be easily exchanged for heartbeat.

Jul 3 2019, 1:44 PM · vm-requests, Operations, cloud-services-team (Kanban)

Jul 2 2019

JHedden added a comment to T227060: nova-fullstack: alert due to leaked instances.

Looks like the list used for all_cloudvirts is not maintaining any order. It's updating libvirtd.conf every puppet run:

Jul 2 2019, 2:17 PM · cloud-services-team (Kanban)
JHedden added a comment to T227060: nova-fullstack: alert due to leaked instances.

With fresh eyes I found that the VMs are intermittently failing with

Jul 2 2019, 2:04 PM · cloud-services-team (Kanban)
JHedden added a comment to T227060: nova-fullstack: alert due to leaked instances.

I didn't find anything in the OpenStack or daemon logs on cloudcontrol1003 or a few cloudvirts I spot checked with fullstack instances. I'll check more in the morning.

Jul 2 2019, 7:52 AM · cloud-services-team (Kanban)
JHedden added a comment to T227060: nova-fullstack: alert due to leaked instances.

I got the page and was also working on this event, cleaning VMs that I found to be active and online. T227057 https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin-monitoring#nova-fullstack

Jul 2 2019, 7:42 AM · cloud-services-team (Kanban)
JHedden added a comment to T227057: cloudcontrol1003/Check for VMs leaked by the nova-fullstack test is CRITICAL.

Fullstack VMs currently leaked

| ID                                   | Name                  | Status | Task State | Power State | Networks                                                          | Availability Zone | Host          | Properties |
+--------------------------------------+-----------------------+--------+------------+-------------+-------------------------------------------------------------------+-------------------+---------------+------------+
| e83bf2eb-0256-4ac7-8b36-e8d3c6e3c519 | fullstackd-1562051634 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.7.181                            | nova              | cloudvirt1004 |            |
| 9551a639-6d8c-4619-8c16-1f652457478f | fullstackd-1562049720 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.7.176, 172.16.7.177, 172.16.7.18 | nova              | cloudvirt1007 |            |
| 3653dacc-568a-4b9c-a126-3db668ea4972 | fullstackd-1562014758 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.7.109, 172.16.7.11               | nova              | cloudvirt1002 |            |
| 8f4a7f1e-7329-4a97-ab79-79f3955a2880 | fullstackd-1561887686 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.6.109, 172.16.6.110              | nova              | cloudvirt1005 |            |
| 21a45c21-476f-4278-b3a2-ca0d134e0fed | fullstackd-1561861047 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.5.44, 172.16.5.45                | nova              | cloudvirt1006 |            |
| 32e92dfc-290f-4c05-8df8-b43ec48033bb | fullstackd-1561740112 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.4.250, 172.16.4.251              | nova              | cloudvirt1002 |            |
Jul 2 2019, 7:18 AM · Cloud-VPS
JHedden created T227057: cloudcontrol1003/Check for VMs leaked by the nova-fullstack test is CRITICAL.
Jul 2 2019, 7:14 AM · Cloud-VPS

Jul 1 2019

JHedden added a comment to T227041: Three small ganeti VMs to host haproxy for OpenStack endpoints.

I feel that 1 CPU might be too limiting, haproxy is multi-threaded and we'll have a number of backends defined.

Jul 1 2019, 10:35 PM · vm-requests, Operations, cloud-services-team (Kanban)
JHedden added a comment to T223907: Set up HA endpoints for keystone, glance, nova, designate apis.

There's technically nothing stopping us from using 4 nodes in corosync. I'd vote for using all 4 and keeping the configuration in sync, avoiding any one off special configuration.

Jul 1 2019, 8:41 PM · Patch-For-Review, cloud-services-team (Kanban)
JHedden added a comment to T223907: Set up HA endpoints for keystone, glance, nova, designate apis.

This is strictly true, yes, but that's how AWS provides load balancer HA and how lots of production web services work. Right now, we are just down until we manually move the service. If the DNS TTL is set short, it would only be down for that long (we also handle most manual failovers entirely via DNS already--again this would be superior to that method). It's not 100% foolproof, but is a gigantic leap forward from where we are. Haproxy servers would just need to be manually "depooled" from DNS before maintenance.

Jul 1 2019, 8:02 PM · Patch-For-Review, cloud-services-team (Kanban)
JHedden added a comment to T223907: Set up HA endpoints for keystone, glance, nova, designate apis.

@Bstorm Using round robin DNS alone does not provide HA. All they do is rotate the first host address returned to the client. If the address given to the client is not reachable, the client will not ask for another address or try the other addresses in the record. This also leads to DNS caching issues when one of the addresses in the record is offline. If we were to use round robin DNS entries we'd need to ensure that all the addresses defined are reachable at all times (using something like keepalived/corosync/pacemaker.)

Jul 1 2019, 7:49 PM · Patch-For-Review, cloud-services-team (Kanban)
JHedden added a comment to T223907: Set up HA endpoints for keystone, glance, nova, designate apis.

What's the plan for managing the address HAproxy will use?

Jul 1 2019, 7:15 PM · Patch-For-Review, cloud-services-team (Kanban)
JHedden added a comment to T223907: Set up HA endpoints for keystone, glance, nova, designate apis.

@Andrew How many hosts will support the OpenStack APIs in your HA architecture design?

Jul 1 2019, 6:55 PM · Patch-For-Review, cloud-services-team (Kanban)
JHedden closed T225067: labtestvirt2003: test different power management / CPU setups for faster kvm as Resolved.
Jul 1 2019, 4:00 PM · Continuous-Integration-Infrastructure, cloud-services-team (Kanban)
JHedden closed T225067: labtestvirt2003: test different power management / CPU setups for faster kvm, a subtask of T223971: Old cloudvirt (with Intel Xeon) are twice slower than new ones (Intel Sky Lake), as Resolved.
Jul 1 2019, 4:00 PM · cloud-services-team (Kanban), Continuous-Integration-Infrastructure
JHedden added a comment to T225067: labtestvirt2003: test different power management / CPU setups for faster kvm.

Closing this task. The default (dynamic) power regulator settings are not impacting the virtual machine performance.

Jul 1 2019, 4:00 PM · Continuous-Integration-Infrastructure, cloud-services-team (Kanban)

Jun 28 2019

JHedden added a comment to T213413: Adapt Toolschecker to work with Prometheus.

Fixing these critical alerts in icinga to reduce noise and give us a better direction for migrating over to prometheus.

Jun 28 2019, 9:02 PM · Toolforge, cloud-services-team (Kanban)
JHedden added a comment to T223906: Active/active rabbitMQ servers on wmcs controller nodes.

Both rabbitMQ servers should be disk nodes. Currently cloudcontrol1004 is a RAM node.
{nodes,[{disc,[rabbit@cloudcontrol1003]},{ram,[rabbit@cloudcontrol1004]

Jun 28 2019, 8:15 PM · cloud-services-team (Kanban)
JHedden added a comment to T224688: Outstanding icinga critical on cloudcontrol-dev hosts.
  • understand how notification works (I don't fully understand them yet!), by understanding the relationship between profile::base::notifications, icinga, and defined checks for a given server/service.

I think the best approach to handling this with Icinga is to downtime hosts and/or disable active checks for the services we're working on. This will help reduce noise for both notifications and dashboard visibility.

  • figure out a proper value for the hiera keys, a value that results in servers producing alerts in IRC/email only and not pages (since this is the -dev cluster)

Notifications are disabled, but cloudservices2002-dev.wikimedia.org is configured to alert via IRC only. It is not configured for SMS paging.

  • check which alerts are true/false positives, and fix them, i.e, get the openstack deployment in shape by fixing any remaining missing configuration bits (remember, is a new deployment!)

These alerts were all relevant and now fixed.

Jun 28 2019, 4:24 PM · cloud-services-team (Kanban), Cloud-Services
JHedden added a comment to T226731: Implement nova host-aggregates.

I'm certainly in favor of replacing custom code with upstream code! In particular it seems like we'll need this in order to make live-migration work sensibly between different CPU-typed cloudvirts, right?

Jun 28 2019, 3:57 PM · Cloud-VPS, cloud-services-team

Jun 27 2019

JHedden created T226731: Implement nova host-aggregates.
Jun 27 2019, 4:04 PM · Cloud-VPS, cloud-services-team

Jun 26 2019

JHedden added a comment to T226463: Connecting to Wiki Replicas from whgi.wikidumpparse.eqiad.wmflabs intermittently gives SSL error.

Hi @notconfusing I think this might be related to the mysql-community-client 5.7.25-1debian9 package you have installed on whgi. The mysql community builds are bundled with yaSSL[0] and the wiki replicas are using 10.1.39-MariaDB which is using openSSL.

Jun 26 2019, 10:18 PM · VPS-Projects
JHedden added a comment to T226647: nova-fullstack crashed from a keystone timeout.

This is related to the task T226632. Nova-api was stopped and started for the wmf_scheduler_hosts_pool config update.

Jun 26 2019, 3:45 PM · cloud-services-team (Kanban)

Jun 25 2019

JHedden closed T225823: Request creation of asyncwiki VPS project as Resolved.
Jun 25 2019, 5:17 PM · Cloud-VPS (Project-requests)
JHedden added a comment to T225823: Request creation of asyncwiki VPS project.

All set, your asyncwiki VPS project has been created.

Jun 25 2019, 5:17 PM · Cloud-VPS (Project-requests)
JHedden added a comment to T226480: toolforge: puppet issue probably related to puppet-enc.

I've confirmed the new 3.4.2-1+deb8u4 python3.4 package is working and reverted the patch on tools-puppetmaster-01.tools.eqiad.wmflabs.

Jun 25 2019, 2:48 PM · cloud-services-team (Kanban), Toolforge
JHedden added a comment to T226480: toolforge: puppet issue probably related to puppet-enc.

New package is available python3.4/oldstable 3.4.2-1+deb8u4

Jun 25 2019, 2:37 PM · cloud-services-team (Kanban), Toolforge
JHedden added a comment to T226480: toolforge: puppet issue probably related to puppet-enc.

Confirmed it's due to the python3.4 upgrade. After applying this patch to tools-puppetmaster-01.tools.eqiad.wmflabs puppet agent is working again.

Jun 25 2019, 2:17 PM · cloud-services-team (Kanban), Toolforge
JHedden added a comment to T226480: toolforge: puppet issue probably related to puppet-enc.

Looks like this is related to the python3.4 upgrade. I think it's blocking puppet-enc from running correctly.

Jun 25 2019, 1:55 PM · cloud-services-team (Kanban), Toolforge

Jun 24 2019

JHedden added a comment to T224188: rack/setup/install (3) new osd ceph nodes.

There's a lot of good information in this task. I'm still catching up, but I wanted to note that it's important to consider the replication factor when designing the network architecture for Ceph. By default Ceph uses synchronous replicated pools, which ensures that data is physically copied to multiple OSDs before sending the acknowledgment to the client. This leads to another benefit of segmenting the public and cluster network traffic. For every single write request on the public network, there are 2 replicated writes on the cluster network.

Jun 24 2019, 7:35 PM · ops-eqiad, Operations, cloud-services-team (Kanban), Cloud-Services
JHedden added a comment to T226239: access.log is not being written for wsexport.

When switching log files the lighttpd service needs to be restarted to release the existing file handle. I restarted the service with webservice restart and verified log messages are now showing up in access.log.

Jun 24 2019, 3:24 PM · E-Book-Export-Reliability, Community-Tech, Cloud-Services

Jun 20 2019

JHedden closed T101631: rev_len should be available also for deleted revisions in database replicas as Resolved.
Jun 20 2019, 3:26 PM · cloud-services-team (Kanban), Data-Services, Cloud-VPS
JHedden closed T101631: rev_len should be available also for deleted revisions in database replicas, a subtask of T150767: Wikireplica service for tools and labs - issues and missing available views (tracking), as Resolved.
Jun 20 2019, 3:26 PM · Data-Services, Tracking-Neverending, DBA

Jun 18 2019

JHedden closed T224192: Onboard jhedden to Wikimedia Foundation as SRE in Cloud Services as Resolved.
Jun 18 2019, 4:07 PM · Patch-For-Review, cloud-services-team (Kanban)
JHedden closed T224981: rabbitmq: connectivity issues between cloudservices1004 and rabbitmq as Resolved.
Jun 18 2019, 4:06 PM · cloud-services-team (Kanban)
JHedden updated subscribers of T101631: rev_len should be available also for deleted revisions in database replicas.

@Anomie asked a good question on the patch review. "Should you do the same for ar_len, here and in the other archive views?" referring to: if(ar_deleted&1,null,ar_len) as ar_len

Jun 18 2019, 4:00 PM · cloud-services-team (Kanban), Data-Services, Cloud-VPS
JHedden added a comment to T224627: Requesting access to ops group in admin for jeh.

Yes, this is resolved. The invalid jeh address defined in the exim alias was fixed.

Jun 18 2019, 12:52 PM · SRE-Access-Requests, Operations
JHedden updated the task description for T224192: Onboard jhedden to Wikimedia Foundation as SRE in Cloud Services.
Jun 18 2019, 12:27 PM · Patch-For-Review, cloud-services-team (Kanban)
JHedden added a comment to T224627: Requesting access to ops group in admin for jeh.

I fixed the incorrect email alias, thanks for letting me know

Jun 18 2019, 12:25 PM · SRE-Access-Requests, Operations

Jun 17 2019

JHedden added a comment to T225067: labtestvirt2003: test different power management / CPU setups for faster kvm.

Results of different power regulator settings on labtestvirt2003.codfw.wmnet.

Jun 17 2019, 7:28 PM · Continuous-Integration-Infrastructure, cloud-services-team (Kanban)
JHedden updated the task description for T224192: Onboard jhedden to Wikimedia Foundation as SRE in Cloud Services.
Jun 17 2019, 4:00 PM · Patch-For-Review, cloud-services-team (Kanban)
JHedden added a comment to T225932: support ssl for openstack REST endpoints.

Standard practice is to use the same ports, as there's only one endpoint entry per service + region + interface.

Jun 17 2019, 2:26 PM · cloud-services-team (Kanban)
JHedden updated the task description for T224192: Onboard jhedden to Wikimedia Foundation as SRE in Cloud Services.
Jun 17 2019, 1:40 PM · Patch-For-Review, cloud-services-team (Kanban)
JHedden updated the task description for T224192: Onboard jhedden to Wikimedia Foundation as SRE in Cloud Services.
Jun 17 2019, 12:49 PM · Patch-For-Review, cloud-services-team (Kanban)

Jun 6 2019

JHedden updated the task description for T224192: Onboard jhedden to Wikimedia Foundation as SRE in Cloud Services.
Jun 6 2019, 7:24 PM · Patch-For-Review, cloud-services-team (Kanban)