Page MenuHomePhabricator

dcaro (David Caro)
SRE & amauteur yak shaver

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Nov 2 2020, 11:59 AM (46 w, 2 d)
Availability
Available
IRC Nick
dcaro
LDAP User
David Caro
MediaWiki User
DCaro (WMF) [ Global Accounts ]

Recent Activity

Today

dcaro updated subscribers of T291557: Puppet agent failure detected on instance toolsbeta-puppetmaster-04 in project toolsbeta.

This seems related to the NFS tests that @Andrew and @Bstorm have been working on:

dcaro@toolsbeta-puppetmaster-04:/$ sudo -i
root@toolsbeta-puppetmaster-04:~# puppet agent --test
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for toolsbeta-puppetmaster-04.toolsbeta.eqiad.wmflabs
Info: Applying configuration version '(ec85155414) root - contacts: don't fail if _role is not defined on labs realm'
Notice: The LDAP client stack for this host is: sssd/sudo
Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: sssd/sudo'
Error: /Stage[main]/Puppetmaster::Gitpuppet/File[/home/gitpuppet/.ssh]: Could not evaluate: Stale file handle @ rb_file_s_lstat - /home/gitpuppet/.ssh
Notice: /Stage[main]/Puppetmaster::Gitpuppet/File[/home/gitpuppet/.ssh/id_rsa]: Dependency File[/home/gitpuppet/.ssh] has failures: true
Warning: /Stage[main]/Puppetmaster::Gitpuppet/File[/home/gitpuppet/.ssh/id_rsa]: Skipping because of failed dependencies
Warning: /Stage[main]/Puppetmaster::Gitpuppet/File[/home/gitpuppet/.ssh/gitpuppet-private-repo]: Skipping because of failed dependencies
Info: Stage[main]: Unscheduling all events on Stage[main]
Notice: Applied catalog in 7.10 seconds
Wed, Sep 22, 10:20 AM · cloud-services-team (Kanban)
dcaro triaged T291557: Puppet agent failure detected on instance toolsbeta-puppetmaster-04 in project toolsbeta as Medium priority.
Wed, Sep 22, 9:56 AM · cloud-services-team (Kanban)
dcaro added a comment to T291546: Puppet agent failure detected on instance tools-k8s-worker-53 in project tools.

It seems that something happened at ~2:36:

2021-09-22T11:27:52,285520659+02:00.png (977×2 px, 169 KB)

Wed, Sep 22, 9:42 AM · cloud-services-team (Kanban)
dcaro added a comment to T291546: Puppet agent failure detected on instance tools-k8s-worker-53 in project tools.

Also from the email:

Date: Wed, 22 Sep 2021 08:15:07 +0000
From: root <root@tools.wmflabs.org>
To: dcaro@wikimedia.org
Subject: [Cloud VPS alert][tools] Puppet failure on tools-k8s-worker-53.tools.eqiad.wmflabs (172.16.1.128)
Wed, Sep 22, 9:06 AM · cloud-services-team (Kanban)
dcaro triaged T291546: Puppet agent failure detected on instance tools-k8s-worker-53 in project tools as High priority.
Wed, Sep 22, 8:56 AM · cloud-services-team (Kanban)
dcaro updated the task description for T291544: Transition Toolforge Build Service.
Wed, Sep 22, 8:38 AM · Toolforge, cloud-services-team (Kanban)
dcaro created T291544: Transition Toolforge Build Service.
Wed, Sep 22, 8:25 AM · Toolforge, cloud-services-team (Kanban)

Yesterday

dcaro closed T290059: [quarry] Consolidate the redis configuration, stop using the db server for redis as Invalid.

This was intended and temporary.

Tue, Sep 21, 12:53 PM · cloud-services-team (Kanban)
dcaro closed T290307: Icinga/Check for VMs leaked by the nova-fullstack test as Invalid.

This is to old to be actioned

Tue, Sep 21, 12:52 PM · cloud-services-team (Kanban)
dcaro added a comment to T290970: File System corruption on cloud-vps instances.

:facepalm: for myself xd

Tue, Sep 21, 10:22 AM · cloud-services-team (Kanban)
dcaro added a comment to T290970: File System corruption on cloud-vps instances.

Finding out where the VM backups happen:

11:39 AM ~/Work/wikimedia/operations-puppet  (production|…1)
dcaro@vulcanus$ git grep -B 1 virt_ceph_and_backy
manifests/site.pp-node /^cloudvirt102[1-8]\.eqiad\.wmnet$/ {
manifests/site.pp:    role(wmcs::openstack::eqiad1::virt_ceph_and_backy)
...
Tue, Sep 21, 9:53 AM · cloud-services-team (Kanban)
dcaro added a comment to T291446: cloudcontrol1004 galera crash.

That's a bit less than 4 days, I guess the disk is being used heavily.

Tue, Sep 21, 9:52 AM · Cloud-VPS, cloud-services-team (Kanban)
dcaro added a comment to T290970: File System corruption on cloud-vps instances.

Expanding on @Andrew 's log, I created a dummy scirpt (F34649566) to parse the output, trying to get some patterns, this is the output, using the last 100 lines for each host: F34649565

Tue, Sep 21, 9:32 AM · cloud-services-team (Kanban)

Fri, Sep 3

dcaro updated the task description for T290318: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet.
Fri, Sep 3, 2:25 PM · SRE, ops-eqiad, DC-Ops
dcaro added a comment to T290307: Icinga/Check for VMs leaked by the nova-fullstack test.

The instances seem to be working now, tried to ssh to the latest stuck instance but got auth error, will try to pull
the syslog logs.

Fri, Sep 3, 11:28 AM · cloud-services-team (Kanban)
dcaro added a parent task for T290321: HTTP 502 "Next hop connection failed" when trying to upload a file: T290318: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet.
Fri, Sep 3, 11:15 AM · Phabricator
dcaro added a subtask for T290318: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet: T290321: HTTP 502 "Next hop connection failed" when trying to upload a file.
Fri, Sep 3, 11:15 AM · SRE, ops-eqiad, DC-Ops
dcaro created T290321: HTTP 502 "Next hop connection failed" when trying to upload a file.
Fri, Sep 3, 11:14 AM · Phabricator
dcaro created T290318: hw troubleshooting: megaraid reset due to fatal error for labstore1005.eqiad.wmnet.
Fri, Sep 3, 11:00 AM · SRE, ops-eqiad, DC-Ops
dcaro triaged T290307: Icinga/Check for VMs leaked by the nova-fullstack test as High priority.
Fri, Sep 3, 9:13 AM · cloud-services-team (Kanban)
dcaro triaged T290306: The labs-monitoring dashboard does not exist anymore, replace/remove the alert pointing to it as High priority.
Fri, Sep 3, 9:07 AM · cloud-services-team (Kanban)
dcaro added a comment to T290294: Icinga/Persistent high iowait - labstore1004.

On labstore1005 side:

Sep 03 02:29:42 labstore1005 kernel: drbd tools: meta connection shut down by peer.
Sep 03 02:29:42 labstore1005 kernel: drbd tools: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Sep 03 02:29:52 labstore1005 kernel: drbd misc: meta connection shut down by peer.
Sep 03 02:29:52 labstore1005 kernel: drbd misc: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Fri, Sep 3, 5:31 AM · cloud-services-team (Kanban)
dcaro added a comment to T290294: Icinga/Persistent high iowait - labstore1004.

From the graphs (https://grafana-rw.wikimedia.org/d/000000568/labstore1004-1005-1006-1007?orgId=1) it seems that labstore1005, the pair for drbd4:

root@labstore1004:~# grep -B 1 drbd4 /etc/drbd.d/*
/etc/drbd.d/tools.res-  on labstore1004 {
/etc/drbd.d/tools.res:    device    /dev/drbd4;
--
/etc/drbd.d/tools.res-  on labstore1005 {
/etc/drbd.d/tools.res:    device    /dev/drbd4;
Fri, Sep 3, 5:27 AM · cloud-services-team (Kanban)
dcaro added a comment to T290294: Icinga/Persistent high iowait - labstore1004.

DRBD seems up to date:

root@labstore1004:~# cat /proc/drbd
version: 8.4.7 (api:1/proto:86-101)
srcversion: F7D2F0C9036CD0E796D5958
Fri, Sep 3, 5:12 AM · cloud-services-team (Kanban)
dcaro triaged T290294: Icinga/Persistent high iowait - labstore1004 as High priority.
Fri, Sep 3, 5:07 AM · cloud-services-team (Kanban)

Wed, Sep 1

dcaro closed T285858: Install the new ceph osd machines cloudcephosd10(1[6-9]|20) using cookbooks as Resolved.

Docs updated

Wed, Sep 1, 3:22 PM · Patch-For-Review, cloud-services-team (Kanban)
dcaro closed T289670: Leftover backup config on cloudstore1009? /dev/srv/ceph_backups as Resolved.
Wed, Sep 1, 9:46 AM · cloud-services-team (Kanban)

Tue, Aug 31

dcaro added a comment to T288528: quarry to python 3.7.

Now the buster instances are using the travis DB, we will need to update the database to the latest right before moving to the new hosts:

root@quarry-db-01:~# mysqldump quarry | mysql -h vuylalotiaz.svc.trove.eqiad1.wikimedia.cloud -p quarry -u quarry
Tue, Aug 31, 10:56 AM · Quarry (FY2021/2022-Q1)
dcaro added a comment to T290059: [quarry] Consolidate the redis configuration, stop using the db server for redis.

This might be intended to separate the old (using the db server as redis) and the buster deployment (using the redis VM for it instead).

Tue, Aug 31, 10:23 AM · cloud-services-team (Kanban)
dcaro updated subscribers of T289568: [quarry] Move quarry database to Trove, update backup system.

Moved the new (buster) hosts to use the trove DB instance, I had to remove a non-longer use column from the
query_revisions table (dbs was replaced by query_database at some point, but it's still left over on the old database).

Tue, Aug 31, 10:22 AM · Patch-For-Review, cloud-services-team (Kanban)
dcaro added a comment to T289850: Setup Fix Suggester Bot on Toolforge.

OK, I removed the ssh-agent implementation in my app and used a private key, so with phpseclib (pure PHP implementation of SSH2) I don't need openssh-client in the base image.

Now I'm stuck here when I try to add a gerrit stream event into a Redis stream:

ERR unknown command 'XADD'
Tue, Aug 31, 8:43 AM · Fix-Suggester-Bot, cloud-services-team (Kanban), Toolforge
dcaro triaged T290059: [quarry] Consolidate the redis configuration, stop using the db server for redis as High priority.
Tue, Aug 31, 8:26 AM · cloud-services-team (Kanban)

Mon, Aug 30

dcaro added a comment to T289700: wmcs.ceph.osd.bootstrap_and_add failing to connect to rados.

Hmm... yep, I think I know what's going on, when creating the keyring we don't put a dependency between the exec that
populates the file, and the file resource itself, and in this case, it ends up creating the file empty first, and when
the exec comes in, it sees the file there already and just skips.
Will send a patch.

Mon, Aug 30, 9:09 AM · cloud-services-team (Kanban)
dcaro added a comment to T289700: wmcs.ceph.osd.bootstrap_and_add failing to connect to rados.

Indeed there's something going on with puppet and the creation of the keyring.

Mon, Aug 30, 9:02 AM · cloud-services-team (Kanban)

Thu, Aug 26

Restricted Application added a project to T273713: Packets discarded on dumpsdata1001: Infrastructure-Foundations.
Thu, Aug 26, 3:16 PM · Infrastructure-Foundations, Dumps-Generation, Datasets-General-or-Unknown, netops, SRE
dcaro updated the task description for T289661: Audit usages or the realm variable with a view to drop it.
Thu, Aug 26, 8:20 AM · Patch-For-Review, User-jbond, Infrastructure-Foundations, Puppet, Cloud-VPS, cloud-services-team (Kanban)
dcaro claimed T289661: Audit usages or the realm variable with a view to drop it.
Thu, Aug 26, 8:14 AM · Patch-For-Review, User-jbond, Infrastructure-Foundations, Puppet, Cloud-VPS, cloud-services-team (Kanban)
dcaro reassigned T289663: Icinga/Check for VMs leaked by the nova-fullstack test from dcaro to Andrew.

@Andrew handing it over to you, as it's not clear to me if you tried the other patches or not (the ones about checking different puppet state file paths), feel free to abandon them if they are not needed and close the task.

Thu, Aug 26, 7:37 AM · Patch-For-Review, cloud-services-team (Kanban)
dcaro claimed T289670: Leftover backup config on cloudstore1009? /dev/srv/ceph_backups.

From the meeting, this is a leftover config, can go away.

Thu, Aug 26, 7:28 AM · cloud-services-team (Kanban)

Wed, Aug 25

dcaro added a comment to T289700: wmcs.ceph.osd.bootstrap_and_add failing to connect to rados.

for some reason the keyring is empty:

root@cloudcephosd1018:~# wc /var/lib/ceph/bootstrap-osd/ceph.keyring
0 0 0 /var/lib/ceph/bootstrap-osd/ceph.keyring
Wed, Aug 25, 4:19 PM · cloud-services-team (Kanban)
dcaro added a comment to T289663: Icinga/Check for VMs leaked by the nova-fullstack test.

I started a new VM with the same image, and while booting (before the first puppet run), I connected to the virsh
console and was able to print the puppet config, showing that the state dir is not the one we are looking at
(/var/lib/puppet/state):

agent_catalog_run_lockfile = /var/cache/puppet/state/agent_catalog_run.lock
agent_disabled_lockfile = /var/cache/puppet/state/agent_disabled.lock
classfile = /var/cache/puppet/state/classes.txt
graphdir = /var/cache/puppet/state/graphs
lastrunfile = /var/cache/puppet/state/last_run_summary.yaml
lastrunreport = /var/cache/puppet/state/last_run_report.yaml
resourcefile = /var/cache/puppet/state/resources.txt
statedir = /var/cache/puppet/state
statefile = /var/cache/puppet/state/state.yaml
statettl = 2764800
Wed, Aug 25, 12:39 PM · Patch-For-Review, cloud-services-team (Kanban)
dcaro closed T289655: [Cloud VPS alert][paws] Puppet failure on paws-k8s-haproxy-1.paws.eqiad.wmflabs (172.16.0.191) as Resolved.
Wed, Aug 25, 10:01 AM · cloud-services-team (Kanban)
dcaro added a comment to T289655: [Cloud VPS alert][paws] Puppet failure on paws-k8s-haproxy-1.paws.eqiad.wmflabs (172.16.0.191).

Remounted the failing nfs mount, and now everything looks ok, there's no more nfs timeout messagen on dmesg, and the
timesyncd service starts correctly, closing.

Wed, Aug 25, 10:01 AM · cloud-services-team (Kanban)
dcaro moved T289670: Leftover backup config on cloudstore1009? /dev/srv/ceph_backups from Inbox to Needs discussion on the cloud-services-team (Kanban) board.
Wed, Aug 25, 9:53 AM · cloud-services-team (Kanban)
dcaro reassigned T285858: Install the new ceph osd machines cloudcephosd10(1[6-9]|20) using cookbooks from dcaro to Andrew.

Re-assign to me once the new OSD is deployed to followup with the docs.

Wed, Aug 25, 9:53 AM · Patch-For-Review, cloud-services-team (Kanban)
dcaro created T289670: Leftover backup config on cloudstore1009? /dev/srv/ceph_backups.
Wed, Aug 25, 9:49 AM · cloud-services-team (Kanban)
dcaro added a comment to T289655: [Cloud VPS alert][paws] Puppet failure on paws-k8s-haproxy-1.paws.eqiad.wmflabs (172.16.0.191).

There were many errors (from way before the first ntp error) about the nfsmount of cloudstore1009.wikimedia.org timing out,
I manually ran umount -l /mnt/nfs/secondary-cloudstore1009.wikimedia.org-scratch, an then it started working
continuously, so the issue seems related to that (though why timesyncd is bound to some nfs mount still escapes me).

Wed, Aug 25, 9:27 AM · cloud-services-team (Kanban)
dcaro triaged T289663: Icinga/Check for VMs leaked by the nova-fullstack test as High priority.
Wed, Aug 25, 9:08 AM · Patch-For-Review, cloud-services-team (Kanban)
dcaro added a comment to T289655: [Cloud VPS alert][paws] Puppet failure on paws-k8s-haproxy-1.paws.eqiad.wmflabs (172.16.0.191).

When running manually with debug logs it was passing:

root@paws-k8s-haproxy-1:~# SYSTEMD_LOG_LEVEL=debug /lib/systemd/systemd-timesyncd
...
Connecting to time server 172.16.2.81:123 (ntp-01.cloudinfra.wmflabs.org).
Sent NTP request to 172.16.2.81:123 (ntp-01.cloudinfra.wmflabs.org).
NTP response:
  leap         : 0
  version      : 4
  mode         : 4
  stratum      : 2
  precision    : 0.000000 sec (-24)
  root distance: 0.033295 sec
  reference    : n/a
  origin       : 1629881115.524
  receive      : 1629881115.523
  transmit     : 1629881115.523
  dest         : 1629881115.525
  offset       : -0.002 sec
  delay        : +0.001 sec
  packet count : 1
  jitter       : 0.000
  poll interval: 64
  adjust (slew): -0.002 sec
  status       : 8193 sync
  time now     : 1629881115.525
  constant     : 2
  offset       : -0.002 sec
  freq offset  : +2108154 (+32 ppm)
interval/delta/delay/jitter/drift 64s/-0.002s/0.001s/0.000s/+32ppm
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/timesync1 interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=3 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
Synchronized to time server for the first time 172.16.2.81:123 (ntp-01.cloudinfra.wmflabs.org)
...
Wed, Aug 25, 9:03 AM · cloud-services-team (Kanban)
dcaro added a comment to T289655: [Cloud VPS alert][paws] Puppet failure on paws-k8s-haproxy-1.paws.eqiad.wmflabs (172.16.0.191).

It started failing this morning at ~7:30:

Aug 25 07:24:05 paws-k8s-haproxy-1 systemd[1]: Stopping Network Time Synchronization...
### It worked here
Aug 25 07:24:05 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Succeeded.
Aug 25 07:24:05 paws-k8s-haproxy-1 systemd[1]: Stopped Network Time Synchronization.
Aug 25 07:24:05 paws-k8s-haproxy-1 systemd[1]: Starting Network Time Synchronization...
### First failure
Aug 25 07:25:35 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Start operation timed out. Terminating.
Aug 25 07:25:35 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Main process exited, code=killed, status=15/TERM
Aug 25 07:25:35 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Failed with result 'timeout'.
Aug 25 07:25:35 paws-k8s-haproxy-1 systemd[1]: Failed to start Network Time Synchronization.
Aug 25 07:25:35 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Service has no hold-off time (RestartSec=0), scheduling restart.
Aug 25 07:25:35 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Scheduled restart job, restart counter is at 1.
Aug 25 07:25:35 paws-k8s-haproxy-1 systemd[1]: Stopped Network Time Synchronization.
Wed, Aug 25, 8:31 AM · cloud-services-team (Kanban)
dcaro added a comment to T289655: [Cloud VPS alert][paws] Puppet failure on paws-k8s-haproxy-1.paws.eqiad.wmflabs (172.16.0.191).

It's timing out trying to sync the time, it's happening on several instances, looking:

root@paws-k8s-haproxy-1:~# run-puppet-agent
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for paws-k8s-haproxy-1.paws.eqiad.wmflabs
Info: Applying configuration version '(e9aa6938a9) Manuel Arostegui - install_server: Reimage db1160 to Buster'
Notice: The LDAP client stack for this host is: sssd/sudo
Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: sssd/sudo'
Wed, Aug 25, 8:24 AM · cloud-services-team (Kanban)
dcaro triaged T289655: [Cloud VPS alert][paws] Puppet failure on paws-k8s-haproxy-1.paws.eqiad.wmflabs (172.16.0.191) as High priority.
Wed, Aug 25, 8:18 AM · cloud-services-team (Kanban)

Tue, Aug 24

dcaro triaged T289569: [quarry] Fancy up the CI pipeline in Jenkins as High priority.
Tue, Aug 24, 11:46 AM · cloud-services-team (Kanban)
dcaro added a comment to T289568: [quarry] Move quarry database to Trove, update backup system.

From https://etherpad.wikimedia.org/p/productionize-quarry

Tue, Aug 24, 11:45 AM · Patch-For-Review, cloud-services-team (Kanban)
dcaro triaged T289568: [quarry] Move quarry database to Trove, update backup system as High priority.
Tue, Aug 24, 11:44 AM · Patch-For-Review, cloud-services-team (Kanban)
dcaro added a comment to T289502: Figure out how to delete Glance images.

There's two tasks related to this:
2021-08-11 00:00:37 unassigned :T273723(MOp)(Inbox)[backups] Periodically cleanup non-handled backups
2021-08-10 19:02:47 unassigned :T273720(MOp)(Inbox)[ceph][rbd] Periodically cleanup dangling snapshots

Tue, Aug 24, 10:27 AM · cloud-services-team (Kanban)
dcaro created T289563: Error parsing "/var/lib/prometheus/node.d/node_cloudvirt_libvirt_stats.prom".
Tue, Aug 24, 9:53 AM · cloud-services-team (Kanban)

Mon, Aug 23

dcaro assigned T289415: Need temporary Cinder volume increase to restart MariaDB to nskaggs.

+1 from me

Mon, Aug 23, 2:23 PM · Cloud-VPS (Quota-requests), cloud-services-team (Kanban)

Aug 23 2021

dcaro closed T287471: Validate and autocomplete database names in the database input field as Resolved.

The validation added is kinda lose, as the data for the available databases gets populated from the queries that have been already run but it can't differentiate between those that have a valid DB or not, there's some filtering out obviously wrong DB names, but that's all.

Aug 23 2021, 11:04 AM · Patch-For-Review, cloud-services-team (Kanban), Quarry (FY2021/2022-Q1)

Aug 20 2021

dcaro set the image for Quarry (FY2021/2022-Q1) to F34611161: profile.
Aug 20 2021, 1:05 PM
dcaro closed T289337: [Cloud VPS alert][admin] Puppet failure on buildvm-f575acf4-c566-4258-9a5f-20b9ec80ccad.admin.eqiad1.wikimedia.cloud (172.16.4.109) as Resolved.
Aug 20 2021, 9:10 AM · cloud-services-team (Kanban)
dcaro added a comment to T289337: [Cloud VPS alert][admin] Puppet failure on buildvm-f575acf4-c566-4258-9a5f-20b9ec80ccad.admin.eqiad1.wikimedia.cloud (172.16.4.109).

This VM is used to create new images, my guess is that when booting up, the puppet state file has the timestamp when
the base image was created, thus failing the check.

Aug 20 2021, 9:10 AM · cloud-services-team (Kanban)
dcaro triaged T289337: [Cloud VPS alert][admin] Puppet failure on buildvm-f575acf4-c566-4258-9a5f-20b9ec80ccad.admin.eqiad1.wikimedia.cloud (172.16.4.109) as High priority.
Aug 20 2021, 8:49 AM · cloud-services-team (Kanban)
dcaro added a comment to T288982: Productionize quarry a bit.

When, where, who and how should we decide what goes here? (has that been done already? might have missed it.)

Aug 20 2021, 8:47 AM · Quarry (FY2021/2022-Q1), cloud-services-team (Kanban), Epic
dcaro added a comment to T289282: Shall we drop the backy2 backup jobs.

Though I agree that the current setup is not as useful as it could be, I do think that we should find a replacement first, this might require some refining, some random ideas, feel free to ignore them:

Aug 20 2021, 8:45 AM · cloud-services-team (Kanban)
dcaro added a comment to T275024: Toolforge: Update go runtime.

Might be a good candidate for extra docs? (as in "How to build your go code" kinda thing).

Aug 20 2021, 7:20 AM · Toolforge (Software install/update), cloud-services-team (Kanban)
dcaro added a comment to T289313: Rollback DNS record creation on dynamicproxy registration failure.

It would be really useful if we are able to log an error with an id (request id?) anywhere, besides if it should try/retry the action or not. Now that we have central logging should be doable, making it more trackable and debuggable, unblocking some extra observability and possible future automation.

Aug 20 2021, 7:16 AM · cloud-services-team (Kanban), Horizon

Aug 19 2021

dcaro closed T286695: Request creation of wikisp VPS project as Resolved.

Done, the new project name is "wikisp".

Aug 19 2021, 1:41 PM · cloud-services-team (Kanban), Wikimedia-Small-Projects-User-Group, Cloud-VPS (Project-requests)
dcaro closed T288863: Request increased quota for wikilearn Cloud VPS project as Resolved.

Change done:

dcaro@cloudcontrol1003:~$ sudo wmcs-openstack quota set wikilearn --ram $((24*1024)) --cores 18
Aug 19 2021, 12:46 PM · Cloud-VPS (Quota-requests)
dcaro closed T289239: exim paniclog on tools-sgecron-01.tools.eqiad.wmflabs has non-zero size as Resolved.
Aug 19 2021, 9:48 AM · cloud-services-team (Kanban)
dcaro added a comment to T289239: exim paniclog on tools-sgecron-01.tools.eqiad.wmflabs has non-zero size.

There's no more errors currently happening:

root@tools-sgecron-01:~# cat /var/log/exim4/paniclog
root@tools-sgecron-01:~#
Aug 19 2021, 9:48 AM · cloud-services-team (Kanban)
dcaro triaged T289239: exim paniclog on tools-sgecron-01.tools.eqiad.wmflabs has non-zero size as High priority.
Aug 19 2021, 9:43 AM · cloud-services-team (Kanban)
dcaro added a comment to T288863: Request increased quota for wikilearn Cloud VPS project.

Sorry for the derailing of the ticket, but it would be quite interesting to hear what is your use case for the S3-like storage :), that way we can have that in consideration in this early stages.

Aug 19 2021, 8:09 AM · Cloud-VPS (Quota-requests)
dcaro added a comment to T288863: Request increased quota for wikilearn Cloud VPS project.

About the S3-like service, currently we don't have any, we have it in mind but it might take some time until it's implemented. Depends on your use case you might be able to get away with cinder volumes.

Aug 19 2021, 7:46 AM · Cloud-VPS (Quota-requests)
dcaro added a comment to T288863: Request increased quota for wikilearn Cloud VPS project.

@Asaf Thanks for the reply, we can consider higher quotas yes, but I'll need some specific numbers :)
Can you fill up the following template?

Aug 19 2021, 7:21 AM · Cloud-VPS (Quota-requests)

Aug 18 2021

dcaro claimed T286695: Request creation of wikisp VPS project.
Aug 18 2021, 4:03 PM · cloud-services-team (Kanban), Wikimedia-Small-Projects-User-Group, Cloud-VPS (Project-requests)
dcaro added a comment to T289159: cloudvirt1028 overheating issues.

There were some logs that stopped on the 8th of August and has not happened again, each time all CPUs complained and
then there's an ok message on the same second too:

root@cloudvirt1028:~# dmesg -T | grep 'Package temperature above' | cut -d] -f1 | sort | uniq | sort
[Mon Aug  9 02:37:37 2021
[Mon Aug  9 02:43:37 2021
[Sat Aug  7 18:52:00 2021
[Sat Aug  7 18:57:58 2021
[Sat Aug  7 19:15:10 2021
[Sat Aug  7 19:21:33 2021
[Sat Aug  7 20:01:09 2021
[Sat Aug  7 20:26:29 2021
[Sat Aug  7 20:55:56 2021
[Sat Aug  7 21:20:30 2021
[Sat Aug  7 21:32:59 2021
[Sat Aug  7 21:58:12 2021
[Sat Aug  7 22:03:12 2021
[Sat Aug  7 22:12:26 2021
[Sat Aug  7 22:23:02 2021
[Sat Aug  7 22:31:46 2021
[Sat Aug  7 22:58:59 2021
[Sat Aug  7 23:04:11 2021
[Sat Aug  7 23:14:59 2021
[Sat Aug  7 23:30:30 2021
[Sat Aug  7 23:45:17 2021
[Sat Aug  7 23:50:29 2021
[Sun Aug  8 03:21:42 2021
[Sun Aug  8 05:42:01 2021
[Sun Aug  8 06:40:13 2021
[Sun Aug  8 07:04:24 2021
[Sun Aug  8 07:36:09 2021
[Sun Aug  8 07:42:47 2021
[Sun Aug  8 07:51:42 2021
[Sun Aug  8 07:57:30 2021
[Sun Aug  8 08:03:39 2021
[Sun Aug  8 08:12:17 2021
[Sun Aug  8 08:30:35 2021
[Sun Aug  8 08:38:50 2021
[Sun Aug  8 08:59:13 2021
[Sun Aug  8 09:04:35 2021
[Sun Aug  8 09:12:13 2021
[Sun Aug  8 09:29:10 2021
[Sun Aug  8 09:34:53 2021
[Sun Aug  8 09:40:42 2021
[Sun Aug  8 09:50:15 2021
[Sun Aug  8 09:55:16 2021
[Sun Aug  8 10:00:16 2021
[Sun Aug  8 10:12:21 2021
[Sun Aug  8 10:21:29 2021
[Sun Aug  8 10:44:51 2021
[Sun Aug  8 11:25:05 2021
[Sun Aug  8 12:04:14 2021
[Sun Aug  8 12:56:53 2021
[Sun Aug  8 13:15:09 2021
[Sun Aug  8 13:20:31 2021
[Sun Aug  8 13:32:23 2021
[Sun Aug  8 13:37:30 2021
[Sun Aug  8 13:56:31 2021
[Sun Aug  8 14:02:34 2021
[Sun Aug  8 14:23:33 2021
[Sun Aug  8 14:32:23 2021
[Sun Aug  8 14:39:36 2021
[Sun Aug  8 14:44:46 2021
[Sun Aug  8 14:54:17 2021
[Sun Aug  8 15:55:52 2021
[Sun Aug  8 16:01:01 2021
[Sun Aug  8 16:25:46 2021
[Sun Aug  8 16:31:09 2021
[Sun Aug  8 16:37:15 2021
[Sun Aug  8 16:43:19 2021
[Sun Aug  8 17:00:11 2021
[Sun Aug  8 17:06:26 2021
[Sun Aug  8 17:57:00 2021
[Sun Aug  8 18:02:20 2021
[Sun Aug  8 18:07:20 2021
[Sun Aug  8 18:19:12 2021
[Sun Aug  8 18:25:52 2021
[Sun Aug  8 18:37:15 2021
[Sun Aug  8 18:53:55 2021
[Sun Aug  8 19:00:07 2021
[Sun Aug  8 19:13:09 2021
[Sun Aug  8 19:19:20 2021
[Sun Aug  8 19:25:24 2021
[Sun Aug  8 19:31:59 2021
[Sun Aug  8 19:49:41 2021
[Sun Aug  8 19:54:41 2021
[Sun Aug  8 20:07:34 2021
[Sun Aug  8 20:32:30 2021
[Sun Aug  8 20:45:16 2021
[Sun Aug  8 21:01:03 2021
[Sun Aug  8 22:01:13 2021
Aug 18 2021, 2:32 PM · cloud-services-team (Kanban)
dcaro closed T289158: summary: WARNING: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. as Resolved.
Aug 18 2021, 2:27 PM · cloud-services-team (Kanban)
dcaro added a comment to T289158: summary: WARNING: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle..

Ran puppet manually without issues, the alert stopped triggering:

root@cloudvirt1028:~# run-puppet-agent
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for cloudvirt1028.eqiad.wmnet
Info: Applying configuration version '(cfb285998d) Jelto - hieradata: cleanup old canary api server'
Notice: Applied catalog in 19.05 seconds
Aug 18 2021, 2:27 PM · cloud-services-team (Kanban)
dcaro triaged T289159: cloudvirt1028 overheating issues as Medium priority.
Aug 18 2021, 2:23 PM · cloud-services-team (Kanban)
dcaro triaged T289158: summary: WARNING: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. as High priority.
Aug 18 2021, 2:21 PM · cloud-services-team (Kanban)
dcaro updated subscribers of T288962: Cannot create web proxy because of Duplicate RecordSet.

@Andrew I'm seeing the entry in the dns, and on designate logs, but openstack seems not to have it in it's database, the docs point to a 'designate' cli that seems not available anymore, any guidance on how to continue? (I can doodle with the DB, but before messing with the internals of anything I'd like your input). Thanks!

Aug 18 2021, 2:04 PM · cloud-services-team (Kanban)
dcaro added a comment to T288863: Request increased quota for wikilearn Cloud VPS project.

@Asaf Hi, I'll need some clarifications if you may :)

Aug 18 2021, 1:45 PM · Cloud-VPS (Quota-requests)
dcaro added a comment to T284471: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet.

There's a new switch on D5, we should be able to start racking these :)

Aug 18 2021, 1:27 PM · SRE, ops-eqiad, DC-Ops
dcaro removed a watcher for Spicerack: dcaro.
Aug 18 2021, 9:41 AM

Aug 17 2021

dcaro added a comment to T288976: Add black formatting to quarry linter.

About the 80 char lines, that's just a config value that can be changed, it's been a while since I cared what that number is xd

Aug 17 2021, 2:41 PM · Quarry
dcaro added a comment to T288976: Add black formatting to quarry linter.

We can copy the scripts from spicerack: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/spicerack/+/refs/heads/master/tox.ini#33
That will allow formatting it too (and using isort also), and it's setup already in a way that works on CI.

It formats it in place after submission? Call me old, but I don't like that.

No, it has a 'check' mode, that is what runs after submission (on ci), but also has a 'format' mode that will format it the way the CI wants (that you have to manually run if you want to).

Aug 17 2021, 2:40 PM · Quarry
dcaro added a comment to T288962: Cannot create web proxy because of Duplicate RecordSet.

I can't see the recordset on the horizon ui or in the cli:

root@cloudcontrol1003:~# wmcs-openstack recordset list --os-project-id=cloudinfra wmcloud.org. | grep wikicommunityhealth
| cbcf4738-33b6-424a-8343-4128db8d744d | wikicommunityhealth-dev.wmcloud.org.                   | A     | 185.15.56.49                                                                              | ACTIVE | NONE   |

(that's the dev one, not the one being added).
looking

Aug 17 2021, 8:30 AM · cloud-services-team (Kanban)
dcaro added a comment to T288976: Add black formatting to quarry linter.

We can copy the scripts from spicerack: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/spicerack/+/refs/heads/master/tox.ini#33
That will allow formatting it too (and using isort also), and it's setup already in a way that works on CI.

Aug 17 2021, 7:59 AM · Quarry

Aug 16 2021

dcaro triaged T288955: Rename DutchTom to DutchTina on wikitech.wikimedia.org as Medium priority.
Aug 16 2021, 2:56 PM · wikitech.wikimedia.org
dcaro triaged T288962: Cannot create web proxy because of Duplicate RecordSet as Medium priority.
Aug 16 2021, 2:56 PM · cloud-services-team (Kanban)
dcaro closed T288959: Project tools instance tools-sgeexec-0947 is down as Resolved.
Aug 16 2021, 2:55 PM · cloud-services-team (Kanban)
dcaro added a comment to T288959: Project tools instance tools-sgeexec-0947 is down.

The server is back up and running, will reopen and investigate deeper if this happens again.

Aug 16 2021, 2:55 PM · cloud-services-team (Kanban)
dcaro added a comment to T288962: Cannot create web proxy because of Duplicate RecordSet.

The dns entry is there:

04:33 PM ~/Work/wikimedia/analytics_quarry_web  (master|✔)
dcaro@vulcanus$ dig +short wikicommunityhealth.wmcloud.org
185.15.56.49
Aug 16 2021, 2:41 PM · cloud-services-team (Kanban)
dcaro moved T288962: Cannot create web proxy because of Duplicate RecordSet from Inbox to Clinic Duty on the cloud-services-team (Kanban) board.
Aug 16 2021, 2:39 PM · cloud-services-team (Kanban)
dcaro added a project to T288962: Cannot create web proxy because of Duplicate RecordSet: cloud-services-team (Kanban).
Aug 16 2021, 2:38 PM · cloud-services-team (Kanban)
dcaro claimed T288962: Cannot create web proxy because of Duplicate RecordSet.
Aug 16 2021, 2:38 PM · cloud-services-team (Kanban)
dcaro added a comment to T288959: Project tools instance tools-sgeexec-0947 is down.

From openstack logs:

[6822729.159131] Out of memory: Kill process 9980 (update_items_fr) score 190 or sacrifice child
[6822729.161914] Killed process 9980 (update_items_fr) total-vm:6671716kB, anon-rss:6229280kB, file-rss:0kB, shmem-rss:0kB
[7035198.109082] Out of memory: Kill process 3146 (update_items_fr) score 188 or sacrifice child
[7035198.111435] Killed process 3146 (update_items_fr) total-vm:6612468kB, anon-rss:6169216kB, file-rss:0kB, shmem-rss:0kB
[7038690.572639] Out of memory: Kill process 16532 (update_items_fr) score 189 or sacrifice child
[7038690.575322] Killed process 16532 (update_items_fr) total-vm:6614376kB, anon-rss:6172660kB, file-rss:0kB, shmem-rss:0kB
[7072114.491442] Out of memory: Kill process 26759 (update_items_fr) score 184 or sacrifice child
[7072114.493856] Killed process 26759 (update_items_fr) total-vm:6481244kB, anon-rss:6036696kB, file-rss:0kB, shmem-rss:0kB
[9051844.357016] Out of memory: Kill process 6299 (perl) score 212 or sacrifice child
[9051844.361744] Killed process 6299 (perl) total-vm:8471636kB, anon-rss:6923376kB, file-rss:0kB, shmem-rss:0kB
[9152003.281682] Out of memory: Kill process 20733 (perl) score 213 or sacrifice child
[9152003.284239] Killed process 20733 (perl) total-vm:8471688kB, anon-rss:6973748kB, file-rss:0kB, shmem-rss:0kB
Aug 16 2021, 2:20 PM · cloud-services-team (Kanban)
dcaro triaged T288959: Project tools instance tools-sgeexec-0947 is down as High priority.
Aug 16 2021, 2:18 PM · cloud-services-team (Kanban)
dcaro claimed T288955: Rename DutchTom to DutchTina on wikitech.wikimedia.org.
Aug 16 2021, 1:56 PM · wikitech.wikimedia.org