Vgutierrez (Valentín Gutiérrez)
Traffic Security Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Feb 12 2018, 9:51 AM (52 w, 6 d)
Availability
Available
IRC Nick
vgutierrez
LDAP User
Vgutierrez
MediaWiki User
Unknown

Recent Activity

Wed, Feb 13

Vgutierrez updated the task description for T207389: Rename the Certcentral project.
Wed, Feb 13, 5:55 PM · Patch-For-Review, Acme-chief
Vgutierrez added a comment to T207389: Rename the Certcentral project.

pinkunicorn certificate successfully issued from acmechief1001:

Feb 13 15:53:08 acmechief1001 acme-chief-backend[10861]: Number of certificates per status: Counter({'VALID': 22, 'INITIAL': 2})
Feb 13 15:53:08 acmechief1001 acme-chief-backend[10861]: Creating initial self-signed certificate for pinkunicorn / ec-prime256v1
Feb 13 15:53:08 acmechief1001 acme-chief-backend[10861]: Creating initial self-signed certificate for pinkunicorn / rsa-2048
Feb 13 15:53:10 acmechief1001 acme-chief-backend[10861]: Handling new certificate event for pinkunicorn / ec-prime256v1
Feb 13 15:53:12 acmechief1001 acme-chief-backend[10861]: Triggering DNS zone update...
Feb 13 15:53:12 acmechief1001 acme-chief-backend[10861]: Running subprocess ['/usr/local/bin/acme-chief-gdnsd-sync.py', '--remote-servers', 'authdns1001.wikimedia.org', 'authdns2001.wikimedia.org', 'multatuli.wikimedia.org', '--', '_acme-challenge.pinkunicorn.wikimedia.org', 'YJJ16UymyHiq85pnvjDEk5MhcvzBAB06_F1kYAOYnEA']
Feb 13 15:53:14 acmechief1001 acme-chief-backend[10861]: Handling pushed CSR event for pinkunicorn / ec-prime256v1
Feb 13 15:53:14 acmechief1001 acme-chief-backend[10861]: Handling validated challenges event for pinkunicorn / ec-prime256v1
Feb 13 15:53:14 acmechief1001 acme-chief-backend[10861]: Handling pushed challenges event for pinkunicorn / ec-prime256v1
Feb 13 15:53:16 acmechief1001 acme-chief-backend[10861]: Handling order finalized event for pinkunicorn / ec-prime256v1
Feb 13 15:53:17 acmechief1001 acme-chief-backend[10861]: Pushing the new certificate for pinkunicorn / ec-prime256v1
Feb 13 15:53:17 acmechief1001 acme-chief-backend[10861]: Handling new certificate event for pinkunicorn / rsa-2048
Feb 13 15:53:18 acmechief1001 acme-chief-backend[10861]: Skipping challenge validation for certificate pinkunicorn / rsa-2048
Feb 13 15:53:23 acmechief1001 acme-chief-backend[10861]: Handling pushed challenges event for pinkunicorn / rsa-2048
Feb 13 15:53:24 acmechief1001 acme-chief-backend[10861]: Handling order finalized event for pinkunicorn / rsa-2048
Feb 13 15:53:25 acmechief1001 acme-chief-backend[10861]: Pushing the new certificate for pinkunicorn / rsa-2048
Wed, Feb 13, 3:55 PM · Patch-For-Review, Acme-chief
Vgutierrez updated the task description for T207389: Rename the Certcentral project.
Wed, Feb 13, 3:54 PM · Patch-For-Review, Acme-chief
Vgutierrez updated the task description for T207389: Rename the Certcentral project.
Wed, Feb 13, 3:37 PM · Patch-For-Review, Acme-chief
Vgutierrez updated the task description for T207389: Rename the Certcentral project.
Wed, Feb 13, 3:37 PM · Patch-For-Review, Acme-chief
Vgutierrez committed rOSCC682b57bf098c: debian: Add release 0.10 to changelog (authored by Vgutierrez).
debian: Add release 0.10 to changelog
Wed, Feb 13, 6:31 AM
Vgutierrez committed rOSCCd258d2af3401: Release 0.10 (authored by Vgutierrez).
Release 0.10
Wed, Feb 13, 6:26 AM
Vgutierrez committed rOSCCdc72efd02056: acme-chief: Bump to buster (authored by Vgutierrez).
acme-chief: Bump to buster
Wed, Feb 13, 6:26 AM
Vgutierrez committed rOSCC0beaaef42e1b: Release 0.10 (authored by Vgutierrez).
Release 0.10
Wed, Feb 13, 6:26 AM

Tue, Feb 12

Vgutierrez committed rOSCC4a29ed88674d: acme-chief: Bump to buster (authored by Vgutierrez).
acme-chief: Bump to buster
Tue, Feb 12, 5:00 PM
Vgutierrez triaged T215925: Upgrade acme-chief to run in debian buster as Normal priority.
Tue, Feb 12, 4:53 PM · Patch-For-Review, Operations, Traffic, Acme-chief
Vgutierrez created T215925: Upgrade acme-chief to run in debian buster.
Tue, Feb 12, 4:53 PM · Patch-For-Review, Operations, Traffic, Acme-chief
Vgutierrez updated the task description for T207389: Rename the Certcentral project.
Tue, Feb 12, 1:40 PM · Patch-For-Review, Acme-chief
Vgutierrez committed rOSCC724ef162bc07: debian: rename certcentral to acme-chief (authored by Vgutierrez).
debian: rename certcentral to acme-chief
Tue, Feb 12, 10:26 AM
Vgutierrez committed rOSCCe5b41ec8c032: Release 0.9 This release includes the following changes: * Implement… (authored by Vgutierrez).
Release 0.9 This release includes the following changes: * Implement…
Tue, Feb 12, 10:26 AM
Vgutierrez updated the task description for T207389: Rename the Certcentral project.
Tue, Feb 12, 10:20 AM · Patch-For-Review, Acme-chief
Vgutierrez updated the task description for T207389: Rename the Certcentral project.
Tue, Feb 12, 10:19 AM · Patch-For-Review, Acme-chief
Vgutierrez committed rOSCCdee531a87473: Release 0.9 This release includes the following changes: * Implement… (authored by Vgutierrez).
Release 0.9 This release includes the following changes: * Implement…
Tue, Feb 12, 10:13 AM
Vgutierrez committed rOSCC18812bc9f0e0: debian: rename certcentral to acme-chief (authored by Vgutierrez).
debian: rename certcentral to acme-chief
Tue, Feb 12, 8:38 AM
Vgutierrez committed rOSCC8cb711b24be7: debian: rename certcentral to acme-chief (authored by Vgutierrez).
debian: rename certcentral to acme-chief
Tue, Feb 12, 8:28 AM
Vgutierrez committed rOSCC52552fc07a0b: Rename certcentral to acme-chief (authored by Vgutierrez).
Rename certcentral to acme-chief
Tue, Feb 12, 8:18 AM
Vgutierrez committed rOSCC5ec2b0e98c6f: certcentral: Implement staging time (authored by Vgutierrez).
certcentral: Implement staging time
Tue, Feb 12, 8:18 AM

Mon, Feb 11

Vgutierrez updated the task description for T207389: Rename the Certcentral project.
Mon, Feb 11, 5:40 PM · Patch-For-Review, Acme-chief
Vgutierrez committed rOSCC5f5c1e7605ff: Edit Project Config (authored by Vgutierrez).
Edit Project Config
Mon, Feb 11, 4:52 PM
Vgutierrez committed rOSCCec03a61b4ba2: Rename certcentral to acme-chief (authored by Vgutierrez).
Rename certcentral to acme-chief
Mon, Feb 11, 4:37 PM
Vgutierrez updated the task description for T207389: Rename the Certcentral project.
Mon, Feb 11, 4:20 PM · Patch-For-Review, Acme-chief
Vgutierrez updated the task description for T207389: Rename the Certcentral project.
Mon, Feb 11, 4:10 PM · Patch-For-Review, Acme-chief
Vgutierrez moved T213417: lvs2002: raid battery failure from Triage to Hardware on the Traffic board.
Mon, Feb 11, 2:57 PM · Operations, ops-codfw, Traffic
Vgutierrez reopened T207389: Rename the Certcentral project as "Open".
Mon, Feb 11, 12:46 PM · Patch-For-Review, Acme-chief
Vgutierrez closed T215783: certcentral fails to renew certificates as Resolved.

As soon as https://gerrit.wikimedia.org/r/489164 had been merged, the certificates has been renewed as expected:

Mon, Feb 11, 12:46 PM · Patch-For-Review, Operations, Traffic, Acme-chief
Vgutierrez closed T207389: Rename the Certcentral project as Resolved.
Mon, Feb 11, 12:32 PM · Patch-For-Review, Acme-chief
Vgutierrez triaged T215783: certcentral fails to renew certificates as High priority.
Mon, Feb 11, 12:19 PM · Patch-For-Review, Operations, Traffic, Acme-chief
Vgutierrez created T215783: certcentral fails to renew certificates.
Mon, Feb 11, 12:19 PM · Patch-For-Review, Operations, Traffic, Acme-chief

Fri, Feb 8

Vgutierrez committed rOSCC014b3960a7eb: Rename certcentral to acme-chief (authored by Vgutierrez).
Rename certcentral to acme-chief
Fri, Feb 8, 7:53 AM
Vgutierrez committed rOSCCedfbe5a63b11: Rename certcentral to acme-chief (authored by Vgutierrez).
Rename certcentral to acme-chief
Fri, Feb 8, 7:25 AM
Vgutierrez committed rOSCCcc0203ea7ef7: Rename certcentral to acme-chief (authored by Vgutierrez).
Rename certcentral to acme-chief
Fri, Feb 8, 7:12 AM

Thu, Feb 7

Vgutierrez added a comment to T214274: Degraded RAID on cp5010.

here is the log line:

Jan 21 01:39:21 cp5010 kernel: [7472184.163052] sd 1:0:0:0: [sdb] Stopping disk
Thu, Feb 7, 9:19 PM · Traffic, ops-eqsin, Operations
Vgutierrez added a comment to T214274: Degraded RAID on cp5010.

that's right, the kernel shutdown sdb due to the errors, that's why is not even listed on lshw

Thu, Feb 7, 9:12 PM · Traffic, ops-eqsin, Operations
Vgutierrez updated subscribers of T214274: Degraded RAID on cp5010.

since @ayounsi is going to eqsin datacenter later this month maybe we could join efforts and replace sdb.
^^ @RobH

Thu, Feb 7, 4:10 PM · Traffic, ops-eqsin, Operations
Vgutierrez committed rOSCC3fa7a5d8572b: certcentral: Implement staging time (authored by Vgutierrez).
certcentral: Implement staging time
Thu, Feb 7, 12:29 PM
Vgutierrez closed T215103: cp nodes still try to OCSP staple the already expired digicert-2017 certificate as Resolved.

After merging the change, the following commands have been issued over cumin:

rm -f /etc/update-ocsp.d/digicert-2017-ecdsa-unified.conf
rm -f /etc/update-ocsp.d/digicert-2017-rsa-unified.conf
rm -f /var/cache/ocsp/digicert-2017-ecdsa-unified.ocsp
rm -f /var/cache/ocsp/digicert-2017-rsa-unified.ocsp
Thu, Feb 7, 12:04 PM · Operations, Traffic

Wed, Feb 6

Vgutierrez added a comment to T215389: esams cache layer mangles downloads of specific url.

Checking the rest of the text cluster in esams from bast3002 showed that all of them where affected. After restarting varnish-frontend the issue is gone. I'll leave the task open for further discussion with @BBlack

Wed, Feb 6, 12:33 PM · Operations, Traffic

Tue, Feb 5

Vgutierrez added a comment to T203194: cp1075-90 - bnxt_en transmit hangs.

This is looking way better than the previous attempt. All the systems have been running 19 days without issues since the firmware upgrade:

vgutierrez@cumin1001:~$ sudo cumin cp[1075-1090].eqiad.wmnet "uptime |cut -d' ' -f 4-5"
16 hosts will be targeted:
cp[1075-1090].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(16) cp[1075-1090].eqiad.wmnet
----- OUTPUT of 'uptime |cut -d' ' -f 4-5' -----
19 days
Tue, Feb 5, 4:21 PM · Patch-For-Review, Traffic, Operations
Vgutierrez added a comment to T196560: rack/setup/install LVS200[7-10].

So, the NIC issue reported in T203194 seems to be fixed after upgrading the NIC firmware to version 21.40 (https://www.dell.com/support/home/us/en/04/drivers/driversdetails?driverid=3x5g0).

Tue, Feb 5, 4:12 PM · Patch-For-Review, ops-codfw, Traffic, Operations
Vgutierrez committed rOSCC808aa227f1b8: certcentral: Implement staging time (authored by Vgutierrez).
certcentral: Implement staging time
Tue, Feb 5, 3:15 PM
Vgutierrez closed T213301: Avoid inter-hosts puppet dependencies on certificate deployment as Resolved.
Tue, Feb 5, 9:21 AM · Acme-chief
Vgutierrez closed T213301: Avoid inter-hosts puppet dependencies on certificate deployment, a subtask of T213705: Deploy managed LetsEncrypt certs for all public use-cases, as Resolved.
Tue, Feb 5, 9:21 AM · Traffic, Operations, Acme-chief, Goal

Mon, Feb 4

Vgutierrez claimed T207389: Rename the Certcentral project.
Mon, Feb 4, 7:29 AM · Patch-For-Review, Acme-chief
Vgutierrez raised the priority of T207389: Rename the Certcentral project from Normal to High.

+1 for acme-chief.
@Krenair I'd definitely like to hear your input here but I want to move this forward during February.
@BBlack what are your thoughts?

Mon, Feb 4, 6:59 AM · Patch-For-Review, Acme-chief

Sat, Feb 2

Vgutierrez triaged T215103: cp nodes still try to OCSP staple the already expired digicert-2017 certificate as Normal priority.
Sat, Feb 2, 3:44 PM · Operations, Traffic
Vgutierrez moved T215103: cp nodes still try to OCSP staple the already expired digicert-2017 certificate from Triage to TLS on the Traffic board.
Sat, Feb 2, 3:44 PM · Operations, Traffic
Vgutierrez created T215103: cp nodes still try to OCSP staple the already expired digicert-2017 certificate.
Sat, Feb 2, 3:44 PM · Operations, Traffic

Tue, Jan 29

Vgutierrez closed T214872: cp2014 host down as Resolved.

everything got back to normal after a reboot

Tue, Jan 29, 12:13 AM · ops-codfw, Traffic, Operations

Mon, Jan 28

Vgutierrez moved T214872: cp2014 host down from Triage to Hardware on the Traffic board.
Mon, Jan 28, 11:53 PM · ops-codfw, Traffic, Operations
Vgutierrez created T214872: cp2014 host down.
Mon, Jan 28, 11:50 PM · ops-codfw, Traffic, Operations

Mon, Jan 21

Vgutierrez committed rOSCCd7a33928526b: certcentral: Implement staging time (authored by Vgutierrez).
certcentral: Implement staging time
Mon, Jan 21, 9:47 AM
Vgutierrez committed rOSCC0c804db36018: certcentral: Implement staging time (authored by Vgutierrez).
certcentral: Implement staging time
Mon, Jan 21, 9:18 AM
Vgutierrez committed rOSCCa6fb46fc5f1e: certcentral: Implement staging time (authored by Vgutierrez).
certcentral: Implement staging time
Mon, Jan 21, 9:18 AM
Vgutierrez moved T214274: Degraded RAID on cp5010 from Triage to Hardware on the Traffic board.
Mon, Jan 21, 7:00 AM · Traffic, ops-eqsin, Operations
Vgutierrez added a project to T214274: Degraded RAID on cp5010: Traffic.

initial failure at 01:39:

vgutierrez@cp5010:~$ grep sdb /var/log/kern.log |grep -v "__ext4_get_inode_loc" |grep -v "IO failure"
Jan 21 01:39:17 cp5010 kernel: [7472180.491194] blk_update_request: I/O error, dev sdb, sector 2056
Jan 21 01:39:17 cp5010 kernel: [7472180.498009] md/raid1:md0: Disk failure on sdb1, disabling device.
Jan 21 01:39:17 cp5010 kernel: [7472180.517585] blk_update_request: I/O error, dev sdb, sector 19557568
Jan 21 01:39:17 cp5010 kernel: [7472180.524786] Buffer I/O error on dev sdb3, logical block 2968, lost async page write
Jan 21 01:39:17 cp5010 kernel: [7472180.539578] blk_update_request: I/O error, dev sdb, sector 1376643888
Jan 21 01:39:17 cp5010 kernel: [7472180.546977] EXT4-fs warning (device sdb3): ext4_end_bio:314: I/O error -5 writing to inode 12 (offset 0 size 0 starting block 172080489)
Jan 21 01:39:17 cp5010 kernel: [7472180.546980] Buffer I/O error on device sdb3, logical block 169638758
Jan 21 01:39:17 cp5010 kernel: [7472180.554269] Buffer I/O error on device sdb3, logical block 169638759
Jan 21 01:39:17 cp5010 kernel: [7472180.561567] Buffer I/O error on device sdb3, logical block 169638760
Jan 21 01:39:17 cp5010 kernel: [7472180.574918] blk_update_request: I/O error, dev sdb, sector 1377903848
Jan 21 01:39:17 cp5010 kernel: [7472180.574938] blk_update_request: I/O error, dev sdb, sector 1385932840
Jan 21 01:39:17 cp5010 kernel: [7472180.574941] EXT4-fs warning (device sdb3): ext4_end_bio:314: I/O error -5 writing to inode 12 (offset 0 size 0 starting block 173241606)
Jan 21 01:39:17 cp5010 kernel: [7472180.574943] Buffer I/O error on device sdb3, logical block 170799877
Jan 21 01:39:17 cp5010 kernel: [7472180.574948] blk_update_request: I/O error, dev sdb, sector 1387288288
Jan 21 01:39:17 cp5010 kernel: [7472180.574949] EXT4-fs warning (device sdb3): ext4_end_bio:314: I/O error -5 writing to inode 12 (offset 0 size 0 starting block 173411037)
Jan 21 01:39:17 cp5010 kernel: [7472180.574950] Buffer I/O error on device sdb3, logical block 170969308
Jan 21 01:39:17 cp5010 kernel: [7472180.574953] blk_update_request: I/O error, dev sdb, sector 1387358600
Jan 21 01:39:17 cp5010 kernel: [7472180.574954] EXT4-fs warning (device sdb3): ext4_end_bio:314: I/O error -5 writing to inode 12 (offset 0 size 0 starting block 173419826)
Jan 21 01:39:17 cp5010 kernel: [7472180.574955] Buffer I/O error on device sdb3, logical block 170978097
Jan 21 01:39:17 cp5010 kernel: [7472180.575001] blk_update_request: I/O error, dev sdb, sector 1387450928
Jan 21 01:39:17 cp5010 kernel: [7472180.575004] EXT4-fs warning (device sdb3): ext4_end_bio:314: I/O error -5 writing to inode 12 (offset 0 size 0 starting block 173431367)
Jan 21 01:39:17 cp5010 kernel: [7472180.575005] Buffer I/O error on device sdb3, logical block 170989638
Jan 21 01:39:17 cp5010 kernel: [7472180.575009] blk_update_request: I/O error, dev sdb, sector 1387705752
Jan 21 01:39:17 cp5010 kernel: [7472180.575011] EXT4-fs warning (device sdb3): ext4_end_bio:314: I/O error -5 writing to inode 12 (offset 0 size 0 starting block 173463220)
Jan 21 01:39:17 cp5010 kernel: [7472180.575012] Buffer I/O error on device sdb3, logical block 171021491
Jan 21 01:39:17 cp5010 kernel: [7472180.575014] blk_update_request: I/O error, dev sdb, sector 1387727912
Jan 21 01:39:17 cp5010 kernel: [7472180.575016] EXT4-fs warning (device sdb3): ext4_end_bio:314: I/O error -5 writing to inode 12 (offset 0 size 0 starting block 173465992)
Jan 21 01:39:17 cp5010 kernel: [7472180.575017] Buffer I/O error on device sdb3, logical block 171024261
Jan 21 01:39:17 cp5010 kernel: [7472180.575018] Buffer I/O error on device sdb3, logical block 171024262
Jan 21 01:39:17 cp5010 kernel: [7472180.575071] EXT4-fs warning (device sdb3): ext4_end_bio:314: I/O error -5 writing to inode 12 (offset 0 size 0 starting block 173563570)
Jan 21 01:39:17 cp5010 kernel: [7472180.575078] EXT4-fs warning (device sdb3): ext4_end_bio:314: I/O error -5 writing to inode 12 (offset 0 size 0 starting block 173588828)
Jan 21 01:39:17 cp5010 kernel: [7472180.575083] EXT4-fs warning (device sdb3): ext4_end_bio:314: I/O error -5 writing to inode 12 (offset 0 size 0 starting block 173646893)
Jan 21 01:39:17 cp5010 kernel: [7472180.581918] Buffer I/O error on dev sdb3, logical block 2968, lost async page write
Jan 21 01:39:20 cp5010 kernel: [7472183.994814] Buffer I/O error on dev sdb3, logical block 190472252, lost async page write
Jan 21 01:39:21 cp5010 kernel: [7472184.092088] EXT4-fs (sdb3): Delayed block allocation failed for inode 12 at logical offset 12485094 with max blocks 739 with error 5
Jan 21 01:39:21 cp5010 kernel: [7472184.105848] EXT4-fs (sdb3): This should not happen!! Data will be lost
Jan 21 01:39:21 cp5010 kernel: [7472184.116381] Buffer I/O error on dev sdb3, logical block 2968, lost async page write
Jan 21 01:39:21 cp5010 kernel: [7472184.125134] Buffer I/O error on dev sdb3, logical block 190472252, lost async page write
Jan 21 01:39:21 cp5010 kernel: [7472184.134545] EXT4-fs (sdb3): Delayed block allocation failed for inode 12 at logical offset 12485833 with max blocks 9 with error 5
Jan 21 01:39:21 cp5010 kernel: [7472184.138930]  disk 1, wo:1, o:0, dev:sdb1
Jan 21 01:39:21 cp5010 kernel: [7472184.148144] EXT4-fs (sdb3): This should not happen!! Data will be lost
Jan 21 01:39:21 cp5010 kernel: [7472184.163052] sd 1:0:0:0: [sdb] Stopping disk
Jan 21 01:39:21 cp5010 kernel: [7472184.163065] sd 1:0:0:0: [sdb] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Mon, Jan 21, 7:00 AM · Traffic, ops-eqsin, Operations
Vgutierrez added a comment to T214274: Degraded RAID on cp5010.

kern.log is reporting multiple failures in /dev/sdb3 as well

Jan 21 06:41:47 cp5010 kernel: [7490330.204759] EXT4-fs error (device sdb3) in ext4_reserve_inode_write:5448: IO failure
Jan 21 06:41:48 cp5010 kernel: [7490331.199323] EXT4-fs error (device sdb3) in ext4_reserve_inode_write:5448: IO failure
Jan 21 06:41:49 cp5010 kernel: [7490332.212809] EXT4-fs error (device sdb3) in ext4_reserve_inode_write:5448: IO failure
vgutierrez@cp5010:~$ grep "IO failure" /var/log/kern.log |wc -l
15916
Mon, Jan 21, 6:42 AM · Traffic, ops-eqsin, Operations

Fri, Jan 18

Vgutierrez closed T214041: inconsistencies between pybal configuration and IPVS status as Resolved.
Fri, Jan 18, 1:36 PM · Operations, Pybal, Traffic

Jan 18 2019

Vgutierrez renamed T214072: dns200[12] lack IPv6 records from prometheus metrics apparently are missing some ipvs entries to dns200[12] lack IPv6 records.
Jan 18 2019, 10:31 AM · Patch-For-Review, monitoring, Traffic, Operations, Pybal
Vgutierrez closed T214072: dns200[12] lack IPv6 records as Resolved.

as @Joe properly pointed out in our IRC discussions, the main issue here is that dns200[12] lacked IPv6 records

Jan 18 2019, 10:31 AM · Patch-For-Review, monitoring, Traffic, Operations, Pybal
Vgutierrez added a comment to T214072: dns200[12] lack IPv6 records.

It appears that prometheus is not listing any IPVS service without backends, and right now (IPVS wise), dns_rec6 doesn't have any backend server configured in lvs200[25].

Jan 18 2019, 9:32 AM · Patch-For-Review, monitoring, Traffic, Operations, Pybal

Jan 17 2019

Vgutierrez created T214072: dns200[12] lack IPv6 records.
Jan 17 2019, 6:48 PM · Patch-For-Review, monitoring, Traffic, Operations, Pybal
Vgutierrez triaged T214041: inconsistencies between pybal configuration and IPVS status as Normal priority.
Jan 17 2019, 2:30 PM · Operations, Pybal, Traffic
Vgutierrez added a comment to T214041: inconsistencies between pybal configuration and IPVS status.

After removing a service in pybal, a restart is not enough to get rid of the service at IPVS level, it should be removed manually with ipvsadm -D -t ip:port(or -u if it's UDP instead of TCP)

Jan 17 2019, 2:08 PM · Operations, Pybal, Traffic
Vgutierrez moved T214041: inconsistencies between pybal configuration and IPVS status from Triage to LoadBalancer on the Traffic board.
Jan 17 2019, 1:43 PM · Operations, Pybal, Traffic
Vgutierrez created T214041: inconsistencies between pybal configuration and IPVS status.
Jan 17 2019, 1:43 PM · Operations, Pybal, Traffic
Vgutierrez added a comment to T213820: certcentral is incompatible with the current python3-acme version shipped in stretch-backports.

@Vgutierrez: this is done now right?

Jan 17 2019, 1:00 PM · Patch-For-Review, Acme-chief
Vgutierrez committed rOSCC8c358c09b4ad: CI: Run tests with minimum and latest dependencies (authored by Vgutierrez).
CI: Run tests with minimum and latest dependencies
Jan 17 2019, 12:41 PM
Vgutierrez committed rOSCCe9b0a56e2f27: CI: Run tests against py{35,36,37} with min and latest deps (authored by Vgutierrez).
CI: Run tests against py{35,36,37} with min and latest deps
Jan 17 2019, 12:23 PM
Vgutierrez committed rOSCCd08a1028b57f: debian: Add release 0.8 to changelog (authored by Vgutierrez).
debian: Add release 0.8 to changelog
Jan 17 2019, 11:43 AM
Vgutierrez committed rOSCC9a579db63d74: Release 0.8 (authored by Vgutierrez).
Release 0.8
Jan 17 2019, 11:32 AM
Vgutierrez committed rOSCCef8c96347cdc: certcentral: Allow specifying authorized hosts and regex in the config (authored by Vgutierrez).
certcentral: Allow specifying authorized hosts and regex in the config
Jan 17 2019, 11:32 AM
Vgutierrez committed rOSCC033eadd885cc: certcentral: Bump josepy to the latest version shipped in stretch-bp (authored by Vgutierrez).
certcentral: Bump josepy to the latest version shipped in stretch-bp
Jan 17 2019, 11:32 AM
Vgutierrez committed rOSCCb153b67eb3d9: certcentral: Bump acme to the latest version shipped in stretch-backports (authored by Vgutierrez).
certcentral: Bump acme to the latest version shipped in stretch-backports
Jan 17 2019, 11:32 AM
Vgutierrez committed rOSCC6bc17d2f7202: acme_requests: Handle TCP/HTTPS errors (authored by Vgutierrez).
acme_requests: Handle TCP/HTTPS errors
Jan 17 2019, 11:31 AM
Vgutierrez added a comment to T203194: cp1075-90 - bnxt_en transmit hangs.

firmware upgrade completed for all the affected systems.

Jan 17 2019, 11:11 AM · Patch-For-Review, Traffic, Operations

Jan 16 2019

Vgutierrez added a comment to T203194: cp1075-90 - bnxt_en transmit hangs.

cp1088 has been affected as well after the kernel upgrade

Jan 16 2019, 7:54 AM · Patch-For-Review, Traffic, Operations

Jan 15 2019

Vgutierrez committed rOSCCaa6c77b0a421: Release 0.8 (authored by Vgutierrez).
Release 0.8
Jan 15 2019, 4:56 PM
Vgutierrez committed rOSCCa8e98f88082c: certcentral: Allow specifying authorized hosts and regex in the config (authored by Vgutierrez).
certcentral: Allow specifying authorized hosts and regex in the config
Jan 15 2019, 3:20 PM
Vgutierrez committed rOSCC2b70f5ea1c1f: certcentral: Allow specifying authorized hosts and regex in the config (authored by Vgutierrez).
certcentral: Allow specifying authorized hosts and regex in the config
Jan 15 2019, 3:03 PM
Vgutierrez committed rOSCCec5e85f78812: certcentral: Bump josepy to the latest version shipped in stretch-bp (authored by Vgutierrez).
certcentral: Bump josepy to the latest version shipped in stretch-bp
Jan 15 2019, 2:51 PM
Vgutierrez added a comment to T213820: certcentral is incompatible with the current python3-acme version shipped in stretch-backports.

@Krenair yeah, same issue, the proper patch is https://gerrit.wikimedia.org/r/#/c/operations/software/certcentral/+/484438/2/certcentral/acme_requests.py though

Jan 15 2019, 2:46 PM · Patch-For-Review, Acme-chief
Vgutierrez committed rOSCCb64b3e2e3d07: certcentral: Bump acme to the latest version shipped in stretch-backports (authored by Vgutierrez).
certcentral: Bump acme to the latest version shipped in stretch-backports
Jan 15 2019, 2:45 PM
Vgutierrez committed rOSCC6d3ce64ce735: certcentral: Bump acme to the latest version shipped in stretch-backports (authored by Vgutierrez).
certcentral: Bump acme to the latest version shipped in stretch-backports
Jan 15 2019, 2:42 PM
Vgutierrez created T213820: certcentral is incompatible with the current python3-acme version shipped in stretch-backports.
Jan 15 2019, 2:33 PM · Patch-For-Review, Acme-chief
Vgutierrez committed rOSCC922ea6d28422: certcentral: Allow specifying authorized hosts and regex in the config (authored by Vgutierrez).
certcentral: Allow specifying authorized hosts and regex in the config
Jan 15 2019, 8:53 AM
Vgutierrez added a comment to T203194: cp1075-90 - bnxt_en transmit hangs.

so far we've seen crashes in the following servers (updated on 16/01/19):

  • cp1078 (twice)
  • cp1080
  • cp1084
  • cp1085
  • cp1088
Jan 15 2019, 7:40 AM · Patch-For-Review, Traffic, Operations
Vgutierrez added a comment to T203194: cp1075-90 - bnxt_en transmit hangs.

Thanks for handling cp1078 @Dzahn. It looks like 4.9.144 is also affected

Jan 15 2019, 6:47 AM · Patch-For-Review, Traffic, Operations

Jan 14 2019

Vgutierrez added a comment to T203194: cp1075-90 - bnxt_en transmit hangs.

kernel upgraded successfully in cp1075-cp1090:

vgutierrez@cumin1001:~$ sudo cumin cp[1075-1090].eqiad.wmnet 'uname -v'
16 hosts will be targeted:
cp[1075-1090].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(16) cp[1075-1090].eqiad.wmnet
----- OUTPUT of 'uname -v' -----
#1 SMP Debian 4.9.144-1 (2018-12-30)
Jan 14 2019, 5:48 PM · Patch-For-Review, Traffic, Operations
Vgutierrez triaged T213737: Allow specifying a custom period of time before deploying a newly issued certificate as Normal priority.
Jan 14 2019, 5:13 PM · Patch-For-Review, Traffic, Operations, Acme-chief
Vgutierrez moved T213705: Deploy managed LetsEncrypt certs for all public use-cases from Triage to TLS on the Traffic board.
Jan 14 2019, 2:48 PM · Traffic, Operations, Acme-chief, Goal
Vgutierrez added a parent task for T213301: Avoid inter-hosts puppet dependencies on certificate deployment: T213705: Deploy managed LetsEncrypt certs for all public use-cases.
Jan 14 2019, 2:42 PM · Acme-chief
Vgutierrez added a subtask for T213705: Deploy managed LetsEncrypt certs for all public use-cases: T213301: Avoid inter-hosts puppet dependencies on certificate deployment.
Jan 14 2019, 2:42 PM · Traffic, Operations, Acme-chief, Goal
Vgutierrez triaged T213705: Deploy managed LetsEncrypt certs for all public use-cases as Normal priority.
Jan 14 2019, 2:41 PM · Traffic, Operations, Acme-chief, Goal
Vgutierrez created T213705: Deploy managed LetsEncrypt certs for all public use-cases.
Jan 14 2019, 2:41 PM · Traffic, Operations, Acme-chief, Goal

Jan 10 2019

Vgutierrez added a comment to T203194: cp1075-90 - bnxt_en transmit hangs.

yeah, it's included as part of 4.9.134: https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.9.134

Jan 10 2019, 4:56 PM · Patch-For-Review, Traffic, Operations