Page MenuHomePhabricator

Andrew (Andrew Bogott)
User

Projects (10)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Nov 2 2014, 11:35 PM (236 w, 6 d)
Availability
Available
IRC Nick
andrewbogott
LDAP User
Unknown
MediaWiki User
Andrewbogott [ Global Accounts ]

Recent Activity

Today

Andrew updated subscribers of T223832: puppet vs. stretch vs. keystone.
Sun, May 19, 3:37 PM
Andrew updated the task description for T223832: puppet vs. stretch vs. keystone.
Sun, May 19, 3:26 PM
Andrew created T223832: puppet vs. stretch vs. keystone.
Sun, May 19, 3:23 PM
Andrew created P8542 puppet never settles on cloudcontrol2001-dev.
Sun, May 19, 3:23 PM · cloud-services-team

Thu, May 16

Andrew triaged T219079: Horizon - Not possible to remove A record from Record Set as Low priority.

I can confirm that this is present on our current Horizon. It seems like something that can be worked around in the meantime, so I'm going to mark it as low priority and hope that future upgrades address it.

Thu, May 16, 9:33 PM · Horizon, cloud-services-team (Kanban)
Andrew closed T216239: CloudVPS: drain and rebuild labvirt1009 as cloudvirt1009 as Resolved.
Thu, May 16, 9:11 PM · Patch-For-Review, cloud-services-team (Kanban)
Andrew closed T218514: puppet breakage on Jessie tools nodes (and probably on Jessie VMs everywhere) as Resolved.

This seems to be fine now. I double-checked the state of apt and dpkg and although there are a few things stuck from race conditions there's nothing comprehensive or serious going on.

Thu, May 16, 7:32 PM · cloud-services-team (Kanban)
Andrew placed T166845: monitor some things on all Cloud instances (discussion) up for grabs.

I still sort of want this but I'm clearly not really working on it.

Thu, May 16, 6:39 PM · Patch-For-Review, cloud-services-team (Kanban), Cloud-VPS
Andrew closed T223370: CloudVPS DNS: weird behavior in integration-puppetmaster01 as Resolved.

This particular thing shouldn't happen any more.

Thu, May 16, 6:38 PM · Patch-For-Review, Continuous-Integration-Infrastructure, cloud-services-team (Kanban)
Andrew closed T222893: Add new Cloud VPS project "sso" as Resolved.

Done

Thu, May 16, 5:24 PM · Cloud-VPS (Project-requests), cloud-services-team
Andrew closed T221157: Request creation of Gratitude VPS project as Resolved.

I've created this project and made @Maximilianklein and @Rubberpaw project admins. I'm unable to locate a dev account for 'Epenn-cs' but one of the others can add them if they have an account.

Thu, May 16, 5:22 PM · Cloud-VPS (Project-requests)
Andrew closed T222363: Request creation of wikilink VPS project as Resolved.

I've created this project and added @Samwalton9 as project admin. They can add additional users as needed.

Thu, May 16, 5:19 PM · Cloud-VPS (Project-requests)
Andrew closed T222800: Requesting quota increase for 'puppet-diffs' project as Resolved.

I doubled the RAM and core quota for the project -- let me know if that doesn't get you where you need to go :)

Thu, May 16, 5:18 PM · Operations, Cloud-VPS (Quota-requests), puppet-compiler
Andrew closed T222800: Requesting quota increase for 'puppet-diffs' project, a subtask of T221969: Puppet catalog compiler - increasing max concurrent jobs, as Resolved.
Thu, May 16, 5:18 PM · puppet-compiler, Continuous-Integration-Infrastructure, Release-Engineering-Team
Andrew added a comment to T223370: CloudVPS DNS: weird behavior in integration-puppetmaster01.

I'm still puzzled about this. A second look suggests that the only way the firstboot script would fail is if it's unable to resolve the name of the host. But if /that/ happens then I can't imagine things ever working.

Thu, May 16, 3:41 PM · Patch-For-Review, Continuous-Integration-Infrastructure, cloud-services-team (Kanban)
Andrew updated subscribers of T223458: mgmt outages for cloud* systems seem to page everyone.

If only WMCS staff (or people explicitly opting in for pages to those systems) got paged then this is probably correct behavior. @ArielGlenn says they did not get pages. @fgiunchedi /did/ get paged though. So... ???

Thu, May 16, 3:31 PM · cloud-services-team (Kanban)
Andrew created T223458: mgmt outages for cloud* systems seem to page everyone.
Thu, May 16, 3:22 PM · cloud-services-team (Kanban)

Wed, May 15

Dzahn awarded T221654: Puppet broken on VMs in deployment-prep a Like token.
Wed, May 15, 9:02 PM · Patch-For-Review, Beta-Cluster-Infrastructure, serviceops
Andrew closed T219390: Have puppet-merge on puppetmaster1001 publish the official sha1 after merging as Resolved.

This is done now, published at

Wed, May 15, 7:14 PM · Patch-For-Review, cloud-services-team (Kanban), Puppet
Andrew closed T219390: Have puppet-merge on puppetmaster1001 publish the official sha1 after merging, a subtask of T171188: Move the main WMCS puppetmaster into the Labs realm, as Resolved.
Wed, May 15, 7:14 PM · Patch-For-Review, cloud-services-team (Kanban), Cloud-Services, Puppet, Operations
Andrew added a comment to T220853: VMs on cloudvirt1015 crashing - bad mainboard/memory.

*bump* Chris, do you have any thoughts about what we should do next here?

Wed, May 15, 7:13 PM · Operations, ops-eqiad, DC-Ops, User-Zppix, cloud-services-team (Kanban)
Andrew moved T223148: Cloud Services: reallocate workload from rack B5-eqiad from Inbox to Doing on the cloud-services-team (Kanban) board.
Wed, May 15, 7:12 PM · Patch-For-Review, cloud-services-team (Kanban), ops-eqiad, Operations
Andrew moved T223370: CloudVPS DNS: weird behavior in integration-puppetmaster01 from Inbox to Doing on the cloud-services-team (Kanban) board.
Wed, May 15, 7:11 PM · Patch-For-Review, Continuous-Integration-Infrastructure, cloud-services-team (Kanban)
Andrew closed T216190: Rebuild labvirt1012 as cloudvirt1012 as Resolved.
Wed, May 15, 7:07 PM · cloud-services-team (Kanban), Patch-For-Review
Andrew awarded T221654: Puppet broken on VMs in deployment-prep a Barnstar token.
Wed, May 15, 6:59 PM · Patch-For-Review, Beta-Cluster-Infrastructure, serviceops
Andrew closed T221183: Rename and re-assign cloud dns servers as Resolved.
Wed, May 15, 6:16 PM · Patch-For-Review, cloud-services-team (Kanban)
Andrew claimed T221183: Rename and re-assign cloud dns servers.
Wed, May 15, 6:11 PM · Patch-For-Review, cloud-services-team (Kanban)
Andrew created T223394: https://toolsadmin.wikimedia.org/profile/settings/ssh-keys/delete throws a 500.
Wed, May 15, 5:48 PM · cloud-services-team (Kanban), Striker
Andrew claimed T223370: CloudVPS DNS: weird behavior in integration-puppetmaster01.
Wed, May 15, 5:26 PM · Patch-For-Review, Continuous-Integration-Infrastructure, cloud-services-team (Kanban)
Andrew added a comment to T223370: CloudVPS DNS: weird behavior in integration-puppetmaster01.

So it looks like any failure at all in the firstboot script means that firstboot will be run again on subsequent boots. I'm pretty sure that this is wrong and we should just only ever run it once even if it throws an error but I'm going to look at more failure cases first.

Wed, May 15, 4:58 PM · Patch-For-Review, Continuous-Integration-Infrastructure, cloud-services-team (Kanban)
Andrew added a comment to T223370: CloudVPS DNS: weird behavior in integration-puppetmaster01.

Ah so I see. There are some VMs which were built from a bad base image and for some reason re-run the firstboot script on every boot. It should be fixed now and I thought I'd caught them all... I'll do a bit more research.

Wed, May 15, 4:39 PM · Patch-For-Review, Continuous-Integration-Infrastructure, cloud-services-team (Kanban)
Andrew reassigned T223370: CloudVPS DNS: weird behavior in integration-puppetmaster01 from Andrew to aborrero.

...and it looks like puppet caught up?

Wed, May 15, 2:44 PM · Patch-For-Review, Continuous-Integration-Infrastructure, cloud-services-team (Kanban)
Andrew added a comment to T223370: CloudVPS DNS: weird behavior in integration-puppetmaster01.

This is likely a result of puppet going out of sync. The old recursors were 208.80.155.118 and 208.80.154.20, replaced with 208.80.154.143 and 208.80.154.24 as per T221183

Wed, May 15, 2:41 PM · Patch-For-Review, Continuous-Integration-Infrastructure, cloud-services-team (Kanban)

Tue, May 14

Andrew added a comment to T223272: CloudVPS: evaluate if we can make rsync use 10G in cloudvirts.

drive-by-comment: I've also been disappointed at transfer speeds when migrating to/from 10G systems but never followed up to figure out what the bottleneck was. Worth investigating I think.

Tue, May 14, 4:59 PM · Patch-For-Review, Operations, netops, cloud-services-team (Kanban)
Andrew added a comment to T222800: Requesting quota increase for 'puppet-diffs' project.

Approved

Tue, May 14, 4:17 PM · Operations, Cloud-VPS (Quota-requests), puppet-compiler
Andrew added a comment to T221157: Request creation of Gratitude VPS project.

Approved -- we'll create this in the next few days.

Tue, May 14, 4:16 PM · Cloud-VPS (Project-requests)
Andrew added a comment to T222363: Request creation of wikilink VPS project.

Approved -- we'll try to get this created in the next few days.

Tue, May 14, 4:14 PM · Cloud-VPS (Project-requests)
Andrew added a comment to T222893: Add new Cloud VPS project "sso".

Approved

Tue, May 14, 4:14 PM · Cloud-VPS (Project-requests), cloud-services-team

Mon, May 13

Andrew added a comment to T223148: Cloud Services: reallocate workload from rack B5-eqiad.

I think we should risk the slight chance of a multi-hour outage. Three days isn't enough time to give proper notice of an evacuation, and if things go well the work will have been in vain anyway. So, I propose:

Mon, May 13, 6:24 PM · Patch-For-Review, cloud-services-team (Kanban), ops-eqiad, Operations
Andrew added a comment to T223148: Cloud Services: reallocate workload from rack B5-eqiad.

I've asked for clarification about what kind of power outage is feared here. Since emptying 1028 will cause downtime anyway I want to know if the expected downtime from the PDU move is more or less than the downtime associated with evacuation.

Mon, May 13, 5:54 PM · Patch-For-Review, cloud-services-team (Kanban), ops-eqiad, Operations
Andrew added a comment to T223126: Install new PDUs into b5-eqiad.

Just to clarify -- best case (normal) scenario is no interruption? And worst case is... brief power interruption? Or no power for hours?

Mon, May 13, 5:53 PM · Patch-For-Review, ops-eqiad, Operations
Andrew closed T216724: relocate/reimage cloudvirt1024 with 10G interfaces as Resolved.
Mon, May 13, 5:11 PM · Patch-For-Review, Operations, cloud-services-team (Kanban)
Andrew closed T216724: relocate/reimage cloudvirt1024 with 10G interfaces, a subtask of T216195: Move cloudvirt hosts to 10Gb ethernet, as Resolved.
Mon, May 13, 5:11 PM · ops-eqiad, DC-Ops, Operations, Epic, cloud-services-team (Kanban)
Andrew created T223107: cloudvirt 'different negotiated speed than requested' alerts.
Mon, May 13, 2:48 PM · Patch-For-Review, cloud-services-team (Kanban)
Andrew added a comment to T210850: WMCS-related dashboards using Diamond metrics.

Thanks arturo! I worked on this a bit last week but didn't make a whole lot of progress.

Mon, May 13, 1:24 PM · cloud-services-team (Kanban), Operations

Fri, May 10

Andrew updated the task description for T210850: WMCS-related dashboards using Diamond metrics.
Fri, May 10, 2:39 PM · cloud-services-team (Kanban), Operations
Andrew updated the task description for T210850: WMCS-related dashboards using Diamond metrics.
Fri, May 10, 2:38 PM · cloud-services-team (Kanban), Operations
Andrew claimed T210850: WMCS-related dashboards using Diamond metrics.
Fri, May 10, 2:09 PM · cloud-services-team (Kanban), Operations

Thu, May 9

dduvall awarded Blog Post: Nova-network is gone! a Evil Spooky Haunted Tree token.
Thu, May 9, 3:46 PM · Toolforge, Cloud-VPS
Andrew added a comment to T166337: wsexport tool leaking files in /tmp.

To be clear: The issue is the leaking files. It doesn't matter where the files wind up, they'll cause problems either way.

Thu, May 9, 2:56 PM · Community-Tech-Sprint, Community-Tech, E-Book-Export-Reliability, Tools, Toolforge

Wed, May 8

Andrew updated subscribers of T221339: Missing index on revision_userindex.rev_actor.
Wed, May 8, 8:03 PM · Patch-For-Review, Data-Services
Andrew added a comment to T218844: Update Gerrit /r/p/ links to /r/.

Ran sudo cumin --force --timeout 500 -o json "A:all" "sed -i 's%/r/p/%/r/%' /srv/composer/.git/config" for the composer urls

Wed, May 8, 7:23 PM · MW-1.34-notes (1.34.0-wmf.4; 2019-05-07), User-zeljkofilipin, Patch-For-Review, good first bug, Documentation, Epic, Wikimedia-General-or-Unknown, Gerrit
Andrew added a comment to T218844: Update Gerrit /r/p/ links to /r/.

After merging the above patch I corrected the urls in /var/lib/git/operations/puppet/.git/config and /var/lib/git/operations/software/.git/config on all prod and cloud hosts.

Wed, May 8, 5:13 PM · MW-1.34-notes (1.34.0-wmf.4; 2019-05-07), User-zeljkofilipin, Patch-For-Review, good first bug, Documentation, Epic, Wikimedia-General-or-Unknown, Gerrit

Tue, May 7

Andrew closed Restricted Task, a subtask of T171188: Move the main WMCS puppetmaster into the Labs realm, as Resolved.
Tue, May 7, 8:25 PM · Patch-For-Review, cloud-services-team (Kanban), Cloud-Services, Puppet, Operations

Mon, May 6

mmodell awarded Blog Post: Nova-network is gone! a Yellow Medal token.
Mon, May 6, 5:13 PM · Toolforge, Cloud-VPS
Andrew updated the task description for T216549: Hold back spare drives in all cloudvirts.
Mon, May 6, 3:05 PM · cloud-services-team (Kanban)
Andrew updated the task description for T216195: Move cloudvirt hosts to 10Gb ethernet.
Mon, May 6, 3:02 PM · ops-eqiad, DC-Ops, Operations, Epic, cloud-services-team (Kanban)
Andrew closed T221138: relocate/reimage cloudvirt1004 with 10G interfaces, a subtask of T216195: Move cloudvirt hosts to 10Gb ethernet, as Resolved.
Mon, May 6, 3:02 PM · ops-eqiad, DC-Ops, Operations, Epic, cloud-services-team (Kanban)
Andrew closed T221138: relocate/reimage cloudvirt1004 with 10G interfaces as Resolved.
Mon, May 6, 3:02 PM · Patch-For-Review, Operations, Epic, cloud-services-team (Kanban)
Andrew closed T221139: relocate/reimage cloudvirt1003 with 10G interfaces, a subtask of T216195: Move cloudvirt hosts to 10Gb ethernet, as Resolved.
Mon, May 6, 3:01 PM · ops-eqiad, DC-Ops, Operations, Epic, cloud-services-team (Kanban)
Andrew closed T221139: relocate/reimage cloudvirt1003 with 10G interfaces as Resolved.
Mon, May 6, 3:01 PM · Patch-For-Review, Operations, Epic, cloud-services-team (Kanban)
Andrew closed T221140: relocate/reimage cloudvirt1002 with 10G interfaces as Resolved.
Mon, May 6, 3:01 PM · Patch-For-Review, Operations, Epic, cloud-services-team (Kanban)
Andrew closed T221140: relocate/reimage cloudvirt1002 with 10G interfaces, a subtask of T216195: Move cloudvirt hosts to 10Gb ethernet, as Resolved.
Mon, May 6, 3:01 PM · ops-eqiad, DC-Ops, Operations, Epic, cloud-services-team (Kanban)
Andrew closed T221141: relocate/reimage cloudvirt1001 with 10G interfaces as Resolved.

Thank you for working on all these, @Cmjohnson !

Mon, May 6, 3:01 PM · Patch-For-Review, Operations, Epic, cloud-services-team (Kanban)
Andrew closed T221141: relocate/reimage cloudvirt1001 with 10G interfaces, a subtask of T216195: Move cloudvirt hosts to 10Gb ethernet, as Resolved.
Mon, May 6, 3:01 PM · ops-eqiad, DC-Ops, Operations, Epic, cloud-services-team (Kanban)
Andrew updated the task description for T221141: relocate/reimage cloudvirt1001 with 10G interfaces.
Mon, May 6, 3:01 PM · Patch-For-Review, Operations, Epic, cloud-services-team (Kanban)
Andrew updated the task description for T212772: Track remaining trusty servers in production.
Mon, May 6, 2:37 PM · cloud-services-team (Kanban), Patch-For-Review, Operations

Sun, May 5

Andrew added a comment to T222522: DNS resolutions error's within WMCS.

There are a fair number of timeout messages when pdns and mdns try to sync via axfr. I suspect that that's related to the problem but so far that's all I've found.

Sun, May 5, 3:56 AM · cloud-services-team, Cloud-VPS
Andrew added a comment to T222522: DNS resolutions error's within WMCS.

Frustratingly there are no log messages from pdns-recursor about these failures.

Sun, May 5, 2:49 AM · cloud-services-team, Cloud-VPS

Fri, May 3

Andrew added a comment to T221063: Request creation of Wikimedia developer account to join existing VPS project.

Hm, I seem not to have gotten notified when this was last updated -- sorry! The good news is that self-serve account creation is now back online, so Tupino should be able to just create a new account on wikitech.

Fri, May 3, 5:27 PM · Soweego, Wikidata, cloud-services-team, Cloud-VPS

Thu, May 2

D3r1ck01 awarded Blog Post: Nova-network is gone! a Party Time token.
Thu, May 2, 10:42 PM · Toolforge, Cloud-VPS
greg awarded Blog Post: Nova-network is gone! a Barnstar token.
Thu, May 2, 10:29 PM · Toolforge, Cloud-VPS
chasemp awarded Blog Post: Nova-network is gone! a Love token.
Thu, May 2, 9:25 PM · Toolforge, Cloud-VPS
Krenair awarded Blog Post: Nova-network is gone! a Mountain of Wealth token.
Thu, May 2, 9:11 PM · Toolforge, Cloud-VPS
Andrew updated the post content for Blog Post: Nova-network is gone!.
Thu, May 2, 9:06 PM · Toolforge, Cloud-VPS

Tue, Apr 30

Andrew added a comment to T209707: tagged_interface sometimes exceeds IFNAMSIZ.

I still need something like https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474272/ in order to get cloudvirt1024 online (and to pave the way towards upgrading similar hardware to Stretch). Are there competing solutions at this point, or should I just be bold and merge 474272?

Tue, Apr 30, 9:25 PM · Traffic, Operations

Sat, Apr 27

Andrew updated the task description for T212772: Track remaining trusty servers in production.
Sat, Apr 27, 11:41 PM · cloud-services-team (Kanban), Patch-For-Review, Operations

Fri, Apr 26

Andrew closed T221106: cloudservices2002-dev: bootstrap as Resolved.

pdns is now running and happily talking to the db. I added some docs to the puppet class about what I did.

Fri, Apr 26, 6:14 PM · Patch-For-Review, cloud-services-team (Kanban)
Andrew closed T221106: cloudservices2002-dev: bootstrap, a subtask of T218575: Reallocate LDAP database from labtestservices2001, as Resolved.
Fri, Apr 26, 6:14 PM · Patch-For-Review, cloud-services-team (Kanban)
Andrew closed T219789: Horizon proxy UI doesn't allow dashes in proxy names as Invalid.
Fri, Apr 26, 5:46 PM · cloud-services-team (Kanban), Horizon
Andrew closed T219953: labservices1001/1002 sometimes unresponsive as Declined.

I think this is moot since we're shutting down these systems. T221857

Fri, Apr 26, 5:44 PM · Patch-For-Review
Andrew reassigned T221857: Decommission labservices1001 & labservices1002 from Andrew to RobH.
Fri, Apr 26, 5:39 PM · ops-eqiad, decommission, Operations
Andrew updated the task description for T221857: Decommission labservices1001 & labservices1002.
Fri, Apr 26, 5:17 PM · ops-eqiad, decommission, Operations
Andrew updated the task description for T212772: Track remaining trusty servers in production.
Fri, Apr 26, 12:43 PM · cloud-services-team (Kanban), Patch-For-Review, Operations
Andrew closed T221049: relocate/reimage cloudvirt1005 with 10G interfaces as Resolved.
Fri, Apr 26, 4:46 AM · Patch-For-Review, Operations, Epic, cloud-services-team (Kanban)
Andrew closed T221049: relocate/reimage cloudvirt1005 with 10G interfaces, a subtask of T216195: Move cloudvirt hosts to 10Gb ethernet, as Resolved.
Fri, Apr 26, 4:46 AM · ops-eqiad, DC-Ops, Operations, Epic, cloud-services-team (Kanban)
Andrew closed T221048: relocate/reimage cloudvirt1006 with 10G interfaces as Resolved.
Fri, Apr 26, 4:46 AM · Patch-For-Review, Operations, Epic, cloud-services-team (Kanban)
Andrew closed T221048: relocate/reimage cloudvirt1006 with 10G interfaces, a subtask of T216195: Move cloudvirt hosts to 10Gb ethernet, as Resolved.
Fri, Apr 26, 4:46 AM · ops-eqiad, DC-Ops, Operations, Epic, cloud-services-team (Kanban)
Andrew updated the task description for T221049: relocate/reimage cloudvirt1005 with 10G interfaces.
Fri, Apr 26, 4:46 AM · Patch-For-Review, Operations, Epic, cloud-services-team (Kanban)
Andrew updated the task description for T221048: relocate/reimage cloudvirt1006 with 10G interfaces.
Fri, Apr 26, 4:46 AM · Patch-For-Review, Operations, Epic, cloud-services-team (Kanban)

Thu, Apr 25

Andrew created P8440 killallpods.sh.
Thu, Apr 25, 10:58 PM
Andrew reassigned T221818: Decommission labnet1001 & labnet1002 from Andrew to RobH.
Thu, Apr 25, 10:50 PM · ops-eqiad, decommission, Operations
Andrew reassigned T221817: Decommission labcontrol1001 & labcontrol1002 from Andrew to RobH.

@RobH, I'm supposed to assign decom hosts to you at this point, right?

Thu, Apr 25, 10:49 PM · ops-eqiad, decommission, Operations
Andrew updated the task description for T221817: Decommission labcontrol1001 & labcontrol1002.
Thu, Apr 25, 10:48 PM · ops-eqiad, decommission, Operations
Andrew updated the task description for T221818: Decommission labnet1001 & labnet1002.
Thu, Apr 25, 9:38 PM · ops-eqiad, decommission, Operations
Andrew added a comment to T221063: Request creation of Wikimedia developer account to join existing VPS project.

To do this I need:

Thu, Apr 25, 9:16 PM · Soweego, Wikidata, cloud-services-team, Cloud-VPS
Andrew added a comment to T221721: Puppet broken on several vms in toolsbeta.

I hand-edited resolv.conf on these hosts so that they will survive the upcoming nameserver change.

Thu, Apr 25, 7:27 PM · Patch-For-Review, cloud-services-team (Kanban)
Andrew reassigned T220144: Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet from Andrew to RobH.
Thu, Apr 25, 4:37 PM · Patch-For-Review, Operations, decommission, Data-Services, cloud-services-team (Kanban)
Andrew renamed T221817: Decommission labcontrol1001 & labcontrol1002 from Reclaim/Decommission labcontrol1001, 1002 to Decommission labcontrol1001, 1002.
Thu, Apr 25, 4:27 PM · ops-eqiad, decommission, Operations
Andrew renamed T221818: Decommission labnet1001 & labnet1002 from Reclaim/Decommission labnet1001, 1002 to Decommission labnet1001, 1002.
Thu, Apr 25, 4:27 PM · ops-eqiad, decommission, Operations