Andrew (Andrew Bogott)
User

Projects (9)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Nov 2 2014, 11:35 PM (154 w, 2 d)
Availability
Available
IRC Nick
andrewbogott
LDAP User
Unknown
MediaWiki User
Andrewbogott

Recent Activity

Today

Andrew updated subscribers of T178510: Upgrade puppetmaster on toolsbeta and test.

cc: everyone who has been active in the toolsbeta project

Wed, Oct 18, 3:53 PM · cloud-services-team
Andrew edited projects for T178508: Update VPS puppetmasters to 4.8 or newer, added: cloud-services-team (Kanban); removed cloud-services-team.
Wed, Oct 18, 3:46 PM · cloud-services-team (Kanban)
Andrew created T178510: Upgrade puppetmaster on toolsbeta and test.
Wed, Oct 18, 3:45 PM · cloud-services-team
Andrew added a comment to T178508: Update VPS puppetmasters to 4.8 or newer.

The future parser has few complaints, so we're ready to move on to actual upgrade testing.

Wed, Oct 18, 3:44 PM · cloud-services-team (Kanban)
Andrew created T178508: Update VPS puppetmasters to 4.8 or newer.
Wed, Oct 18, 3:42 PM · cloud-services-team (Kanban)

Yesterday

Andrew added a comment to T177944: k8s nodes sometimes getting bad token value from hiera.

Current theory is that this happens when the labs-private repo is in the process of being rebased.

Tue, Oct 17, 9:29 PM · Toolforge

Mon, Oct 16

Andrew closed T178052: pagetranslation log_type missing on replicas as Resolved.

All set.

Mon, Oct 16, 3:18 PM · Patch-For-Review, Security-Team, DBA, Data-Services

Thu, Oct 12

Andrew added a comment to T97081: toolsbeta: set up puppet-compiler / temporary-apply.

Here's my latest attempt to describe what works. Once the concerned patches are merged I'll try to get this down on wikitech someplace.

Thu, Oct 12, 9:01 PM · Patch-For-Review, cloud-services-team (Kanban), puppet-compiler, Toolforge
Andrew closed T178082: Shutdown ocg Cloud VPS project? as Resolved.

Horizon can't quite delete everything yet, so I generally delete everything that Horizon can see first and then use the 'delete' link in wikitech. Horizon is /close/ to being able to do everything but it needs a bit of work.

Thu, Oct 12, 5:57 PM · OfflineContentGenerator, Cloud-VPS
Andrew added a comment to T178052: pagetranslation log_type missing on replicas.

It looks to me like this is filter in maintain-views.yaml via logging_whitelist. However, I don't see that 'pagetranslation' has ever been in that list (or at least not since 2016-10-12 which is when the history becomes murky.)

Thu, Oct 12, 2:14 PM · Patch-For-Review, Security-Team, DBA, Data-Services
Andrew triaged T177855: Difficulty applying profile class parameters in Horizon interface as Normal priority.
Thu, Oct 12, 2:06 PM · cloud-services-team (Kanban), Horizon

Wed, Oct 11

Andrew created T177959: Should VPS puppetmasters include labs-ns0/ns-1 in their resolv.confs?.
Wed, Oct 11, 3:53 PM · cloud-services-team (Kanban)
Andrew closed T177944: k8s nodes sometimes getting bad token value from hiera as Resolved.

When I refreshed puppet on the affected host, it included this diff:

Wed, Oct 11, 3:23 PM · Toolforge

Tue, Oct 10

Andrew created T177880: Automatically run maintain-views and and maintain-meta_p when config changes on cloud replicas.
Tue, Oct 10, 8:21 PM · cloud-services-team (Kanban), Data-Services
Andrew added a comment to T167114: Open view for term_full_entity_id in wb_terms table in labs.

ok -- I was expecting this table to be present in enwiki. If it's wikidata-specific then we're probably done. @Ladsgroup can you confirm?

Tue, Oct 10, 7:34 PM · cloud-services-team (Kanban), Data-Services, User-Ladsgroup, Wikidata-Sprint, Wikidata
Andrew updated subscribers of T167114: Open view for term_full_entity_id in wb_terms table in labs.

I've run maintain-views, but the wb_terms table isn't getting replicated at all. I don't see any evidence of filtering in the sanitarium files but I may be looking in the wrong place... @Marostegui, any ideas?

Tue, Oct 10, 7:02 PM · cloud-services-team (Kanban), Data-Services, User-Ladsgroup, Wikidata-Sprint, Wikidata
Andrew added a comment to T177299: Revert temporary increase of floating-ip quota for 'cvn' project (trusty to debian migration).

ok! I've raised the quota to 4 IPs. Lets' leave this task open and you can nudge me when you're ready to clean up.

Tue, Oct 10, 5:40 PM · Cloud-VPS (Quota-requests)
Andrew renamed T177299: Revert temporary increase of floating-ip quota for 'cvn' project (trusty to debian migration) from Temporary increase of floating-ip quota for 'cvn' project (trusty to debian migration) to Revert temporary increase of floating-ip quota for 'cvn' project (trusty to debian migration).
Tue, Oct 10, 5:39 PM · Cloud-VPS (Quota-requests)
Andrew closed T177500: Request creation of MWStake VPS project as Resolved.

We don't support CamelCase in project names, so I've created a project called 'mwstake'. @MarkAHershberger is a project admin and can add other users or admins as needed.

Tue, Oct 10, 5:37 PM · Cloud-VPS (Project-requests)
Andrew claimed T177299: Revert temporary increase of floating-ip quota for 'cvn' project (trusty to debian migration).
Tue, Oct 10, 3:40 PM · Cloud-VPS (Quota-requests)
Andrew added a comment to T177299: Revert temporary increase of floating-ip quota for 'cvn' project (trusty to debian migration).

Do you already have ram/CPU quota to create the additional instances? Is it really just the IPs that are holding you back?

Tue, Oct 10, 3:39 PM · Cloud-VPS (Quota-requests)
Andrew claimed T177500: Request creation of MWStake VPS project.

Approved, will do shortly

Tue, Oct 10, 3:37 PM · Cloud-VPS (Project-requests)
Andrew created T177851: Page if fullstack test fails more than once in a row.
Tue, Oct 10, 3:35 PM · cloud-services-team (Kanban)
Andrew created T177850: Page if the grid engine master is unreachable.
Tue, Oct 10, 3:35 PM · monitoring, Toolforge, cloud-services-team (Kanban)
Andrew closed T177834: Wikimedia Cloud (labs) dns is intermittingly failing as Resolved.

This seems to have been caused by https://gerrit.wikimedia.org/r/#/c/382415/, which has now been reverted.

Tue, Oct 10, 2:03 PM · Operations, Cloud-Services

Mon, Oct 9

Andrew added a comment to T177427: Remove non-interactive bots from #wikimedia-cloud.

I don't mind the idea of unbreak now still showing up via wikibugs. [in the main channel]

Is that possible? +1 if so!

Mon, Oct 9, 5:57 PM · Patch-For-Review, Cloud-Services, cloud-services-team (Kanban)

Fri, Oct 6

Andrew added a comment to T97081: toolsbeta: set up puppet-compiler / temporary-apply.

I'm trying to reproduce the tools puppet compiler described here. A few things have clearly changed since this was last built... The hiera setup I seem to need looks like this:

Fri, Oct 6, 6:21 PM · Patch-For-Review, cloud-services-team (Kanban), puppet-compiler, Toolforge

Thu, Oct 5

Andrew added a comment to T177450: Not all content is getting replicated to wikitech-static.

I ran the export and import by hand just now, and I think we're getting the complete wiki.

Thu, Oct 5, 8:32 PM · cloud-services-team (Kanban), wikitech.wikimedia.org
Andrew added a comment to T177427: Remove non-interactive bots from #wikimedia-cloud.

I've directed shinken-wm to talk in #wikimedia-cloud-feed.

Thu, Oct 5, 8:11 PM · Patch-For-Review, Cloud-Services, cloud-services-team (Kanban)
mmodell awarded T177427: Remove non-interactive bots from #wikimedia-cloud a Like token.
Thu, Oct 5, 7:34 PM · Patch-For-Review, Cloud-Services, cloud-services-team (Kanban)
Andrew renamed T167973: Move wikitech and labstestwiki to s3 from move wikitech and labstestwiki to s3 (needs discussion) to move wikitech and labstestwiki to s3.
Thu, Oct 5, 4:03 PM · Data-Services, wikitech.wikimedia.org, cloud-services-team, DBA
Andrew added a comment to T177450: Not all content is getting replicated to wikitech-static.

A --current dump is 8.6M, a --full dump is 7.2G. So doing --full may not be practical.

Thu, Oct 5, 3:56 PM · cloud-services-team (Kanban), wikitech.wikimedia.org
Andrew closed T176090: wikitech-static sync failing as Resolved.
Thu, Oct 5, 3:55 PM · MW-1.30-release-notes (WMF-deploy-2017-09-19 (1.30.0-wmf.19)), MediaWiki-Maintenance-scripts, Operations
Andrew added a comment to T167973: Move wikitech and labstestwiki to s3.

Is there anything I can do to nudge this along, short of 'clone Jaime'?

Thu, Oct 5, 2:45 PM · Data-Services, wikitech.wikimedia.org, cloud-services-team, DBA
Andrew closed T167820: rack/setup/install labweb100[12].wikimedia.org as Resolved.

These boxes are up and installed and seem ok. Actual service implementation is T168470

Thu, Oct 5, 2:34 PM · Patch-For-Review, Cloud-Services, Operations
Andrew added a comment to T176757: CamelCase vs. VPS instance naming.

Adding a regex validation to the instance name in Horizon turns out to be non-trivial in the current version.

Thu, Oct 5, 2:33 PM · cloud-services-team (Kanban)
Andrew closed T170492: figure out if nodepool is overwhelming rabbitmq and/or nova as Resolved.

rabbit is now much quieter, so this is /maybe/ better. Closing for now, optimistically.

Thu, Oct 5, 2:32 PM · cloud-services-team (Kanban), Release-Engineering-Team (Watching / External), Nodepool, Cloud-VPS, Continuous-Integration-Infrastructure, Patch-For-Review
Andrew added a comment to T177450: Not all content is getting replicated to wikitech-static.

Wikitech is dumped using

Thu, Oct 5, 12:29 PM · cloud-services-team (Kanban), wikitech.wikimedia.org
Andrew created T177450: Not all content is getting replicated to wikitech-static.
Thu, Oct 5, 12:11 AM · cloud-services-team (Kanban), wikitech.wikimedia.org

Wed, Oct 4

Andrew created T177443: Missing .deb dependencies for appserver on Stretch.
Wed, Oct 4, 10:20 PM · User-Elukey, HHVM, Operations
Andrew closed T176044: Replace kernel and reboot labvirt1015, 1016, 1017, 1018 as Resolved.

Every labvirt is now running Linux labvirt1008 4.4.0-81-generic #104~14.04.1-Ubuntu SMP Wed Jun 14 12:45:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Wed, Oct 4, 7:42 PM · Patch-For-Review, cloud-services-team (Kanban)
Andrew created T177427: Remove non-interactive bots from #wikimedia-cloud.
Wed, Oct 4, 5:42 PM · Patch-For-Review, Cloud-Services, cloud-services-team (Kanban)

Tue, Oct 3

mmodell awarded Blog Post: New dedicated puppetmasters for cloud instances a Cookie token.
Tue, Oct 3, 11:23 PM · Cloud-VPS
Andrew closed T177293: Create a general-purpose labs download server for big files as Resolved.

To add new files, copy them to download-01.download.eqiad.wmlabs:/srv/public_files/

Tue, Oct 3, 10:46 PM · Patch-For-Review, Huggle, Cloud-Services
Andrew closed T177293: Create a general-purpose labs download server for big files, a subtask of T177145: Huggle development environment - portable virtual box, as Resolved.
Tue, Oct 3, 10:46 PM · Huggle, Cloud-Services
Andrew added a comment to T177145: Huggle development environment - portable virtual box.

In order to keep big files off of NFS, I've created a static download site for things like this. Your file, for example, is now:

Tue, Oct 3, 10:44 PM · Huggle, Cloud-Services
Andrew updated subscribers of T177279: Request increased quota for webperf labs project.

Approved! @chasemp will help with the specifics of performance testing.

Tue, Oct 3, 6:41 PM · Performance-Team (Radar), Cloud-VPS (Quota-requests)
Andrew reopened T177218: New function to reset password without viewing email as "Open".

Closed by accident or vandalism

Tue, Oct 3, 5:10 PM · Trash
Andrew reopened T177293: Create a general-purpose labs download server for big files as "Open".

Closed in error, best I can tell

Tue, Oct 3, 5:08 PM · Patch-For-Review, Huggle, Cloud-Services
Andrew reopened T177293: Create a general-purpose labs download server for big files, a subtask of T177145: Huggle development environment - portable virtual box, as Open.
Tue, Oct 3, 5:08 PM · Huggle, Cloud-Services
Andrew created T177293: Create a general-purpose labs download server for big files.
Tue, Oct 3, 3:45 PM · Patch-For-Review, Huggle, Cloud-Services

Sun, Oct 1

Andrew added a comment to T177164: puppet-phabricator and gerrit-test3 have gone down.

https://wikitech.wikimedia.org/wiki/Incident_documentation/20171001-labvirt1015

Sun, Oct 1, 11:00 PM · Cloud-VPS
Andrew added a comment to T171473: labvirt1015 crashes.

Here's the latest mcelog. Without timestamps it's hard to correlate this to the failures but still seems bad.

Sun, Oct 1, 10:19 PM · cloud-services-team (Kanban), DC-Ops, Operations, ops-eqiad
Andrew added a comment to T171473: labvirt1015 crashes.

The last syslog before reboot was at Oct 1 01:21:01. It was down for many hours and didn't page because I downtimed it during the hardware replacement and didn't clear the downtime before putting it back into service :( There's nothing in the syslog or kernel log to indicate distress.

Sun, Oct 1, 10:05 PM · cloud-services-team (Kanban), DC-Ops, Operations, ops-eqiad

Fri, Sep 29

Andrew updated the post content for Blog Post: Automated OpenStack Testing, now with charts and graphs.
Fri, Sep 29, 11:04 PM · cloud-services-team
bd808 awarded Blog Post: Automated OpenStack Testing, now with charts and graphs a The World Burns token.
Fri, Sep 29, 10:06 PM · cloud-services-team
Andrew renamed Blog Post: Automated OpenStack Testing, now with charts and graphs blog post from Automated openstack testing, now with charts and graphs to Automated OpenStack Testing, now with charts and graphs.
Fri, Sep 29, 9:33 PM · cloud-services-team
Andrew updated the post content for Blog Post: Automated OpenStack Testing, now with charts and graphs.
Fri, Sep 29, 9:31 PM · cloud-services-team
Andrew created Blog Post: Automated OpenStack Testing, now with charts and graphs.
Fri, Sep 29, 9:26 PM · cloud-services-team
Andrew triaged T176757: CamelCase vs. VPS instance naming as Normal priority.
Fri, Sep 29, 9:01 PM · cloud-services-team (Kanban)
Andrew added a comment to T176044: Replace kernel and reboot labvirt1015, 1016, 1017, 1018.

I've rebuilt labvirt1015, 1017 and 1018 (and the labtestvirts) with 4.4.0-81. So now all of our virt nodes are running that kernel except for 1016, which needs an evacuation before I mess with it.

Fri, Sep 29, 8:50 PM · Patch-For-Review, cloud-services-team (Kanban)
Andrew closed T175029: rabbitmq: Consume and log messages sent to notifications.error as Resolved.
Fri, Sep 29, 8:38 PM · Patch-For-Review, cloud-services-team (Kanban), Release-Engineering-Team (Watching / External), Nodepool, Cloud-VPS, Continuous-Integration-Infrastructure
Andrew closed T175029: rabbitmq: Consume and log messages sent to notifications.error, a subtask of T170492: figure out if nodepool is overwhelming rabbitmq and/or nova, as Resolved.
Fri, Sep 29, 8:38 PM · cloud-services-team (Kanban), Release-Engineering-Team (Watching / External), Nodepool, Cloud-VPS, Continuous-Integration-Infrastructure, Patch-For-Review
Andrew created P6060 VMs on labvirt1016.
Fri, Sep 29, 8:26 PM
Andrew closed T167556: Define a metric to track OpenStack system availability as Resolved.
Fri, Sep 29, 3:25 PM · Patch-For-Review, Goal, cloud-services-team (FY2017-18)
Andrew closed T167556: Define a metric to track OpenStack system availability, a subtask of T166396: Program 1 Outcome 4: VPS hosting, as Resolved.
Fri, Sep 29, 3:25 PM · cloud-services-team (FY2017-18), Goal
Andrew placed T164290: Set up external DNS record for wikitech-static up for grabs.
Fri, Sep 29, 3:25 PM · Operations, cloud-services-team (Kanban), Cloud-Services, wikitech.wikimedia.org
Andrew closed T115194: Some labs instances IP have multiple PTR entries in DNS as Resolved.

This is as fixed as it's going to be. Any time there's a designate outage I need to run the dnsleaks script to clean up.

Fri, Sep 29, 3:09 PM · Patch-For-Review, Wikimedia-Incident, Cloud-VPS, Operations, Cloud-Services
Andrew closed T170447: Set good availability-zone defaults for nova users as Resolved.

An equivalent to 375941 was merged as part of a larger refactor, and I'm pretty sure this is adequate for the problem.

Fri, Sep 29, 2:49 PM · Patch-For-Review, Cloud-VPS, cloud-services-team (Kanban)
Andrew triaged T175029: rabbitmq: Consume and log messages sent to notifications.error as Normal priority.
Fri, Sep 29, 2:47 PM · Patch-For-Review, cloud-services-team (Kanban), Release-Engineering-Team (Watching / External), Nodepool, Cloud-VPS, Continuous-Integration-Infrastructure
Andrew closed T176645: lots of cloud-local puppetmasters broken as Resolved.

There are only a few left that are broken, and I've emailed all the owners.

Fri, Sep 29, 1:55 PM · cloud-services-team (Kanban)

Tue, Sep 26

Andrew added a comment to T176757: CamelCase vs. VPS instance naming.

I've confirmed that nova detects name collisions between 'camelcase' and 'CamelCase'. So this isn't especially urgent. There's still a potential race if the users get really luck and create overlapping named instances at exactly the same time.

Tue, Sep 26, 7:27 PM · cloud-services-team (Kanban)
Andrew created T176757: CamelCase vs. VPS instance naming.
Tue, Sep 26, 2:59 PM · cloud-services-team (Kanban)

Mon, Sep 25

Andrew triaged T176645: lots of cloud-local puppetmasters broken as High priority.
Mon, Sep 25, 3:31 PM · cloud-services-team (Kanban)
Andrew added a project to T176645: lots of cloud-local puppetmasters broken: cloud-services-team (Kanban).
Mon, Sep 25, 3:31 PM · cloud-services-team (Kanban)
Andrew created T176645: lots of cloud-local puppetmasters broken.
Mon, Sep 25, 3:31 PM · cloud-services-team (Kanban)
Andrew created T176632: Remove salt master (and related packages) from labcontrol1001.
Mon, Sep 25, 1:47 PM · cloud-services-team (Kanban), Goal, Technical-Debt, Operations-Software-Development, Operations
Andrew moved T175846: Request creation of Zppix-Wiki-AI VPS project from Inbox to Discussion needed on the Cloud-VPS (Project-requests) board.
Mon, Sep 25, 3:34 AM · User-Zppix, Cloud-VPS (Project-requests)
Andrew added a comment to T175846: Request creation of Zppix-Wiki-AI VPS project.

which due to requirements of mw-vagrant isnt possible for me

Mon, Sep 25, 1:18 AM · User-Zppix, Cloud-VPS (Project-requests)

Fri, Sep 22

Andrew added a comment to T176437: puppet ca_server confusion.

As far as I can see, the docs only describe setting ca_server once, for agents, in the [main] block. I am missing an explanation of why we would set it twice, and what setting it in [agent] does vs. what setting it in [master] does.

Fri, Sep 22, 2:13 PM · cloud-services-team (Kanban), Operations
Andrew closed T165555: nova-fullstack is losing instances on creation as Resolved.

This has almost totally stopped happening; when it does happen it's usually for a good (but new) reason. So I don't think this bug itself is useful anymore.

Fri, Sep 22, 2:18 AM · Patch-For-Review, cloud-services-team (Kanban), Cloud-Services

Thu, Sep 21

Andrew created T176437: puppet ca_server confusion.
Thu, Sep 21, 8:19 PM · cloud-services-team (Kanban), Operations
Andrew closed T176381: Grant tool "tool-db-usage" (s53508) ability to read full table_schema.TABLES contents as Resolved.

Done via GRANT SHOW VIEW ON *.* TO 's53508'@'%' on labsdb1001 and labdb1003

Thu, Sep 21, 4:04 PM · DBA, Data-Services

Wed, Sep 20

Andrew renamed T90784: Monitor nova-scheduler log for lost contact with compute nodes from Monitor nova services to Monitor nova-scheduler log for lost contact with compute nodes.
Wed, Sep 20, 9:23 PM · labs-sprint-118, labs-sprint-117, labs-sprint-116, Labs-Sprint-109, Patch-For-Review, monitoring, Cloud-Services
Andrew added a comment to T90784: Monitor nova-scheduler log for lost contact with compute nodes.

This is modestly different, but needs to be retitled. T42022 is about public http APIs, this is about internal services which can break despite the public APIs functioning.

Wed, Sep 20, 9:23 PM · labs-sprint-118, labs-sprint-117, labs-sprint-116, Labs-Sprint-109, Patch-For-Review, monitoring, Cloud-Services
Andrew added a comment to T167556: Define a metric to track OpenStack system availability.

I've added fullstack success % to the above graph. We still need to add some auto-cleanup functions to the fullstack test to keep accurate numbers.

Wed, Sep 20, 3:25 PM · Patch-For-Review, Goal, cloud-services-team (FY2017-18)

Tue, Sep 19

Andrew added a comment to T176044: Replace kernel and reboot labvirt1015, 1016, 1017, 1018.

I've moved labvirt1018 to 4.4.0-83 but can't reproduce this issue.

Tue, Sep 19, 10:13 PM · Patch-For-Review, cloud-services-team (Kanban)
Andrew added a comment to T175846: Request creation of Zppix-Wiki-AI VPS project.

Is this something that could be done within the existing ores project?

Tue, Sep 19, 6:16 PM · User-Zppix, Cloud-VPS (Project-requests)
Andrew added a comment to T176090: wikitech-static sync failing.

@Reedy definitely no need to cherry-pick if this is getting pushed out today :)

Tue, Sep 19, 2:39 PM · MW-1.30-release-notes (WMF-deploy-2017-09-19 (1.30.0-wmf.19)), MediaWiki-Maintenance-scripts, Operations
Andrew added a comment to T176044: Replace kernel and reboot labvirt1015, 1016, 1017, 1018.

Was network connectivity lost to the server at large or to the VMs running on that labvirt instance?

Tue, Sep 19, 2:34 PM · Patch-For-Review, cloud-services-team (Kanban)

Sep 17 2017

Andrew added a comment to T176090: wikitech-static sync failing.

@Reedy I'm on holiday and so only got as far as seeing that that one use-case produces the problem. I don't know immediately how to find all the mismatches, although there are quite a few:

Sep 17 2017, 8:41 PM · MW-1.30-release-notes (WMF-deploy-2017-09-19 (1.30.0-wmf.19)), MediaWiki-Maintenance-scripts, Operations
Krenair awarded T176090: wikitech-static sync failing a Manufacturing Defect? token.
Sep 17 2017, 6:46 PM · MW-1.30-release-notes (WMF-deploy-2017-09-19 (1.30.0-wmf.19)), MediaWiki-Maintenance-scripts, Operations
Andrew added a project to T176090: wikitech-static sync failing: Operations.
Sep 17 2017, 5:43 PM · MW-1.30-release-notes (WMF-deploy-2017-09-19 (1.30.0-wmf.19)), MediaWiki-Maintenance-scripts, Operations
Andrew added a comment to T176090: wikitech-static sync failing.

That file is produced on silver via this command:

Sep 17 2017, 5:42 PM · MW-1.30-release-notes (WMF-deploy-2017-09-19 (1.30.0-wmf.19)), MediaWiki-Maintenance-scripts, Operations
Andrew created T176090: wikitech-static sync failing.
Sep 17 2017, 5:41 PM · MW-1.30-release-notes (WMF-deploy-2017-09-19 (1.30.0-wmf.19)), MediaWiki-Maintenance-scripts, Operations

Sep 16 2017

Andrew added a comment to T176044: Replace kernel and reboot labvirt1015, 1016, 1017, 1018.

I re-imaged labvir1015 and 1017. They're now running 4.4.0-93-generic and rebooting fine. Do I know what just happened here? I do not.

Sep 16 2017, 4:27 PM · Patch-For-Review, cloud-services-team (Kanban)
Andrew added a comment to T176044: Replace kernel and reboot labvirt1015, 1016, 1017, 1018.

I just upgraded labvirt1015 and 1017 to -93 and rebooted, and both lost network config just like we saw with -83. So something very bad is going on here. I'm going to re-image 1015 and see where I get.

Sep 16 2017, 3:50 PM · Patch-For-Review, cloud-services-team (Kanban)
Andrew triaged T176044: Replace kernel and reboot labvirt1015, 1016, 1017, 1018 as High priority.
Sep 16 2017, 3:23 PM · Patch-For-Review, cloud-services-team (Kanban)
Andrew created T176044: Replace kernel and reboot labvirt1015, 1016, 1017, 1018.
Sep 16 2017, 3:23 PM · Patch-For-Review, cloud-services-team (Kanban)

Sep 14 2017

Andrew added a comment to T167556: Define a metric to track OpenStack system availability.

I have some api uptime stats at https://grafana.wikimedia.org/dashboard/db/wmcs-api-uptimes?orgId=1

Sep 14 2017, 10:01 PM · Patch-For-Review, Goal, cloud-services-team (FY2017-18)