Page MenuHomePhabricator

Andrew (Andrew Bogott)
User

Projects (13)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Nov 2 2014, 11:35 PM (317 w, 2 d)
Availability
Available
IRC Nick
andrewbogott
LDAP User
Unknown
MediaWiki User
Andrewbogott [ Global Accounts ]

Recent Activity

Today

Andrew updated the task description for T268746: [ceph] cloudcephosd1004-1015 think that their hard drives are HDD when they are SSD.
Wed, Dec 2, 9:15 PM · DC-Ops, cloud-services-team (Kanban)
Andrew updated the task description for T268746: [ceph] cloudcephosd1004-1015 think that their hard drives are HDD when they are SSD.
Wed, Dec 2, 6:46 PM · DC-Ops, cloud-services-team (Kanban)
Andrew added a comment to T269019: Request creation of toolhub VPS project.

+1 approved in today's meeting

Wed, Dec 2, 4:45 PM · cloud-services-team (Kanban), Toolhub, Cloud-VPS (Project-requests)
Andrew added a comment to T267433: Enable support for nested VMs.

I still need to merge and implement this but it should be straightforward -- ping me if you don't hear back by the end of the week.

Wed, Dec 2, 4:37 PM · cloud-services-team (Kanban), Cloud-VPS, Patch-For-Review
Andrew added a comment to T269252: add logrotation for haproxy logs on cloudcontrols.

Brooke says, look at how it's configured for k8s on toolforge

Wed, Dec 2, 4:35 PM · cloud-services-team (Kanban)
Andrew placed T269252: add logrotation for haproxy logs on cloudcontrols up for grabs.
Wed, Dec 2, 4:34 PM · cloud-services-team (Kanban)
Andrew created T269252: add logrotation for haproxy logs on cloudcontrols.
Wed, Dec 2, 4:33 PM · cloud-services-team (Kanban)
Andrew updated the task description for T268746: [ceph] cloudcephosd1004-1015 think that their hard drives are HDD when they are SSD.
Wed, Dec 2, 4:47 AM · DC-Ops, cloud-services-team (Kanban)

Yesterday

Andrew added a comment to T266187: relocate/reimage cloudvirt1025 with 10G interfaces.

I'm unable to pxe boot this host. It doesn't display much of anything, just hangs for a while and then fails over to hdd.

Tue, Dec 1, 11:22 PM · cloud-services-team (Hardware), ops-eqiad, DC-Ops, Operations
Andrew updated the task description for T268746: [ceph] cloudcephosd1004-1015 think that their hard drives are HDD when they are SSD.
Tue, Dec 1, 10:14 PM · DC-Ops, cloud-services-team (Kanban)
Andrew added a comment to T268746: [ceph] cloudcephosd1004-1015 think that their hard drives are HDD when they are SSD.

ATA device, with non-removable media
Model Number: MTFDDAK1T9TDN
Serial Number: 19472511BD26
Firmware Revision: D1DF003
Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
Used: unknown (minor revision code 0x006d)
Supported: 10 9 8 7 6 5
Likely used: 10
Configuration:
Logical max current
cylinders 16383 0
heads 16 0
sectors/track 63 0

LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 3750748848
Logical Sector size: 512 bytes
Physical Sector size: 4096 bytes
Logical Sector-0 offset: 0 bytes
device size with M = 1024*1024: 1831420 MBytes
device size with M = 1000*1000: 1920383 MBytes (1920 GB)
cache/buffer size = unknown
Form Factor: 2.5 inch
Nominal Media Rotation Rate: Solid State Device
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, with device specific minimum
R/W multiple sector transfer: Max = 16 Current = 16
Advanced power management level: 254
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6

	     Cycle time: min=120ns recommended=120ns

PIO: pio0 pio1 pio2 pio3 pio4

	     Cycle time: no flow control=120ns  IORDY flow control=120ns

Commands/features:
Enabled Supported:

  • SMART feature set
  • Power Management feature set
  • Write cache
  • Look-ahead
  • WRITE_BUFFER command
  • READ_BUFFER command
  • NOP cmd
  • DOWNLOAD_MICROCODE
  • Advanced Power Management feature set
  • 48-bit Address feature set
  • Mandatory FLUSH_CACHE
  • FLUSH_CACHE_EXT
  • SMART error logging
  • SMART self-test
  • General Purpose Logging feature set
  • 64-bit World wide name
  • IDLE_IMMEDIATE with UNLOAD
	    	Write-Read-Verify feature set
	   *	WRITE_UNCORRECTABLE_EXT command
	   *	{READ,WRITE}_DMA_EXT_GPL commands
	   *	Segmented DOWNLOAD_MICROCODE
	    	unknown 119[6]
	    	unknown 119[8]
	   *	Gen1 signaling speed (1.5Gb/s)
	   *	Gen2 signaling speed (3.0Gb/s)
	   *	Gen3 signaling speed (6.0Gb/s)
	   *	Native Command Queueing (NCQ)
	   *	Phy event counters
	   *	NCQ priority information
	   *	READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
	   *	DMA Setup Auto-Activate optimization
	   *	Software settings preservation
	   *	SMART Command Transport (SCT) feature set
	   *	SCT Write Same (AC2)
	   *	SCT Error Recovery Control (AC3)
	   *	SCT Features Control (AC4)
	   *	SCT Data Tables (AC5)
	   *	SANITIZE_ANTIFREEZE_LOCK_EXT command
	   *	SANITIZE feature set
	   *	CRYPTO_SCRAMBLE_EXT command
	   *	BLOCK_ERASE_EXT command
	   *	DOWNLOAD MICROCODE DMA command
	   *	WRITE BUFFER DMA command
	   *	READ BUFFER DMA command
	   *	Data Set Management TRIM supported (limit 8 blocks)
	   *	Deterministic read ZEROs after TRIM

Logical Unit WWN Device Identifier: 500a07512511bd26
NAA : 5
IEEE OUI : 00a075
Unique ID : 12511bd26
Checksum: correct

Tue, Dec 1, 10:08 PM · DC-Ops, cloud-services-team (Kanban)
Andrew updated the task description for T268746: [ceph] cloudcephosd1004-1015 think that their hard drives are HDD when they are SSD.
Tue, Dec 1, 9:27 PM · DC-Ops, cloud-services-team (Kanban)
Andrew updated the task description for T268746: [ceph] cloudcephosd1004-1015 think that their hard drives are HDD when they are SSD.
Tue, Dec 1, 8:51 PM · DC-Ops, cloud-services-team (Kanban)
Andrew updated the task description for T268746: [ceph] cloudcephosd1004-1015 think that their hard drives are HDD when they are SSD.
Tue, Dec 1, 8:38 PM · DC-Ops, cloud-services-team (Kanban)
Andrew assigned T268746: [ceph] cloudcephosd1004-1015 think that their hard drives are HDD when they are SSD to RobH.

I've moved the workload off of cloudcephosd1015.eqiad.wmnet so we can experiment. For starters @RobH is going to upgrade the firmware (including the raid controller), boot back to the OS, and then we'll see what it looks like. If we need to reinstall the OS for it to re-detect the drives that's also fine.

Tue, Dec 1, 4:06 AM · DC-Ops, cloud-services-team (Kanban)

Thu, Nov 26

Andrew added a comment to T268786: ceph pg 6.91 inconsistent.

Just now, a repeat of this:

Thu, Nov 26, 3:33 AM · Cloud-VPS, cloud-services-team (Kanban)

Wed, Nov 25

Andrew added a comment to T268746: [ceph] cloudcephosd1004-1015 think that their hard drives are HDD when they are SSD.

<_dcaro> David Caro hmmm... the new servers for ceph (in codfw) have the same brand of disks (a bit smaller size), but they are detected correctly, I think it might be the RAID controller on the other ones that's messing things up

Wed, Nov 25, 6:30 PM · DC-Ops, cloud-services-team (Kanban)
Andrew updated subscribers of T266261: rearrange networking for cloudceph200[1-3]-dev and rename.

I'm a bit lost in the backscroll :) Did the second nics get attached and assigned for all these? @dcaro points out that they kernel thinks they are disconnected:`

Wed, Nov 25, 5:11 PM · cloud-services-team (Kanban)
Andrew claimed T268190: Custom Flavour for Wikidumpparse Cloud VPS project.

This is approved, someone will create the new flavor soon

Wed, Nov 25, 4:31 PM · cloud-services-team (Kanban), Humaniki, Cloud-VPS (Quota-requests)
Andrew added a comment to T268746: [ceph] cloudcephosd1004-1015 think that their hard drives are HDD when they are SSD.

related tasks for this hardware: T251619, T242133

Wed, Nov 25, 3:14 PM · DC-Ops, cloud-services-team (Kanban)

Tue, Nov 24

Andrew added a comment to T262350: bad failure cases for wmcs custom puppet enc.

legoktm> could you have a enabled: true hiera key or something that the enc sets and which puppet will refuse to run if not present? to distinguish a barf with no hieradata from instance has no extra hieradata?

Tue, Nov 24, 6:09 PM · cloud-services-team (Kanban)

Mon, Nov 23

Andrew renamed T266261: rearrange networking for cloudceph200[1-3]-dev and rename from rearrange networking for cloudceph200[1-3]-dev to rearrange networking for cloudceph200[1-3]-dev and rename.
Mon, Nov 23, 5:32 PM · cloud-services-team (Kanban)
Andrew added a comment to T258103: Repurpose labtestpuppetmaster2001.wikimedia.org as cloudcephmon2003-dev.codfw.wmnet.

It's idle, you can power it down whenever.

Mon, Nov 23, 3:23 PM · Patch-For-Review, cloud-services-team (Kanban)

Sun, Nov 22

Andrew added a comment to T261134: upgrade cloud-vps openstack to Openstack version 'Stein'.

Since the cloudcontrol nodes are already running Buster, I upgraded them to Stein just now. All seems well for the moment.

Sun, Nov 22, 6:01 PM · cloud-services-team (Kanban)
Andrew awarded T267967: set up bpo repos for OpenStack Stein and Debian Buster a Cookie token.
Sun, Nov 22, 5:59 PM · cloud-services-team (Kanban)
Andrew added a comment to T267967: set up bpo repos for OpenStack Stein and Debian Buster.

It works! For designate, at least.

Sun, Nov 22, 5:59 PM · cloud-services-team (Kanban)

Fri, Nov 20

Andrew reassigned T258103: Repurpose labtestpuppetmaster2001.wikimedia.org as cloudcephmon2003-dev.codfw.wmnet from Andrew to Papaul.

This needs to be moved to row B before it can connect to cloud-hosts1-b-codfw. Leaving that in @Papaul's hands for now.

Fri, Nov 20, 5:07 PM · Patch-For-Review, cloud-services-team (Kanban)
Andrew updated the task description for T258103: Repurpose labtestpuppetmaster2001.wikimedia.org as cloudcephmon2003-dev.codfw.wmnet.
Fri, Nov 20, 4:26 PM · Patch-For-Review, cloud-services-team (Kanban)
Andrew reassigned T266261: rearrange networking for cloudceph200[1-3]-dev and rename from Andrew to Papaul.

@Papaul, if you want to make the netbox/network changes, I can do the actual re-imaging. There's nothing happening on these boxes currently so you can break them whenever :)

Fri, Nov 20, 2:37 PM · cloud-services-team (Kanban)
Andrew updated the task description for T266261: rearrange networking for cloudceph200[1-3]-dev and rename.
Fri, Nov 20, 2:36 PM · cloud-services-team (Kanban)
Andrew updated the task description for T266261: rearrange networking for cloudceph200[1-3]-dev and rename.
Fri, Nov 20, 2:14 PM · cloud-services-team (Kanban)

Thu, Nov 19

Andrew created T268285: update RAID controller firmware on labstore1006, 1007.
Thu, Nov 19, 9:55 PM · ops-eqiad, cloud-services-team (Kanban), Operations
Andrew added a comment to T268280: labstore1006 spontaneous reboot.

This is the same failure. Not a very useful suggestion though:

Thu, Nov 19, 9:51 PM · cloud-services-team (Hardware)
Andrew added a comment to T268280: labstore1006 spontaneous reboot.

"an unrecoverable system error (NMI) has occurred. (Service Information: 0x00CC47F0, 0x00CC4AF0)"

Thu, Nov 19, 9:50 PM · cloud-services-team (Hardware)
Andrew added a comment to T268280: labstore1006 spontaneous reboot.

This is a little bit like https://support.hpe.com/hpesc/public/docDisplay?docId=a00037929en_us&docLocale=en_US

Thu, Nov 19, 9:32 PM · cloud-services-team (Hardware)
Andrew added a comment to T268280: labstore1006 spontaneous reboot.

Just before it died there were several things like this:

Thu, Nov 19, 9:30 PM · cloud-services-team (Hardware)
Andrew added a comment to T268175: central logging for OpenStack services.

There's an upstream catch-all for filtering openstack related services:

Thu, Nov 19, 5:29 PM · Patch-For-Review, cloud-services-team (Kanban), Cloud-VPS
Andrew closed T268176: Allow kafka network access to cloudvirt hosts as Resolved.

It works!

Thu, Nov 19, 3:02 PM · cloud-services-team (Kanban), Cloud-VPS
Andrew closed T268176: Allow kafka network access to cloudvirt hosts, a subtask of T268175: central logging for OpenStack services, as Resolved.
Thu, Nov 19, 3:02 PM · Patch-For-Review, cloud-services-team (Kanban), Cloud-VPS

Wed, Nov 18

Andrew created T268176: Allow kafka network access to cloudvirt hosts.
Wed, Nov 18, 8:36 PM · cloud-services-team (Kanban), Cloud-VPS
Andrew created T268175: central logging for OpenStack services.
Wed, Nov 18, 8:34 PM · Patch-For-Review, cloud-services-team (Kanban), Cloud-VPS
Andrew changed the status of T266198: Move labstore1004 and labstore1005 to 10G Ethernet from Open to Stalled.

This is stalled pending available 10G rackspace in eqiad. The Tetris game there is well underway.

Wed, Nov 18, 8:07 PM · cloud-services-team (Hardware), Epic, ops-eqiad, Data-Services, Operations
Andrew closed T261132: Move all cloud-vps VMs to Ceph, a subtask of T216195: Move cloudvirt hosts to 10Gb ethernet, as Resolved.
Wed, Nov 18, 7:51 PM · cloud-services-team (Hardware), ops-eqiad, DC-Ops, Operations, Epic
Andrew closed T261132: Move all cloud-vps VMs to Ceph, a subtask of T194334: [Epic] Modern Cloud VPS storage layer, as Resolved.
Wed, Nov 18, 7:51 PM · cloud-services-team (Kanban), Epic, Cloud-VPS
Andrew closed T261132: Move all cloud-vps VMs to Ceph, a subtask of T259399: Upgrade cloudvirts to Debian Buster, as Resolved.
Wed, Nov 18, 7:51 PM · Patch-For-Review, Cloud-VPS, cloud-services-team (Kanban)
Andrew closed T261132: Move all cloud-vps VMs to Ceph as Resolved.

I'm closing this; upgrading and moving remaining hardware can be dealt with via different tasks.

Wed, Nov 18, 7:51 PM · cloud-services-team (Kanban)
Andrew closed T267499: Request increased quota for wikidumpparse Cloud VPS project as Resolved.
Wed, Nov 18, 6:33 PM · Cloud-VPS (Quota-requests)

Tue, Nov 17

Andrew claimed T266198: Move labstore1004 and labstore1005 to 10G Ethernet.
Tue, Nov 17, 9:52 PM · cloud-services-team (Hardware), Epic, ops-eqiad, Data-Services, Operations
Andrew added a comment to T266198: Move labstore1004 and labstore1005 to 10G Ethernet.

to validate the move, check 'drbd-overview' output before and after

Tue, Nov 17, 9:49 PM · cloud-services-team (Hardware), Epic, ops-eqiad, Data-Services, Operations
Andrew added a comment to T267378: (Need By: TBD) rack/setup/install cloudcephmon200[12]-dev.

If they have hw raid then all drives in one big raid10 and partman recipe hwraid-1dev.cfg. If no hwraid then... I think raid10-4dev.cfg ? It's hard for me to say, I'm not familiar with the hardware and also all the partman recipes have been rewritten since I last looked. Whatever is simple is good with me, it's not very critical for these hosts.

Tue, Nov 17, 8:53 PM · Patch-For-Review, ops-codfw, Operations, DC-Ops
Andrew added a comment to T267378: (Need By: TBD) rack/setup/install cloudcephmon200[12]-dev.

We only need OS partitions for these. what OS partitions?

Tue, Nov 17, 6:53 PM · Patch-For-Review, ops-codfw, Operations, DC-Ops

Mon, Nov 16

Andrew updated subscribers of T267433: Enable support for nested VMs.

update: I'm still waiting to have codfw1dev working properly so I can test things there. Right now our puppetmasters there are messed up (probably my doing) and @jbond is looking at untangling the cert mess there.

Mon, Nov 16, 8:01 PM · cloud-services-team (Kanban), Cloud-VPS, Patch-For-Review
Andrew created T267967: set up bpo repos for OpenStack Stein and Debian Buster.
Mon, Nov 16, 7:52 PM · cloud-services-team (Kanban)
Andrew closed T216549: Hold back spare drives in all cloudvirts as Resolved.

this is no longer necessary; we aren't using local storage on any of these other than 1019 and 1020.

Mon, Nov 16, 7:36 PM · cloud-services-team (Kanban)
Andrew added a comment to T267378: (Need By: TBD) rack/setup/install cloudcephmon200[12]-dev.

Please in the future those changes need to be done before i have already applied the label on all the hosts now i have to go back and make those changes again

Mon, Nov 16, 7:00 PM · Patch-For-Review, ops-codfw, Operations, DC-Ops
Andrew added a comment to T267378: (Need By: TBD) rack/setup/install cloudcephmon200[12]-dev.

please note hostname change -- these should be cloudcephmon2001-dev and cloudcephmon2002-dev

Mon, Nov 16, 6:40 PM · Patch-For-Review, ops-codfw, Operations, DC-Ops
Andrew renamed T267378: (Need By: TBD) rack/setup/install cloudcephmon200[12]-dev from (Need By: TBD) rack/setup/install cloudcephmon200[12] to (Need By: TBD) rack/setup/install cloudcephmon200[12]-dev.
Mon, Nov 16, 6:40 PM · Patch-For-Review, ops-codfw, Operations, DC-Ops
Andrew renamed T258103: Repurpose labtestpuppetmaster2001.wikimedia.org as cloudcephmon2003-dev.codfw.wmnet from Decide fate of labtestpuppetmaster2001.wikimedia.org to Repurpose labtestpuppetmaster2001.wikimedia.org as cloudcephmon2003-dev.codfw.wmnet.
Mon, Nov 16, 6:39 PM · Patch-For-Review, cloud-services-team (Kanban)
Andrew added a comment to T267378: (Need By: TBD) rack/setup/install cloudcephmon200[12]-dev.

I'm going to reuse an old puppetmaster as cloudcephmon2003 (T258103) -- does that server also need to be re-racked or can we just rename it in place?

Mon, Nov 16, 6:37 PM · Patch-For-Review, ops-codfw, Operations, DC-Ops
Andrew added a comment to T267935: wikitech: INSERT command denied to user 'wikiuser'@'10.64.32.36' for table 'comment' (10.64.0.98).

Nothing has changed that I know of. That IP is a prod mw server (mw1334.eqiad.wmnet); I've no idea why it would be trying to insert into the wikitech database.

Mon, Nov 16, 4:25 PM · wikitech.wikimedia.org, cloud-services-team (Kanban)

Wed, Nov 11

Andrew added a comment to T267478: Adoption request for wikilint.

I've added @Matanya as a maintainer; what became of @Tidoni? Should I add them as well?

Wed, Nov 11, 3:08 PM · Toolforge-standards-committee
Andrew added a comment to T267499: Request increased quota for wikidumpparse Cloud VPS project.

What is the correct way to use more disk-space, for a database context, exceeding the flavour-definition? I can imagine putting it on the NFS (slow), or creating another volume in horizon (can't see how to do that in the interface or with wikitech).

Wed, Nov 11, 2:59 PM · Cloud-VPS (Quota-requests)

Tue, Nov 10

Andrew added a comment to T267499: Request increased quota for wikidumpparse Cloud VPS project.

I actually increased your quotas to 18 vcpus and 38GB of RAM because there wouldn't be room otherwise (since there's also a 2-core instance in that project.)

Tue, Nov 10, 7:28 PM · Cloud-VPS (Quota-requests)
Andrew added a comment to T267105: Wikitech static sync syslog spam.

thank you @Reedy!

Tue, Nov 10, 7:01 PM · MW-1.35-notes, MW-1.36-notes (1.36.0-wmf.18; 2020-11-17), MW-1.35-release, MediaWiki-Maintenance-system, wikitech.wikimedia.org
Andrew closed T267618: Request increased quota for toolsbeta Cloud VPS project, a subtask of T267616: Set up docker-registry and image builder infra in toolsbeta, as Resolved.
Tue, Nov 10, 6:44 PM · cloud-services-team (Kanban), Toolforge
Andrew closed T267618: Request increased quota for toolsbeta Cloud VPS project as Resolved.

done

Tue, Nov 10, 6:44 PM · Cloud-VPS (Quota-requests)

Mon, Nov 9

Andrew closed T261336: 2020-08-26: tools NFS share cleanup, a subtask of T261335: Fix alerting for disk space on the NFS servers, as Resolved.
Mon, Nov 9, 6:03 PM · Data-Services, cloud-services-team (Kanban)
Andrew closed T261336: 2020-08-26: tools NFS share cleanup as Resolved.

Seems ok for now

Mon, Nov 9, 6:03 PM · Data-Services, cloud-services-team (Kanban)
Andrew reassigned T267078: Open the ceph throttle a bit for tools-k8s-etcd server from Andrew to Bstorm.
Mon, Nov 9, 6:01 PM · cloud-services-team (Kanban), Toolforge

Fri, Nov 6

Andrew added a comment to T267378: (Need By: TBD) rack/setup/install cloudcephmon200[12]-dev.

@Andrew: "We only need OS partitions for these." Does this mean just a normal raid10 lvm setup of the 4 disks or what? I'm assuming yes.

Fri, Nov 6, 1:05 AM · Patch-For-Review, ops-codfw, Operations, DC-Ops

Thu, Nov 5

Andrew added a comment to T247449: wdumps custom generated dumps storage space.

@bennofs, this tool is currently consuming about 225Gb of disk space. Can you please update us about your plan for automating cleanup? And, a quick by-hand cleanup would help keep things moving in the meantime.

Thu, Nov 5, 10:49 PM · Data-Services, cloud-services-team (Kanban)
Andrew added a comment to T248188: Zoomviewer has ~450,000 files in NFS home directory .

@dschwen this tool is currently using 310 GB of space. Is that expected, or do we need to add more cleanup automation here?

Thu, Nov 5, 10:47 PM · Tools
Andrew moved T257274: Database table for request tool needs maintenance to prevent data loss from Clinic Duty to Doing on the cloud-services-team (Kanban) board.
Thu, Nov 5, 10:20 PM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)
Andrew moved T257275: Database tables for checkwiki tool needs maintenance to prevent data loss from Clinic Duty to Doing on the cloud-services-team (Kanban) board.
Thu, Nov 5, 10:20 PM · Patch-For-Review, Tools, Data-Services, cloud-services-team (Kanban)
Andrew added a comment to T266180: Request increased quota for wmf-research-tools Cloud VPS project.

update: we needed to order some new hardware to get those cloudvirts online so things are delayed a bit. Hopefully not more than another week or two :(

Thu, Nov 5, 10:04 PM · cloud-services-team (Kanban), Cloud-VPS (Quota-requests)

Wed, Nov 4

Andrew closed T267162: Request creation of wikicommunityhealth VPS project (Community Health Metrics: Understanding Editor Drop-off) as Resolved.

I've created the project and added CristianCantoro, marcmiquel, and elaragon.

Wed, Nov 4, 5:23 PM · Cloud-VPS (Project-requests)
Andrew claimed T267162: Request creation of wikicommunityhealth VPS project (Community Health Metrics: Understanding Editor Drop-off).

Approved -- I'll set this up soon.

Wed, Nov 4, 4:15 PM · Cloud-VPS (Project-requests)

Mon, Nov 2

Andrew added a comment to T267078: Open the ceph throttle a bit for tools-k8s-etcd server.

those VMs are now using the flavor 'g2.cores1.ram2.disk20.4xiops'.

Mon, Nov 2, 11:07 PM · cloud-services-team (Kanban), Toolforge

Nov 2 2020

Andrew added a comment to T266777: integration instances suffer from high IO latency due to Ceph.

I've adjusted the throttling rules for all integration-agent-docker-10xx nodes.

Nov 2 2020, 10:59 PM · cloud-services-team (Kanban), Cloud-VPS, Release-Engineering-Team (CI & Testing services), Continuous-Integration-Infrastructure
Andrew added a comment to T266777: integration instances suffer from high IO latency due to Ceph.

And just below:

iops_max=bm,iops_rd_max=rm,iops_wr_max=wm
Specify bursts in requests per second, either for all request types or for reads or writes only. Bursts allow the guest I/O to spike above the limit temporarily.

That one is in OpenStack rocky / Nova 18.0.0+ and is exposed as disk_write_iops_sec_max:

Nov 2 2020, 10:57 PM · cloud-services-team (Kanban), Cloud-VPS, Release-Engineering-Team (CI & Testing services), Continuous-Integration-Infrastructure
Andrew updated the task description for T266068: Onboard David Caro to Wikimedia Foundation as SRE in Cloud Services.
Nov 2 2020, 7:14 PM · Patch-For-Review, cloud-services-team (Kanban)

Oct 30 2020

Andrew awarded T263145: cloudvirt1033 psu redundancy alert a Like token.
Oct 30 2020, 4:51 PM · Operations, cloud-services-team (Kanban), ops-eqiad
Andrew added a comment to T266777: integration instances suffer from high IO latency due to Ceph.

I've resized integration-agent-docker-1020 to a flavor with increased IO throttles (4x the standard limits). Let's see what the graphs look like on Monday and we can determine if these limits are adequate or overkill or what.

Oct 30 2020, 4:04 PM · cloud-services-team (Kanban), Cloud-VPS, Release-Engineering-Team (CI & Testing services), Continuous-Integration-Infrastructure

Oct 29 2020

Andrew added a parent task for T236582: "automation-framework" Cloud VPS project jessie deprecation: T266822: puppetmasters in cloud-vps project automation-framework.
Oct 29 2020, 7:44 PM · Cloud-VPS (Debian Jessie Deprecation)
Andrew added a subtask for T266822: puppetmasters in cloud-vps project automation-framework: T236582: "automation-framework" Cloud VPS project jessie deprecation.
Oct 29 2020, 7:44 PM · cloud-services-team (Kanban), Cloud-VPS
Andrew created T266822: puppetmasters in cloud-vps project automation-framework.
Oct 29 2020, 6:04 PM · cloud-services-team (Kanban), Cloud-VPS
Andrew added a comment to T266777: integration instances suffer from high IO latency due to Ceph.

Our conversation in August was about whether or not Integration VMs should be early adopters of Ceph. The entire cloud was/is/has moved to Ceph; this was never in question.

Oct 29 2020, 5:10 PM · cloud-services-team (Kanban), Cloud-VPS, Release-Engineering-Team (CI & Testing services), Continuous-Integration-Infrastructure
Andrew closed T266793: puppet last_run_summary.yaml incoherent when catalog can't compile as Resolved.

I've worked around the one specific case of this that was bothering me; I'm not going to dig into historical changes in puppet yaml output today :)

Oct 29 2020, 4:40 PM · cloud-services-team (Kanban)
Andrew created T266793: puppet last_run_summary.yaml incoherent when catalog can't compile.
Oct 29 2020, 2:56 PM · cloud-services-team (Kanban)
Andrew added a comment to T266777: integration instances suffer from high IO latency due to Ceph.

If you have specific iops or throughput numbers that you need for proper performance you can create a quota request task and we can make a custom flavor with different limits.

Oct 29 2020, 2:40 PM · cloud-services-team (Kanban), Cloud-VPS, Release-Engineering-Team (CI & Testing services), Continuous-Integration-Infrastructure
Andrew closed T260916: Move CI instances to use ceph in WMCS, a subtask of T194334: [Epic] Modern Cloud VPS storage layer, as Resolved.
Oct 29 2020, 2:40 PM · cloud-services-team (Kanban), Epic, Cloud-VPS
Andrew closed T260916: Move CI instances to use ceph in WMCS as Resolved.

All VMs have been moved to Ceph so this is done.

Oct 29 2020, 2:40 PM · Release-Engineering-Team (CI & Testing services), Release-Engineering-Team-TODO, Continuous-Integration-Infrastructure

Oct 28 2020

Andrew added a comment to T266623: relocate/reimage cloudvirt1030 with 10G interfaces.

@Bstorm you are correct, that is the nic that is in the server but the 10G capability would require 10GB SFP transceiver. I believe we only have the 1GB transceivers on-site. @wiki_willy @Andrew @Bstorm do we want to order these? for cloudvirt1025-1030?

Oct 28 2020, 6:48 PM · cloud-services-team (Hardware), ops-eqiad, DC-Ops, Operations
Andrew added a comment to T266180: Request increased quota for wmf-research-tools Cloud VPS project.

This request is approved, but can't be immediately granted.

Oct 28 2020, 3:50 PM · cloud-services-team (Kanban), Cloud-VPS (Quota-requests)
Andrew assigned T266174: Request creation of pipelinelib-experimental VPS project to nskaggs.

This request was approved in today's meeting; assigning to @nskaggs to create this week.

Oct 28 2020, 3:38 PM · cloud-services-team (Kanban), Release-Engineering-Team, Cloud-VPS (Project-requests)
Andrew added a comment to T261132: Move all cloud-vps VMs to Ceph.

Other than the special-case VMs in clouddb-services and wdqs VMs (which may remain with local storage for good), all VMs are now hosted on Ceph. Cloudvirt1025-30 are awaiting network upgrades before being rebuilt as Ceph-enabled hypervisors.

Oct 28 2020, 2:19 AM · cloud-services-team (Kanban)
Andrew updated the task description for T261132: Move all cloud-vps VMs to Ceph.
Oct 28 2020, 2:13 AM · cloud-services-team (Kanban)
Andrew updated the task description for T261132: Move all cloud-vps VMs to Ceph.
Oct 28 2020, 2:13 AM · cloud-services-team (Kanban)
Andrew added a comment to T266587: ToolsDB replication is broken.

@Andrew Question: we have enough storage that I could build a new database replica on a ceph VM right? I'd wanted to wait until we had cinder volumes to migrate these to allow detachability, but that would be one way to get this "unlocked" from the hardware. They'd need far less of a throttle for storage, of course.

Oct 28 2020, 1:48 AM · Patch-For-Review, Data-Services, cloud-services-team (Kanban)

Oct 27 2020

Andrew added a comment to T216195: Move cloudvirt hosts to 10Gb ethernet.

@Cmjohnson, in case you were waiting to do all these in bulk: all remaining cloudvirts are now ready for upgrade.

Oct 27 2020, 11:26 PM · cloud-services-team (Hardware), ops-eqiad, Operations, DC-Ops, Epic