Page MenuHomePhabricator

GTirloni (Giovanni Tirloni)
Operations Engineer

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Sep 4 2018, 6:39 PM (24 w, 3 d)
Availability
Available
IRC Nick
gtirloni
LDAP User
GTirloni
MediaWiki User
GTirloni (WMF) [ Global Accounts ]

Recent Activity

Yesterday

GTirloni triaged T216781: Create Buster image as Normal priority.
Fri, Feb 22, 5:59 PM · cloud-services-team (Kanban), Cloud-VPS
GTirloni added a comment to T216781: Create Buster image.

Our Puppet repository isn't prepared for Buster so this experiment will have to stop here for now.

Fri, Feb 22, 5:57 PM · cloud-services-team (Kanban), Cloud-VPS
GTirloni added a parent task for T213546: Prepare puppet for Debian buster: T216781: Create Buster image.
Fri, Feb 22, 5:54 PM · Patch-For-Review, Packaging, Puppet, Operations
GTirloni added a subtask for T216781: Create Buster image: T213546: Prepare puppet for Debian buster.
Fri, Feb 22, 5:54 PM · cloud-services-team (Kanban), Cloud-VPS
GTirloni added a comment to T216659: tools puppetmaster is badly overloaded.

Resized tools-puppetmaster-01 to m1.large.

Fri, Feb 22, 4:05 PM · cloud-services-team (Kanban)
GTirloni closed T215416: shinken: python3-irc missing as Resolved.
Fri, Feb 22, 1:34 PM · Patch-For-Review, cloud-services-team (Kanban), Shinken
GTirloni added a comment to T216375: "Looks like you already have another webservice running" failure when trying to migrate webservice.

tools.commons-video-clicks experienced this today. Here are the contents of the existing service.manifest:

Fri, Feb 22, 11:56 AM · Toolforge
GTirloni triaged T216375: "Looks like you already have another webservice running" failure when trying to migrate webservice as Normal priority.
Fri, Feb 22, 11:47 AM · Toolforge
GTirloni closed T216706: Adopt cloud.wikimedia.org as top-level domain for cloud-services as Declined.
Fri, Feb 22, 2:05 AM · cloud-services-team (Kanban)
GTirloni added a comment to T216706: Adopt cloud.wikimedia.org as top-level domain for cloud-services.

@bd808 cool, thanks for the historical perspective. I'll close this ticket since there isn't any action items right now. When the time comes we can use one of those domains.

Fri, Feb 22, 2:05 AM · cloud-services-team (Kanban)

Thu, Feb 21

GTirloni added a comment to T216781: Create Buster image.

bootstrap-vz 0.9.11+20180121git-1 doesn't know about Buster. It also tries to install Puppet from puppetlabs.com APT repository.

Thu, Feb 21, 11:10 PM · cloud-services-team (Kanban), Cloud-VPS
GTirloni created T216781: Create Buster image.
Thu, Feb 21, 11:07 PM · cloud-services-team (Kanban), Cloud-VPS
GTirloni closed T210122: Cloud VPS: Default image is not allocated automatically as Invalid.
Thu, Feb 21, 7:54 PM · cloud-services-team (Kanban), Cloud-VPS
GTirloni added a comment to T216195: Move cloudvirt hosts to 10Gb ethernet.

Related T190364

Thu, Feb 21, 7:50 PM · cloud-services-team (Kanban)
GTirloni triaged T216733: cloudvirts: ensure we're running the latest raid controller firmware as Normal priority.
Thu, Feb 21, 4:31 PM · cloud-services-team (Kanban), Cloud-VPS
GTirloni closed T216422: Virtualize NFS servers used exclusively by Cloud VPS tenants as Declined.
Thu, Feb 21, 3:58 PM · Data-Services, cloud-services-team (Kanban)
GTirloni closed T216422: Virtualize NFS servers used exclusively by Cloud VPS tenants, a subtask of T207536: Move various support services for Cloud VPS currently in prod into their own instances, as Declined.
Thu, Feb 21, 3:58 PM · cloud-services-team (Kanban), Operations, Cloud-VPS
GTirloni added a comment to T216422: Virtualize NFS servers used exclusively by Cloud VPS tenants.

It seems this ticket should be closed in light of the Ceph goal, right?

Thu, Feb 21, 2:35 PM · Data-Services, cloud-services-team (Kanban)
GTirloni added a comment to T216707: CloudVPS: cloudvirtan1002 puppet failures due to memory allocation issues?.

This server has 128GB of RAM. There are two VM's currently running on it:

Thu, Feb 21, 12:26 PM · Analytics-Kanban, Analytics-Cluster, Analytics, cloud-services-team (Kanban)
GTirloni added a comment to T216422: Virtualize NFS servers used exclusively by Cloud VPS tenants.

We need to be careful with huge QCOW2 files because moving them around will be really painful.

Thu, Feb 21, 12:15 PM · Data-Services, cloud-services-team (Kanban)
GTirloni moved T216688: Document procedure for controlled cluster restart from Inbox to Important on the cloud-services-team (Kanban) board.
Thu, Feb 21, 12:02 PM · Toolforge, cloud-services-team (Kanban)
GTirloni moved T95922: Adopt service status dashboard from Blocked to Needs discussion on the cloud-services-team (Kanban) board.
Thu, Feb 21, 12:01 PM · cloud-services-team (Kanban), Cloud-VPS
GTirloni moved T95922: Adopt service status dashboard from Inbox to Blocked on the cloud-services-team (Kanban) board.
Thu, Feb 21, 12:01 PM · cloud-services-team (Kanban), Cloud-VPS
GTirloni renamed T95922: Adopt service status dashboard from Labs needs a reliable and communicative status dashboard to Adopt service status dashboard.
Thu, Feb 21, 12:01 PM · cloud-services-team (Kanban), Cloud-VPS
GTirloni added a comment to T95922: Adopt service status dashboard.

This is critical for proper communication with end users.

Thu, Feb 21, 12:01 PM · cloud-services-team (Kanban), Cloud-VPS
GTirloni edited projects for T95922: Adopt service status dashboard, added: cloud-services-team (Kanban); removed Cloud-Services.
Thu, Feb 21, 11:57 AM · cloud-services-team (Kanban), Cloud-VPS
GTirloni added a comment to T216706: Adopt cloud.wikimedia.org as top-level domain for cloud-services.

I fee like that's bad branding. People have to remember "wmf" is "wikimedia" and tie that with "cloud" and remember it's ".org". It's almost a password.

Thu, Feb 21, 11:51 AM · cloud-services-team (Kanban)
GTirloni moved T216706: Adopt cloud.wikimedia.org as top-level domain for cloud-services from Inbox to Needs discussion on the cloud-services-team (Kanban) board.
Thu, Feb 21, 11:47 AM · cloud-services-team (Kanban)
GTirloni created T216706: Adopt cloud.wikimedia.org as top-level domain for cloud-services.
Thu, Feb 21, 11:47 AM · cloud-services-team (Kanban)
GTirloni created T216688: Document procedure for controlled cluster restart.
Thu, Feb 21, 9:41 AM · Toolforge, cloud-services-team (Kanban)
GTirloni closed T214637: Setup CSP http header in Quarry as Resolved.
Thu, Feb 21, 9:29 AM · Patch-For-Review, Security, Quarry
GTirloni closed T216685: labpuppetmaster - Slowness and puppet-enc errors as Resolved.
Thu, Feb 21, 9:12 AM · cloud-services-team (Kanban), Cloud-VPS
GTirloni updated subscribers of T216685: labpuppetmaster - Slowness and puppet-enc errors.

labpuppetmaster1001 stuck processes:

Thu, Feb 21, 9:12 AM · cloud-services-team (Kanban), Cloud-VPS
GTirloni created T216685: labpuppetmaster - Slowness and puppet-enc errors.
Thu, Feb 21, 9:08 AM · cloud-services-team (Kanban), Cloud-VPS

Wed, Feb 20

GTirloni added a comment to T193264: Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020.

cloudvirt1020 has been reimaged with Stretch and RAID configuration contains 2 spares now.

Wed, Feb 20, 7:02 PM · cloud-services-team (Kanban), Patch-For-Review, Epic, Cloud-VPS
GTirloni closed T216004: Degraded RAID on cloudvirt1018 as Resolved.
Wed, Feb 20, 7:01 PM · cloud-services-team (Kanban), ops-eqiad, Operations
GTirloni added a comment to T216004: Degraded RAID on cloudvirt1018.

All good, thank you!

Wed, Feb 20, 7:01 PM · cloud-services-team (Kanban), ops-eqiad, Operations
GTirloni closed T194855: Degraded RAID on cloudvirt1020, a subtask of T216208: ToolsDB overload and cleanup, as Resolved.
Wed, Feb 20, 5:43 PM · Patch-For-Review, TCB-Team, Phragile, Data-Services, cloud-services-team (Kanban)
GTirloni closed T194855: Degraded RAID on cloudvirt1020 as Resolved.
Wed, Feb 20, 5:43 PM · Patch-For-Review, cloud-services-team (Kanban), ops-eqiad, Operations
GTirloni closed T194855: Degraded RAID on cloudvirt1020, a subtask of T193264: Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020, as Resolved.
Wed, Feb 20, 5:43 PM · cloud-services-team (Kanban), Patch-For-Review, Epic, Cloud-VPS
GTirloni added a comment to T194855: Degraded RAID on cloudvirt1020.

@Cmjohnson thank you!

Wed, Feb 20, 5:43 PM · Patch-For-Review, cloud-services-team (Kanban), ops-eqiad, Operations

Tue, Feb 19

GTirloni closed T216481: Remove views on ep_* tables on the wikireplicas hosts, a subtask of T174802: Archive and drop education program (ep_*) tables on all wikis, as Resolved.
Tue, Feb 19, 7:40 PM · Patch-For-Review, User-notice, Datasets-General-or-Unknown, Data-Services, DBA
GTirloni closed T216481: Remove views on ep_* tables on the wikireplicas hosts as Resolved.
Tue, Feb 19, 7:40 PM · Patch-For-Review, cloud-services-team (Kanban), Data-Services
GTirloni added a comment to T216481: Remove views on ep_* tables on the wikireplicas hosts.

Views removed.

Tue, Feb 19, 7:40 PM · Patch-For-Review, cloud-services-team (Kanban), Data-Services
GTirloni closed T215892: Degraded RAID on cloudvirt1024 as Resolved.
Tue, Feb 19, 6:33 PM · cloud-services-team (Kanban), ops-eqiad, Operations
GTirloni reopened T215892: Degraded RAID on cloudvirt1024 as "Open".
Tue, Feb 19, 6:21 PM · cloud-services-team (Kanban), ops-eqiad, Operations
GTirloni added a comment to T216481: Remove views on ep_* tables on the wikireplicas hosts.

Not yet.

Tue, Feb 19, 3:39 PM · Patch-For-Review, cloud-services-team (Kanban), Data-Services
GTirloni added a comment to T211939: Drop several views from srwikinews.

Please ignore above commits,they were meant for T216481. Sorry, bad copy/paste.

Tue, Feb 19, 1:55 PM · Patch-For-Review, User-Banyek, cloud-services-team (Kanban), Data-Services, User-Zoranzoki21
GTirloni created T216506: imagemagick: No such file or directory - /etc/ImageMagick-6/policy.xml20190219-7388-jrbucp.lock.
Tue, Feb 19, 1:41 PM · Patch-For-Review, cloud-services-team (Kanban)
GTirloni added a comment to T216441: Evaluate transferring the non-replicated tables to the new toolsdb server.

Could someone fill me in on why we don't replicate these databases/tables from a technical and operational perspective?

Tue, Feb 19, 12:07 PM · Data-Services, cloud-services-team (Kanban)
GTirloni claimed T216481: Remove views on ep_* tables on the wikireplicas hosts.
Tue, Feb 19, 12:03 PM · Patch-For-Review, cloud-services-team (Kanban), Data-Services
GTirloni awarded T215586: Custom Kubernetes deployment fails from Stretch bastion a Yellow Medal token.
Tue, Feb 19, 1:42 AM · Patch-For-Review, cloud-services-team (Kanban), Toolforge

Mon, Feb 18

GTirloni added a comment to T193264: Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020.

toolsdb is now being monitored

Mon, Feb 18, 8:25 PM · cloud-services-team (Kanban), Patch-For-Review, Epic, Cloud-VPS
GTirloni added a project to T216404: deployment-db03.deployment-prep.eqiad.wmflabs instance can not start: Cloud-VPS.
Mon, Feb 18, 1:28 PM · User-Ryasmeen, Wikidata, User-Addshore, Cloud-VPS, cloud-services-team (Kanban), Beta-Cluster-Infrastructure
GTirloni edited projects for T216404: deployment-db03.deployment-prep.eqiad.wmflabs instance can not start, added: cloud-services-team (Kanban); removed cloud-services-team.
Mon, Feb 18, 1:28 PM · User-Ryasmeen, Wikidata, User-Addshore, Cloud-VPS, cloud-services-team (Kanban), Beta-Cluster-Infrastructure
GTirloni updated subscribers of T216404: deployment-db03.deployment-prep.eqiad.wmflabs instance can not start.
root@cloudvirt1026:/var/lib/nova/instances/27460e9d-5548-4cd6-9472-548db6402294# qemu-img check ./disk
qemu-img: Could not open './disk': Could not open backing file: Could not open '/var/lib/nova/instances/_base/76a35d5edf0cd19144cac5d4b0a44e7f9212fa14': No such file or directory
Mon, Feb 18, 1:28 PM · User-Ryasmeen, Wikidata, User-Addshore, Cloud-VPS, cloud-services-team (Kanban), Beta-Cluster-Infrastructure
GTirloni claimed T216404: deployment-db03.deployment-prep.eqiad.wmflabs instance can not start.
Mon, Feb 18, 12:47 PM · User-Ryasmeen, Wikidata, User-Addshore, Cloud-VPS, cloud-services-team (Kanban), Beta-Cluster-Infrastructure

Fri, Feb 15

GTirloni updated the task description for T216239: CloudVPS: drain and rebuild labvirt1009 as cloudvirt1009.
Fri, Feb 15, 5:14 PM · Patch-For-Review, cloud-services-team (Kanban)
GTirloni awarded T215211: cloud instance rescue tools a Yellow Medal token.
Fri, Feb 15, 4:04 PM · Patch-For-Review, Cloud-VPS, cloud-services-team (Kanban)
GTirloni updated the task description for T216239: CloudVPS: drain and rebuild labvirt1009 as cloudvirt1009.
Fri, Feb 15, 2:54 PM · Patch-For-Review, cloud-services-team (Kanban)
GTirloni added a comment to T194855: Degraded RAID on cloudvirt1020.

cloudvirt1020 is also 5x slower to enter the BIOS menu (ESC+9) than cloudvirt1019. Not sure what that means.

Fri, Feb 15, 1:50 PM · Patch-For-Review, cloud-services-team (Kanban), ops-eqiad, Operations
GTirloni added a comment to T193264: Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020.

cloudvirt1020 has a bad disk. See T194855.

Fri, Feb 15, 1:47 PM · cloud-services-team (Kanban), Patch-For-Review, Epic, Cloud-VPS
GTirloni added a comment to T194855: Degraded RAID on cloudvirt1020.

@Cmjohnson cloudvirt1020 is reporting a disk missing:

Fri, Feb 15, 1:46 PM · Patch-For-Review, cloud-services-team (Kanban), ops-eqiad, Operations

Thu, Feb 14

GTirloni added a comment to T216170: toolsdb - Per-user connection limits.

Thanks for your feedback.

Thu, Feb 14, 9:06 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge, Data-Services
GTirloni created T216173: labsdb1005/6 - Upgrade to Stretch.
Thu, Feb 14, 7:47 PM · Data-Services, cloud-services-team (Kanban)
GTirloni created T216170: toolsdb - Per-user connection limits.
Thu, Feb 14, 7:22 PM · Patch-For-Review, cloud-services-team (Kanban), Toolforge, Data-Services
GTirloni created T216168: Review labsdb1005 MariaDB configuration against prod standards.
Thu, Feb 14, 7:11 PM · Data-Services, cloud-services-team (Kanban)
GTirloni created T216167: Verify checkwiki tool against excessive DB usage.
Thu, Feb 14, 7:07 PM · Data-Services, Toolforge, cloud-services-team (Kanban)
GTirloni updated subscribers of T181375: Revamp first boot process for new VMs.

We recently discussed supporting cloud-init / user-data so that terraform can be used with more advanced automation.

Thu, Feb 14, 12:35 PM · Patch-For-Review, cloud-services-team (Kanban)

Wed, Feb 13

GTirloni added a comment to T216004: Degraded RAID on cloudvirt1018.

Looking in the RAID controller firmware logs, it seems we have consistent issues with all disks (which could point to a faulty controller, cable or enclosure). What do you think?

Wed, Feb 13, 8:37 PM · cloud-services-team (Kanban), ops-eqiad, Operations
GTirloni added a comment to T215892: Degraded RAID on cloudvirt1024.

Yep, slot 0 and 3 are gone and need replacement.

Wed, Feb 13, 8:35 PM · cloud-services-team (Kanban), ops-eqiad, Operations
GTirloni updated subscribers of T216004: Degraded RAID on cloudvirt1018.

Slots 2 & 3 were part of the outage today. Even though they show as online, could we replace them? They are likely in a pair so we'll need to do it one at a time.

Wed, Feb 13, 7:16 PM · cloud-services-team (Kanban), ops-eqiad, Operations
GTirloni created P8079 cloudvirt1018 - fs corruption.
Wed, Feb 13, 6:58 PM
GTirloni added a comment to T216004: Degraded RAID on cloudvirt1018.

/var/lib/nova/instances took some damage today:

Wed, Feb 13, 3:35 PM · cloud-services-team (Kanban), ops-eqiad, Operations
GTirloni created T216040: Implement START_DELAY in libvirt.
Wed, Feb 13, 2:23 PM · Cloud-VPS, cloud-services-team (Kanban)
GTirloni edited P8076 cloudvirt1018 disk list.
Wed, Feb 13, 1:55 PM
GTirloni updated the title for P8076 cloudvirt1018 disk list from cloudvirt1018 smart data to cloudvirt1018 disk list.
Wed, Feb 13, 1:53 PM
GTirloni created P8076 cloudvirt1018 disk list.
Wed, Feb 13, 1:52 PM

Tue, Feb 12

GTirloni created T215968: profile::grafana - Could not find data item profile::grafana::secret_key .
Tue, Feb 12, 9:58 PM · Patch-For-Review, cloud-services-team (Kanban)
GTirloni updated the task description for T210818: Move admin cron jobs to systemd timers.
Tue, Feb 12, 8:10 PM · Patch-For-Review, Epic, cloud-services-team (Kanban)
GTirloni updated the task description for T210818: Move admin cron jobs to systemd timers.
Tue, Feb 12, 7:11 PM · Patch-For-Review, Epic, cloud-services-team (Kanban)
GTirloni added a comment to T210818: Move admin cron jobs to systemd timers.

systemd::timer::job { 'toolfoge_clush_update':

    ensure                    => present,
    description               => 'Update list of Toolforge servers for clush',
    command                   => "/usr/local/sbin/tools-clush-generator /etc/clustershell/tools.yaml --observer-pass ${observer_pass}",
    interval                  => {
        'start'    => 'OnCalendar',
        'interval' => '*-*-* *:00:00', # hourly
    },
    logging_enabled           => false,
    monitoring_enabled        => true,
    monitoring_contact_groups => 'wmcs-team',
    user                      => 'root',
}
Tue, Feb 12, 4:43 PM · Patch-For-Review, Epic, cloud-services-team (Kanban)

Mon, Feb 11

GTirloni closed T215417: labmon1001: archive-instances not working as Resolved.
Mon, Feb 11, 6:03 PM · cloud-services-team (Kanban)
GTirloni added a comment to T215417: labmon1001: archive-instances not working.

It seems that at some point in April 2018, archive-instances was executed as root and the ownership of some files under /srv/carbon/whisper/archived_metrics was incorrect. The regular cronjob running as _graphite could move files there.

Mon, Feb 11, 6:03 PM · cloud-services-team (Kanban)
GTirloni renamed T215417: labmon1001: archive-instances not working from labmon1001: DISK WARNING - free space: /srv 103841 MB (5% inode=93%): to labmon1001: archive-instances not working.
Mon, Feb 11, 5:47 PM · cloud-services-team (Kanban)
GTirloni added a comment to T215417: labmon1001: archive-instances not working.

Relevant docs: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Metrics_life-cycle

Mon, Feb 11, 5:42 PM · cloud-services-team (Kanban)
GTirloni created P8066 too many redirects.
Mon, Feb 11, 4:59 PM
GTirloni closed T212308: Rerun maintain-views for all tables to drop valid_tag and tag_summary tables as Resolved.
Mon, Feb 11, 4:41 PM · Patch-For-Review, cloud-services-team (Kanban)
GTirloni closed T212308: Rerun maintain-views for all tables to drop valid_tag and tag_summary tables, a subtask of T212254: Drop valid_tag table, as Resolved.
Mon, Feb 11, 4:41 PM · Patch-For-Review, DBA
GTirloni added a comment to T212308: Rerun maintain-views for all tables to drop valid_tag and tag_summary tables.

maintain-view --all-databases --replace-all executed on all replicas (labsdb1009, labsdb1010 and labsdb1011).

Mon, Feb 11, 4:41 PM · Patch-For-Review, cloud-services-team (Kanban)
GTirloni added a comment to T206706: Establish documentation review process for WMCS.

@bd808 thanks!

Mon, Feb 11, 11:19 AM · cloud-services-team (Kanban)
GTirloni added a comment to T208099: nova: can we expose the creator and virt host of VMs to the public?.

OpenStack browser has been updated to show the hypervisor in the VM list today (this information is already in the server view).

Mon, Feb 11, 11:01 AM · Patch-For-Review, cloud-services-team (Kanban)
GTirloni committed R2073:cef96a7ca7e0: Add hypervisor to VM list in project page (authored by GTirloni).
Add hypervisor to VM list in project page
Mon, Feb 11, 10:58 AM
GTirloni created T215778: Publish new Debian 9.7 images.
Mon, Feb 11, 10:52 AM · Cloud-VPS, cloud-services-team (Kanban)
GTirloni claimed T215417: labmon1001: archive-instances not working.
Mon, Feb 11, 10:46 AM · cloud-services-team (Kanban)
GTirloni triaged T215417: labmon1001: archive-instances not working as Normal priority.
Mon, Feb 11, 9:56 AM · cloud-services-team (Kanban)
GTirloni merged T215758: Free up disk space on labmon1001 into T215417: labmon1001: archive-instances not working.
Mon, Feb 11, 9:56 AM · cloud-services-team (Kanban)
GTirloni merged task T215758: Free up disk space on labmon1001 into T215417: labmon1001: archive-instances not working.
Mon, Feb 11, 9:56 AM · Cloud-Services
GTirloni added a comment to T206706: Establish documentation review process for WMCS.

I think this task might be a good place for me to add something I've been thinking about, please let me know if I should add it somewhere else.

Mon, Feb 11, 12:45 AM · cloud-services-team (Kanban)

Sat, Feb 9

GTirloni added a comment to T206951: Puppet doesn't restart ferm on failure.

I don't know if this is related but today I noticed that, if iptables rules are cleared (iptables -F), subsequent puppet runs will not re-apply them. I also had to run systemctl restart ferm to get them re-applied.

Sat, Feb 9, 1:09 AM · Wikimedia-Incident, Traffic, Operations