Page MenuHomePhabricator

Set up Misc Object Storage Service (moss)
Open, Stalled, Needs TriagePublic

Description

The task tracks setting up M.O.S.S. or misc object storage service. The service is meant to host all existing (and new) object storage use cases that have accumulated over the years in the media storage cluster (listed at T264291: Swift users and their usage with more details)

The cluster will have at least 3x replication, will be spanning across eqiad and codfw, and offer S3 + swift API compatibility (i.e. same as the thanos cluster).

  • Evaluate whether we need encrypted backend traffic across datacenters for the cluster (likely ipsec)
  • Decide on initial storage policies (replication factor, ssd/hdd, site-local vs global, which should be default, etc)
  • Bring frontends online: T275513 T275511
  • Bring backends online: T276642 T276637
  • Bring up service IPs / LVS and certs
  • Bring up dashboards/monitoring/alerting

Once the service/cluster is up we can start migrating users / use cases (in a different task, TBD)

Event Timeline

Hi everybody, is there a timeline for MOSS? The ML-Team is currently using the Thanos Swift cluster to store objects/models, we don't require a lot of space but at the same time we are not a great fit either :) MOSS will surely be better, and now that we are loading models to Swift we were wondering if it was best to wait a bit for MOSS or keep going with Thanos.

A couple of Friday-afternoon thoughts, not any kind of policy statement:

Swift is somewhat directly available both directly within the WMF network, and via our usual caching layers to the outside world.

I infer that you are thinking that MOSS (or something like it) might be the object storage for your shared data platform? That's an interesting idea (and, indeed, Ceph's RGW (S3) layer now has support for bucket notifications - https://docs.ceph.com/en/latest/radosgw/notifications/ ), though might well have hardware/architectural implications for what we do with MOSS - our current Swift clusters are not very high-performance, for example, because we rely heavily on caching everything, if we were to be thinking something that might be supporting more "live" workflows, we might want to think harder about NVME for bluestore journals or similar...

Hello!

Yes! No policy statements here; we are in the 'feedback / alignment building' phase of talking about Shared Data Platform. :)

Data stored in Shared Data Platform is intended to be loaded by engineers into their datastores for serving production features, not for access by production features directly for serving. The storage in Shared Data Platform is 2 tiered: Kafka for realtime, and then object store for historical long term data and more static datasets. I'd expect that most product usages will read and write to this system via Kafka, rather than directly to the object store.

However, some use cases will have the need to 'bootstrap' current state and/or historical data from the object store. Bootstrapping would cause pretty spikey read loads; rare but high volume.

Also, it is not yet clear to us where analytics fits in here. In an idealized view, data for analytics could belong in a totally different storage cluster (Hadoop, or a different object store cluster). However, copying this data between different clusters just for idealized reasons might not be worth the overhead. We could consider 'analytics' as a non-production use, and be okay with access by analytics tooling (e.g. Apache Spark in k8s?) to data in the shared data platform object store. Analytics usage would also cause spikey reads, but also occasional spikey writes too.

Anyway, all this is very preliminary and just in the ideas phase. Stay tuned for a feedback meeting about all this soon. :)

Will Toolforge and Cloud VPS jobs be able to read and write into their own custom buckets? (That would be super helpful.)

Change 876221 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] thanos: drain thanos-be[1,2]004

https://gerrit.wikimedia.org/r/876221

Change 876221 merged by MVernon:

[operations/puppet@production] thanos: drain thanos-be[1,2]004

https://gerrit.wikimedia.org/r/876221

Change 889277 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] Revert "thanos: drain thanos-be[1,2]004"

https://gerrit.wikimedia.org/r/889277

MatthewVernon changed the task status from Open to Stalled.Feb 15 2023, 10:13 AM

Marking this as stalled, as we've had to restore thanos-be[1,2]004 to thanos use, meaning we can't progress this task until next fiscal year when we can buy two backend servers for MOSS.

Change 889277 merged by MVernon:

[operations/puppet@production] Revert "thanos: drain thanos-be[1,2]004"

https://gerrit.wikimedia.org/r/889277

The extra hardware needed is due to arrive in Q1; so I expect getting MOSS going will be a KR for Q2 [obviously I can't promise that at this point!]

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-be1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-be1003.eqiad.wmnet with OS bookworm completed:

  • moss-be1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202312051002_mvernon_1090681_moss-be1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-be1002.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-be1002.eqiad.wmnet with OS bookworm executed with errors:

  • moss-be1002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • New OS is buster but bookworm was requested
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-be1002.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-be1002.eqiad.wmnet with OS bookworm completed:

  • moss-be1002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202312051637_mvernon_1270041_moss-be1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-be1001.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-be1001.eqiad.wmnet with OS bookworm completed:

  • moss-be1001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202312061234_mvernon_1783865_moss-be1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Change 1005110 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] aptrepo: add external repository for Ceph reef release

https://gerrit.wikimedia.org/r/1005110

Change 1005110 merged by MVernon:

[operations/puppet@production] aptrepo: add external repository for Ceph reef release

https://gerrit.wikimedia.org/r/1005110

Mentioned in SAL (#wikimedia-operations) [2024-02-20T15:30:20Z] <Emperor> import ceph-reef packages to apt1001 T279621

Mentioned in SAL (#wikimedia-operations) [2024-02-20T15:55:23Z] <Emperor> import ceph-reef packages to apt1001 T279621

Change 1009494 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/docker-images/production-images@master] Add new ceph container image

https://gerrit.wikimedia.org/r/1009494

Change #1009494 merged by MVernon:

[operations/docker-images/production-images@master] Add new ceph container image

https://gerrit.wikimedia.org/r/1009494