Set up Misc Object Storage Service (moss)
Open, Stalled, Needs TriagePublic
Actions

Assigned To

None

Authored By

	fgiunchedi
	Apr 8 2021, 7:55 AM

Description

The task tracks setting up M.O.S.S. or misc object storage service. The service is meant to host all existing (and new) object storage use cases that have accumulated over the years in the media storage cluster (listed at T264291: Swift users and their usage with more details)

The cluster will have at least 3x replication, will be spanning across eqiad and codfw, and offer S3 + swift API compatibility (i.e. same as the thanos cluster).

Evaluate whether we need encrypted backend traffic across datacenters for the cluster (likely ipsec)
Decide on initial storage policies (replication factor, ssd/hdd, site-local vs global, which should be default, etc)
Bring frontends online: T275513 T275511
Bring backends online: T276642 T276637
Bring up service IPs / LVS and certs
Bring up dashboards/monitoring/alerting

Once the service/cluster is up we can start migrating users / use cases (in a different task, TBD)

Details

Subject	Repo	Branch	Lines +/-
Add new ceph container image	operations/docker-images/production-images	master	+32 -0
aptrepo: add external repository for Ceph reef release	operations/puppet	production	+12 -0
Revert "thanos: drain thanos-be[1,2]004"	operations/puppet	production	+2 -4
thanos: drain thanos-be[1,2]004	operations/puppet	production	+4 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Stalled	None	T279621 Set up Misc Object Storage Service (moss)
Resolved	Jclark-ctr	T310922 Install NVMe SSDs into moss-be100[1\|2] & thanos-be100?
Resolved	MatthewVernon	T310923 Install NVMe SSDs into moss-be200[1\|2] & thanos-be200?

Event Timeline

fgiunchedi created this task.Apr 8 2021, 7:55 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 8 2021, 7:55 AM

elukey mentioned this in T280025: Find a way to store models for Kubeflow.Apr 15 2021, 1:44 PM

elukey subscribed.

• ACraze subscribed.Apr 15 2021, 5:48 PM

Theofpa subscribed.Apr 18 2021, 9:17 AM

jijiki subscribed.May 11 2021, 1:34 PM

fgiunchedi updated the task description. (Show Details)Jun 4 2021, 8:32 AM

jcrespo mentioned this in T276442: Puppetize media backups infrastructure.Jul 21 2021, 4:50 PM

BTullis subscribed.Jul 30 2021, 9:04 AM

hashar subscribed.Jul 30 2021, 9:04 AM

LSobanski subscribed.Aug 12 2021, 9:46 AM

jcrespo mentioned this in T274463: Backups for GitLab.Aug 12 2021, 3:00 PM

MatthewVernon subscribed.Aug 18 2021, 12:37 PM

fgiunchedi moved this task from Inbox to Backlog on the SRE-swift-storage board.Aug 25 2021, 4:17 PM

fgiunchedi updated the task description. (Show Details)Aug 30 2021, 2:28 PM

dcausse subscribed.Sep 7 2021, 7:36 AM

jcrespo mentioned this in T138915: OTRS database is "too large".Sep 7 2021, 1:33 PM

fgiunchedi updated the task description. (Show Details)Sep 14 2021, 2:34 PM

Hi everybody, is there a timeline for MOSS? The ML-Team is currently using the Thanos Swift cluster to store objects/models, we don't require a lot of space but at the same time we are not a great fit either :) MOSS will surely be better, and now that we are loading models to Swift we were wondering if it was best to wait a bit for MOSS or keep going with Thanos.

dcausse awarded a token.Feb 17 2022, 1:48 PM

BTullis added subscribers: Milimetric, JAllemandou, Antoine_Quhen, Ottomata.Mar 30 2022, 5:22 PM

Oh ho, we should talk! https://wikitech.wikimedia.org/wiki/Shared_Data_Platform

A couple of Friday-afternoon thoughts, not any kind of policy statement:

Swift is somewhat directly available both directly within the WMF network, and via our usual caching layers to the outside world.

I infer that you are thinking that MOSS (or something like it) might be the object storage for your shared data platform? That's an interesting idea (and, indeed, Ceph's RGW (S3) layer now has support for bucket notifications - https://docs.ceph.com/en/latest/radosgw/notifications/ ), though might well have hardware/architectural implications for what we do with MOSS - our current Swift clusters are not very high-performance, for example, because we rely heavily on caching everything, if we were to be thinking something that might be supporting more "live" workflows, we might want to think harder about NVME for bluestore journals or similar...

Hello!

Yes! No policy statements here; we are in the 'feedback / alignment building' phase of talking about Shared Data Platform. :)

Data stored in Shared Data Platform is intended to be loaded by engineers into their datastores for serving production features, not for access by production features directly for serving. The storage in Shared Data Platform is 2 tiered: Kafka for realtime, and then object store for historical long term data and more static datasets. I'd expect that most product usages will read and write to this system via Kafka, rather than directly to the object store.

However, some use cases will have the need to 'bootstrap' current state and/or historical data from the object store. Bootstrapping would cause pretty spikey read loads; rare but high volume.

Also, it is not yet clear to us where analytics fits in here. In an idealized view, data for analytics could belong in a totally different storage cluster (Hadoop, or a different object store cluster). However, copying this data between different clusters just for idealized reasons might not be worth the overhead. We could consider 'analytics' as a non-production use, and be okay with access by analytics tooling (e.g. Apache Spark in k8s?) to data in the shared data platform object store. Analytics usage would also cause spikey reads, but also occasional spikey writes too.

Anyway, all this is very preliminary and just in the ideas phase. Stay tuned for a feedback meeting about all this soon. :)

hashar unsubscribed.Apr 20 2022, 12:37 PM

Will Toolforge and Cloud VPS jobs be able to read and write into their own custom buckets? (That would be super helpful.)

• EChetty subscribed.May 6 2022, 9:36 AM

bking subscribed.Jun 23 2022, 8:20 PM

Eevans subscribed.Nov 22 2022, 2:54 PM

Change 876221 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] thanos: drain thanos-be[1,2]004

https://gerrit.wikimedia.org/r/876221

gerritbot added a project: Patch-For-Review.Jan 6 2023, 4:36 PM

Change 876221 merged by MVernon:

[operations/puppet@production] thanos: drain thanos-be[1,2]004

https://gerrit.wikimedia.org/r/876221

Maintenance_bot removed a project: Patch-For-Review.Jan 20 2023, 3:31 PM

Change 889277 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] Revert "thanos: drain thanos-be[1,2]004"

https://gerrit.wikimedia.org/r/889277

gerritbot added a project: Patch-For-Review.Feb 15 2023, 9:56 AM

MatthewVernon mentioned this in T329712: Thanos root filesystem filling with logs.Feb 15 2023, 10:10 AM

Marking this as stalled, as we've had to restore thanos-be[1,2]004 to thanos use, meaning we can't progress this task until next fiscal year when we can buy two backend servers for MOSS.

Change 889277 merged by MVernon:

[operations/puppet@production] Revert "thanos: drain thanos-be[1,2]004"

https://gerrit.wikimedia.org/r/889277

Maintenance_bot removed a project: Patch-For-Review.Feb 15 2023, 10:30 AM

gmodena subscribed.Feb 22 2023, 3:10 PM

LSobanski added a subtask: T310922: Install NVMe SSDs into moss-be100[1|2] & thanos-be100?.Mar 2 2023, 3:07 PM

LSobanski added a subtask: T310923: Install NVMe SSDs into moss-be200[1|2] & thanos-be200?.

LSobanski mentioned this in T310922: Install NVMe SSDs into moss-be100[1|2] & thanos-be100?.

@MatthewVernon @Eevans do we have a timeline for MOSS? :)

The extra hardware needed is due to arrive in Q1; so I expect getting MOSS going will be a KR for Q2 [obviously I can't promise that at this point!]

Jelto subscribed.Aug 30 2023, 7:02 AM

MatthewVernon closed subtask T310923: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? as Resolved.Oct 11 2023, 1:58 PM

Jclark-ctr closed subtask T310922: Install NVMe SSDs into moss-be100[1|2] & thanos-be100? as Resolved.Oct 11 2023, 2:37 PM

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-be1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-be1003.eqiad.wmnet with OS bookworm completed:

moss-be1003 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202312051002_mvernon_1090681_moss-be1003.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-be1002.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-be1002.eqiad.wmnet with OS bookworm executed with errors:

moss-be1002 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- New OS is buster but bookworm was requested
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-be1002.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-be1002.eqiad.wmnet with OS bookworm completed:

moss-be1002 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202312051637_mvernon_1270041_moss-be1002.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-be1001.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-be1001.eqiad.wmnet with OS bookworm completed:

moss-be1001 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202312061234_mvernon_1783865_moss-be1001.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active
- The sre.puppet.sync-netbox-hiera cookbook was run successfully

Change 1005110 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] aptrepo: add external repository for Ceph reef release

https://gerrit.wikimedia.org/r/1005110

gerritbot added a project: Patch-For-Review.Feb 20 2024, 3:20 PM

Change 1005110 merged by MVernon:

[operations/puppet@production] aptrepo: add external repository for Ceph reef release

https://gerrit.wikimedia.org/r/1005110

Mentioned in SAL (#wikimedia-operations) [2024-02-20T15:30:20Z] <Emperor> import ceph-reef packages to apt1001 T279621

Maintenance_bot removed a project: Patch-For-Review.Feb 20 2024, 3:30 PM

Mentioned in SAL (#wikimedia-operations) [2024-02-20T15:55:23Z] <Emperor> import ceph-reef packages to apt1001 T279621

Change 1009494 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/docker-images/production-images@master] Add new ceph container image

https://gerrit.wikimedia.org/r/1009494

gerritbot added a project: Patch-For-Review.Mar 7 2024, 11:53 AM

Change #1009494 merged by MVernon:

[operations/docker-images/production-images@master] Add new ceph container image

https://gerrit.wikimedia.org/r/1009494

Maintenance_bot removed a project: Patch-For-Review.Mar 21 2024, 5:30 PM

Ahoelzl mentioned this in T361509: [Spike] Define technology roadmap around Airflow / k8s / ceph.Mon, Apr 1, 6:23 PM

BTullis mentioned this in T362993: Update cephosd100[1-5] with the most recent stable version of Ceph.Fri, Apr 19, 4:51 PM

Set up Misc Object Storage Service (moss)Open, Stalled, Needs TriagePublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Set up Misc Object Storage Service (moss)
Open, Stalled, Needs TriagePublic
Actions

Related Objects
Search...