Page MenuHomePhabricator

reconfigure codfw minio cluster after additional devices added
Closed, ResolvedPublic

Description

Once all of the drives are added from T405982, we need to reconfigure the minio setup so that it will use all 4 hosts and all 4 drives.

Steps required:

  • all drives installed
  • all services referencing writes to minio are stopped/put into maintenance
  • data copied across from codfw cluster to eqiad cluster
    • example command: mcli mirror --insecure franio-codfw/ franio-eqiad/
  • deploy config
  • on codfw franio hosts, wipe all /minio-0*/* including .minio.sys dir
  • ensure all directories are minio:minio perms wise
  • all 4 machines must be online and able to talk to each other on port 9100
  • stop minio
  • start minio
  • copy data back

Rough timing on initial data copy (not catch up syncs) was about 35-40 mins in the two tests. Keep that in mind when planning downtime notifications

Event Timeline

Drive addition is scheduled for tomorrow 2026-01-30. We can then do the expansion earlier in the maintenance week.

Drives have been added. New devices have been formatted with XFS filesystem, added to fstab, mount points created, permissions adjusted, and drive mounted. Ready for reconfiguration.

Dwisehaupt moved this task from Up Next to In Progress on the fundraising-tech-ops board.

one last data copy was done. config was pushed out. data wiped. cluster reconfigured with 4 hosts and 4 drives. cluster came up clean. data copy back to codfw in progress.

Dwisehaupt moved this task from In Progress to Done on the fundraising-tech-ops board.
Dwisehaupt added a subscriber: AStein-WMF.

For reference, the data copy this last time took 78 minutes as compared to the previous runs that were reliably in the 35-40 min range each time I did a fresh reclone. Something to be aware of if we do it in the future.

Also, there was an oversight in the planning for this. In all of the tests and VB setups I didn't have access tokens for multiple users set up. Due to this, I missed the fact that they were not retained in the process. @AStein-WMF was able to help and recreate the production access tokens. This had to be done on the command line since minio removed the ability to do anything but bucket management through the web UI.

We will write up steps on how to back up these access tokens for future backup/recover/transfer processes.

As of now, minio is back up in codfw with all 4 hosts in a 16 drive config. Access is restored, metabase is rendering data, and jobs are running.