Page MenuHomePhabricator

Accidental removal of some files under /srv/deployment on deploy1002
Closed, ResolvedPublic

Description

By mistake around 11:13 UTC I ran rm -rf under /srv/deployment on deploy1002. I immediately stopped it but some files have probably been deleted. I am really sorry for the sloppy mistake, I apologize.

I tried to run puppet right afterwards, since my understanding is that the dirs under /srv/deployment are checked out by puppet based on a config, and got:

Notice: /Stage[main]/Profile::Mediawiki::Deployment::Server/File[/srv/deployment/mediawiki-staging]/ensure: created (corrective)
Error: Execution of '/usr/bin/scap deploy --init' returned 70: 
Error: /Stage[main]/Profile::Mediawiki::Deployment::Server/Scap::Source[dumps/dumps]/Scap_source[dumps/dumps]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy --init' returned 70:  (corrective)
Error: Execution of '/usr/bin/scap deploy --init' returned 70: 
Error: /Stage[main]/Profile::Mediawiki::Deployment::Server/Scap::Source[integration/docroot]/Scap_source[integration/docroot]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy --init' returned 70:  (corrective)
Error: Execution of '/usr/bin/scap deploy --init' returned 1: 
Error: /Stage[main]/Profile::Mediawiki::Deployment::Server/Scap::Source[logstash/plugins]/Scap_source[logstash/plugins]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy --init' returned 1:  (corrective)
Error: Execution of '/usr/bin/scap deploy --init' returned 70: 
Error: /Stage[main]/Profile::Mediawiki::Deployment::Server/Scap::Source[parsoid/deploy]/Scap_source[parsoid/deploy]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy --init' returned 70:  (corrective)
Error: Execution of '/usr/bin/scap deploy --init' returned 70: 
Error: /Stage[main]/Profile::Mediawiki::Deployment::Server/Scap::Source[wikimedia/discovery/analytics]/Scap_source[wikimedia/discovery/analytics]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy --init' returned 70:  (corrective)
Error: Execution of '/usr/bin/scap deploy --init' returned 70: 
Error: /Stage[main]/Profile::Mediawiki::Deployment::Server/Scap::Source[wdqs/wdqs]/Scap_source[wdqs/wdqs]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy --init' returned 70:  (corrective)
Error: Execution of '/usr/bin/scap deploy --init' returned 70: 
Error: /Stage[main]/Profile::Mediawiki::Deployment::Server/Scap::Source[cassandra/logstash-logback-encoder]/Scap_source[cassandra/logstash-logback-encoder]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy --init' returned 70:  (corrective)
Error: Execution of '/usr/bin/scap deploy --init' returned 70: 
Error: /Stage[main]/Profile::Mediawiki::Deployment::Server/Scap::Source[cassandra/twcs]/Scap_source[cassandra/twcs]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy --init' returned 70:  (corrective)
Error: Execution of '/usr/bin/scap deploy --init' returned 70: 
Error: /Stage[main]/Profile::Mediawiki::Deployment::Server/Scap::Source[debmonitor/deploy]/Scap_source[debmonitor/deploy]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy --init' returned 70:  (corrective)
Error: Execution of '/usr/bin/scap deploy --init' returned 70: 
Error: /Stage[main]/Profile::Mediawiki::Deployment::Server/Scap::Source[releng/phatality]/Scap_source[releng/phatality]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy --init' returned 70:  (corrective)
Error: Execution of '/usr/bin/scap deploy --init' returned 70: 
Error: /Stage[main]/Profile::Mediawiki::Deployment::Server/Scap::Source[design/style-guide]/Scap_source[design/style-guide]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy --init' returned 70:  (corrective)
Info: Class[Profile::Mediawiki::Deployment::Server]: Unscheduling all events on Class[Profile::Mediawiki::Deployment::Server]

Subsequent puppet runs got the errors down to:

Error: Execution of '/usr/bin/scap deploy --init' returned 1: 
Error: /Stage[main]/Profile::Mediawiki::Deployment::Server/Scap::Source[logstash/plugins]/Scap_source[logstash/plugins]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy --init' returned 1:  (corrective)
Error: Execution of '/usr/bin/scap deploy --init' returned 70: 
Error: /Stage[main]/Profile::Mediawiki::Deployment::Server/Scap::Source[parsoid/deploy]/Scap_source[parsoid/deploy]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy --init' returned 70:  (corrective)

As test, I saved the parsoid/deploy dir under the root home dir, removed it and ran puppet, since git status was complaining about it not being an existing git repo. The git weird status is now gone but the scap deploy --init ran by puppet fails.

I am going to send this task to ops@ to warn people.

Event Timeline

elukey triaged this task as High priority.May 2 2022, 11:32 AM
elukey created this task.

As suggested in the chat, I have created /var/lock/scap-global-lock with Please check https://phabricator.wikimedia.org/T307349

A backup is being placed under /srv/restore/srv/deployment on deploy1002 by Jaime. The last backup was taken today at 04:13 UTC.

The time of the deletion should be around 11:13 UTC, except ores deployments (done by me) there is only one deployment done by @Ladsgroup - https://sal.toolforge.org/production?p=0&q=deploy1002&d=

The last error that puppet highlights is:

Error: Execution of '/usr/bin/scap deploy --init' returned 1: 
Error: /Stage[main]/Profile::Mediawiki::Deployment::Server/Scap::Source[logstash/plugins]/Scap_source[logstash/plugins]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy --init' returned 1:  (corrective)

Bacula recovery log for the record:
{P27348}

hashar raised the priority of this task from High to Unbreak Now!.May 2 2022, 12:16 PM
hashar subscribed.

This is 100% an unbreak now.

Restore finished ok:

02-May 12:22 backup1001.eqiad.wmnet-fd JobId 437308: Elapsed time=00:37:57, Transfer rate=27.85 M Bytes/second
02-May 12:22 backup1001.eqiad.wmnet JobId 437308: Bacula backup1001.eqiad.wmnet 9.4.2 (04Feb19):
  Build OS:               x86_64-pc-linux-gnu debian 10.5
  JobId:                  437308
  Job:                    RestoreFiles.2022-05-02_11.44.26_38
  Restore Client:         deploy1002.eqiad.wmnet-fd
  Where:                  /srv/restore
  Replace:                Always
  Start time:             02-May-2022 11:44:28
  End time:               02-May-2022 12:22:26
  Elapsed time:           37 mins 58 secs
  Files Expected:         422,301
  Files Restored:         422,301
  Bytes Restored:         62,725,004,800 (62.72 GB)
  Rate:                   27535.1 KB/s
  FD Errors:              0
  FD termination status:  OK
  SD termination status:  OK
  Termination:            Restore OK

I checked the HEAD of all git repos under /srv/deployment with:

find /srv/deployment -name .git -print0 -execdir git rev-parse HEAD \;

I had a few errors which are oddities:

fatal: not a git repository: /Users/gozala/Projects/events/.git/modules/@modules/raw.github.com/Gozala/extendables/v0.2.0
find: ‘/srv/deployment/imagecatalog’: Permission denied
fatal: not a git repository: /srv/deployment/netbox/old_wheels/../.git/modules/wheels

Comparison between deploy2002 and deploy1002:

$ colordiff -U0 --text deploy2002 deploy1002 
--- deploy2002	2022-05-02 14:33:48.321048639 +0200
+++ deploy1002	2022-05-02 14:33:49.585064545 +0200
@@ -82 +81,0 @@
-/srv/deployment/ores/deploy/submodules/assets/.git08b9cebc5e0148ace73915a94e65a5dd0f0ba9a2
@@ -88 +87,2 @@
-/srv/deployment/parsoid/config/node_modules/events/@modules/raw.github.com/Gozala/extendables/v0.2.0/.git/srv/deployment/parsoid/config/.gitf12b09d5dfeb18c74104a5cdefc7ec4ea23a2b2a
+/srv/deployment/parsoid/config/.gitf12b09d5dfeb18c74104a5cdefc7ec4ea23a2b2a
+/srv/deployment/parsoid/config/node_modules/events/@modules/raw.github.com/Gozala/extendables/v0.2.0/.git/srv/deployment/ores/deploy/submodules/assets/.git08b9cebc5e0148ace73915a94e65a5dd0f0ba9a2

Mentioned in SAL (#wikimedia-operations) [2022-05-02T12:48:01Z] <volans> swapped /srv/deployment directory on deploy1002 with the one from the latest backup - T307349

After restore:

$ colordiff -U0 --text deploy2002 deploy1002 
--- deploy2002	2022-05-02 14:51:09.946189381 +0200
+++ deploy1002	2022-05-02 14:51:10.454195799 +0200
@@ -20 +20,2 @@
-/srv/deployment/cassandra/twcs/.git9d5b9879035e0918a49dd30d67e467db8921a14f
+/srv/deployment/cassandra/metrics-collector/.gitd0169ee17be33c4aabc92034c7ba8f042e1008c4
+/srv/deployment/cassandra/twcs/.git636083128725f344826fef4e4514d42f01c8a3b7
@@ -24,0 +26,4 @@
+/srv/deployment/cp-jobqueue/cp-jobqueue/.gitaa53d3ffd698688ca79de53a2c2855f938cba9b8
+/srv/deployment/cp-jobqueue/cp-jobqueue/src/.git7b42eb21c752809adc8a492d7d697a1615c1e781
+/srv/deployment/cpjobqueue/deploy/.git07d8c3223706daf92da959d46ac0802370457b2c
+/srv/deployment/cpjobqueue/deploy/src/.gitfed526a21b9906347d03fd56f4e8d598b85f5c74
@@ -29 +34 @@
-/srv/deployment/design/style-guide/.git008d60450ddbcecff2b8f46b62f541dd465c20bd
+/srv/deployment/design/style-guide/.git9b3b0fbb2db298fc757776ef25dd030ce6a4a765
@@ -33 +38 @@
-/srv/deployment/dumps/dumps/.gitcd309394414535e6156d3e5dd10e21b39bac52ce
+/srv/deployment/dumps/dumps/.gitf7c16d47689f92e8f3663ef6bf30939a289fe187
@@ -61,0 +67 @@
+/srv/deployment/httpbb/.gitd85df68a86ec4b587c0b134a4212690f0220e20c
@@ -64,0 +71 @@
+/srv/deployment/jobrunner.old/jobrunner/.git161c84cfd4dfa536e09278ce65a585c8d6313aeb
@@ -68 +75 @@
-/srv/deployment/logstash/plugins/.gitc9d186a5df45c27d341f9fb6db924cda2a726224
+/srv/deployment/logstash/plugins/.git7fb88433b784ddbe913a848b3cd2d45353ffacc2
@@ -80 +87 @@
-/srv/deployment/ores/deploy/.git98a1b2e51d79e9e13140c03168e12e173b6c891e
+/srv/deployment/ores/deploy/.git29de1cc854a8226d657002d5d44ffa39382276cc
@@ -82 +88,0 @@
-/srv/deployment/ores/deploy/submodules/assets/.git08b9cebc5e0148ace73915a94e65a5dd0f0ba9a2
@@ -87,4 +93,5 @@
-/srv/deployment/ores/deploy/submodules/wheels/.git1e9f54533b744996f46558c11c82b0180bfe0f49
-/srv/deployment/parsoid/config/node_modules/events/@modules/raw.github.com/Gozala/extendables/v0.2.0/.git/srv/deployment/parsoid/config/.gitf12b09d5dfeb18c74104a5cdefc7ec4ea23a2b2a
-/srv/deployment/parsoid/deploy/.gitebfb301d6544ebb9407e11bb6fbb1e0dffaa4bde
-/srv/deployment/parsoid/deploy/src/.git7ddab4db4df014194ec7ab3a144e679dea885be2
+/srv/deployment/ores/deploy/submodules/wheels/.git85c0dccefacb803d60e600d4d5f4ae86a74c6068
+/srv/deployment/parsoid/config/.gitf12b09d5dfeb18c74104a5cdefc7ec4ea23a2b2a
+/srv/deployment/parsoid/config/node_modules/events/@modules/raw.github.com/Gozala/extendables/v0.2.0/.git/srv/deployment/ores/deploy/submodules/assets/.git08b9cebc5e0148ace73915a94e65a5dd0f0ba9a2
+/srv/deployment/parsoid/deploy/.gitd2d48702eda33ff9f5fcdf579ae0fa15517754f8
+/srv/deployment/parsoid/deploy/src/.git74730a37bc180ffb72ced084b41b5d39deed241e
@@ -103,0 +111 @@
+/srv/deployment/prometheus/jmx_exporter/.git0ee30e891c51e5d166d925c502558ff4a4e28016
@@ -107,0 +116,2 @@
+/srv/deployment/recommendation-api/deploy/.gitdb7fd80990a4c12638a2384632046bfd1d234aa7
+/srv/deployment/recommendation-api/deploy/src/.git7e0017724a164b5d1b240a0afb039f62b3a9ce9b

Mentioned in SAL (#wikimedia-operations) [2022-05-02T12:48:01Z] <volans> swapped /srv/deployment directory on deploy1002 with the one from the latest backup - T307349

This was done running;

$ sudo mv /srv/deployment /srv/deployment.T307349 && sudo mv /srv/restore/srv/deployment /srv/

Ran the git rev-parse HEAD again:

$ diff --text -U0 deploy2002 deploy1002
--- deploy2002	2022-05-02 15:17:27.886187073 +0200
+++ deploy1002	2022-05-02 15:17:31.078228280 +0200
@@ -89 +88,0 @@
-/srv/deployment/ores/deploy/submodules/assets/.git08b9cebc5e0148ace73915a94e65a5dd0f0ba9a2
@@ -95 +94,2 @@
-/srv/deployment/parsoid/config/node_modules/events/@modules/raw.github.com/Gozala/extendables/v0.2.0/.git/srv/deployment/parsoid/config/.gitf12b09d5dfeb18c74104a5cdefc7ec4ea23a2b2a
+/srv/deployment/parsoid/config/.gitf12b09d5dfeb18c74104a5cdefc7ec4ea23a2b2a
+/srv/deployment/parsoid/config/node_modules/events/@modules/raw.github.com/Gozala/extendables/v0.2.0/.git/srv/deployment/ores/deploy/submodules/assets/.git08b9cebc5e0148ace73915a94e65a5dd0f0ba9a2
elukey lowered the priority of this task from Unbreak Now! to Medium.May 2 2022, 1:21 PM

Everything got restored, and /var/lock/scap-global-lock has been removed, deployments can proceed.

Please note that we kept the old /srv/deployment dir on deploy1002 (/srv/deployment.T307349), in case anything needs to be fetched from there.

To all deployers: sorry for the inconvenience, please keep an extra eye opened when deploying to see if anything weird pops up.

Let's keep this task open for a couple of days more in case people need to ask question or info.

From IRC the big diff at T307349#7895665 shows that deploy2002 has repositories with a more up to date git commit. The reason is deploy2002 has been reprovisioned by Puppet which cloned all repositories in scap::sources of hieradata/role/common/deployment_server/kubernetes.yaml and checked out origin/HEAD. However some of those repositories are no more deployed by scap, they thus never advanced on deploy1001 which is stall to a commit in the past.

Eventually the hourly sync_deployment_dir.timer made deplo2002 to pull from deploy1001 and restored the falsy state set by Puppet.

A follow up action would be to remove obsolete repositories from hieradata/role/common/deployment_server/kubernetes.yaml and delete them from the primary deployment server. There might be repositories no more defined in Puppet which would need to be deleted from the deployment server.

Adding parsoid team here for awareness- please check the repo and all its submodules look as expected on next deploy, and that no data was lost on deploy1002.

ssastry edited projects, added Parsoid (Tracking); removed Parsoid.

@elukey We didn't receive any bad reports so far, should we be good to close this task or are there any outstanding actionables left?

T309162 is still actionable from the incident.

We can close this task and see if any clean up is needed in the follow up task :)

There is a directory "deployment.T307349" under /srv/ on deploy1002 that uses 47GB.

And the deploy server is out of disk in /srv now.

Mentioned in SAL (#wikimedia-operations) [2023-02-14T19:36:34Z] <mutante> root@deploy1002:/srv# rm -rf deployment.T307349/

I deleted it.

19:36 < mutante> !log root@deploy1002:/srv# rm -rf deployment.T307349/
19:36 < mutante> /dev/mapper/vg0-srv 277G 216G 47G 83% /srv