I can confirm since a while these have been happening. The pattern is always:
[backup1001:~] $ sudo bconsole Connecting to Director backup1001.eqiad.wmnet:9101 1000 OK: 103 backup1001.eqiad.wmnet Version: 9.4.2 (04 February 2019) Enter a period to cancel a command. *restore .. 5: Select the most recent backup for a client .. 96: gitlab1001.wikimedia.org-fd .. Select the Client (1-228): 96 The defined FileSet resources are: Selection list for "FileSet" is empty! No FileSet found for client "gitlab1001.wikimedia.org-fd".
Let's call it Done once we have confirmed on the Bacula side that there is a full backup and it's restorable.
added to DNS
dag.wikipedia.org is an alias for dyna.wikimedia.org.
Fri, Jun 11
I reverted the firewall (ferm) change that allowed mgmt to connect to install since as comments above say it couldn't be used for firmware upgrades.
I agree, pretty sure this only happens when we add/remove disks, never happened randomly on just a reboot to me.
The 150G secondary disk has been removed from the prometheus3001 VM.
secondary disk has been removed
Strangely .. network interface was renamed. Interface was ens14 before shutdown, and after rebooting it is ens13.
Wed, Jun 9
Here are the MAC addresses, but we still need the DHCP change, I could not upload right now:
dzahn@cumin1001:~$ sudo cookbook sre.ganeti.makevm --vcpus 2 --memory 8 --disk 10 --network public eqiad_D doh1002
Ready to create Ganeti VM doh1002.wikimedia.org in the ganeti01.svc.eqiad.wmnet cluster on row D with 2 vCPUs, 8GB of RAM, 10GB of disk in the public network.
I copied the data to @deploy1002:/srv/miscweb (using gerrit:699064 and compressed it over there and it's 440MB .tar.gz with default settings.
dzahn@cumin1001:~$ sudo cookbook sre.ganeti.makevm --vcpus 2 --memory 8 --disk 10 --network public eqiad_C doh1001
Ready to create Ganeti VM doh1001.wikimedia.org in the ganeti01.svc.eqiad.wmnet cluster on row C with 2 vCPUs, 8GB of RAM, 10GB of disk in the public network.
The size I want to add here is 3.6G _uncompressed_ so it should be only a couple hundred MB in a single .tar.gz that I actually want to add. Given that other discussion there talks about Terabytes.. seems like a drop in the ocean though, maybe ?
Tue, Jun 8
reserved port 4111 as public TLS service port for miscweb
Tentatively closing since this sounds like issues are resolved.
Mon, Jun 7
Sat, Jun 5
Alright, so if we can't make public dumps then I don't see a difference between the 2 things, both contain secrets. right?
Fri, Jun 4
I think all backups are treated as if they contain secrets, it's not like they are made public. I am not aware of 2 levels of security when it comes to that.
I added /etc/gitlab/config_backup to the suggested backup::set, so it would consist of /srv/gitlab-backup and /etc/gitlab/config_backup as part of a single set called "gitlab".
I can also wget http://mirrors.wikimedia.org/debian/pool/main/f/ffmpeg/libavcodec58_4.3.2-0+deb11u1_amd64.deb without problems now.
18:38 < icinga-wm> PROBLEM - Check systemd state on cp1087 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket
Sure you want to open ssh to the public before backups and logging tasks are done?
boldly closing this - shell access was done quite some time ago and there have been no further issues that we heard of
boldly closing this since there obviously aren't blockers anymore or contractors wouldn't have been able to work all this time
ACCEPT tcp -- anywhere gitlab.wikimedia.org tcp dpt:http ACCEPT tcp -- anywhere gitlab.wikimedia.org tcp dpt:https ACCEPT tcp -- anywhere gitlab.wikimedia.org tcp dpt:ssh
Thanks! I think the SSH part was T276148 (shouldnt that be closed now? We can't have ssh port open before the ssh setup is done, surely?) and separate from this ticket though.
per T276144#7133694 the SSH port is open to the world now.
Alright, thanks! That's not even on a different note but exactly the kind of feedback needed. So we'll have to add /etc/gitlab/config_backup to our new backup::set. It would't be separate though. In which way are they supposed to be separate?
@ssingh doh5001.wikimedia.org is ready for you now. doh5002 on hold for lack of IP in that subnet.
Thu, Jun 3
@JAnstee_WMF Alright, I made a patch to replace your key and uploaded it to code review.
If it turns out you need to make new keys, just paste the public part here and ask us to update the repository.
Ok, thank you. Hmm.. Let's try this:
ACK, thanks, @fgiunchedi As commented on the linked ticket, multiple people have mentioned we have metal there that prometheus could move to. Seems to me that is what we should probably do (instead of using that hardware as another ganeti server).
ACK. thanks all. Several people have mentioned the part that we have metal there this could move to. It seems to make sense to me to move prometheus to that and outside of ganeti (rather than adding another ganeti server).
doh5001 has been created but doh5002 hit resource limits here as well, even though we just used 10G disk, it is maybe another resource:
May 31 16:02:36 deneb docker-report-releng: ERROR[docker-report] Debmonitor report for image docker-registry.wikimedia.org/releng/quibble-jessie-php56:0.0.31-1 failed
2021-06-03 19:19:30 3d 2h 47m 35s 3/3 CRITICAL - degraded: The following units failed: docker-reporter-releng-images.service
checked on install* that nginx-full is gone, nginx-light is there and restarted nginx to be sure
440601 Jun 2 23:05:38 bast4003 sshd: Accepted key ED25519 SHA256:plaVmNDA1Ug/00RQCUV2WfIKRDNwP7GLq9NouyMKMJM found at /etc/ssh/userkeys/janstee:1 440602 Jun 2 23:05:38 bast4003 sshd: Postponed publickey for janstee from 184.108.40.206 port 60707 ssh2 [preauth]
Would you mind pasting the contents of /Users/janstee/.ssh/config, Jaime?
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp1087.eqiad.wmnet with reason: replaced DIMM https://phabricator.wikimedia.org/T278729
debug1: Executing proxy command: exec ssh -a -W stat1006.ulsfo.wmnet:22 email@example.com
Wed, Jun 2
[doh3001:~] $ df -h Filesystem Size Used Avail Use% Mounted on .. /dev/vda1 8.9G 1.6G 6.8G 19% /
VMs have been created.
We ran out of resources on esams ganeti (disk). Creating machines with just 10G instead after discussion.
Today we tried to create 2 new VMs in the ganeti esams cluster and actually ran out of resources.
When we recreate the bastions without prometheus, we don't need to use 40GB disk anymore, right?
all interested, please take a look at the latest Gerrit link above and feel free to review/comment there.
We now have the Bacula client (fd for file daemon) running on gitlab1001. It
[gitlab1001:~] $ ps aux | grep bacula root 11735 0.0 0.0 26140 7684 ? Ssl 19:00 0:00 /usr/sbin/bacula-fd -fP -c /etc/bacula/bacula-fd.conf
There are 2 things needed to make Bacula backups work on a given host.
git clone "https://gerrit.wikimedia.org/r/operations/container/miscweb"
Tue, Jun 1
thanks, Rob! the continuation of this will be T280597
note: if a 'periodic job' is needed these are not literally cron jobs anymore, they are nowadays systemd timers and in other, non-mediawiki, places we are still in the process of converting from cron to systemd. The puppet class 'profile::mediawiki::periodic_job' should be used for this, that will create a service and a timer on the mwmaint hosts.
Sun, May 30
Ok, thanks for additional info! cheers
Sat, May 29
MariaDB [wikistats]> insert into mediawikis (statsurl, method) values ("https://jojowiki.com/api.php",8);