Page MenuHomePhabricator

clouddumps100[12] puppet alert: "Puppet performing a change on every puppet run"
Closed, ResolvedPublic

Assigned To
Authored By
Andrew
Sep 12 2023, 4:11 PM
Referenced Files
F37917756: image.png
Oct 2 2023, 9:45 AM
F37917749: image.png
Oct 2 2023, 9:45 AM
F37917753: image.png
Oct 2 2023, 9:45 AM
F37853141: image.png
Sep 29 2023, 4:09 PM

Description

There's some sort of argument happening over /srv/dumps/xmldatadumps/public/dvd.html -- puppet is updating it on almost every run.

Ariel says:

there's probably a dvd.html file that was added in puppet and updated, and some old copy that is on the primary dumps nfs share that is getting rsynced over, overwriting the one in puppet
the easy fix then is to remove the copy on the primary nfs share, and toss it, indeed
11:03 AM and then migt as well toss it on the secondary nfs share too
11:04 AM if you make a task we can make sure it's gone from any spare nfs share, so that if some host is swapped in to replace another, the problem doesn't return

Event Timeline

Ok, to resolve I'm going to erase this dvd.html file from all the dumpsdata hosts as per docs:

sudo -u dumpsgen rm /data/xmldatadumps/public/dvd.html
sudo -u dumpsgen rm /data/xmldatadumps/public/dvd.html.not

(and done)

That was fixed, but now there seem to be some permissions that are getting overwritten:

Notice: /Stage[main]/Dumps::Web::Dirs/File[/srv/dumps/xmldatadumps/public/wikidatawiki/entities]/owner: owner changed 'dumpsgen' to 'root' (corrective)
Notice: /Stage[main]/Dumps::Web::Dirs/File[/srv/dumps/xmldatadumps/public/wikidatawiki/entities]/group: group changed 'dumpsgen' to 'root' (corrective)
Notice: /Stage[main]/Dumps::Web::Fetches::Phab/Dumps::Web::Fetches::Job[phabdumps]/File[/srv/dumps/xmldatadumps/public/other/misc]/owner: owner changed 'dumpsgen' to 'root' (corrective)
Notice: /Stage[main]/Dumps::Web::Fetches::Phab/Dumps::Web::Fetches::Job[phabdumps]/File[/srv/dumps/xmldatadumps/public/other/misc]/group: group changed 'dumpsgen' to 'root' (corrective)

Looking into the first case, a puppet run does this:

btullis@clouddumps1001:~$ ls -l /srv/dumps/xmldatadumps/public/wikidatawiki
total 6304
drwxr-xr-x 2 dumpsgen dumpsgen  192512 Aug  1 09:35 20230620
drwxr-xr-x 2 dumpsgen dumpsgen 1486848 Aug 20 09:48 20230701
drwxr-xr-x 2 dumpsgen dumpsgen  196608 Sep  1 09:35 20230720
drwxr-xr-x 2 dumpsgen dumpsgen 1478656 Sep 20 09:45 20230801
drwxr-xr-x 2 dumpsgen dumpsgen  192512 Aug 24 21:21 20230820
drwxr-xr-x 2 dumpsgen dumpsgen 1486848 Sep 18 18:37 20230901
drwxr-xr-x 2 dumpsgen dumpsgen  188416 Sep 24 22:33 20230920
lrwxrwxrwx 1 root     root          30 Sep 22  2015 entities -> ../other/wikibase/wikidatawiki
drwxrwxr-x 2 dumpsgen dumpsgen 1200128 Sep 24 22:32 latest

Setting that symlink to be owned by root:root
That resource is defined here: https://github.com/wikimedia/operations-puppet/blob/production/modules/dumps/manifests/web/dirs.pp#L43-L47
...with a warning that this is mounted over NFS at different locations on different hosts.

Interestingly, if I type `showmount` it doesn't list any NFS clients of those host.

btullis@clouddumps1001:~$ sudo showmount
Hosts on clouddumps1001:

The second one to be managed and changed by puppet is:

The second one that is changed is this:

btullis@clouddumps1001:~$ ls -ld /srv/dumps/xmldatadumps/public/other/misc
drwxr-xr-x 2 root root 4096 Dec 16  2017 /srv/dumps/xmldatadumps/public/other/misc

This is defined here: https://github.com/wikimedia/operations-puppet/blob/production/modules/dumps/manifests/web/fetches/phab.pp#L6-L13 and is of type dumps::web::fetches::job

I'm going to investigate the first changed file first.

As far as I can see, that class is only applied to the two hosts in question:

btullis@cumin1001:~$ sudo cumin C:dumps::web::dirs
2 hosts will be targeted:
clouddumps[1001-1002].wikimedia.org
DRY-RUN mode enabled, aborting

This is a good thing because it means that we should be able to make the change in puppet to match what is being synced. I'm assuming, since no NFS clients are mentioned from showmount that the change in ownership is happening as a result of an rsync process. I will try to find the source next and verify this.

Change 961412 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Change the owner:group of the wikidatawiki entities link

https://gerrit.wikimedia.org/r/961412

Although I prepared a patch to fix the wikidatawiki issue, I'm not sure it's necessary.

I have removed the /data/xmldatadumps/public/wikidatawiki/entities symlink from dumpsdata1006 since it appears that this might have been a stray symlink that wasn't supposed to be there. It isn't present for the commonswiki dump, until it gets to the clouddumps100[1-2] servers.

btullis@clouddumps1002:~$ ls -l /srv/dumps/xmldatadumps/public/wikidatawiki/
total 6296
drwxr-xr-x 2 dumpsgen dumpsgen  192512 Aug  1 09:35 20230620
drwxr-xr-x 2 dumpsgen dumpsgen 1474560 Aug 20 09:48 20230701
drwxr-xr-x 2 dumpsgen dumpsgen  188416 Sep  1 09:35 20230720
drwxr-xr-x 2 dumpsgen dumpsgen 1519616 Sep 20 09:45 20230801
drwxr-xr-x 2 dumpsgen dumpsgen  200704 Aug 24 21:21 20230820
drwxr-xr-x 2 dumpsgen dumpsgen 1474560 Sep 18 18:37 20230901
drwxr-xr-x 2 dumpsgen dumpsgen  212992 Sep 27 16:25 20230920
lrwxrwxrwx 1 root     root          30 Sep 22  2015 entities -> ../other/wikibase/wikidatawiki
drwxrwxr-x 2 dumpsgen dumpsgen 1150976 Sep 27 16:25 latest
btullis@clouddumps1002:~$ ls -l /srv/dumps/xmldatadumps/public/commonswiki
total 3224
drwxr-xr-x 2 dumpsgen dumpsgen 532480 Jul 20 09:45 20230601
drwxr-xr-x 2 dumpsgen dumpsgen 167936 Aug  1 09:40 20230620
drwxr-xr-x 2 dumpsgen dumpsgen 536576 Aug 20 09:57 20230701
drwxr-xr-x 2 dumpsgen dumpsgen 167936 Sep  1 09:38 20230720
drwxr-xr-x 2 dumpsgen dumpsgen 561152 Sep 20 09:54 20230801
drwxr-xr-x 2 dumpsgen dumpsgen 155648 Aug 22 07:46 20230820
drwxr-xr-x 2 dumpsgen dumpsgen 557056 Sep 13 01:43 20230901
drwxr-xr-x 2 dumpsgen dumpsgen 163840 Sep 27 16:35 20230920
lrwxrwxrwx 1 root     root         29 Feb 21  2021 entities -> ../other/wikibase/commonswiki
drwxrwxr-x 2 dumpsgen dumpsgen 421888 Sep 27 16:35 latest
btullis@dumpsdata1006:/data/xmldatadumps/public$ ls -l wikidatawiki/ commonswiki/
commonswiki/:
total 2008
drwxr-xr-x 2 dumpsgen dumpsgen 569344 Sep 20 09:54 20230801
drwxr-xr-x 2 dumpsgen dumpsgen 180224 Aug 22 07:46 20230820
drwxr-xr-x 2 dumpsgen dumpsgen 552960 Sep 13 01:43 20230901
drwxr-xr-x 2 dumpsgen dumpsgen 167936 Sep 22 06:50 20230920
drwxrwxr-x 2 dumpsgen dumpsgen 565248 Sep 22 06:50 latest

wikidatawiki/:
total 4952
drwxr-xr-x 2 dumpsgen dumpsgen 1486848 Sep 20 09:45 20230801
drwxr-xr-x 2 dumpsgen dumpsgen  184320 Aug 24 21:21 20230820
drwxr-xr-x 2 dumpsgen dumpsgen 1519616 Sep 18 18:37 20230901
drwxr-xr-x 2 dumpsgen dumpsgen  204800 Sep 24 22:33 20230920
lrwxrwxrwx 1 dumpsgen dumpsgen      30 Sep 22  2015 entities -> ../other/wikibase/wikidatawiki

So I've deleted that symlink from dumpsdata1006.

drwxrwxr-x 2 dumpsgen dumpsgen 1654784 Sep 24 22:32 latest
btullis@dumpsdata1006:/dataxmldatadumps/public/wikidatawiki$ sudo rm entities

I haven't quite worked out precisely where the /srv/dumps/xmldatadumps/public/other/misc chown is coming from, but I think it's something to do with: rsyncer_misc and the dumps-rsyncer.service running on dumpsdata1006.

I'll keep looking into it.

Here is the source directory on datadumps1006. It is owned by dumpsgen|dumpsgen and is empty.

btullis@dumpsdata1006:/data/xmldatadumps/public/other$ ls -la misc/
total 8
drwxr-xr-x  2 dumpsgen dumpsgen 4096 Aug 29  2017 .
drwxr-xr-x 13 dumpsgen dumpsgen 4096 Aug 29  2017 ..

I believe therefore that I can safely delete it.

btullis@dumpsdata1006:/data/xmldatadumps/public/other$ sudo rmdir misc/
btullis@dumpsdata1006:/data/xmldatadumps/public/other$

I'll monitor the puppet runs for a little while, to see if they now run without changes.

Oh, that didn't work. The owner and group of clouddumps1002:/srv/dumps/xmldatadumps/public/other/misc was still changed back to dumpsgen:dumpsgen by this rsync process.

Sep 28 11:22:46 clouddumps1002 rsyncd[734853]: connect from dumpsdata1006.eqiad.wmnet (2620:0:861:109:10:64:130:3)
Sep 28 11:22:46 clouddumps1002 rsyncd[734853]: rsync allowed access on module data from dumpsdata1006.eqiad.wmnet (2620:0:861:109:10:64:130:3)
Sep 28 11:22:47 clouddumps1002 rsyncd[734853]: rsync to data/xmldatadumps/public/ from dumpsdata1006.eqiad.wmnet (2620:0:861:109:10:64:130:3)
Sep 28 11:22:47 clouddumps1002 rsyncd[734853]: receiving file list
Sep 28 11:22:51 clouddumps1002 rsyncd[734860]: connect from dumpsdata1006.eqiad.wmnet (2620:0:861:109:10:64:130:3)
Sep 28 11:22:51 clouddumps1002 rsyncd[734860]: rsync allowed access on module data from dumpsdata1006.eqiad.wmnet (2620:0:861:109:10:64:130:3)
Sep 28 11:22:51 clouddumps1002 rsyncd[734853]: sent 13854 bytes  received 12706907 bytes  total size 18162011245923
Sep 28 11:22:51 clouddumps1002 rsyncd[734860]: rsync to data/xmldatadumps/public/ from dumpsdata1006.eqiad.wmnet (2620:0:861:109:10:64:130:3)
Sep 28 11:22:51 clouddumps1002 rsyncd[734860]: receiving file list
Sep 28 11:22:51 clouddumps1002 rsyncd[734860]: sent 25034 bytes  received 16884311 bytes  total size 17290727

As we can see by these two consecutive ls operations. (Although it would have been more useful if I had captured the time at which they were executed.)

btullis@clouddumps1002:/srv/dumps/xmldatadumps/public/other$ ls -ld misc/
drwxr-xr-x 2 root root 4096 Dec 16  2017 misc/
btullis@clouddumps1002:/srv/dumps/xmldatadumps/public/other$ ls -ld misc/
drwxr-xr-x 2 dumpsgen dumpsgen 4096 Nov 18  2022 misc/
btullis@clouddumps1002:/srv/dumps/xmldatadumps/public/other$

I'll have to look again more closely at how the dumps-rsyncer service works on dumpsdata1006.

I have noticed that each run seems to log some broken pipes, which doesn't seem right.

btullis@dumpsdata1006:~$ systemctl status dumps-rsyncer.service 
● dumps-rsyncer.service - Dumps rsyncer service
     Loaded: loaded (/lib/systemd/system/dumps-rsyncer.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2023-08-31 10:51:03 UTC; 4 weeks 0 days ago
   Main PID: 1310 (bash)
      Tasks: 2 (limit: 76754)
     Memory: 5.9G
        CPU: 1d 18h 3min 3.067s
     CGroup: /system.slice/dumps-rsyncer.service
             ├─   1310 /bin/bash /usr/local/bin/rsync-via-primary.sh --do_tarball --do_rsync_xml --xmldumpsdir /data/xmldatadumps/public --xmlremotedirs dumpsdata1007.eqiad.wmnet::data/xmldatadumps/public/,clou>
             └─2835104 sleep 600

Sep 28 10:51:22 dumpsdata1006 dumps-rsyncer[2829552]: ls: write error: Broken pipe
Sep 28 11:01:31 dumpsdata1006 dumps-rsyncer[2830569]: ls: write error: Broken pipe
Sep 28 11:01:40 dumpsdata1006 dumps-rsyncer[2830626]: ls: write error: Broken pipe
Sep 28 11:01:49 dumpsdata1006 dumps-rsyncer[2830676]: ls: write error: Broken pipe
Sep 28 11:11:58 dumpsdata1006 dumps-rsyncer[2833845]: ls: write error: Broken pipe
Sep 28 11:12:07 dumpsdata1006 dumps-rsyncer[2833874]: ls: write error: Broken pipe
Sep 28 11:12:15 dumpsdata1006 dumps-rsyncer[2833915]: ls: write error: Broken pipe
Sep 28 11:22:25 dumpsdata1006 dumps-rsyncer[2834977]: ls: write error: Broken pipe
Sep 28 11:22:33 dumpsdata1006 dumps-rsyncer[2835024]: ls: write error: Broken pipe
Sep 28 11:22:42 dumpsdata1006 dumps-rsyncer[2835052]: ls: write error: Broken pipe

Change 961412 abandoned by Btullis:

[operations/puppet@production] Change the owner:group of the wikidatawiki entities link

Reason:

Not required as we found another way.

https://gerrit.wikimedia.org/r/961412

I found some sources of the /srv/dumps/xmldatadumps/public/other/misc folder that were being rsynced from dumpsdata1006 and dumpsdata1003 to clouddumps100[1-2].

I have removed these empty source directories and I'm hopeful that this will fix the continual puppet changes. Monitoring the fix now.

BTullis moved this task from Blocked / Waiting to Done on the Data-Platform-SRE board.

I think it's fixed.

image.png (280×926 px, 31 KB)

BTullis moved this task from Done to In Progress on the Data-Platform-SRE board.

It's not quite fixed, as it's doing something else annoying. It keep rewriting the file: /etc/ferm/conf.d/10_dumps_rsyncd

image.png (874×861 px, 84 KB)

The definitions of the ferm rule seems to be the same before and afgter the change, but some zero-padding in IPv6 is different and one of the IPv6 addresses usesd a different case of one of the addresses:
image.png (588×1 px, 167 KB)

image.png (321×1 px, 122 KB)

i will look into this again.

Here's the full list of IP addresses that are being modified.

(base) btullis@marlin:~/tmp$ cat clouddumps1.txt | tr ' ' '\n' > clouddumps1-translated.txt 
(base) btullis@marlin:~/tmp$ cat clouddumps2.txt | tr ' ' '\n' > clouddumps2-translated.txt 
(base) btullis@marlin:~/tmp$ diff clouddumps1-translated.txt clouddumps2-translated.txt 
1c1
< -&R_SERVICE(tcp,
---
> +&R_SERVICE(tcp,
36,48c36,48
< 2620:0:860:101:10:192:0:141
< 2620:0:860:102:10:192:16:26
< 2620:0:860:103:10:192:32:9
< 2620:0:861:101:10:64:0:66
< 2620:0:861:102:10:64:16:101
< 2620:0:861:103:10:64:32:141
< 2620:0:861:103:10:64:32:164
< 2620:0:861:105:10:64:21:118
< 2620:0:861:108:10:64:53:31
< 2620:0:861:109:10:64:130:3
< 2620:0:861:10d:10:64:134:2
< 2620:0:861:2:208:80:154:142
< 2620:0:861:3:208:80:154:71
---
> 2620::860:101:10:192:0:141
> 2620::860:102:10:192:16:26
> 2620::860:103:10:192:32:9
> 2620::861:101:10:64:0:66
> 2620::861:102:10:64:16:101
> 2620::861:103:10:64:32:141
> 2620::861:103:10:64:32:164
> 2620::861:105:10:64:21:118
> 2620::861:108:10:64:53:31
> 2620::861:109:10:64:130:3
> 2620::861:10D:10:64:134:2
> 2620::861:2:208:80:154:142
> 2620::861:3:208:80:154:71

They are all our servers, so it's not that we are getting the addresses from a DNS server that is not under our control.

(base) btullis@marlin:~/tmp$ dig -f ips.txt 
wdqs2009.codfw.wmnet.
wcqs2001.codfw.wmnet.
mwlog2002.codfw.wmnet.
dumpsdata1004.eqiad.wmnet.
phab1004.eqiad.wmnet.
mwlog1002.eqiad.wmnet.
dumpsdata1005.eqiad.wmnet.
stat1007.eqiad.wmnet.
stat1006.eqiad.wmnet.
dumpsdata1006.eqiad.wmnet.
dumpsdata1007.eqiad.wmnet.
clouddumps1001.wikimedia.org.
clouddumps1002.wikimedia.org.

That list of clients comes from here and here.

All of those hosts are submitted to a dnsquery::lookup function here.

So why are some returning with uppercase padded zeros, while others are returned without?

@jbond - Could I ask you to cast your eye over this (T346165#9219409) please, if you have a little time.

We've seeing the same change happening repeatedly on clouddumps100[12], but it's not on every puppet run.
https://puppetboard.wikimedia.org/node/clouddumps1001.wikimedia.org
https://puppetboard.wikimedia.org/node/clouddumps1002.wikimedia.org

The DNS results for these servers are always consistent, but the zero-padding and case of the results are inconsistent, which is resulting in these repeat puppet changes.
There doesn't seem to be anything strange about the resolver configuration for these servers. I checked with the Traffic team in case it's something we have seen before, but it didn't ring any bells.

If you could help me to check out the possible cause, I'd be very grateful.

Change 963062 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:dumps::distribution::ferm: pass array directly do ferm::service

https://gerrit.wikimedia.org/r/963062

@BTullis from a quick look the resources that is changing is the ferm::service resource from profile::dumps::distribution::ferm. this creates an array of addresses based off dns lookups. my best guess is that the dns is changing frequently which is why you see puppet changes

@BTullis from a quick look the resources that is changing is the ferm::service resource from profile::dumps::distribution::ferm. this creates an array of addresses based off dns lookups. my best guess is that the dns is changing frequently which is why you see puppet changes

Thanks @jbond, that was my first thought too. I thought that the addresses might be for external servers, for which we aren't responsible for the DNS. But it turns out that they are addresses for our own servers, so I don't think that they are changing.

It's just the zero-padding and case of the IPv6 addresses and most of them are in the .wmnet domain, so they aren't resolvable from anywhere else.

We could add some normalization function at the ferm or puppet-dns-lookup layer perhaps (lowercase and do the zeros in a consistent way)?

@BTullis if its the zero padding we could be hitting a bug in ruby-resolve, i thought i added a local hack but i can take a look tomorrow. All answers should already be lowercase but let me know if you see different (i.e. point me to a puppet report)

But it turns out that they are addresses for our own servers, so I don't think that they are changing.

there is a mix, take a look at profile::dumps::distribution::mirrors

Change 963062 merged by Jbond:

[operations/puppet@production] P:dumps::distribution::ferm: pass array directly do ferm::service

https://gerrit.wikimedia.org/r/963062

Change 963299 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] wmflib: Add monkey patching

https://gerrit.wikimedia.org/r/963299

So why are some returning with uppercase padded zeros, while others are returned without?

I have sent a patch to try fix the ipv6 representation but will need to wait for @jhathaway to come back for review

Looks good to me too. Thanks again @jbond.
I'll check back again next week, once it's deployed.

Change 963299 merged by Jbond:

[operations/puppet@production] wmflib: Add monkey patching

https://gerrit.wikimedia.org/r/963299

@BTullis i have merged a patch and ran puppet on the two clouddumps host 5 times now and seen no diff, hopefully this has fixed it but perhaps leave it a day and see what the resulting changes look like

Great! Many thanks indeed @jbond - I'll monitor for a day or so, as you suggest.