@hashar reports that there is graphite data missing:
That grafana dashboard uses graphite data from labs:
This seems likely related to all the operations surrounding T299744: cloudmetrics1004 potential hardware problem
@hashar reports that there is graphite data missing:
That grafana dashboard uses graphite data from labs:
This seems likely related to all the operations surrounding T299744: cloudmetrics1004 potential hardware problem
It seems we have lost historical data too, but we synced it when moving to the newer boxes, no?
aborrero@cumin1001:~ $ sudo cumin --force cloudmetrics1* 'du -h --max-depth=1 /srv/carbon/' 4 hosts will be targeted: cloudmetrics[1001-1004].eqiad.wmnet FORCE mode enabled, continuing without confirmation ===== NODE GROUP ===== (1) cloudmetrics1001.eqiad.wmnet ----- OUTPUT of 'du -h --max-depth=1 /srv/carbon/' ----- 605G /srv/carbon/whisper 605G /srv/carbon/ ===== NODE GROUP ===== (1) cloudmetrics1002.eqiad.wmnet ----- OUTPUT of 'du -h --max-depth=1 /srv/carbon/' ----- 335G /srv/carbon/whisper 335G /srv/carbon/ ===== NODE GROUP ===== (1) cloudmetrics1004.eqiad.wmnet ----- OUTPUT of 'du -h --max-depth=1 /srv/carbon/' ----- 3.5G /srv/carbon/whisper 3.5G /srv/carbon/ ===== NODE GROUP ===== (1) cloudmetrics1003.eqiad.wmnet ----- OUTPUT of 'du -h --max-depth=1 /srv/carbon/' ----- 34G /srv/carbon/whisper 34G /srv/carbon/ ================
We are serving graphite stats from cloudmetrics1004.eqiad.wmnet due to T297814: cloudmetrics1003 seizes up under load, but seems that the data is still there in cloudmetrics1001. So perhaps the simplest thing to do is to make cloudmetrics1001 the primary.
Change 756954 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):
[operations/puppet@production] wmcs: monitoring: sharpen primary/backup rsync setup
Change 756957 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):
[operations/dns@master] wmnet: make cloudmetrics1001 the backed of grafana/graphite endpoints
Change 756958 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):
[operations/puppet@production] wmcs: monitoring: make cloudmetrics1001 the primary
Change 756958 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] wmcs: monitoring: make cloudmetrics1001 the primary
Change 756957 merged by Arturo Borrero Gonzalez:
[operations/dns@master] wmnet: make cloudmetrics1001 the backend of grafana/graphite endpoints
Mentioned in SAL (#wikimedia-cloud) [2022-01-25T10:49:46Z] <arturo> made cloudmetrics1001/1002 primary/backup respectively (T299744, T297814, T300011)
Change 756954 merged by Andrew Bogott:
[operations/puppet@production] wmcs: monitoring: sharpen primary/backup rsync setup
Part of this discrepancy is due to sparse files being handled differently on the different hosts. "du --apparent-size" produces the same usage on 1001 and 1002.
After a manual rsync:
andrew@cumin1001:~$ sudo cumin --force cloudmetrics1* 'du --apparent-size -h --max-depth=1 /srv/carbon/'
4 hosts will be targeted:
cloudmetrics[1001-1004].eqiad.wmnet
FORCE mode enabled, continuing without confirmation
(1) cloudmetrics1003.eqiad.wmnet
628G /srv/carbon/whisper
628G /srv/carbon/
(3) cloudmetrics[1001-1002,1004].eqiad.wmnet
627G /srv/carbon/whisper
PASS |███████████████████████████████████████████████████████████████████████| 100% (4/4) [00:17<00:00, 5.25s/hosts]
FAIL | | 0% (0/4) [00:17<?, ?hosts/s]
100.0% (4/4) success ratio (>= 100.0% threshold) for command: 'du --apparent-si...=1 /srv/carbon/'.
100.0% (4/4) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Change 757097 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):
[operations/puppet@production] Revert \"wmcs: monitoring: make cloudmetrics1001 the primary\"
Change 757097 merged by Andrew Bogott:
[operations/puppet@production] Revert \"wmcs: monitoring: make cloudmetrics1001 the primary\"
Change 757101 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):
[operations/dns@master] Revert \"wmnet: make cloudmetrics1001 the backend of grafana/graphite endpoints\"
Change 757101 merged by Andrew Bogott:
[operations/dns@master] Revert \"wmnet: make cloudmetrics1001 the backend of grafana/graphite endpoints\"
Change 757110 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):
[operations/puppet@production] Make cloudmetrics1003/1004 monitoring hosts, 1001/1002 spare systems.
Change 757110 merged by Andrew Bogott:
[operations/puppet@production] Make cloudmetrics1003/1004 monitoring hosts, 1001/1002 spare systems.
There's a pretty big gap in history but most historical data is now present on cloudmetrics1003/1004. I confirmed that metrics are getting copied over to 1004 every hour.