Page MenuHomePhabricator

missing graphite data
Closed, ResolvedPublic

Description

@hashar reports that there is graphite data missing:

image.png (720×990 px, 92 KB)

That grafana dashboard uses graphite data from labs:

image.png (389×579 px, 30 KB)

This seems likely related to all the operations surrounding T299744: cloudmetrics1004 potential hardware problem

Event Timeline

It seems we have lost historical data too, but we synced it when moving to the newer boxes, no?

aborrero@cumin1001:~ $ sudo cumin --force cloudmetrics1* 'du -h --max-depth=1  /srv/carbon/'
4 hosts will be targeted:
cloudmetrics[1001-1004].eqiad.wmnet
FORCE mode enabled, continuing without confirmation
===== NODE GROUP =====                                                                                  
(1) cloudmetrics1001.eqiad.wmnet                                                                        
----- OUTPUT of 'du -h --max-depth=1  /srv/carbon/' -----                                               
605G    /srv/carbon/whisper                                                                             
605G    /srv/carbon/                                                                                    
===== NODE GROUP =====                                                                                  
(1) cloudmetrics1002.eqiad.wmnet                                                                        
----- OUTPUT of 'du -h --max-depth=1  /srv/carbon/' -----                                               
335G    /srv/carbon/whisper                                                                             
335G    /srv/carbon/                                                                                    
===== NODE GROUP =====                                                                                  
(1) cloudmetrics1004.eqiad.wmnet                                                                        
----- OUTPUT of 'du -h --max-depth=1  /srv/carbon/' -----                                               
3.5G    /srv/carbon/whisper                                                                             
3.5G    /srv/carbon/                                                                                    
===== NODE GROUP =====                                                                                  
(1) cloudmetrics1003.eqiad.wmnet                                                                        
----- OUTPUT of 'du -h --max-depth=1  /srv/carbon/' -----                                               
34G     /srv/carbon/whisper                                                                             
34G     /srv/carbon/                                                                                    
================                  

We are serving graphite stats from cloudmetrics1004.eqiad.wmnet due to T297814: cloudmetrics1003 seizes up under load, but seems that the data is still there in cloudmetrics1001. So perhaps the simplest thing to do is to make cloudmetrics1001 the primary.

Change 756954 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] wmcs: monitoring: sharpen primary/backup rsync setup

https://gerrit.wikimedia.org/r/756954

Change 756957 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] wmnet: make cloudmetrics1001 the backed of grafana/graphite endpoints

https://gerrit.wikimedia.org/r/756957

Change 756958 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] wmcs: monitoring: make cloudmetrics1001 the primary

https://gerrit.wikimedia.org/r/756958

Change 756958 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] wmcs: monitoring: make cloudmetrics1001 the primary

https://gerrit.wikimedia.org/r/756958

Change 756957 merged by Arturo Borrero Gonzalez:

[operations/dns@master] wmnet: make cloudmetrics1001 the backend of grafana/graphite endpoints

https://gerrit.wikimedia.org/r/756957

Mentioned in SAL (#wikimedia-cloud) [2022-01-25T10:49:46Z] <arturo> made cloudmetrics1001/1002 primary/backup respectively (T299744, T297814, T300011)

Change 756954 merged by Andrew Bogott:

[operations/puppet@production] wmcs: monitoring: sharpen primary/backup rsync setup

https://gerrit.wikimedia.org/r/756954

Part of this discrepancy is due to sparse files being handled differently on the different hosts. "du --apparent-size" produces the same usage on 1001 and 1002.

After a manual rsync:

andrew@cumin1001:~$ sudo cumin --force cloudmetrics1* 'du --apparent-size -h --max-depth=1 /srv/carbon/'
4 hosts will be targeted:
cloudmetrics[1001-1004].eqiad.wmnet
FORCE mode enabled, continuing without confirmation

NODE GROUP

(1) cloudmetrics1003.eqiad.wmnet

  • OUTPUT of 'du --apparent-si...=1 /srv/carbon/' -----

628G /srv/carbon/whisper
628G /srv/carbon/

NODE GROUP

(3) cloudmetrics[1001-1002,1004].eqiad.wmnet

  • OUTPUT of 'du --apparent-si...=1 /srv/carbon/' -----

627G /srv/carbon/whisper

627G /srv/carbon/

PASS |███████████████████████████████████████████████████████████████████████| 100% (4/4) [00:17<00:00, 5.25s/hosts]
FAIL | | 0% (0/4) [00:17<?, ?hosts/s]
100.0% (4/4) success ratio (>= 100.0% threshold) for command: 'du --apparent-si...=1 /srv/carbon/'.
100.0% (4/4) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Change 757097 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Revert \"wmcs: monitoring: make cloudmetrics1001 the primary\"

https://gerrit.wikimedia.org/r/757097

Change 757097 merged by Andrew Bogott:

[operations/puppet@production] Revert \"wmcs: monitoring: make cloudmetrics1001 the primary\"

https://gerrit.wikimedia.org/r/757097

Change 757101 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/dns@master] Revert \"wmnet: make cloudmetrics1001 the backend of grafana/graphite endpoints\"

https://gerrit.wikimedia.org/r/757101

Change 757101 merged by Andrew Bogott:

[operations/dns@master] Revert \"wmnet: make cloudmetrics1001 the backend of grafana/graphite endpoints\"

https://gerrit.wikimedia.org/r/757101

Change 757110 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Make cloudmetrics1003/1004 monitoring hosts, 1001/1002 spare systems.

https://gerrit.wikimedia.org/r/757110

Change 757110 merged by Andrew Bogott:

[operations/puppet@production] Make cloudmetrics1003/1004 monitoring hosts, 1001/1002 spare systems.

https://gerrit.wikimedia.org/r/757110

There's a pretty big gap in history but most historical data is now present on cloudmetrics1003/1004. I confirmed that metrics are getting copied over to 1004 every hour.