Page MenuHomePhabricator

Low disk space on the root partition for several Hadoop workers
Closed, ResolvedPublic

Description

We have recently seen a number of alerts from Hadoop worker nodes that are experiencing low disk space conditions on their root partitions.

This is causing the hadoop-yarn-nodemanager processes to crash due to being unable to allocate any space.

See here for report emails: https://groups.google.com/a/wikimedia.org/g/data-platform-alerts/search?q=nodemanager%20critical%20after%3A2025-01-09

We can see that, for example an-worker1154 suddenly used up 20% of the space in / at which point yarn crashed.

image.png (795×1 px, 58 KB)

Event Timeline

I have a feeling that this might have been caused by an accidental copy of data to users' home directory on the hadoop-workers.
I found some mediawiki-history files with yesterday's timestamp on an-worker1154.

root@an-worker1154:/home/fab# ls -lh mediawiki_history/
total 11G
-rw-r----- 1 fab wikidev 448M Jan  8 19:44 part-01019-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:44 part-01059-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 450M Jan  8 19:46 part-01091-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 448M Jan  8 19:44 part-01149-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:46 part-01159-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:45 part-01188-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:45 part-01215-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:45 part-01235-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 448M Jan  8 19:46 part-01284-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 448M Jan  8 19:45 part-01373-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 445M Jan  8 19:45 part-01406-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 448M Jan  8 19:44 part-01432-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 448M Jan  8 19:44 part-01568-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 448M Jan  8 19:44 part-01574-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 448M Jan  8 19:44 part-01605-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 446M Jan  8 19:45 part-01615-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:45 part-01626-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:46 part-01658-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:46 part-01775-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 445M Jan  8 19:46 part-01801-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 448M Jan  8 19:44 part-01808-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:45 part-01824-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 449M Jan  8 19:45 part-01875-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:45 part-01944-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet
-rw-r----- 1 fab wikidev 447M Jan  8 19:44 part-01995-0dcb3c89-0fec-4c52-b795-21324d042169-c000.snappy.parquet

Then I checked and found quite a lot of files in this user's home directory on various Hadoop workers.

btullis@cumin1002:~$ sudo cumin A:hadoop-worker 'du -sh /home/fab'
113 hosts will be targeted:
an-worker[1065-1069,1078-1177].eqiad.wmnet,analytics[1070-1077].eqiad.wmnet
OK to proceed on 113 hosts? Enter the number of affected hosts to confirm or "q" to quit: 113
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1106.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
9.2G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) analytics1075.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
26G     /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1141.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
3.6G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(4) an-worker[1090,1139,1143,1147].eqiad.wmnet                                                                                                                                                                     
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
13G     /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1175.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
8.0G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(2) an-worker1172.eqiad.wmnet,analytics1074.eqiad.wmnet                                                                                                                                                            
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
15G     /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1129.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
891M    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1083.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
1.5G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1169.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
14G     /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(2) an-worker[1089,1124].eqiad.wmnet                                                                                                                                                                               
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
8.8G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(2) an-worker[1117,1176].eqiad.wmnet                                                                                                                                                                               
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
7.0G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1066.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
1.4G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(2) an-worker[1116,1119].eqiad.wmnet                                                                                                                                                                               
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
9.6G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(2) an-worker[1065,1157].eqiad.wmnet                                                                                                                                                                               
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
16G     /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1156.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
12G     /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(3) an-worker[1115,1118,1154].eqiad.wmnet                                                                                                                                                                          
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
11G     /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1112.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
401M    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(1) an-worker1110.eqiad.wmnet                                                                                                                                                                                      
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
8.3G    /home/fab                                                                                                                                                                                                  
===== NODE GROUP =====                                                                                                                                                                                             
(85) an-worker[1067-1069,1078-1082,1084-1088,1091-1105,1107-1109,1111,1113-1114,1120-1123,1125-1128,1130-1138,1140,1142,1144-1146,1148-1153,1155,1158-1168,1170-1171,1173-1174,1177].eqiad.wmnet,analytics[1070-1073,1076-1077].eqiad.wmnet                                                                                                                                                                                           
----- OUTPUT of 'du -sh /home/fab' -----                                                                                                                                                                           
8.0K    /home/fab

@fab - was this a mistake, or were you trying to do this deliberately? We don't have an awful lot of free space on the root volume of the Hadoop workers, so this isn't a very good idea to use it.

Testing again with btullis@cumin1002:~$ sudo cumin A:hadoop-worker 'du -s /home/*|sort -n|tail -n 1' shows that /home/fab is the largest home directory on all of these servers. So I'm guessing it's a mistake and we should remove the contents, but I'll wait for now.

Mentioned in SAL (#wikimedia-operations) [2025-01-10T10:02:44Z] <elukey> elukey@cumin1002:~$ sudo cumin -b 20 'an-worker*' 'apt-get clean' (safety to free space and avoid issues on hadoop) - T383320

BTullis added a subscriber: fkaelin.

I have spoken to @fkaelin about this and we understand what happened. It was an accidental invocation of hadoop distcp.

Removed the files with:

sudo cumin A:hadoop-worker 'rm /home/fab/mediawiki_history/*.parquet'
sudo cumin A:hadoop-worker 'rmdir /home/fab/mediawiki_history'

I'll close this ticket now.