Page MenuHomePhabricator

Blazegraph journal too large on wdqs2003
Closed, ResolvedPublic

Description

The same issue described in T284445 is now affecting wdqs2003. /srv is full due to wikidata.jnl and has been for the past 24 hours according to Icinga.

root@wdqs2003:~# df -Th /srv
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/md2       ext4  2.9T  2.8T  6.1M 100% /srv
root@wdqs2003:~# ls -lh /srv/wdqs/wikidata.jnl 
-rw-rw-r-- 1 blazegraph blazegraph 2.8T Aug  9 19:25 /srv/wdqs/wikidata.jnl

Event Timeline

ryankemper@wdqs2003:~$ ls -lah /srv/wdqs
total 2.8T
drwxrwxr-x 3 blazegraph wikidev    4.0K Aug  6 07:26 .
drwxr-xr-x 5 root       root       4.0K Jun  2 22:59 ..
-rw-rw-r-- 1 blazegraph wikidev      54 Aug  6 07:26 aliases.map
-rw-rw-r-- 1 blazegraph blazegraph  23G Aug 10 07:41 categories.jnl
-rw-r--r-- 1 root       root          0 Jun  3 01:25 data_loaded
-rw-r--r-- 1 blazegraph blazegraph  83K Aug  6 07:26 dcatap-20210806.rdf
drwxrwxr-x 2 blazegraph wikidev     68K Aug 10 07:41 dumps
-rw-rw-r-- 1 blazegraph blazegraph 2.8T Aug  9 19:25 wikidata.jnl
ryankemper@wdqs2003:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             63G     0   63G   0% /dev
tmpfs            13G  1.3G   12G  11% /run
/dev/md0         73G   15G   55G  22% /
tmpfs            63G  4.0K   63G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            63G     0   63G   0% /sys/fs/cgroup
tmpfs           1.0G   20K  1.0G   1% /var/lib/nginx
/dev/md2        2.9T  2.8T  120K 100% /srv
tmpfs            13G     0   13G   0% /run/user/24320

wikidata.jnl is over-inflated. here's a healthy host to compare to:

ryankemper@wdqs2002:~$ ls -lah /srv/wdqs
total 998G
drwxrwxr-x 3 blazegraph wikidev    4.0K Aug  6 07:38 .
drwxr-xr-x 5 root       root       4.0K Jun  4 04:46 ..
-rw-rw-r-- 1 blazegraph wikidev      54 Aug  6 07:38 aliases.map
-rw-rw-r-- 1 blazegraph blazegraph  23G Aug 10 07:56 categories.jnl
-rw-r--r-- 1 root       root          0 Jun  4 06:46 data_loaded
-rw-r--r-- 1 blazegraph blazegraph  83K Aug  6 07:38 dcatap-20210806.rdf
drwxrwxr-x 2 blazegraph wikidev     64K Aug 10 07:56 dumps
-rw-rw-r-- 1 blazegraph blazegraph 975G Aug 10 17:07 wikidata.jnl
ryankemper@wdqs2002:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             48G     0   48G   0% /dev
tmpfs           9.5G  952M  8.6G  10% /run
/dev/md0         73G  7.3G   62G  11% /
tmpfs            48G  4.0K   48G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            48G     0   48G   0% /sys/fs/cgroup
tmpfs           1.0G   20K  1.0G   1% /var/lib/nginx
/dev/md2        2.9T  999G  1.8T  36% /srv
tmpfs           9.5G     0  9.5G   0% /run/user/24320

Solution is pretty simple, nuke wikidata.jnl (w/ the host de-pooled) and then kick off a data-transfer from wdqs2002 (external) or wdqs2005 (internal)

Mentioned in SAL (#wikimedia-operations) [2021-08-10T17:13:30Z] <ryankemper> T288501 [WDQS] ryankemper@wdqs2003:~$ sudo rm -fv /srv/wdqs/wikidata.jnl

Mentioned in SAL (#wikimedia-operations) [2021-08-10T17:19:46Z] <ryankemper> T288501 sudo -i cookbook sre.wdqs.data-transfer --source wdqs2005.codfw.wmnet --dest wdqs2003.codfw.wmnet --reason "transferring fresh wikidata journal to resolve disk issue" --blazegraph_instance blazegraph on cumin2001 tmux session wdqs_data_xfer

Mentioned in SAL (#wikimedia-operations) [2021-08-10T18:45:26Z] <ryankemper> T288501 data-transfer of wikidata.jnl completed successfully. Host needs to catch up on ~22 hours of WDQS lag before being re-pooled

Mentioned in SAL (#wikimedia-operations) [2021-08-10T18:46:15Z] <ryankemper> T288501 (Misread grafana graph, wdqs2003 only has 1.33 hours to catch up on)

Mentioned in SAL (#wikimedia-operations) [2021-08-10T21:40:14Z] <ryankemper> T288501 ryankemper@wdqs2003:~$ sudo pool