Page MenuHomePhabricator

wdqs1005: disk space critical on /srv/
Closed, ResolvedPublic

Description

Some links

Alert link => https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=wdqs1005&service=Disk+space

Graph of triples not updating (due to lack of disk) => https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7&orgId=1&var-cluster_name=wdqs&from=1608044124030&to=1608105553755

General context

For whatever reason, the wikidata.jnl on wdqs1005 appears to be ~140GB larger than the vast majority of the rest of the fleet:

ryankemper@cumin1001:~$ sudo cumin 'P{wdqs*}' 'du -h /srv/wdqs/wikidata.jnl'
19 hosts will be targeted:
wdqs[2001-2008].codfw.wmnet,wdqs[1003-1013].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(1) wdqs1005.eqiad.wmnet
----- OUTPUT of 'du -h /srv/wdqs/wikidata.jnl' -----
1021G   /srv/wdqs/wikidata.jnl
===== NODE GROUP =====
(1) wdqs1007.eqiad.wmnet
----- OUTPUT of 'du -h /srv/wdqs/wikidata.jnl' -----
948G    /srv/wdqs/wikidata.jnl
===== NODE GROUP =====
(17) wdqs[2001-2008].codfw.wmnet,wdqs[1003-1004,1006,1008-1013].eqiad.wmnet
----- OUTPUT of 'du -h /srv/wdqs/wikidata.jnl' -----
886G    /srv/wdqs/wikidata.jnl
================
PASS |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (19/19) [00:00<00:00, 19.42hosts/s]
FAIL |                                                                                                                                                                                                                                                                                                                      |   0% (0/19) [00:00<?, ?hosts/s]
100.0% (19/19) success ratio (>= 100.0% threshold) for command: 'du -h /srv/wdqs/wikidata.jnl'.
100.0% (19/19) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

This does not seem to be attributable to a higher triple count (see the graph link in the top section)