Page MenuHomePhabricator

Temp files left around in wikistats_1/ ?
Closed, ResolvedPublic

Description

One of our dumps mirrors maintainers reported finding what looks like a few temporary files in the wikistats_1 directory. On our web host the files are these:

-rw-------  1 dumpsgen dumpsgen 1.2G May 29  2018 zi628Wdw
-rw-------  1 dumpsgen dumpsgen  64M Aug 10  2018 zinG3GQw
-rw-------  1 dumpsgen dumpsgen 2.0G May 29  2018 ziUIbTGY

Should these be cleaned up? I don't see them on stat1007, so perhaps we can remove them from the web server without issue?

I notice also that the web server has both wikistats_1 and wikistats_1.0. directories. Should we be keeping both of these? Or is one left over as an artifact from a previous rsync configuration?

Event Timeline

ArielGlenn triaged this task as Medium priority.Apr 16 2021, 5:06 AM
ArielGlenn created this task.
ArielGlenn added a subscriber: Ottomata.

@Ottomata I'm tagging you because you knew about the rsync at some point; if someone else would know better, please feel free to redirect me. Thanks!

@ArielGlenn thank you for noticing, please delete!

@ArielGlenn thank you for noticing, please delete!

I will delete the temp files immediately, thanks! Do we need both of wikistats_1 and wikistats_1.0 dirs, or can one of them go?

So it looks like the https://dumps.wikimedia.org/other/wikistats_1.0/ folder is empty, so that can be deleted.

The https://dumps.wikimedia.org/other/wikistats_1/ folder contains all kinds of crazy and very outdated reports and results of all kinds. If we wanted to reclaim that space, we could look through access logs for the past month to see if anyone's downloading it. I'd imagine we wouldn't find anything. So, my opinion: archive in HDFS and delete. If anyone feels uncomfortable with that, keep it around until we need the space.

So it looks like the https://dumps.wikimedia.org/other/wikistats_1.0/ folder is empty, so that can be deleted.

The https://dumps.wikimedia.org/other/wikistats_1/ folder contains all kinds of crazy and very outdated reports and results of all kinds. If we wanted to reclaim that space, we could look through access logs for the past month to see if anyone's downloading it. I'd imagine we wouldn't find anything. So, my opinion: archive in HDFS and delete. If anyone feels uncomfortable with that, keep it around until we need the space.

It takes up about 15G so honestly it's not that big a deal to keep around, even if there are only a few downloaders. I can't tell about our mirrors of course, but even from our own web server there are a few downloaders that aren't bots. So, meh. Keep?

It takes up about 15G so honestly it's not that big a deal to keep around, even if there are only a few downloaders. I can't tell about our mirrors of course, but even from our own web server there are a few downloaders that aren't bots. So, meh. Keep?

So mean, feeding digital hoarding addictions :) Ok, sure, keep it

ArielGlenn claimed this task.

Moar data! OK, closing :-)