Page MenuHomePhabricator

dumps.wikimedia.org/enwiki/latest/ out of date files
Closed, ResolvedPublic

Description

Hi, if you go to https://dumps.wikimedia.org/enwiki/latest/ you can notice that some file are indexed, but shouldn't be there, such as enwiki-latest-abstract.xml or enwiki-latest-pages-meta-current27.xml-p54663462p55702527.bz2 or all the files with a november timestamp.

In the attachment you can find the current enwiki latest:

Keep up the great work,
Enrico Bonetti Vieno

Event Timeline

Yes, these are there because there is no equivalent any more; the abstratcs dumps are now gz compressed, the page content files were split along a different page range this month because that's how the timing fell out, etc.

I see, thanks ... well, I'm doing some automation, so for now I'm using a temporary fix based on the dates.
Let me know if you plan to fix it, otherwise I switch right away to the json files ;)

Keep it up, Enrico

I'm not sure a complete fix is desirable. I can remove all the old abstract.xml links, that's easy enough. But for example old meta-history files should not necessarily be cleaned up after every run because the run on the 20th of the month doesn't produce such files, so downloaders will want the files available from the run on the 1st of the month. Thoughts?

Exactly, from my perspective it's pretty clear: it should contain only the latest avaiable data for every table/tipology/group, while right now it's a rought mix: just take a look at the dates of enwiki-latest-pages-meta-history files.
It's a bit of tradeoff: sacrificing temporal consistency between the files for getting the latest data avaible. On the other side, if someone is aiming for temporal consistent data should defintely go for the monthly dump :)

Do you agree?

Cheers, Enrico

It's more that we have a lot of folks interested in the full history content, but they may not always pick it up right when the files are available. So for convenience they ought to be able to grab those files by using the 'latest' links, even if those files are dated YYYYMM01 and mixed in with files that are dated YYYYMM20 (like right now). Because there will be no full history content with date YYYYMM20.

Yep, yep , I agree with you, though as of now the situation is different: what's the convenience of keeping all the following links instead of just the latest one? 😏

  1. 13-Nov-2017
  2. 11-Jan-2018
  3. 10-Feb-2018
  4. 09-Mar-2018

Cheers, Enrico

Hmm I might be able to clean up the older ones, let me think about the nice way to fold that into the existing architecture.

That would be nice, thanks! Next time I'll specify better the issue from the beginning :)

Have a nice day, Enrico

Change 421851 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] clean up all 'latest' links from most runs older than current run

https://gerrit.wikimedia.org/r/421851

Change 422879 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] clean up dumps 'latest' links that are too old

https://gerrit.wikimedia.org/r/422879

There's a couple things going on here.

First, it would be nice to be able to configure the dumps jobs themselves to do this cleanup, which is what the first patchset does.
The second issue however is that all these files are later rsynced out to our web server; this is not a straight rsync of everything. because we want to skip partial files and other stuff. So we can't just rsync blah --delete to get rid of out of date 'latest' links and rss files. Instead we need a cleanup job for that on the servers, which is what the second patchset would do.

Change 422879 merged by ArielGlenn:
[operations/puppet@production] clean up dumps 'latest' links that are too old

https://gerrit.wikimedia.org/r/422879

Change 421851 merged by ArielGlenn:
[operations/dumps@master] clean up all 'latest' links from most runs older than current run

https://gerrit.wikimedia.org/r/421851

Both of these changes are live. We won't see the full effect until after the next full run on April 1st, so I'll leave this ticket open until we see some big wikis finish up from that run.

I checked out the results: that's a major improvement, nicely done, I would say perfect! 💪 or almost: I cannot find the 7z enwiki-latest-pages-meta-history files :(

With the changes, we only keep links from one dump previous to the current one. The current one right now is the one that will generate new 7z files. You'll need to wait a week or show for those to show up.

I'm sorry, be patient, let me just ask:

  1. is the update process gonna break the user experience during the begininning of every month? 😅
  2. If doing this incrementally by swapping a ready table/tipology/group is cumbersome, why not wait until the whole process is done, then update?

Cheers, Enrico

Folks want their latest links as soon as the files for a job are available, so tha's why we don't wait until then dumps for a wiki are complete.
There's no swapping of something that can be done; because file names can vary for the large wikis, we have to decide to keep or purge by date, at the time of the run.
We could decide to keep two old dump runs worth of files, but this will mean more clutter for those following links and not filtering out by date.
We could decide only to clean up "too old" files for the particular dump step being run, but this is harder to manage properly when things fail and we backfill manually.

i'm trying to balance all of these out with a good choice, but my choice might not be optimal for others. Still open to discussion though.

I see, this behavior is gonna break some automated scripts, still it seems a good trade-off .... I don't know enough the system to give any more suggestions: if you want to point me to some materials or explain briefly, maybe I can give some more. Otherwise it's still a major improvement! 💪

Cheers, Enrico

ArielGlenn claimed this task.

I'm going to go ahead and close this for now.