Page MenuHomePhabricator

A "current" version of the upload.tar dir would be nice
Closed, ResolvedPublic

Description

A special version of upload.tar which only includes the most recent version of
files would be nice. Perheps only including files that are linked from the
current version of the articles.


Version: unspecified
Severity: enhancement
URL: http://dumps.wikimedia.org/

Details

Reference
bz1298
TitleReferenceAuthorSource BranchDest Branch
Revert "Exclude deploy servers from target list if /srv/mediawiki is a symlink"repos/releng/scap!128dancyreview/dancy/T329857-cleanupmaster
search: Disable auto versioning in glentrepos/data-engineering/airflow-dags!335ebernhardsonwork/ebernhardson/glent-upload-auto-versionmain
search: Bump glent to 0.3.3repos/data-engineering/airflow-dags!334ebernhardsonwork/ebernhardson/glent-0.3.3main
search: Bump glent jar to 0.3.2repos/data-engineering/airflow-dags!324ebernhardsonwork/ebernhardson/glent-0.3.2main
search: Port transfer_to_es from airflow 1repos/data-engineering/airflow-dags!318ebernhardsonwork/ebernhardson/transfer-to-esmain
HivePartitionWriter: Cast values to appropriate typesrepos/search-platform/discolytics!22ebernhardsonwork/ebernhardson/partition-writer-cast-valuesmain
Migrate search_satisfaction DAGrepos/data-engineering/airflow-dags!309pfischermigrate-search-dag-search_satisfactionmain
Migrate scripts used by search_satisfaction AirFlow DAGrepos/search-platform/discolytics!20pfischermigrate-cli-for-search_satisfaction-dagmain
Migrate glent_weekly DAGrepos/data-engineering/airflow-dags!292pfischermigrate-search-dag-glent_weeklymain
Exclude deploy servers from target list if /srv/mediawiki is a symlinkrepos/releng/scap!103dancyT329857master
Migrate scripts used by ores_predictions AirFlow DAGrepos/search-platform/discolytics!17pfischermigrate-cli-for-ores_predictions-dagmain
search: Import popularity score from airflow 1repos/data-engineering/airflow-dags!266ebernhardsonwork/ebernhardson/popularity_scoremain
search: Import query clicks from airflow 1repos/data-engineering/airflow-dags!265ebernhardsonwork/ebernhardson/query_clicksmain
search: Port incoming_links from airflow 1repos/data-engineering/airflow-dags!252ebernhardsonwork/ebernhardson/incoming-linksmain
search: Port export_queries_to_relforge from airflow 1repos/data-engineering/airflow-dags!251ebernhardsonwork/ebernhardson/export-queries-to-relforgemain
Port incoming_links from airflow v1repos/search-platform/discolytics!15ebernhardsonwork/ebernhardson/incoming_linksmain
Port export_queries_to_relforgerepos/search-platform/discolytics!12ebernhardsonwork/ebernhardson/export-queries-to-relforgemain
search: Port import_cirrus_indexes from airflow 1repos/data-engineering/airflow-dags!247ebernhardsonwork/ebernhardson/import_cirrus_indexesmain
Make scap pull exit if /srv/mediawiki is a symlinkrepos/releng/scap!91dancyreview/dancy/no-pull-via-symlinkmaster
Port import_cirrus_indexesrepos/search-platform/discolytics!11ebernhardsonwork/ebernhardson/import_cirrus_indexesmain
Show related patches Customize query in GitLab

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 8:10 PM
bzimport set Reference to bz1298.

avarab wrote:

Regular upload dumps are now provided, marking this as FIXED.

Reopening: This feature-request is not about a recent dump but about a dump
containing only the most recent versions of files.

While actual .tar files are probably not feasible at our current level (~3TB for Commons files current versions only), getting some offsite image mirrors and redistribution is on the table. Tomasz, assigning this one to you since you'll be coordinating the data dump stuff.

Releasing this bug so that anyone who has time can take it on.

These are now semi-available (I'm running them on an ad hoc basis, they are generated on a mirror site rather than one of our servers, we're still working out hardware issues with them, etc etc.) If you're willing to deal with directories moving around and possible inaccessibility, you can get these before the official announcement, from http://ftpmirror.your.org/pub/wikimedia/imagedumps/ in the tarballs/full directory and the tarballs/incrs directory. These are indeed current version only, per project *except* for commons.

If you want commons images, you should get them via rsync from rsync://ftpmirror.your.org/wikimedia-images/ and please see http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current_Mirrors for more information about what data is mirrored where.

Anyone on this bug that's not on the xmldatadumps-l list had better get on it, since that's generally where updates about this sort of thing will be sent.

Hmm I guess since the official announcement went out we can call this done, or close enough to done at any rate. (Everyone on the xmldatadumps-l list yet??)