Page MenuHomePhabricator

download_dump: Handle exception in `get_dump_name()` if there's a folder that contains not number string
Closed, ResolvedPublic

Description

In https://github.com/wikimedia/pywikibot/blob/master/scripts/maintenance/download_dump.py#L72

def get_dump_name(self, db_name, typ):
    """Check if dump file exists locally in a Toolforge server."""
    db_path = '/public/dumps/public/{0}/'.format(db_name)
    if os.path.isdir(db_path):
        dates = map(int, os.listdir(db_path))
        dates = sorted(dates, reverse=True)
        for date in dates:
            dump_filepath = ('/public/dumps/public/{0}/{1}/{2}-{3}-{4}'
                             .format(db_name, date, db_name, date, typ))
            if os.path.isfile(dump_filepath):
                return dump_filepath
    return None

There's a call to map function which converts list of directories (which are string) into integer list, but that code will raise an exception if there's a 'non-number' string in the directory name, for example latest.

In toolforge there's one directory which named latest:

rafid@tools-bastion-03:~$ find /public/dumps/public/ -name 'latest'
/public/dumps/public/de_labswikimedia/latest
rafid@tools-bastion-03:~$ ls -l /public/dumps/public/de_labswikimedia/latest
total 128
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-abstract.xml
-rw-r--r-- 1 root root 652 Mar  3  2013 de_labswikimedia-latest-abstract.xml-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-all-titles-in-ns0.gz
-rw-r--r-- 1 root root 676 Mar  3  2013 de_labswikimedia-latest-all-titles-in-ns0.gz-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-categorylinks.sql.gz
-rw-r--r-- 1 root root 676 Mar  3  2013 de_labswikimedia-latest-categorylinks.sql.gz-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-category.sql.gz
-rw-r--r-- 1 root root 661 Mar  3  2013 de_labswikimedia-latest-category.sql.gz-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-externallinks.sql.gz
-rw-r--r-- 1 root root 676 Mar  3  2013 de_labswikimedia-latest-externallinks.sql.gz-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-flaggedpages.sql.gz
-rw-r--r-- 1 root root 673 Mar  3  2013 de_labswikimedia-latest-flaggedpages.sql.gz-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-flaggedrevs.sql.gz
-rw-r--r-- 1 root root 670 Mar  3  2013 de_labswikimedia-latest-flaggedrevs.sql.gz-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-imagelinks.sql.gz
-rw-r--r-- 1 root root 667 Mar  3  2013 de_labswikimedia-latest-imagelinks.sql.gz-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-image.sql.gz
-rw-r--r-- 1 root root 652 Mar  3  2013 de_labswikimedia-latest-image.sql.gz-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-interwiki.sql.gz
-rw-r--r-- 1 root root 664 Mar  3  2013 de_labswikimedia-latest-interwiki.sql.gz-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-iwlinks.sql.gz
-rw-r--r-- 1 root root 658 Mar  3  2013 de_labswikimedia-latest-iwlinks.sql.gz-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-langlinks.sql.gz
-rw-r--r-- 1 root root 664 Mar  3  2013 de_labswikimedia-latest-langlinks.sql.gz-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-md5sums.txt
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-oldimage.sql.gz
-rw-r--r-- 1 root root 661 Mar  3  2013 de_labswikimedia-latest-oldimage.sql.gz-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-pagelinks.sql.gz
-rw-r--r-- 1 root root 664 Mar  3  2013 de_labswikimedia-latest-pagelinks.sql.gz-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-page_props.sql.gz
-rw-r--r-- 1 root root 667 Mar  3  2013 de_labswikimedia-latest-page_props.sql.gz-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-page_restrictions.sql.gz
-rw-r--r-- 1 root root 688 Mar  3  2013 de_labswikimedia-latest-page_restrictions.sql.gz-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-pages-articles-multistream-index.txt.bz2
-rw-r--r-- 1 root root 736 Mar  3  2013 de_labswikimedia-latest-pages-articles-multistream-index.txt.bz2-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-pages-articles-multistream.xml.bz2
-rw-r--r-- 1 root root 718 Mar  3  2013 de_labswikimedia-latest-pages-articles-multistream.xml.bz2-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-pages-articles.xml.bz2
-rw-r--r-- 1 root root 682 Mar  3  2013 de_labswikimedia-latest-pages-articles.xml.bz2-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-pages-logging.xml.gz
-rw-r--r-- 1 root root 676 Mar  3  2013 de_labswikimedia-latest-pages-logging.xml.gz-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-pages-meta-current.xml.bz2
-rw-r--r-- 1 root root 694 Mar  3  2013 de_labswikimedia-latest-pages-meta-current.xml.bz2-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-pages-meta-history.xml.7z
-rw-r--r-- 1 root root 691 Mar  3  2013 de_labswikimedia-latest-pages-meta-history.xml.7z-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-pages-meta-history.xml.bz2
-rw-r--r-- 1 root root 694 Mar  3  2013 de_labswikimedia-latest-pages-meta-history.xml.bz2-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-page.sql.gz
-rw-r--r-- 1 root root 649 Mar  3  2013 de_labswikimedia-latest-page.sql.gz-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-protected_titles.sql.gz
-rw-r--r-- 1 root root 685 Mar  3  2013 de_labswikimedia-latest-protected_titles.sql.gz-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-redirect.sql.gz
-rw-r--r-- 1 root root 661 Mar  3  2013 de_labswikimedia-latest-redirect.sql.gz-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-site_stats.sql.gz
-rw-r--r-- 1 root root 667 Mar  3  2013 de_labswikimedia-latest-site_stats.sql.gz-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-stub-articles.xml.gz
-rw-r--r-- 1 root root 676 Mar  3  2013 de_labswikimedia-latest-stub-articles.xml.gz-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-stub-meta-current.xml.gz
-rw-r--r-- 1 root root 688 Mar  3  2013 de_labswikimedia-latest-stub-meta-current.xml.gz-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-stub-meta-history.xml.gz
-rw-r--r-- 1 root root 688 Mar  3  2013 de_labswikimedia-latest-stub-meta-history.xml.gz-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-templatelinks.sql.gz
-rw-r--r-- 1 root root 676 Mar  3  2013 de_labswikimedia-latest-templatelinks.sql.gz-rss.xml
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-user_groups.sql.gz
-rw-r--r-- 1 root root 670 Mar  3  2013 de_labswikimedia-latest-user_groups.sql.gz-rss.xml

Event Timeline

Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald TranscriptJan 3 2018, 4:11 AM

What is de_labswikimedia? @ArielGlenn

Apparently wikidata has another:

06:11:33 0 ✓ zhuyifei1999@tools-bastion-05: ~$ find /public/dumps/public/ -mindepth 2 -maxdepth 2 | grep -vP '/\d+$' | xargs ls -l
/public/dumps/public/de_labswikimedia/latest:
total 128
---------- 1 root root   0 Aug 22  2014 de_labswikimedia-latest-abstract.xml
[...]
-rw-r--r-- 1 root root 670 Mar  3  2013 de_labswikimedia-latest-user_groups.sql.gz-rss.xml

/public/dumps/public/wikidatawiki/entities:
total 148
drwxrwxr-x 2 400 400  4096 Oct 18 18:16 20171016
drwxrwxr-x 2 400 400  4096 Oct 20 14:34 20171018
drwxrwxr-x 2 400 400  4096 Oct 26 01:59 20171023
drwxrwxr-x 2 400 400  4096 Oct 27 20:43 20171026
drwxr-xr-x 2 400 400  4096 Nov  1 07:59 20171030
drwxrwxr-x 2 400 400  4096 Nov  3 08:07 20171101
drwxrwxr-x 2 400 400  4096 Nov  9 06:11 20171106
drwxrwxr-x 2 400 400  4096 Nov 12 06:26 20171109
drwxrwxr-x 2 400 400  4096 Nov 16 07:12 20171113
drwxrwxr-x 2 400 400  4096 Nov 18 20:10 20171116
drwxrwxr-x 2 400 400  4096 Nov 22 23:06 20171120
drwxrwxr-x 2 400 400  4096 Nov 22 23:06 20171122
drwxrwxr-x 2 400 400  4096 Nov 27 03:15 20171127
drwxr-xr-x 2 400 400  4096 Nov 28 11:23 20171128
drwxr-xr-x 2 400 400  4096 Dec  7 12:05 20171204
drwxr-xr-x 2 400 400  4096 Dec  9 09:14 20171206
drwxr-xr-x 2 400 400  4096 Dec 12 06:13 20171209
drwxrwxr-x 2 400 400  4096 Dec 14 06:39 20171211
drwxrwxr-x 2 400 400  4096 Dec 16 14:46 20171214
drwxrwxr-x 2 400 400  4096 Dec 20 13:45 20171218
drwxrwxr-x 2 400 400  4096 Dec 23 07:16 20171220
drwxrwxr-x 2 400 400  4096 Dec 27 13:01 20171225
drwxrwxr-x 2 400 400  4096 Dec 29 08:03 20171227
drwxrwxr-x 2 400 400  4096 Jan  1 03:15 20180101
-rw-rw-r-- 1 400 400 51563 Dec 29 08:03 dcatap.rdf
lrwxrwxrwx 1 400 400    39 Dec 27 06:20 latest-all.json.bz2 -> 20171225/wikidata-20171225-all.json.bz2
lrwxrwxrwx 1 400 400    38 Dec 26 09:19 latest-all.json.gz -> 20171225/wikidata-20171225-all.json.gz
lrwxrwxrwx 1 400 400    43 Dec 27 13:01 latest-all.ttl.bz2 -> 20171225/wikidata-20171225-all-BETA.ttl.bz2
lrwxrwxrwx 1 400 400    42 Dec 27 05:08 latest-all.ttl.gz -> 20171225/wikidata-20171225-all-BETA.ttl.gz
lrwxrwxrwx 1 400 400    45 Dec 29 08:03 latest-truthy.nt.bz2 -> 20171227/wikidata-20171227-truthy-BETA.nt.bz2
lrwxrwxrwx 1 400 400    44 Dec 28 19:22 latest-truthy.nt.gz -> 20171227/wikidata-20171227-truthy-BETA.nt.gz

Btw, I wonder what is that latest for when it is nobody-readable:

06:11:51 0 ✓ zhuyifei1999@tools-bastion-05: ~$ head -c 1 /public/dumps/public/de_labswikimedia/latest/de_labswikimedia-latest-*.gz | xxd
head: cannot open ‘/public/dumps/public/de_labswikimedia/latest/de_labswikimedia-latest-all-titles-in-ns0.gz’ for reading: Permission denied
[...]
head: cannot open ‘/public/dumps/public/de_labswikimedia/latest/de_labswikimedia-latest-user_groups.sql.gz’ for reading: Permission denied
D3r1ck01 moved this task from Backlog to Needs Review on the Pywikibot board.Nov 5 2018, 11:27 AM
Xqt moved this task from Needs Review to Backlog on the Pywikibot board.Feb 3 2019, 11:36 AM

Change 488067 had a related patch set uploaded (by Rafidaslam; owner: rafid):
[pywikibot/core@master] download_dump: Handle get_dump_name() if there's a 'latest' dir

https://gerrit.wikimedia.org/r/488067

rafidaslam triaged this task as Low priority.Feb 5 2019, 2:52 PM
This comment was removed by rafidaslam.
Xqt closed this task as Resolved.Feb 6 2019, 4:09 PM

Change 488067 merged by jenkins-bot:
[pywikibot/core@master] download_dump: Handle get_dump_name() if there are non-number dirnames

https://gerrit.wikimedia.org/r/488067