Page MenuHomePhabricator

simplify flickrripper.cleanUpTitle
Closed, DeclinedPublic

Description

cleanUpTitle() function of flickrripper can be simplified. It contains

title = title.strip()
title = re.sub(r'[<{\[]', '(', title)
title = re.sub(r'[>}\]]', ')', title)
title = re.sub(r'[ _]?\(!\)', '', title)
title = re.sub(',:[ _]', ', ', title)
title = re.sub('[;:][ _]', ', ', title)
title = re.sub(r'[\t\n ]+', ' ', title)
title = re.sub(r'[\r\n ]+', ' ', title)
title = re.sub('[\n]+', '', title)
title = re.sub('[?!]([.\"]|$)', r'\1', title)
title = re.sub('[&#%?!]', '^', title)
title = re.sub('[;]', ',', title)
title = re.sub(r'[/+\\:]', '-', title)
title = re.sub('--+', '-', title)
title = re.sub(',,+', ',', title)
title = re.sub('[-,^]([.]|$)', r'\1', title)
title = title.replace(' ', '_')
title = title.strip('_')

obviously the regex --+ can be replaced with -+, ,,+ with ,+ etc. Any replacement to ' ' can be directly shorten to '_' which is the last resort of these statements. [;] is the same as just ;. [\t\n ]+ and [\r\n ]+ can be combined and [\n]+ is obsolet. [/+\\:] can also be combined with --+ and there are several other replacements which can be combined or simplified or removed.

Source code is available to download from Gerrit: https://gerrit.wikimedia.org/r/#/admin/projects/pywikibot/core (flickrripper.py file is in scripts folder)

Related Objects

StatusSubtypeAssignedTask
InvalidNone
DeclinedMh-3110

Event Timeline

Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald Transcript
Xqt triaged this task as Lowest priority.Mar 4 2020, 9:42 AM

Hi @Xqt ,
I would like to work on this.

Thanks

@Mh-3110: please ask if you need any help

@Xqt ,
As the output of all these regex statements is to clean the title, why not to use the Replace() function with the characters to replace in an array?
Something like this:

def cleanUpTitle(title):
    forbidden_characters = [';', ',' ,':' , '<', '>', '[', ']', '_','&','#','%','?','!','^','$', '/', '\\']
    for e in forbidden_characters:
        if e in title:
            title = title.replace(e, '')
            title = re.sub(r'[ ]+', ' ', title)
    return title.strip().replace(' ','_')

What do you think about this?

Use tuple instead of list. Also this would need multiple lists as there are multiple character outputs. Perhaps also look into https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist

Thanks @Dvorapa,
Will have a look at it
Thanks

Change 579867 had a related patch set uploaded (by Mh-3110; owner: Mahuton):
[pywikibot/core@master] [cleanup]simplify flickrripper.cleanUpTitle

https://gerrit.wikimedia.org/r/579867

flickrripper is no longer actively maintained. Please feel free to reopen if you are still using this script.

Change 579867 abandoned by Xqt:
[pywikibot/core@master] [cleanup]simplify flickrripper.cleanUpTitle

Reason:
flickrripper is no longer actively maintained. Please feel free to reopen if you are still using this script.

https://gerrit.wikimedia.org/r/579867