Page MenuHomePhabricator

Make Pywikibot remove disambiguation bracket in labels to created new items for articles in Wikidata
Closed, ResolvedPublic

Description

Wikidata is a common database used by Wikimedia projects. Pywikibot has a script named scripts/newitem.py that creates items in this database for newly created pages.

However page titles can have brackets in it. This task is to remove these brackets from the name of the newly created items, only if the page is in the principal namespace (that means ns: 0 as 0 is used for articles).

Example: Georgia (country)Georgia

Event Timeline

Bugreporter renamed this task from newitem.py: Remove disambiguation bracket in labels to Remove disambiguation bracket in labels when creating new items for articles.Jul 26 2018, 5:41 AM

Also for other scripts that can create items.

When a page like "Georgia (country)" gets imported to Wikidata, the new item has label "Georgia (country)", instead of "Georgia". It should be changed to strip the disambiguator if appropriate.

Xqt triaged this task as Medium priority.Jul 29 2018, 11:26 AM

The issue seems to be in pywikibot class wikidatabot.create_item_for_page , which takes a page (from a pagegenerator) and auto-creates an item for it.

Can we assume that all brackets at the end of the string can be safely removed ? All wikipedias are using this only for disambiguation ?

Most wikis have retained the built-in MediaWiki feature where trailing
brackets are stripped when a page is saved:

Wikitext before saving: [[Foo (bar)|]]
Wikitext after saving: [[Foo (bar)|Foo]].

wikidatabot should strip the bracket when importing links from wikis which
have this feature turned on.

On de.wiktionary there are many categories like
[[Category:Substantiv (Deutsch)]] or [[Category:Substantiv (Englisch)]]. These dismabiguations shoudl not be stripped

alternativelly script should add whole name (with bracket) as alias

On de.wiktionary there are many categories like
[[Category:Substantiv (Deutsch)]] or [[Category:Substantiv (Englisch)]]. These dismabiguations shoudl not be stripped

I have already said "Only do it when handling articles (ns0)"

Will mentor this task for Google-Code-in-2018 with whoever wishes.

Wont this cause errors due to duplicate label?

Still it is a good idea, but new item needs to be able to resolve those errors.

Wont this cause errors due to duplicate label?

Wikidata pages are indexed by a unique qID. So multiple pages with the same label (like a title) will not be problematic at all, they will just be homonyms. Actually the job is done by specialized bots, that have to remove the content in bracket after the item creation by newitem.py.

Aklapper renamed this task from Remove disambiguation bracket in labels when creating new items for articles to Make Pywikibot remove disambiguation bracket in labels to created new items for articles in Wikidata.Oct 7 2018, 1:56 AM
Aklapper updated the task description. (Show Details)
Aklapper moved this task from Proposed tasks to Imported in GCI Site on the Google-Code-in-2018 board.

@D3r1ck01 @Xqt and others:
Do you prefer a regex or an index approach here ? The second looks cleaner to me.
What about implementing this as a param like withoutDisambiguation in page.title() method? That would allow to easily reuse it elsewhere.
I also think that this param use can be directly added in [[ https://github.com/wikimedia/pywikibot/blob/master/pywikibot/bot.py#L2144 | create_item_for_page() ]], and in this case scripts/newitem.py would not be modified.

@Framawiki, if it's per this, https://gerrit.wikimedia.org/r/c/pywikibot/core/+/470627, then I'll say regex solutions become really confusing if they're trying to solve a complex problem hence making it difficult for one to understand a code base if we have many of them. Regex are good for solving parsing problems and somewhat fast as well but indexing seems generally easier to understand no matter how complex the problem is compared to regex and fast too.

But in this case that we're just checking for brackets (pretty straight forward), it's a pretty simpler problem and I think any solution here (regex or indexing) can work well and will not cause so much trouble. But honestly, for future proofing of code and scalability, I'll go for the index approach but let's hear what others have to say :)

@D3r1ck01 @Xqt and others:
Do you prefer a regex or an index approach here ? The second looks cleaner to me.

Sorry, I don’t have it. Could you show me the different approach?

What about implementing this as a param like withoutDisambiguation in page.title() method? That would allow to easily reuse it elsewhere.

Sounds good but there is a general problem with it which isn’t taken into account: There may be real titles which has these bracket tail included. Therefore sth. without_brackets would be more neutral but doesn’t solve that underlying problem

I also think that this param use can be directly added in create_item_for_page() , and in this case scripts/newitem.py would not be modified.

Agree

Best
xqt

@Xyt To clarify, I haven't implemented the second approach. We have two options—the regex-based approach I illustrated or an index-based approach (looping over the characters in the string, looking for an open parenthesis and trimming everything after that or a more advanced version of the same to handle edge cases such as nested parens, unmatched parens, multiple parens).

@Framawiki @D3r1ck01

Since all of you recommend doing it the index way, I'll do that instead.

Would it suffice if I handle only:

1. Georgia (Country)
2. Georgia (Something) (Country) // here it would remove both

with the index approach?

The following will break stuff:

3. Georgia ((Country))
4. Georgia (Country
5. Georgia Country)

Are any of the last few probable (and thus worth implementing)? Should I just look for the last open paren and remove everything after it (in 2. for example)?

Since all of you recommend doing it the index way

Oh, I am fine with regex which might be less complex than index search implementation here.

Change 470627 had a related patch set uploaded (by Shreyasminocha; owner: Shreyasminocha):
[pywikibot/core@master] Strip disambiguation parens from articles

https://gerrit.wikimedia.org/r/470627

Ok, although preferring the index option I chose the regex version for its simplicity. We can always create a task for change later on to have a cleaner code, even if it already seems very good to me.

Patch is ready for review.

Change 470627 merged by jenkins-bot:
[pywikibot/core@master] [IMPR] Strip disambiguation parens from articles

https://gerrit.wikimedia.org/r/470627

@Xqt Thank you so much for your patience with me.