Maniphest T200399

Make Pywikibot remove disambiguation bracket in labels to created new items for articles in Wikidata
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Bugreporter
	Jul 26 2018, 5:33 AM

Description

Wikidata is a common database used by Wikimedia projects. Pywikibot has a script named scripts/newitem.py that creates items in this database for newly created pages.

However page titles can have brackets in it. This task is to remove these brackets from the name of the newly created items, only if the page is in the principal namespace (that means ns: 0 as 0 is used for articles).

Example: Georgia (country) → Georgia

Details

	Subject	Repo	Branch	Lines +/-
	[IMPR] Strip disambiguation parens from articles	pywikibot/core	master	+19 -2

Customize query in gerrit

Event Timeline

Bugreporter created this task.Jul 26 2018, 5:33 AM

Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald TranscriptJul 26 2018, 5:33 AM

Also for other scripts that can create items.

Dvorapa added a project: Pywikibot.Jul 29 2018, 8:42 AM

Could you explain a bit please.

When a page like "Georgia (country)" gets imported to Wikidata, the new item has label "Georgia (country)", instead of "Georgia". It should be changed to strip the disambiguator if appropriate.

Xqt triaged this task as Medium priority.Jul 29 2018, 11:26 AM

The issue seems to be in pywikibot class wikidatabot.create_item_for_page , which takes a page (from a pagegenerator) and auto-creates an item for it.

Can we assume that all brackets at the end of the string can be safely removed ? All wikipedias are using this only for disambiguation ?

Most wikis have retained the built-in MediaWiki feature where trailing
brackets are stripped when a page is saved:

Wikitext before saving: [[Foo (bar)|]]
Wikitext after saving: [[Foo (bar)|Foo]].

wikidatabot should strip the bracket when importing links from wikis which
have this feature turned on.

On de.wiktionary there are many categories like
[[Category:Substantiv (Deutsch)]] or [[Category:Substantiv (Englisch)]]. These dismabiguations shoudl not be stripped

deryckchan unsubscribed.Aug 21 2018, 4:30 PM

alternativelly script should add whole name (with bracket) as alias

In T200399#4482820, @JAnD wrote:

On de.wiktionary there are many categories like
[[Category:Substantiv (Deutsch)]] or [[Category:Substantiv (Englisch)]]. These dismabiguations shoudl not be stripped

I have already said "Only do it when handling articles (ns0)"

deryckchan unsubscribed.Aug 26 2018, 4:43 PM

Will mentor this task for Google-Code-in-2018 with whoever wishes.

Wont this cause errors due to duplicate label?

Still it is a good idea, but new item needs to be able to resolve those errors.

In T200399#4551339, @jayvdb wrote:

Wont this cause errors due to duplicate label?

Wikidata pages are indexed by a unique qID. So multiple pages with the same label (like a title) will not be problematic at all, they will just be homonyms. Actually the job is done by specialized bots, that have to remove the content in bracket after the item creation by newitem.py.

revi awarded a token.Sep 21 2018, 3:53 PM

revi subscribed.

https://codein.withgoogle.com/tasks/5448776681521152/

xSavitar moved this task from Backlog to Needs Review on the Pywikibot board.Oct 14 2018, 3:39 PM

Liuxinyu970226 subscribed.Oct 21 2018, 2:13 AM

Shreyasminocha claimed this task.Oct 30 2018, 11:22 AM

@D3r1ck01 @Xqt and others:
Do you prefer a regex or an index approach here ? The second looks cleaner to me.
What about implementing this as a param like withoutDisambiguation in page.title() method? That would allow to easily reuse it elsewhere.
I also think that this param use can be directly added in [[ https://github.com/wikimedia/pywikibot/blob/master/pywikibot/bot.py#L2144 | create_item_for_page() ]], and in this case scripts/newitem.py would not be modified.

@Framawiki, if it's per this, https://gerrit.wikimedia.org/r/c/pywikibot/core/+/470627, then I'll say regex solutions become really confusing if they're trying to solve a complex problem hence making it difficult for one to understand a code base if we have many of them. Regex are good for solving parsing problems and somewhat fast as well but indexing seems generally easier to understand no matter how complex the problem is compared to regex and fast too.

But in this case that we're just checking for brackets (pretty straight forward), it's a pretty simpler problem and I think any solution here (regex or indexing) can work well and will not cause so much trouble. But honestly, for future proofing of code and scalability, I'll go for the index approach but let's hear what others have to say :)

@D3r1ck01 @Xqt and others:
Do you prefer a regex or an index approach here ? The second looks cleaner to me.

Sorry, I don’t have it. Could you show me the different approach?

What about implementing this as a param like withoutDisambiguation in page.title() method? That would allow to easily reuse it elsewhere.

Sounds good but there is a general problem with it which isn’t taken into account: There may be real titles which has these bracket tail included. Therefore sth. without_brackets would be more neutral but doesn’t solve that underlying problem

I also think that this param use can be directly added in create_item_for_page() , and in this case scripts/newitem.py would not be modified.

Agree

Best
xqt

@Xyt To clarify, I haven't implemented the second approach. We have two options—the regex-based approach I illustrated or an index-based approach (looping over the characters in the string, looking for an open parenthesis and trimming everything after that or a more advanced version of the same to handle edge cases such as nested parens, unmatched parens, multiple parens).

@Framawiki @D3r1ck01

Since all of you recommend doing it the index way, I'll do that instead.

Would it suffice if I handle only:

1. Georgia (Country)
2. Georgia (Something) (Country) // here it would remove both

with the index approach?

The following will break stuff:

3. Georgia ((Country))
4. Georgia (Country
5. Georgia Country)

Are any of the last few probable (and thus worth implementing)? Should I just look for the last open paren and remove everything after it (in 2. for example)?

Since all of you recommend doing it the index way

Oh, I am fine with regex which might be less complex than index search implementation here.

Change 470627 had a related patch set uploaded (by Shreyasminocha; owner: Shreyasminocha):
[pywikibot/core@master] Strip disambiguation parens from articles

https://gerrit.wikimedia.org/r/470627

gerritbot added a project: Patch-For-Review.Oct 31 2018, 3:33 PM

Ok, although preferring the index option I chose the regex version for its simplicity. We can always create a task for change later on to have a cleaner code, even if it already seems very good to me.

Patch is ready for review.

Xqt closed this task as Resolved.Nov 5 2018, 7:26 AM

Change 470627 merged by jenkins-bot:
[pywikibot/core@master] [IMPR] Strip disambiguation parens from articles

https://gerrit.wikimedia.org/r/470627

@Xqt Thank you so much for your patience with me.

Liuxinyu970226 unsubscribed.Nov 6 2018, 12:30 PM

matej_suchanek unsubscribed.Nov 24 2018, 1:21 PM

Make Pywikibot remove disambiguation bracket in labels to created new items for articles in WikidataClosed, ResolvedPublicActions

Description

Details

Event Timeline

Make Pywikibot remove disambiguation bracket in labels to created new items for articles in Wikidata
Closed, ResolvedPublic
Actions