Page MenuHomePhabricator

Omit the interwikilinks from stop words
Closed, ResolvedPublic

Description

Interwiki links comes up in list of stop words which is excepted but we need to omit them.

Event Timeline

Ladsgroup claimed this task.
Ladsgroup raised the priority of this task from to Low.
Ladsgroup updated the task description. (Show Details)
Ladsgroup added a subscriber: Ladsgroup.

Looks like the code is done. I just talked to @Ladsgroup and he needs to run this code against the new wikis. Once that is ready, we can mark this done.

@Ladsgroup: I noticed that the bot removed a few interwiki prefixes which happen to be true stopwords in Portuguese:
https://meta.wikimedia.org/w/index.php?diff=13746843&oldid=13426323
E.g.: "de" (of), "da" (of), "as" (the), "na"/"no" (on/at). It should probably keep those interwiki prefixes which are also words in a given language.

Hey, Pretty valid point:
There are two ways of excluding interwiki links:

  1. Remove interwiki links in every revision
  2. Run this with them and exclude them from results

The first one seems pretty accurate but this change needs to be imposed on a very large scale of the parser and using them makes the code pretty slow since we run this code ~1M revisions.

The first approach takes several weeks to finish but the second one takes about three or four days.

My suggestion: We don't use excluding at all and then we exclude interwiki links in human review or we try faster approach in removing interwiki links like re.sub('\[\[(en|pt|..)\:','', revision.text) but I doubt it give us enough efficiency.

You can use a regex to remove interwikilinks. That should be at least an order of magnitude faster than mwparserfromhell since it doesn't have to build an abstract syntax tree.

>>> import re
>>> 
>>> PREFIXES = ["fr", "en"]
>>> 
>>> interwiki_re = re.compile(r"\[\[:?(" + "|".join(PREFIXES) + "):[^\]]+\]\]")
>>> 
>>> text = "Foo bar.  Herp derp [[:fr:Hats]] [[Talk:Pajamas]] ."
>>> 
>>> interwiki_re.sub("", text)
'Foo bar.  Herp derp  [[Talk:Pajamas]] .'

A quick test of speed.

>>> import time
>>> 
>>> start = time.time();foo = [mwparser.parse(text) for i in range(10000)];time.time()-start
1.7691254615783691
>>> 
>>> start = time.time();foo = [interwiki_re.sub("", text) for i in range(10000)];time.time()-start
0.13825416564941406

I tried your suggestion on Vietnamese Wikipedia and it took 27632.947533369064 seconds to finish (7h40m) which is acceptable for me. I will re-run this for all wikis :)

Probably, but it now includes "en" and "pt" which are not words in Portuguese (but are language codes).

I can't think of anything straightforward to handle "en" and "pt", Maybe we should just let them be?

Halfak moved this task from Completed to Backlog on the Machine-Learning-Team (Active Tasks) board.

@Ladsgroup, you said this was {{done}} in chat, but I don't see a stopwords list for zh