Page MenuHomePhabricator

Omit the interwikilinks from stop words
Closed, ResolvedPublic

Description

Interwiki links comes up in list of stop words which is excepted but we need to omit them.

Event Timeline

Ladsgroup claimed this task.
Ladsgroup raised the priority of this task from to Low.
Ladsgroup updated the task description. (Show Details)
Ladsgroup added a subscriber: Ladsgroup.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 21 2015, 4:29 PM
Halfak added a subscriber: Halfak.Sep 11 2015, 4:35 PM

Looks like the code is done. I just talked to @Ladsgroup and he needs to run this code against the new wikis. Once that is ready, we can mark this done.

Ladsgroup closed this task as Resolved.Sep 22 2015, 11:02 AM
He7d3r added a subscriber: He7d3r.Sep 22 2015, 1:54 PM

@Ladsgroup: I noticed that the bot removed a few interwiki prefixes which happen to be true stopwords in Portuguese:
https://meta.wikimedia.org/w/index.php?diff=13746843&oldid=13426323
E.g.: "de" (of), "da" (of), "as" (the), "na"/"no" (on/at). It should probably keep those interwiki prefixes which are also words in a given language.

Hey, Pretty valid point:
There are two ways of excluding interwiki links:

  1. Remove interwiki links in every revision
  2. Run this with them and exclude them from results

The first one seems pretty accurate but this change needs to be imposed on a very large scale of the parser and using them makes the code pretty slow since we run this code ~1M revisions.

The first approach takes several weeks to finish but the second one takes about three or four days.

My suggestion: We don't use excluding at all and then we exclude interwiki links in human review or we try faster approach in removing interwiki links like re.sub('\[\[(en|pt|..)\:','', revision.text) but I doubt it give us enough efficiency.

You can use a regex to remove interwikilinks. That should be at least an order of magnitude faster than mwparserfromhell since it doesn't have to build an abstract syntax tree.

>>> import re
>>> 
>>> PREFIXES = ["fr", "en"]
>>> 
>>> interwiki_re = re.compile(r"\[\[:?(" + "|".join(PREFIXES) + "):[^\]]+\]\]")
>>> 
>>> text = "Foo bar.  Herp derp [[:fr:Hats]] [[Talk:Pajamas]] ."
>>> 
>>> interwiki_re.sub("", text)
'Foo bar.  Herp derp  [[Talk:Pajamas]] .'

A quick test of speed.

>>> import time
>>> 
>>> start = time.time();foo = [mwparser.parse(text) for i in range(10000)];time.time()-start
1.7691254615783691
>>> 
>>> start = time.time();foo = [interwiki_re.sub("", text) for i in range(10000)];time.time()-start
0.13825416564941406
Halfak reopened this task as Open.Sep 25 2015, 4:36 PM

I tried your suggestion on Vietnamese Wikipedia and it took 27632.947533369064 seconds to finish (7h40m) which is acceptable for me. I will re-run this for all wikis :)

Probably, but it now includes "en" and "pt" which are not words in Portuguese (but are language codes).

I can't think of anything straightforward to handle "en" and "pt", Maybe we should just let them be?

Ladsgroup moved this task from Done to Backlog on the Scoring-platform-team (Current) board.
Halfak moved this task from Done to Backlog on the Scoring-platform-team (Current) board.

@Ladsgroup, you said this was {{done}} in chat, but I don't see a stopwords list for zh

Halfak closed this task as Resolved.Nov 19 2015, 11:41 PM