Interwiki links comes up in list of stop words which is excepted but we need to omit them.
@Ladsgroup: I noticed that the bot removed a few interwiki prefixes which happen to be true stopwords in Portuguese:
E.g.: "de" (of), "da" (of), "as" (the), "na"/"no" (on/at). It should probably keep those interwiki prefixes which are also words in a given language.
Hey, Pretty valid point:
There are two ways of excluding interwiki links:
- Remove interwiki links in every revision
- Run this with them and exclude them from results
The first one seems pretty accurate but this change needs to be imposed on a very large scale of the parser and using them makes the code pretty slow since we run this code ~1M revisions.
The first approach takes several weeks to finish but the second one takes about three or four days.
My suggestion: We don't use excluding at all and then we exclude interwiki links in human review or we try faster approach in removing interwiki links like re.sub('\[\[(en|pt|..)\:','', revision.text) but I doubt it give us enough efficiency.
You can use a regex to remove interwikilinks. That should be at least an order of magnitude faster than mwparserfromhell since it doesn't have to build an abstract syntax tree.
>>> import re >>> >>> PREFIXES = ["fr", "en"] >>> >>> interwiki_re = re.compile(r"\[\[:?(" + "|".join(PREFIXES) + "):[^\]]+\]\]") >>> >>> text = "Foo bar. Herp derp [[:fr:Hats]] [[Talk:Pajamas]] ." >>> >>> interwiki_re.sub("", text) 'Foo bar. Herp derp [[Talk:Pajamas]] .'
A quick test of speed.
>>> import time >>> >>> start = time.time();foo = [mwparser.parse(text) for i in range(10000)];time.time()-start 1.7691254615783691 >>> >>> start = time.time();foo = [interwiki_re.sub("", text) for i in range(10000)];time.time()-start 0.13825416564941406