Omit the interwikilinks from stop words
Interwiki links comes up in list of stop words which is excepted but we need to omit them.

Looks like the code is done. I just talked to @Ladsgroup and he needs to run this code against the new wikis. Once that is ready, we can mark this done.

@Ladsgroup: I noticed that the bot removed a few interwiki prefixes which happen to be true stopwords in Portuguese:
E.g.: "de" (of), "da" (of), "as" (the), "na"/"no" (on/at). It should probably keep those interwiki prefixes which are also words in a given language.

Hey, Pretty valid point:
There are two ways of excluding interwiki links:

  1. Remove interwiki links in every revision
  2. Run this with them and exclude them from results

The first one seems pretty accurate but this change needs to be imposed on a very large scale of the parser and using them makes the code pretty slow since we run this code ~1M revisions.

The first approach takes several weeks to finish but the second one takes about three or four days.

My suggestion: We don't use excluding at all and then we exclude interwiki links in human review or we try faster approach in removing interwiki links like re.sub('\[\[(en|pt|..)\:','', revision.text) but I doubt it give us enough efficiency.

You can use a regex to remove interwikilinks. That should be at least an order of magnitude faster than mwparserfromhell since it doesn't have to build an abstract syntax tree.

>>> import re
>>> PREFIXES = ["fr", "en"]
>>> interwiki_re = re.compile(r"\[\[:?(" + "|".join(PREFIXES) + "):[^\]]+\]\]")
>>> text = "Foo bar.  Herp derp [[:fr:Hats]] [[Talk:Pajamas]] ."
>>> interwiki_re.sub("", text)
'Foo bar.  Herp derp  [[Talk:Pajamas]] .'

A quick test of speed.

>>> import time
>>> start = time.time();foo = [mwparser.parse(text) for i in range(10000)];time.time()-start
>>> start = time.time();foo = [interwiki_re.sub("", text) for i in range(10000)];time.time()-start
I tried your suggestion on Vietnamese Wikipedia and it took 27632.947533369064 seconds to finish (7h40m) which is acceptable for me. I will re-run this for all wikis :)

Probably, but it now includes "en" and "pt" which are not words in Portuguese (but are language codes).

I can't think of anything straightforward to handle "en" and "pt", Maybe we should just let them be?

@Ladsgroup, you said this was {{done}} in chat, but I don't see a stopwords list for zh

