Interwiki links comes up in list of stop words which is excepted but we need to omit them.
Description
Event Timeline
Looks like the code is done. I just talked to @Ladsgroup and he needs to run this code against the new wikis. Once that is ready, we can mark this done.
@Ladsgroup: I noticed that the bot removed a few interwiki prefixes which happen to be true stopwords in Portuguese:
https://meta.wikimedia.org/w/index.php?diff=13746843&oldid=13426323
E.g.: "de" (of), "da" (of), "as" (the), "na"/"no" (on/at). It should probably keep those interwiki prefixes which are also words in a given language.
Hey, Pretty valid point:
There are two ways of excluding interwiki links:
- Remove interwiki links in every revision
- Run this with them and exclude them from results
The first one seems pretty accurate but this change needs to be imposed on a very large scale of the parser and using them makes the code pretty slow since we run this code ~1M revisions.
The first approach takes several weeks to finish but the second one takes about three or four days.
My suggestion: We don't use excluding at all and then we exclude interwiki links in human review or we try faster approach in removing interwiki links like re.sub('\[\[(en|pt|..)\:','', revision.text) but I doubt it give us enough efficiency.
You can use a regex to remove interwikilinks. That should be at least an order of magnitude faster than mwparserfromhell since it doesn't have to build an abstract syntax tree.
>>> import re >>> >>> PREFIXES = ["fr", "en"] >>> >>> interwiki_re = re.compile(r"\[\[:?(" + "|".join(PREFIXES) + "):[^\]]+\]\]") >>> >>> text = "Foo bar. Herp derp [[:fr:Hats]] [[Talk:Pajamas]] ." >>> >>> interwiki_re.sub("", text) 'Foo bar. Herp derp [[Talk:Pajamas]] .'
A quick test of speed.
>>> import time >>> >>> start = time.time();foo = [mwparser.parse(text) for i in range(10000)];time.time()-start 1.7691254615783691 >>> >>> start = time.time();foo = [interwiki_re.sub("", text) for i in range(10000)];time.time()-start 0.13825416564941406
I tried your suggestion on Vietnamese Wikipedia and it took 27632.947533369064 seconds to finish (7h40m) which is acceptable for me. I will re-run this for all wikis :)
@He7d3r: Dos this look good to you?
https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/pt
Probably, but it now includes "en" and "pt" which are not words in Portuguese (but are language codes).
I can't think of anything straightforward to handle "en" and "pt", Maybe we should just let them be?
@Ladsgroup, you said this was {{done}} in chat, but I don't see a stopwords list for zh