Omit the interwikilinks from stop words
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ladsgroup
	Aug 21 2015, 4:29 PM

Description

Interwiki links comes up in list of stop words which is excepted but we need to omit them.

Related Objects
Search...

Status	Assigned	Task
Open	None	T227094 Update RC Filters for new ORES capacities (July, 2019)
Resolved	SBisson	T225561 Update ORES thresholds for nlwiki
Open	None	T223273 Update srwiki thresholds for goodfaith model
Resolved	SBisson	T225562 Deploy ORES filters for zhwiki
Open	None	T225563 Deploy ORES filters for jawiki
Resolved	Halfak	T224484 ORES deployment: Early June
Resolved	Halfak	T224481 Train/test zhwiki editquality models
Resolved	Halfak	T223382 Improvements to ORES localization and support
Resolved	Halfak	T109366 Chinese language utilities
Open	None	T106846 Afrikaans language utilities
Resolved	Ladsgroup	T106845 Arabic language utilities
Resolved	Halfak	T107590 Dutch language utilities
Resolved	Halfak	T106844 Estonian language utilities
Resolved	ToAruShiroiNeko	T106835 Hebrew language utilities
Open	None	T106843 Armenian language utilities
Resolved	Halfak	T107591 Italian language utilities
Resolved	Ladsgroup	T106833 Polish language utilities
Resolved	Halfak	T106836 Russian language utilities
Resolved	Halfak	T106837 Ukrainian language utilities
Invalid	ToAruShiroiNeko	T107609 Community outreach
Resolved	Ladsgroup	T110964 TF-IDF to determine global stop words
Resolved	Ladsgroup	T109844 Omit the interwikilinks from stop words

Event Timeline

Ladsgroup created this task.Aug 21 2015, 4:29 PM

Ladsgroup claimed this task.

Ladsgroup raised the priority of this task from to Low.

Ladsgroup updated the task description. (Show Details)

Ladsgroup added a project: Machine-Learning-Team (Active Tasks).

Ladsgroup subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 21 2015, 4:29 PM

Ladsgroup added a project: Bad-Words-Detection-System.Aug 27 2015, 9:00 PM

Ladsgroup set Security to None.

Ladsgroup moved this task from Backlog to Active on the Bad-Words-Detection-System board.Aug 27 2015, 9:08 PM

Halfak moved this task from Parked to Backlog on the Machine-Learning-Team (Active Tasks) board.Sep 1 2015, 6:56 PM

https://github.com/wiki-ai/Bad-Words-Detection-System/commit/0363bb2fa24bf7e9bf7997a9712f937f905d8a82

Looks like the code is done. I just talked to @Ladsgroup and he needs to run this code against the new wikis. Once that is ready, we can mark this done.

Halfak added a parent task: T110964: TF-IDF to determine global stop words.Sep 11 2015, 4:39 PM

Ladsgroup closed this task as Resolved.Sep 22 2015, 11:02 AM

@Ladsgroup: I noticed that the bot removed a few interwiki prefixes which happen to be true stopwords in Portuguese:
https://meta.wikimedia.org/w/index.php?diff=13746843&oldid=13426323
E.g.: "de" (of), "da" (of), "as" (the), "na"/"no" (on/at). It should probably keep those interwiki prefixes which are also words in a given language.

Hey, Pretty valid point:
There are two ways of excluding interwiki links:

Remove interwiki links in every revision
Run this with them and exclude them from results

The first one seems pretty accurate but this change needs to be imposed on a very large scale of the parser and using them makes the code pretty slow since we run this code ~1M revisions.

The first approach takes several weeks to finish but the second one takes about three or four days.

My suggestion: We don't use excluding at all and then we exclude interwiki links in human review or we try faster approach in removing interwiki links like re.sub('\[\[(en|pt|..)\:','', revision.text) but I doubt it give us enough efficiency.

You can use a regex to remove interwikilinks. That should be at least an order of magnitude faster than mwparserfromhell since it doesn't have to build an abstract syntax tree.

>>> import re
>>> 
>>> PREFIXES = ["fr", "en"]
>>> 
>>> interwiki_re = re.compile(r"\[\[:?(" + "|".join(PREFIXES) + "):[^\]]+\]\]")
>>> 
>>> text = "Foo bar.  Herp derp [[:fr:Hats]] [[Talk:Pajamas]] ."
>>> 
>>> interwiki_re.sub("", text)
'Foo bar.  Herp derp  [[Talk:Pajamas]] .'

A quick test of speed.

>>> import time
>>> 
>>> start = time.time();foo = [mwparser.parse(text) for i in range(10000)];time.time()-start
1.7691254615783691
>>> 
>>> start = time.time();foo = [interwiki_re.sub("", text) for i in range(10000)];time.time()-start
0.13825416564941406

Halfak reopened this task as Open.Sep 25 2015, 4:36 PM

I tried your suggestion on Vietnamese Wikipedia and it took 27632.947533369064 seconds to finish (7h40m) which is acceptable for me. I will re-run this for all wikis :)

@He7d3r: Dos this look good to you?
https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/pt

Probably, but it now includes "en" and "pt" which are not words in Portuguese (but are language codes).

I can't think of anything straightforward to handle "en" and "pt", Maybe we should just let them be?

Ladsgroup moved this task from Backlog to Completed on the Machine-Learning-Team (Active Tasks) board.Sep 30 2015, 10:04 AM

Ladsgroup moved this task from Completed to Backlog on the Machine-Learning-Team (Active Tasks) board.

@Ladsgroup, you said this was {{done}} in chat, but I don't see a stopwords list for zh

Halfak moved this task from Backlog to Completed on the Machine-Learning-Team (Active Tasks) board.Oct 16 2015, 6:01 PM

Halfak closed this task as Resolved.Nov 19 2015, 11:41 PM

• Phabricator_maintenance added a project: User-Ladsgroup.Aug 12 2016, 8:09 PM

Omit the interwikilinks from stop wordsClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Omit the interwikilinks from stop words
Closed, ResolvedPublic
Actions

Related Objects
Search...