Add features for English Language idioms to articlequality models
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Mar 5 2020, 4:20 PM

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		HAKSOAT	T247000 Add features for English Language idioms to articlequality models
		Resolved		HAKSOAT	T205545 Add English Language idioms to revscoring

Event Timeline

Halfak created this task.Mar 5 2020, 4:20 PM

Restricted Application added a project: artificial-intelligence. · View Herald TranscriptMar 5 2020, 4:20 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Halfak added a subtask: T205545: Add English Language idioms to revscoring.Mar 5 2020, 4:20 PM

See https://github.com/wikimedia/articlequality/blob/master/articlequality/feature_lists/tests/test_enwiki.py

I'm imagining something like this:

idioms_count = english.revision.idioms.matches

local_wiki = [
  ...
  idioms,
  words_to_watch_count + idioms_count,
  (words_to_watch_count + idioms_count) / modifiers.max(wikitext.revision.words, 1),
  idioms_count / modifiers.max(wikitext.revision.words, 1)
]

Halfak edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.Mar 5 2020, 7:15 PM

https://github.com/wikimedia/articlequality/pull/105

I ran a test with this and discussed it yesterday with @HAKSOAT. In https://gist.github.com/halfak/b9ce3f174a066e4851d04a2de7d2437d , we can see that there are a lot of phrases that are not problematic idioms that get picked up. It seems that this is because of the way that wiktionary categorizes "idioms". @HAKSOAT and I discussed a few different strategies for cleaning up the dataset.

One method that I proposed is that we gather a set of known good articles and scan them using the idiom set. We can then remove any "idioms" that are commonly matched in the good articles. That would allow us to automatically update the dataset again in the future.

One other thing I noticed is that we tend to pick up idioms in citation text. I wonder if we could somehow exclude all tags when scanning. That would help us disregard block quotes and <ref> tags. This could be a non-issue though. And it sounds hard so if we don't find an obvious way to do this, that's OK.

Halfak moved this task from Review to Parked on the Machine-Learning-Team (Active Tasks) board.Mar 16 2020, 4:50 PM

Halfak moved this task from Parked to Review on the Machine-Learning-Team (Active Tasks) board.Apr 6 2020, 4:54 PM

Halfak moved this task from Review to Parked on the Machine-Learning-Team (Active Tasks) board.

I have been able to work on English idioms and excited to say that the speed has been improved by a large margin. Before now, we were simply piping together a bunch of English Idioms and this was quite inefficient.

To make this efficient, we needed to factorize repetitive characters from the list. During a chat with @Halfak , he suggested I check up the Trie data structure.

Hence, I was able to convert the list of English idioms into a trie and then back into the regex we desire. Before the change, extracting idioms took 18.34465742111206 seconds and after the change it took 0.706458568572998 seconds.

Halfak moved this task from Parked to Review on the Machine-Learning-Team (Active Tasks) board.May 18 2020, 4:55 PM

Halfak moved this task from Review to Completed on the Machine-Learning-Team (Active Tasks) board.Jun 1 2020, 4:48 PM

Halfak moved this task from Completed to Review on the Machine-Learning-Team (Active Tasks) board.

Halfak moved this task from Review to Pending deployment on the Machine-Learning-Team (Active Tasks) board.Jun 5 2020, 2:53 PM

Halfak closed subtask T205545: Add English Language idioms to revscoring as Resolved.Jun 22 2020, 4:37 PM

Halfak moved this task from Pending deployment to Completed on the Machine-Learning-Team (Active Tasks) board.Jun 24 2020, 11:07 PM

• ACraze closed this task as Resolved.Sep 23 2020, 5:04 PM