Maniphest T205545

Add English Language idioms to revscoring
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Sep 26 2018, 2:28 PM

Description

Idioms are generally informal or otherwise problematic in formal articles.

See https://en.wiktionary.org/wiki/Category:English_idioms for a useful list

It would be nice to have a way to automatically extract a dataset from this list.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		HAKSOAT	T247000 Add features for English Language idioms to articlequality models
		Resolved		HAKSOAT	T205545 Add English Language idioms to revscoring

Event Timeline

Halfak created this task.Sep 26 2018, 2:28 PM

Restricted Application added a project: artificial-intelligence. · View Herald TranscriptSep 26 2018, 2:28 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Halfak moved this task from Unsorted to User experience on the Machine-Learning-Team board.Sep 26 2018, 4:41 PM

Halfak moved this task from User experience to Research & analysis on the Machine-Learning-Team board.

awight added a project: good first task.Sep 26 2018, 6:40 PM

Halfak edited projects, added Machine-Learning-Team (Research); removed Machine-Learning-Team.Apr 2 2019, 9:32 PM

Restricted Application edited projects, added Machine-Learning-Team; removed Machine-Learning-Team (Research). · View Herald TranscriptApr 2 2019, 9:32 PM

Harej moved this task from Research & analysis to New development on the Machine-Learning-Team board.Apr 3 2019, 1:52 AM

Harej triaged this task as Medium priority.Apr 3 2019, 5:11 AM

Hello @Harej Is it possible for me to get more guidance for this task?

Right now, I think the best step is to write a script that could extract all of the idioms from the page and turn then into a nice machine-readable format. I think python on mwparserfromhell should be useful for finding the wikilinks on this page. You can see where we store language assets like this here: https://github.com/wikimedia/revscoring/blob/master/revscoring/languages/english.py

It looks like "words to watch" is most closely related to this task. See this line specifically: https://github.com/wikimedia/revscoring/blob/master/revscoring/languages/english.py#L212

Halfak removed a subscriber: Harej.Dec 11 2019, 12:00 AM

Thanks for the pointers @Halfak I have joined the channel on IRC. I'll look at the pointers and get back to you.

So, I'm considering adding a function to that module that fetches the idioms using mwparserfromhell and returns them probably as a list. What do you think of this approach?

I think that makes a lot of sense. If you return them as a list, it is trivial to count them later.

Hello @Halfak I'd like to claim this task.

HAKSOAT claimed this task.Dec 24 2019, 2:20 AM

Hello @Halfak I hope you are having a good time this festive season. So I'm about to parse the text here: https://en.wiktionary.org/wiki/Category:English_idioms I'd normally use the requests and beautifulsoup combo. But I believe there's a tool that does this already. I tried importing pywikibot, but it looks like it needs some initial user configurations. Is there any other method of doing this? Is there a means to use pywikibot for this purpose that I'm not aware of yet?

I would use mwapi and get what I needed from the API directly.

import mwapi

session = mwapi.Session("https://en.wiktionary.org")

results = session.get(
  action='query', 
  list='categorymembers', 
  cmtitle="Category:English idioms",
  formatversion=2,
  continuation=True)

idioms = []
for doc in results:
  for page_doc in doc['query']['categorymembers']:
    idioms.append(page_doc['title'])

Thanks for this

@Halfak Please take a look at my Pull Request: https://github.com/wikimedia/revscoring/pull/466

Halfak edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.Jan 22 2020, 8:40 PM

Halfak moved this task from Parked to Review on the Machine-Learning-Team (Active Tasks) board.Feb 10 2020, 5:54 PM

Halfak moved this task from Review to Parked on the Machine-Learning-Team (Active Tasks) board.Feb 21 2020, 10:39 PM

Halfak moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.Feb 24 2020, 5:49 PM

Halfak added a parent task: T247000: Add features for English Language idioms to articlequality models.Mar 5 2020, 4:20 PM

Halfak closed this task as Resolved.Jun 22 2020, 4:37 PM