Page MenuHomePhabricator

Add English Language idioms to revscoring
Closed, ResolvedPublic

Description

Idioms are generally informal or otherwise problematic in formal articles.

See https://en.wiktionary.org/wiki/Category:English_idioms for a useful list

It would be nice to have a way to automatically extract a dataset from this list.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Harej triaged this task as Medium priority.Apr 3 2019, 5:11 AM

Hello @Harej Is it possible for me to get more guidance for this task?

Halfak lowered the priority of this task from Medium to Low.Dec 10 2019, 11:39 PM

Right now, I think the best step is to write a script that could extract all of the idioms from the page and turn then into a nice machine-readable format. I think python on mwparserfromhell should be useful for finding the wikilinks on this page. You can see where we store language assets like this here: https://github.com/wikimedia/revscoring/blob/master/revscoring/languages/english.py

It looks like "words to watch" is most closely related to this task. See this line specifically: https://github.com/wikimedia/revscoring/blob/master/revscoring/languages/english.py#L212

Thanks for the pointers @Halfak I have joined the channel on IRC. I'll look at the pointers and get back to you.

So, I'm considering adding a function to that module that fetches the idioms using mwparserfromhell and returns them probably as a list. What do you think of this approach?

I think that makes a lot of sense. If you return them as a list, it is trivial to count them later.

Hello @Halfak I'd like to claim this task.

Hello @Halfak I hope you are having a good time this festive season. So I'm about to parse the text here: https://en.wiktionary.org/wiki/Category:English_idioms I'd normally use the requests and beautifulsoup combo. But I believe there's a tool that does this already. I tried importing pywikibot, but it looks like it needs some initial user configurations. Is there any other method of doing this? Is there a means to use pywikibot for this purpose that I'm not aware of yet?

I would use mwapi and get what I needed from the API directly.

import mwapi

session = mwapi.Session("https://en.wiktionary.org")

results = session.get(
  action='query', 
  list='categorymembers', 
  cmtitle="Category:English idioms",
  formatversion=2,
  continuation=True)

idioms = []
for doc in results:
  for page_doc in doc['query']['categorymembers']:
    idioms.append(page_doc['title'])