Idioms are generally informal or otherwise problematic in formal articles.
See https://en.wiktionary.org/wiki/Category:English_idioms for a useful list
It would be nice to have a way to automatically extract a dataset from this list.
Idioms are generally informal or otherwise problematic in formal articles.
See https://en.wiktionary.org/wiki/Category:English_idioms for a useful list
It would be nice to have a way to automatically extract a dataset from this list.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | HAKSOAT | T247000 Add features for English Language idioms to articlequality models | |||
Resolved | HAKSOAT | T205545 Add English Language idioms to revscoring |
Right now, I think the best step is to write a script that could extract all of the idioms from the page and turn then into a nice machine-readable format. I think python on mwparserfromhell should be useful for finding the wikilinks on this page. You can see where we store language assets like this here: https://github.com/wikimedia/revscoring/blob/master/revscoring/languages/english.py
It looks like "words to watch" is most closely related to this task. See this line specifically: https://github.com/wikimedia/revscoring/blob/master/revscoring/languages/english.py#L212
Thanks for the pointers @Halfak I have joined the channel on IRC. I'll look at the pointers and get back to you.
So, I'm considering adding a function to that module that fetches the idioms using mwparserfromhell and returns them probably as a list. What do you think of this approach?
I think that makes a lot of sense. If you return them as a list, it is trivial to count them later.
Hello @Halfak I hope you are having a good time this festive season. So I'm about to parse the text here: https://en.wiktionary.org/wiki/Category:English_idioms I'd normally use the requests and beautifulsoup combo. But I believe there's a tool that does this already. I tried importing pywikibot, but it looks like it needs some initial user configurations. Is there any other method of doing this? Is there a means to use pywikibot for this purpose that I'm not aware of yet?
I would use mwapi and get what I needed from the API directly.
import mwapi session = mwapi.Session("https://en.wiktionary.org") results = session.get( action='query', list='categorymembers', cmtitle="Category:English idioms", formatversion=2, continuation=True) idioms = [] for doc in results: for page_doc in doc['query']['categorymembers']: idioms.append(page_doc['title'])
@Halfak Please take a look at my Pull Request: https://github.com/wikimedia/revscoring/pull/466