Page MenuHomePhabricator

Your second task: classify statements within an article
Open, Needs TriagePublic

Description

In this task, we want to exercise simple parsing of a Wikipedia article and classifying some of its sentences.

Please write a program or script in your preferred language that:

1- Receives as input the title of a English Wikipedia article.
2- Retrieves the text of that article from the MediaWiki API. If using Python, consider using python-mwapi for this.
3- Identifies individual sentences within that text, along with the corresponding section titles. If using Python, mwparserfromhell can help you work with wiki markup.
4- Runs those sentences through the model to classify them.
5- Outputs the sentences, one per line, sorted by score given by the model.

This is similar to the run_citation_need_model.py script in the model repository, but that one loads its input statements from an already structured file, and you have to extract that informations directly from a Wikipedia article.

Please create a GitHub (or similar, like BitBucket) repository with your code and send us a link to it in a comment on this Phabricator entry.

Deadline: This task has no deadline of its own, other than the November 5th deadline for contributions in Outreachy. The sooner the better though, as we would like to look at your code, maybe file an issue and/or discuss design decisions before the actual deadline.

Feel free to ping @Miriam or myself if you have questions.

Event Timeline

Surlycyborg updated the task description. (Show Details)Fri, Oct 4, 10:13 AM

Hi @Surlycyborg,

Do we have a deadline for this task? Thanks!

The only deadline is really the Application period for Outreachy, which goes until mid-October.

What does this mean: "send us a link to it by commenting on this task." ?
Thanks

Surlycyborg updated the task description. (Show Details)Fri, Oct 4, 5:37 PM

Sorry, that was a bit ambiguous I guess :) I mean to make a comment on this Phabricator entry, like we're doing now, with the URL to your repository.

Yes, that's how I understood it, just wanted to be sure as I thought we are competing for the internship and sharing the code might be not the best strategy.

If someone copies code from another submission it will be fairly obvious to us :)

IrinaGruz added a comment.EditedFri, Oct 4, 6:04 PM

What about the first task, how do we submit it ? I asked in another thread, but haven't received a response.

Achillesheel02 added a subscriber: Achillesheel02.EditedSat, Oct 5, 10:30 PM

Hi @Samwalton9 ,
Which section title takes precedence? I noticed there's different levels and was wondering about which one to use.

That is probably a question for @Miriam - I imagine we want the same level of section title that was used to train the model.

@Surlycyborg, do we include the content from the following sections:
'See also',
'References',
'External links',
'Further reading'

Those sections typically don't have a lot of usable text that is not links or references, so I think no, we can just ignore those.

Miriam added a comment.Mon, Oct 7, 8:45 AM

Hi @Samwalton9 ,
Which section title takes precedence? I noticed there's different levels and was wondering about which one to use.

Hi @Achillesheel02 great question! We should use the section title at the highest level, so that we can generalize across different articles.
E.g. if you have a sentence whose main Section is "Biography", and whose sub-section is "Early life" or "Career", we should keep "Biography" as the section title used as input for the model. Does that make sense?

Also, yes, please discard anything coming from
'See also',
'References',
'External links',
'Further reading'
As @Surlycyborg said, these sections don't have a lot of content needing citations, so we can discard them.

Note: mwapi requires python 3 while the citation is written in python 2

Ghassanmas added a comment.EditedThu, Oct 10, 7:12 AM

@Surlycyborg the deadline according to Outreacy is the 5th of Nov right?

@Miriam @Samwalton9 So according to this task the function should take the article title, but I assume the user may not enter the article title exactly as what it appears it Wikipedia, so I assume I am gonna use MediaWiki API to search for articles.

So then should I take the first result of the search or should I ask the user to select on the titles given I will show the result to the user.

Thank you! I hope to take a look and maybe file an issue in the next couple of days :)

@Surlycyborg the deadline according to Outreacy is the 5th of Nov right?

Looks like it, thanks for the link. The mentor's documentation didn't have a date :( I'll use your link in the task description.

@Miriam @Samwalton9 So according to this task the function should take the article title, but I assume the user may not enter the article title exactly as what it appears it Wikipedia, so I assume I am gonna use MediaWiki API to search for articles.
So then should I take the first result of the search or should I ask the user to select on the titles given I will show the result to the user.

For the purposes of this task, I think we don't need to search: just output an error if the article title as given gives no results from the API. In the actual project, we will likely need to process *all* articles on Wikipedia, or at least all within a certain category - so there is no user-input title anyway.

I think the API already normalizes article titles to some extent (e.g. if you ask for the text of an article with spaces, it normalizes spaces to underscores), so we do have a little bit of freedom already in how we take input from the user for this task.

Surlycyborg updated the task description. (Show Details)Thu, Oct 10, 8:38 AM

@Surlycyborg But the mediawiki API usually returns the disambiguation page or the redirect page not the exact article. which would need to send another request assuming the redirect is correct or the first result of the disambiguation is correct. But if using the search API before then we could get the article by pageid instead.

@Ghassanmas you’re right, this is also the path I followed.

@Surlycyborg But the mediawiki API usually returns the disambiguation page or the redirect page not the exact article. which would need to send another request assuming the redirect is correct or the first result of the disambiguation is correct. But if using the search API before then we could get the article by pageid instead.

@Surlycyborg But the mediawiki API usually returns the disambiguation page or the redirect page not the exact article. which would need to send another request assuming the redirect is correct or the first result of the disambiguation is correct. But if using the search API before then we could get the article by pageid instead.

Huh, interesting. To be clear, are you using the query API ? Using that API, I would have tried something like this to get the text using the default title normalization of the API.

That said, using search and grabbing the first result is perfectly fine too - and if it gives us a better user experience by accepting different spellings of articles, that's all the better :)

Ghassanmas added a comment.EditedMon, Oct 14, 6:53 PM

@Surlycyborg @Miriam @Samwalton9 I have confusion hopefully you could clear it out.

First, could you define the statement and the sentence in the scope of this project.

I see in your paper that you did the rain on random sentences. But the example on the github repo in doing the prediction on statements which could be composed of multiple sentences.

So I am not sure when parsing the article if I need to split the paragraphs by sentences or using another method.

@Surlycyborg @Miriam @Samwalton9 I have confusion hopefully you could clear it out.
First, could you define the statement and the sentence in the scope of this project.
I see in your paper that you did the rain on random sentences. But the example on the github repo in doing the prediction on statements which could be composed of multiple sentences.

Hi @Ghassanmas I am also confusing about the definition between statement and sentence in this project. Thanks for pointing out.

As I see in the github repo, at the beginning of the function text_to_word_list, it dealt with the multiple sentences in a statement by selecting the first sentence. (or randomly choose one, but this line was commented out)

So in my opinion, the paragraphs need to be split into sentences as the third point in the task description then classify them.

Hi there @Miriam @Surlycyborg and @Samwalton9,

Here is my repo:
https://github.com/kendallcorner/citation-needed-paper

I forked your script and added some things. It completes all the items in this task. I was also confused on how to determine what a 'statement' was, since the example input file had multiple sentences in a statement, but I just tried something. Let me know what you think.

Thanks!
Kendall

@AikoChou @Kendallcorner Thank you! I hope to have a look in the next few days (weekend is more realistic) and leave some comments as an issue on the repository :)

@Ghassanmas Yes, please split the paragraphs into sentences.

Hi @Miriam @Surlycyborg
I was just wondering, as Outreachy participants, is there any kind of correspondence we need to give to Outreachy itself? Because technically we are supposed to ‘record’ the contributions in our Outreachy accounts.

@Achillesheel02: That question does not sound directly related to the topic of this task - can you please bring it up in an Outreachy channel (e.g. Zulip)? Thanks for your understanding.