Together with @Samwalton9 and @Surlycyborg, supervise an internship from the Outreachy program with the the task of releasing data dumps from a classifier detecting unsourced sentences in Wikipedia. Task tracking: T233707
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | Miriam | T228442 Design and implement an API for "citation needed" tag recommendation | |||
| Resolved | Miriam | T242601 Supervise Outreachy Internship on Releasing data dumps for Citation Needed Classifiers |
Event Timeline
Weekly update: progress as expected. I had to double check the output of the citation needed models as there were some inconsistencies, which were solved by looking at some of the data. There is need for a text pre-processing pipeline which is close to the one we used for training, otherwise the model might give some unexpected output.
Weekly update: progress as expected. Aiko has worked on refining the input pipeline, and setting up a script to ingest the output of the model into Citation Hunt. This is not only useful as CH is one of the primary stakeholders for the project, but also useful for us to eyeball the surfaced "citation needed" sentences.
Weekly update:
- Aiko has almost finalized the integration with Citation Hunt. A prototype of the final ML-enhanced citation hunt tool can be seen here: https://tools.wmflabs.org/aiko-citationhunt
- We have finalized the next steps for the upcoming 3 weeks. Aiko will work on refining the pipeline to get weekly/monthly dumps of the mysql database exposing sentences needing citations. I reached out to Cloud services to see what the best way to expose this dataset to other tools would be.
- We talked about submitting a non-archival paper to Wiki Workshop, containing the idea and rationale behind the dataset and some basic analysis. We will follow up next week based on Aiko's and Miriam's capacity.
Weekly update:
- Aiko has worked on making the data pipeline cleaner, faster and multiprocessing
- Working on understanding which category of articles we should focus on (we can't compute and store citation needed scores for all articles)
Weekly update:
- Aiko has made the database publicly available on Toolforge:
- Mysql Database can be access by using citationdetective_p
- It contains a snapshot of sentences needing citations for 2% of English Wikipedia articles (due to technical limitations)
- It is now being integrated into citation hunt
- Given the relevance of this work for the researched community, with Aiko and the other mentors we worked on a paper for Wiki Workshop and submitted it for the second round of submission. The paper contains a summary of the dataset, and an analysis of citation quality at scale for English Wikipedia.
We had our last meeting with Aiko today. The dataset will be now updated weekly on the toolforge account. She will finalize documentation on meta and then work on advertising the dataset. Part of it is the Wiki Workshop paper, and she will also advertise the dataset and tool over different mailing lists. Her blog is also reporting the final steps of the project wrap-up: https://rollingmist.home.blog/
It was a great experience from which we all learnt a lot. @AikoChou you are great!