Page MenuHomePhabricator

Outreachy Proposal: A system for releasing data dumps from a classifier detecting unsourced sentences in Wikipedia
Closed, ResolvedPublic

Description

Name: Aiko Chou
Github: https://github.com/AikoChou/
IRC nickname: aikoChou
Location: Hsinchu, Taiwan (UTC+8)

Synopsis

The "Citation Needed" model takes as input a sentence from a Wikipedia article and its section title, and gives as output a "citation needed" score reflecting whether the sentence should have a citation or not. The task of this internship is to scale this process up. The candidate will design an end-to-end model that automatically parses the contents of Wikipedia articles to extract unsourced sentences and their section titles, classify these sentences using the machine learning model to detect which of those are actually missing citations, and release the output in form of periodic data dumps.
After creating a system capable of generating periodic data dumps, further work could include writing documentation, publicizing the data, and supporting integration into tools like Citation Hunt.

Project objective:

  • Develop a system to periodically (e.g. twice a month) release dumps in Wikimedia spaces about which sentences need citations on Wikipedia

Additional works if time permits:

  • Integrate into an existing tool like Citation Hunt to let it additionally use these dumps
  • Create a bot to annotate sentences on Wikipedia with a citation needed template

Mentors: @Miriam @Samwalton9 @Surlycyborg

Project task: https://phabricator.wikimedia.org/T233707

Timeline

Week 1-2 (Dec 3rd - 16th)

  • Design research, investigation of user needs for the future use of supporting existing Wikipedia tools or creating a bot.
  • Discussed with mentors and other editors.
  • Propose a system specification that includes:
    1. Input dumps of the system (category.sql, page.sql or others)
    2. Output format and schema of the system (SQL or XML; table schema)
    3. Environment: run on a local computer or on a remote server
    4. System flowchart and algorithm: break it down into several functions or tasks and also think about test methods for the unit tests.
  • Refine the proposal based on mentors' and other editors' feedbacks.

Week 3-5 (Dec 17th - Jan 6th)

  • Sett up the environments
  • Implement the system
    1. Open a github repository
    2. Produce code according to on the specification
    3. Discuss with mentors when completing a function or task or facing issues during coding

Week 6-7 (Jan 7th - 20th)

  • Make a comprehensive test on the system, check the issues like:
    1. Edge cases
    2. System efficiency
    3. The usage of server resources
  • Make the system a cron job to run regularly

Week 8 (Jan 21st - 27th)

  • Write a thorough documentation
  • Tweak the system if necessary
  • Publish the dumps

Week 9-10 (Jan 28th - Feb 10th)

  • Familiarize with the code base of Citation Hunt
  • Set up a local Citation Hunt and run the test script
  • Discuss a proposal to support integration into Citation Hunt with mentors
    • How to let Citation Hunt handle the dumps generated by the system
    • Come up with a flowchart and algorithm: break it down into several functions or tasks and also think about test methods for the unit tests.
  • Refine the proposal based on mentors' feedbacks

Week 11-12 (Feb 11th - 24th)

  • Fork a copy of Citation Hunt repository
  • Produce code according to the proposal
  • Discuss with mentors when completing a function or task or facing issues during coding
  • Test code on local side

Week 13 (Feb 25th - Mar 2nd)

  • Make a comprehensive test
  • Write a thorough documentation

Week 14 (Mar 3rd - 9th)

  • Deploy, support and maintenance

Event Timeline

@AikoChou Hi! As we are almost half-way through the internship period, I would encourage you to update your progress in a comment on both the proposal and the project task. Thank you :)

We are in week 8 now but have moved to week 9 work. Just swapped Citation Hunt work with testing/regular job work. Swapped 9-12 with 6-8. :)

@AikoChou Thanks! If you could also summarize your work done b/w week 1-8, that's be great. Also, consider leaving the same update here T233707.

Week 1-8 Summary

For Citation Detective

  • A design specification for the project
  • First script & PR merged to the Github repository
  • Ran Citation Detective on ~9k articles
  • Modified input pipelines and the format written to the database

For Citation Hunt

  • Set up a local Citation Hunt and study the workflows
  • Worked on scripts to ingest data into Citation Hunt
  • Added simple UI to highlight sentences detected lacking citations

Week 9-10

  • Refined code in aiko-citationhunt
  • Added unit tests to aiko-citationhunt
  • Ran Citation Detective on ~60k articles

Week 11-12

  • Resolved data quality issues (broken sentences, etc.)
  • Worked on multiprocess version of Citation Detective
  • Implemented article filtering/ran on 100k articles

In the last week of the internship, I've been working on:

  • Create a crontab to update the database periodically
  • Write an announcement email
  • Documentation: Meta page, README on GitHub
  • Update aiko-citationhunt to use the final, public citationdetective_p

Completed the wrap-up steps: