Page MenuHomePhabricator

Outreachy Proposal 21 : Create Machine Learning datasets to measure content reliability on Wikipedia.
Closed, ResolvedPublic

Description

Name : Vanilla Thulisile Sibanda
Github : https://github.com/thulieblack/
Email : sibanda.thulie@gmail.com
Location : Capetown,South Africa
Time Zone : (UTC +02:00) Central African Time
Typical working hours : 9pm - 5am Central African Time

Summary
The project consists of researching, gathering and processing Wikipedia related data about articles content reliability, detecting crowd-generated tags or labels currently used by the Wikipedia editors and developers to signal problems with content integrity on Wikipedia to other editors. Wikipedia templates and tools are now used to label potentially bad content, but they are usually not machine friendly. This project is poised to group this content, select the most relevant ones, and create machine readable datasets that will allow ML systems to detect problematic content potentially automatically. We will also test those datasets by running different ML algorithms that will be used as baselines for future researchers. The project is poised to :

  • Explore the space of templates related to content integrity using semi-automated methods
  • Download and process articles and sections with the found templates
  • Analyze the data to summarize the main statistics
  • Potentially produce visualizations of the statistics and data properties

Mentor

lsaac Johnson @Isaac

Project Timeline

WeeksOutcomes
November 5th to November 30thDuring this period I dedicate myself in learning and contributing to tasks on Mediawiki while improving my skills in data analysis/science and also learning machine learning.
December 1st to December 28thWeek 1 to Week 4 -My aim is to intensely explore the spaces of templates related to content integrity by using machine learning algorithm's ,training data inorder to make predictions
December 29 to January 25thWeek 5 to Week 8- Downloading and processing articles and sections with the found templates. Recreating files, making amendments and changes.
26th January to February 8thWeek 9 to Week 11- Analyzing the data and summarizing the main statistics using python libraries and create statistical data reports.
9th February to February 16thProducing visualization of the analyzed data statistics and implementing machine learning systems
18th February to February 23rdImplement any feedbacks and changes added from the reviews
24th February to March 1stFinalize ,review, organize and document necessary changes

Participation

I will continue to communicate with Isaac Johnson via the public chat
I will ask help on the designated project's communication channel

About Me

I started my tech journey in March 2020 after l lost my job in the hospitality industry. This has been a dream come true for me to study tech as it has always been my passion. I took advantage of the lockdown and did a diploma course in python programming at Alison. I went further and developed my skills in data analysis and did some micro projects at freeCodeCamp, here are some of the repos of the projects that l did here. I participated in a virtual 1 month internship at Hash Analytic where l learned to do visualization's, applying machine learning models and to do presentations. The training was a life lighter for me from which l learnt a lot that l believe will be uniquely beneficial to this project.

How do hear about this Program?

After completing on my internship l came across a post on twitter
.
Will you have other time commitments during the program?

I don't have any commitments that will interfere during the program.

What does this project mean to you?

It will be an honor and a privilege to participate in this project as this will help me enhance my skills and experience while working with expert researchers in the area of machine. It will also be a privilege to get this opportunity as wikimedia provides free educational content through projects and support structure for continued skills development.
This also would be a major milestone for me as this is my first time to contribute to Open Source.

Contributions

I recently joined wikimedia community during this outreachy round and l have been active on the public chat,collaborated with other outreachy applicants and have also communicated with my mentor.

Contributions to mediawiki

Event Timeline

Thulieblack added a subscriber: Isaac.

Perfect day @Isaac please kindly review

Isaac claimed this task.

Hey @Thulieblack -- thanks for putting this together. In the past, the guidance had been to create a phabricator task for feedback / application, but we're now asking that you fill out your application via the Outreachy portal (see for more details, specifically step #11: https://www.mediawiki.org/wiki/Outreachy/Participants#Application_process_steps). As I already provided feedback on your initial notebook, I likely won't be able to give you any further feedback while I prioritize applicants who haven't submitted their notebooks yet for feedback. I'm going to resolve the task, but don't hesitate to let me know if you have any further questions.