The project will consist of researching, gathering and processing Wikipedia related data about articles content reliability, detecting crowd-generated tags or labels currently used by the Wikipedia editors and developers to signal problems with content integrity on Wikipedia to other editors. Nowadays, many Wikipedia templates and tools are used to label potentially bad content, but they are usually not machine friendly. In this project, we will characterize this content, select the most relevant ones, and create machine readable datasets that will allow ML systems to detect problematic content potentially automatically. During the project, we will also test those datasets by running different ML algorithms that will be used as baselines for future researchers.
- Python, SQL
- Basic data analysis skills
- Plus: data visualization skills
How to Apply?
Please check the instructions in the following task: T263874. Detailed submission instructions are included in the task as well.