Task 1 : Article Categorization
- For each article/language, gather article categories, topics, and timestamp of creation
- Based on distribution across topics and years, decide on sampling thresholds (with @Miriam and @diego)
- Sample articles equally by topic, and before/after 2024 (this is to ensure that in the dataset there is a percentage of article that the models are not trained on).
- Using the morelike API, retrieve the 10 most similar articles to each article in the sample. Extract categories and save in a separate table with relevant columns.
Task 2-3: Peacock Behavior and NPOV Detection
- Follow code in existing notebook from WikiReliability
@MunizaA and @fkaelin can provide consulting here - please give them a heads up ahead of starting this phase.
- Sample and release data for NPOV detection
- Sample and release data for Peacock detection
