Page MenuHomePhabricator

Collect Evaluation Data for SDS 1.2.1 B
Closed, ResolvedPublic

Description

Task 1 : Article Categorization

  • For each article/language, gather article categories, topics, and timestamp of creation
  • Based on distribution across topics and years, decide on sampling thresholds (with @Miriam and @diego)
  • Sample articles equally by topic, and before/after 2024 (this is to ensure that in the dataset there is a percentage of article that the models are not trained on).
  • Using the morelike API, retrieve the 10 most similar articles to each article in the sample. Extract categories and save in a separate table with relevant columns.

Task 2-3: Peacock Behavior and NPOV Detection

  • Follow code in existing notebook from WikiReliability

@MunizaA and @fkaelin can provide consulting here - please give them a heads up ahead of starting this phase.

  • Sample and release data for NPOV detection
  • Sample and release data for Peacock detection

Event Timeline

Status Update:

  • I collected topics, categories, and section headings for all articles that are assigned a topical category (with a score > 0.5) in 23 languages in the AYA23 model family.
  • I checked the distribution by topic in each language, available here
    • Note: one article could be counted for multiple topic categories
  • Based on the decided thresholds from the above distributions, I sampled seed articles in each language
    • decided to sample 50 articels per topical category before 2024 and max(25, number of created articles) per topic in 2024, to oversample for the most recent data.

Next steps:

  • Using morelike API, retrieve 10 similar pages for each seed article
  • In a separate table store categories for similar articles

Updates:

  • I retrieved 10 similar pages per each seed article sampled previously. Seed article dataset contains additional features to link similar pages, namely sim_page_ids and sim_page_titles and similar pages along with the metadata are available in a separate file.
    • Find current versions of seed articles at ai_use_cases/categories/sample_articles/seed_articles_w_similar10_v1 and similar10 articles at ai_use_cases/categories/sample_articles/similar10_metadata_v1
  • After Mykola’s initial analysis, I updated the above two datasets following his suggestions. I additionally collected revision_text for the main section converting it to plaintext using mwedittypes (https://github.com/geohci/edit-types/blob/main/mwedittypes/utils.py#L77C5-L77C26)
  • I started looking into the previous codes and publications for Task 2, NPOV detection
  • The pipeline for extracting negative and positive samples has been discussed and it was decided to collect for all articles history.

Next Steps:

  • I will collect the data for Task 2
  • Based on the collected data, I will check statistics across topics, time(recency), and languages to decide on the further sampling methods.

Related Code:

Updates:

  • Get templates that link to the NPOV policy using langlinks API and page. redirects
    • hewiki and plwiki do not have a dedicated page for POV template
  • Collected all historical revisions that contain the above-mentioned templates across 23 languages and additionally supplemented with a bunch of metadata
  • Extracted positive/negative pairs from each page following the previous approach
    • 5 languages, i.e., hewiki, hiwiki, idwiki, rowiki, elwiki, have less than 1K pairs --> will be discarded from the final dataset (+plwiki that doesn't have a dedicated POV template page)
    • Stratification by topic for sampling will be applied only to enwiki due to the sparse distribution by topic for non-English languages
  • Checked stats, distribution plots available at the bottom of this notebook

Next Steps:

  • Prepare the dataset for NPOV detection based on the filtering explained above
  • Start collection for Peacock Behavior task

Related Code:

Updates:

  • Sampled data for Task 2, find here
    • 16 wikis with at least 1K pairs + enwiki stratified by topic
  • Sampled data for Task 3, find here
    • 10 wikis with at least 1K pairs + enwiki stratified by topic
  • We consider eventually putting the final data to /published/datasets (https://analytics.wikimedia.org/published/datasets/)
  • Check stats for the peacock behavior detection dataset in the "Plot distributions of pos/neg samples" section of this notebook

Next Steps:

  • collect training data for baseline experiments
  • start baseline experiments for Tasks 2 & 3

Data can be found on the cluster at:

  • Eval data:

Task 2 at aitolkyn/ai_use_cases/npov/data_final/eval_npov_data.parquet
Task 3 at aitolkyn/ai_use_cases/peacock/data_final/eval_peacock_data.parquet

  • Training data:

Task 1 at aitolkyn/ai_use_cases/categories/sample_articles/train_data (fyi @diego )
Task 2 at aitolkyn/ai_use_cases/npov/data_final/train_npov_data.parquet
Task 3 at aitolkyn/ai_use_cases/peacock/data_final/train_peacock_data.parquet

Updates:

  • Data collection and processing was completed (until further requests)
  • Started baseline experiments for Task 3 (peacock behavior detection)
    • For peacock behavior detection, 7 languages were used in training and 10 languages in testing
    • Ran experiments using XLM-R Longformer
    • There was an issue with training with a max length of 4096 but it was resolved. Training on (46,523 train + 8,210 validation) samples across 7 languages for 3 epochs with a batch size of 1 took around 13 hours.
    • Input to the model is lang + page_title + parsed content
    • The evaluation result for a sample of 50 pages from each wiki_db is as follows, detailed experiments can be checked in this notebook:
Accuracy 0.59
ROC-AUC 0.63
PR-AUC 0.62
Precision: 0.57
F1-score 0.63

image.png (674×852 px, 83 KB)

  • Challenges: we are limited by the max token size so the text will be truncated for longer articles. In the peacock dataset, 90% of the samples are within this limit, and in NPOV dataset, 75% of the samples.

Next Steps:

  • run baseline experiments for Task 2
Miriam renamed this task from Collect Evaluation Data to Collect Evaluation Data for SDS 1.2.1 B.Nov 20 2024, 1:48 PM

@Aitolkyn I am inclined to close this task and create a new one for baselines. Would it be ok with you?

@Aitolkyn I am inclined to close this task and create a new one for baselines. Would it be ok with you?

Sure! Will post updates in the new phabricator ticket T380569