User Details
- User Since
- Apr 10 2021, 8:11 AM (244 w, 3 d)
- Availability
- Available
- LDAP User
- Aitolkyn
- MediaWiki User
- Unknown
Dec 30 2024
Note: My contract is finishing tomorrow. Thank you to everyone involved in this project! Below, I will report updates for the past week.
Dec 20 2024
- SHAP explanations for peacock behavior can be found at the end of this notebook. Additionally, I added bars with the top 20 words in the explanation for each class, where these are adjectives for the "positive" class, and there are mainly proper nouns for the "negative" class.
Dec 13 2024
Updates:
- Exploring the explainability of small language models with SHAP values
- Writing a final report on data preparation and baseline experiments
- Adding our reference need work as an additional use-case. Reran full-scale evaluation experiments on reference need data in the top 10 languages by active user count.
- Examples of shape values for peacock behavior, in which words like "greatest" or "stunning beauty" appear to affect the model's positive label the most.
Dec 6 2024
Updates:
- Additionally fine-tuned XLM-Roberta with max length of 512 for NPOV and Peacock. XLM-R performs slightly better than mBERT with the same max length.
- Updated mBERT trained on enwiki - Peacock
- The updates are reflected in the table above and in the notebooks at: for npov, for peacock
Nov 29 2024
Updates
- Baselines for tasks 2 and 3 were updated and evaluation of full data was reported
- All plots & results are presented here: for npov, for peacock
Nov 22 2024
Updates:
- Train multiple models on Task 2 and Task 3
- I used multilingual BERT and XLM-R Longformer in my experiments
- I ran testing on the full evaluation dataset and reported the results in detail in the slides here
- We also test the performance by topic in enwiki
- Findings
- Passing page title along with the content improves the accuracy
- Increased context size of 4K tokens with XLM-R Longformer does not improve the performance significantly
- mBERT trained on all wikis performs better on other languages than mBERT trained only on English samples
- NPOV detection classification shows lower accuracy results
Nov 15 2024
Data can be found on the cluster at:
- Eval data:
Task 2 at aitolkyn/ai_use_cases/npov/data_final/eval_npov_data.parquet
Task 3 at aitolkyn/ai_use_cases/peacock/data_final/eval_peacock_data.parquet
Nov 8 2024
Related Code:
- Task 3 - Peacock Behavior Detection at https://gitlab.wikimedia.org/repos/research/llm_evaluation/-/tree/ait/eval-datasets/notebooks/peacock
Nov 1 2024
Related Code:
- Task 1 - Article Categorization at https://gitlab.wikimedia.org/repos/research/llm_evaluation/-/tree/ait/eval-datasets/notebooks/categories?ref_type=heads
- Task 2 - NPOV Detection at https://gitlab.wikimedia.org/repos/research/llm_evaluation/-/tree/ait/eval-datasets/notebooks/%5Bnew%5Dnpov?ref_type=heads
Updates:
- Get templates that link to the NPOV policy using langlinks API and page. redirects
- hewiki and plwiki do not have a dedicated page for POV template
- Collected all historical revisions that contain the above-mentioned templates across 23 languages and additionally supplemented with a bunch of metadata
- Extracted positive/negative pairs from each page following the previous approach
- 5 languages, i.e., hewiki, hiwiki, idwiki, rowiki, elwiki, have less than 1K pairs --> will be discarded from the final dataset (+plwiki that doesn't have a dedicated POV template page)
- Stratification by topic for sampling will be applied only to enwiki due to the sparse distribution by topic for non-English languages
- Checked stats, distribution plots available at the bottom of this notebook
Oct 25 2024
Updates:
- I retrieved 10 similar pages per each seed article sampled previously. Seed article dataset contains additional features to link similar pages, namely sim_page_ids and sim_page_titles and similar pages along with the metadata are available in a separate file.
- Find current versions of seed articles at ai_use_cases/categories/sample_articles/seed_articles_w_similar10_v1 and similar10 articles at ai_use_cases/categories/sample_articles/similar10_metadata_v1
- After Mykola’s initial analysis, I updated the above two datasets following his suggestions. I additionally collected revision_text for the main section converting it to plaintext using mwedittypes (https://github.com/geohci/edit-types/blob/main/mwedittypes/utils.py#L77C5-L77C26)
- I started looking into the previous codes and publications for Task 2, NPOV detection
- The pipeline for extracting negative and positive samples has been discussed and it was decided to collect for all articles history.
Oct 18 2024
Hello all, my contract for this project ended on September 30, and I would like to summarize our work here.
Status Update:
- I collected topics, categories, and section headings for all articles that are assigned a topical category (with a score > 0.5) in 23 languages in the AYA23 model family.
- I checked the distribution by topic in each language, available here
- Note: one article could be counted for multiple topic categories
- Based on the decided thresholds from the above distributions, I sampled seed articles in each language
- decided to sample 50 articels per topical category before 2024 and max(25, number of created articles) per topic in 2024, to oversample for the most recent data.
Oct 4 2024
I re-ran our latest reference-need model on a test data of 15K sentences. Our currently deployed model uses distilbert-base-multilingual-cased with torch dynamic quantization (column 2 - torch - in the plots below).
Sep 27 2024
Report on ref. need model latency: https://docs.google.com/document/d/1EJbSJ7fekZvor8F-FiPVl7EGlTkckKtSIGOMXb1K2FM/edit?usp=sharing
Sep 20 2024
The ref. need was deployed to production this week. Given the time issue, we run experiments to further improve the model latency.
I shared two updated binaries with ML team with 1) a smaller model and 2) a smaller model + quantization. According to preliminary analysis on stat machines, it should reduce the processing time ~2 times.
Sep 4 2024
Hello @isarantopoulos! We downgraded to match the version in the knowledge-integrity repo.
Aug 28 2024
Hi Aiko! The location on the stat1010 is /home/aitolkyn/temp/reference-quality/pretrained_models/multilingual_reference_need_128_v0.pkl
sha512: 0af0ecd12e05e7c40a0d39dd155589917130d1fa00711c3675c48d4373edca402bdc25cb85a56925deb24ebcf3c0ac01843179c86321f0991772b8963c27ed24 *multilingual_reference_need_128_v0.pkl
Aug 27 2024
Aug 23 2024
Update ref. need model input to
lang_code, section_name, sentence, next_sentence, prev_sentence
and reduce input context size to 128 given the time limitation constraints.
Jul 26 2024
For the second question, it is per article, so these tentative predictions were computed as the times to process an article revision.
Jul 19 2024
ref. risk logic to compute the score follows the definition given in our WWW paper, where we account for blacklisted/deprecated sources in perennial sources list as risky sources, which can be checked in the repo: https://gitlab.wikimedia.org/repos/research/reference-quality/-/tree/classifiers
ref.risk we further annotate an additional 1000 sources to use as ground-truth.
Jul 12 2024
Prepared the code here: https://gitlab.wikimedia.org/repos/research/reference-quality/-/tree/classifiers
Jul 1 2024
- ref. need - model works well when tested on smaller languages (kkwiki balanced ~0.7)
- ref. need - prepared scripts to pass for production
- ref. risk - run classifier on all domains in enwiki --> save the score and pred. probability per domain
- ref. risk - article-level analysis of ref. risk score
Jun 21 2024
- ref.need - with transfer learning, results on the full enwiki dataset ( F1-score 0.76, ROC-AUC 0.77, PR-AUC 0.83)
- ref. risk - results with the ground-truth on enwiki (Accuracy: 0.86, F1-score: 0.74)
The majority (58%) of misclassification is caused by the middle 'Generally unreliable' category
Next:
- test ref. need model on smaller language editions
Jun 14 2024
- ref. need - best-performing model so far is reaching 0.76, we tested with multilingual bert and xlm-roberta. The random sample compared to featured articles sample has on average > twice higher ref. need (RN) score, refer to the plot below
- ref. need - compiled labeling dataset for 5 languages (en, es, de, fr, ru) and labeled ruwiki
- ref. risk - prepared three training datasets using perennial sources (PS) as ground-truth label (before PS list, after PS list, and all)
- ref. risk - binary classifier on balanced dataset, notebook with results here
Jun 2 2024
27/05-02/06:
- ref.need - run inference with our best-performing model so-far and compute ref. need score with two samples of articles: featured and random sample, getting higher scores for random articles, meaning they are missing more references.
ref. risk - modelling approach set as below
- Binary classifier to detect risky sources
- Domain-level
- English Wikipedia
- Ground-truth labels
- Perennial source list (positive class: deprecated and blacklisted sources)
May 24 2024
20/05-26/05:
- ref. need - experiments with additional context (such as prev/ next sections, sentences, or paragraphs)
- ref. need - expand the experiments to multilingual scenarios, train and test on different languages with f1-score ar 0.70-0.72
- ref. risk - more experiments of the signals of reliability (on data by domain, URL, and page levels)
May 17 2024
12/05 - 19/05:
- ref.need - testing best-performing classifier so far with the remaining languages (refers to the 5 languages featured articles were collected from as reported in earlier weeks), pretrained model from we're using is distilbert-base-multilingual-cased
- ref.risk - changed the target feature from absolute edit number survived to the survival ratio, with 1 - meaning a reference survived all the edits after addition and 0 - meaning that the reference didn't survive subsequent edits. Analyzed the signals by comparing references on the article level
May 10 2024
06/04 - 12/04:
- Ref. need - model improves with more context, f1-score achieved is ~0.74
- Ref. risk - experiment with assigning possible reliability labels based on the observed patterns from perennial sources with known labels (last plot here)
- Ref. risk - compare featured articles and the remaining articles in the current snapshot with assigned our labels
May 3 2024
29/04 - 05/05:
- Ref. need - add topic as a feature
- Ref. need - ran experiments with additional numerical and textual features, so far the highest performance achieved in terms of f1-score is 0.733
Apr 26 2024
22/04 - 28/04:
- Ref.need - Tested with Citation Needed data, our approach gets better results with their data
- Ref.need - Added additional inputs to the model, such as the section index, paragraph index, sentence index
- Ref.risk - Limit our data to instances of sources that are labeled in the perennial sources list, clean and preprocess the data for classification
Apr 19 2024
15/04 - 21/04:
Ref.need experiments with different models (refer to slide#10 for results)
Ref.need feature exploration, example refer to slide#8 and below plots for Logistic Regression feature importance for positive/negative labels
Ref.risk analysis of URL permanence to reliability label, example below:
Apr 16 2024
Apr 15 2024
@ssingh Thank you for checking! I get the following error when trying to access my tables:
08/04 - 14/04:
- ref.need experiments with BERT are continued with additional data cleaning, facing overfitting issues, experimenting on a sample of data following prev. work approach
- ref.risk extending data with URL permanence in terms of (lifespan of a URL on a page and number of edits)
- ref.risk analysis of ground-truth labels from enwiki perennial sources list
Apr 5 2024
01/04 - 07/04:
- ref.need using the dataset we prepared, started experimenting with BERT for sentence classification into cited/uncited in enwiki
Mar 30 2024
25/03 - 31/03:
- ref.need running baseline models for classification using 1) tf-idf, 2) sentence metadata, + logistic regression
- planning to run with LLMs next, studying hugging face NLP tutorials
Mar 22 2024
18/03-22/03:
ref.need
- found and fixed an issue with tokenized sentences -> updated the dataset
- Data: ~16M featured articles and ~4.4M extracted sentence (47% accompanied by a citation)
- brainstorming and planning for the classifier model
Mar 15 2024
11/03-15/03:
ref.need
- extend dataset with sentences from FAs of dewiki, frwiki, ruwiki, ptwiki, eswiki
- proportion of cited sentences among all ranges from 40-49% for enwiki, ruwiki, ptwiki, frwiki, is lower for eswiki (~35%) and lowest for dewiki (~20%)
- notebook available here: https://gitlab.wikimedia.org/repos/research/reference-quality/-/blob/research-notebooks/RN/extract-FAs.ipynb?ref_type=heads
Mar 8 2024
04/03 - 08/03:
- ref.need - add featured articles from dewiki, frwiki, ruwiki, ptwiki, eswiki (choice is based on the number of currently existing FAs)
- ref.need - [for enwiki] prepare dataset for the model with the following columns:
page_id, revision_id, section_name, sentence, paragraph, citation_label
where citation_label = 0, if sentence does not include a reference,
citation_label = 1, if sentence includes a reference
Mar 1 2024
26/02 - 01/03:
- onboarding
- ref. need - prepare featured articles data
- ref. risk - literature review (e.g., reverted revs examples, controversiality score) for our dataset specification
Aug 5 2023
The paper has been accepted to CIKM (short paper track).
Mar 10 2023
- We uploaded our work to arxiv: https://arxiv.org/abs/2303.05227
Jun 25 2022
20/06 ~ 24/06:
- extract user lifespan & analyze the lifespan of users vs. collaboration with experts
- manually check pages containing sources from external fake websites lists
- get dominant sources on wiki from the external lists
- finish collecting citation quality scores for the top dataset
Jun 17 2022
13/06 ~ 17/06:
- classify users into exposed and non-exposed in the new random & top datasets
- psm on collaboration between experts and non-experts on the new random and top datasets
- topic coverage of unreliable source lists (including perennials)
- finish collecting citation quality scores for the random dataset
Jun 10 2022
- add one more external source list (3. Snopes)
- visualize the intersection of the 3 lists and coverage of #3
- citation quality scores start collecting for top2021 dataset
Jun 4 2022
- search for external lists of unreliable sources ( e.g. Melissa Zimdars' fake news websites list )
- compute the coverage of wiki by external lists (1. zimdars and 2. daily dots) and compare with perennial source list
- setting up the environment for the citation quality scores collection
May 27 2022
23/05 ~ 27/05
- transfer collection of reference need scores to the server (API was too long)
- active user contributors to add perennial sources analysis
- complete the evolution of references in the perennial source list (the future trend is more positive compared to previous data)
May 20 2022
16/05 ~ 20/05
May 15 2022
It looks very useful, thank you very much!! I'll check this out
May 13 2022
09/05 ~ 13/05
- collect reference risk scores for the 2 datasets: random and top2021
- start collecting all the missing reference need scores
- pageviews for 'bad' sources before and after they are classified as 'bad'
- analyze the data collected so far (significance tests, distributions, plots)
May 6 2022
02/05 ~ 06/05:
- re-check the pageviews data for pages in multiple namespaces (namespace_id was added to pageviews_hourly in 2017)
- analysis of the reference quality of the most viewed pages' revisions
- get pages and collect revision data for two datasets: random and top-viewed
Apr 29 2022
25/04 ~ 29/04
- extract monthly top-viewed pages and get the pages' revisions at that time
- collect reference quality scores for the revisions of top-viewed pages
- aggregate pageviews for the references from perennial source list
- prepare presentation
Apr 22 2022
18/04 ~22/04:
- extract and analyze data from the pageviews table
- check the extracted results with the PageviewsAPI
- check PageviewsAPI
Apr 15 2022
11/04 ~ 15/04:
- perennial source list references lifespan and pageviews
- continued exploring PySpark





