Page MenuHomePhabricator

Initial task generation and ingestion to Cassandra and Search weight tags
Closed, ResolvedPublic

Description

This work includes the following steps:

  • Generate initial ~10K high-quality tasks offline (en, fr, ar, and pt)
  • Bootstrap/initial ingestion to Cassandra and Search weight tags (manually, one-time)

The Search platform can assist with bootstrap/initial ingestion, as they have a manual script for ingesting to weighted tags. Ingestion to Cassandra needs investigation.
We ended up using Lift Wing to do initial ingestion to both Search weighted tags and Cassandra. The ingestion script we use is at https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1211159

After initial ingestion, we'll use T408538: Create a Revise Tone Task Generator in LiftWing to update the task list.

Dependant tasks:

Event Timeline

Update

I've collected articles in English (en), French (fr), Arabic (ar), and Japanese (ja), then generated paragraph data using Spark.

  • Article topics
"Culture.Biography.Biography*",
"Culture.Biography.Women",
"Culture.Sports",
  • Data cleaning
    1. Sections to skip
"en": [
    'See also',
    'References',
    'External links',
    'Further reading',
    'Notes',
    'Additional sources',
    'Sources',
    'Bibliography'
],
"fr": [
    'Notes et références',
    'Annexes',
    'Bibliographie',
    'Articles connexes',
    'Liens externes',
    'Voir aussi',
    'Notes',
    'Références'
],
"ar": [
    'وصلات خارجية',
    'قراءة موسَّعة',
    'الهوامش',
    'انظر أيضاً',
    'الاستشهاد بالمصادر',
    'انظر أيضًا',
    'مراجع',
],
"ja": [
    '脚注',
    '参考文献',
    '関連項目',
],
  1. Prefixes for links/files/category to remove
"en": ("file:", "image:", "category:"),
"fr": ("fichier:", "image:", "catégorie:"),
"ar": ("صورة" ,"ملف" ,"تصنيف"),
"ja": ("file:", "image:", "category:"),
  1. Paragraphs/plaintext that start with to skip
    • *: list items
    • |: table or template leftovers
    • <blockquote>
    • <ref>

We collected 2,605,050 articles for enwiki, 954,813 for frwiki, 614,594 for arwiki, and 512,887 for jawiki. Total is around 4M articles. The next step is to run the Tone Check model on all of this paragraph data to get high-quality tasks. Using statbox is too slow, so we'd like to use ML-lab's GPU instead. However, getting data from HDFS to the lab machine isn't currently possible (tracked in T380279 for ML SRE), so we'll need to manually move the data from HDFS to the lab machine.

Update

We have the initial dataset for frwiki.

  • We filtered paragraph length between 100 and 200 characters. This resulted in 1,861,036 paragraphs and 716,585 articles.
.filter(F.length(F.col("text")) > 100)
.filter(F.length(F.col("text")) <= 200)
  • After running the Tone Check model and selecting samples with label 1 (has tone issue) and score > 0.8, we identified 13,550 paragraphs from 11,777 articles that qualified.

After meeting with @Michael today, we agreed to first enable Testwiki for more controlled experimentation with both the update pipeline and the Newcomer Task integration. This means we will (1) load the initial Testwiki dataset to staging Cassandra and Search weight tags, and (2) enable the Revise Tone Task Generator on Lift Wing for Testwiki.
cc @BWojtowicz-WMF

Hi @pfischer @dcausse, ML team wants to follow up on the initial ingestion process. As you mentioned before, the Search platform team has a manual script for this purpose. Can the ML team execute this on our end (e.g., in statbox)? Or can only the Search team execute it?

All the dataset (frwiki, arwiki, enwiki, ptwiki, and testwiki) will be ready soon, and we'd like the initial ingestion to happen when the Revise Tone Task Generator in Lift Wing is ready this week. Would that be feasible?

@Michael, due to the time difference between data generation and ingestion, there will inevitably be some stale data in the initial dataset. I assume Growth can use the lasteditdate filter or some mechanism to work around this?

Hi @pfischer @dcausse, ML team wants to follow up on the initial ingestion process. As you mentioned before, the Search platform team has a manual script for this purpose. Can the ML team execute this on our end (e.g., in statbox)? Or can only the Search team execute it?

it is a bit cumbersome to run unfortunately and some adaptations have to be made (we only used it to backfill article countries). The script is in stat1009.eqiad.wmnet:~dcausse/articlecountry:

  • backfill_articlecountry.scala the spark job that reads hdfs://analytics-hadoop/user/dcausse/topic_model/wiki-region-groundtruth/regions-cirrus-upload.tsv.gz and convert it to classification.prediction.articlecountry weighted tags, this one would have to be adapted based on your source data
  • wiki.lst: the list of wikis to filter on
  • backfill.sh the shell script that orchestrates all this

it was designed to support large datasets but I suspect that if your dataset is relatively small (<100000 pages) you can use the same strategy you used to push the initial test data via event-gate?

[...]
it was designed to support large datasets but I suspect that if your dataset is relatively small (<100000 pages) you can use the same strategy you used to push the initial test data via event-gate?

Thank you for the detailed info, David! Yes, we ended up using a simple script to send requests to Lift Wing for initial ingestion. :)