Page MenuHomePhabricator

Analyze samples of articles to see how many structured tasks we might be able to generate
Closed, ResolvedPublic

Description

This ticket pertains to "Step 2 - Analysis" in this doc. The steps are re-iterated below.

  1. In the "Counts" tab of this spreadsheet, calculate how many requests we'd make to the model for each subset of articles (each row).
    1. For one large Wikipedia (EN), and one smaller Wikipedia (CS), query the content corpus to calculate how many articles and how many paragraphs we would be passing into the model.
    2. For now, we’ve chosen English and Czech because we believe both languages are supported by the model (we are still doing some evaluation of the model’s performance on Czech content). From the list of supported languages, English represents the biggest wiki and Czech represents the smallest wiki.
  2. In the "Counts" tab of this spreadsheet, calculate how many structured tasks we'd generate for each subset of articles (each row).
    1. For each Wikipedia, take a random sample of 50 articles from each article type.
    2. Parse the sample articles into plain text paragraphs, and send the paragraphs to the model.
    3. Calculate the number of positive predictions with a probability score >= 0.8.
    4. Use this number to generate an estimate of the total number of high-probability positive predictions we’d expect to see if we applied the model to all articles within that article type.
  3. For each Wikipedia and each article type (eg. "EN - Articles with relevant page templates" or "CS - Articles about people", make a tab in this spreadsheet containing the sample article paragraphs that receive a positive prediction with a probability score of 0.8 or higher. Include the following metadata about each paragraph:
    1. [If possible] Number of pageviews to the article (all-time)
    2. Number of pageviews to the article (last 90 days)
    3. Number of edits made to the article (all-time)
    4. Number of edits made to the article (last 90 days)
    5. Section title that the paragraph is in
    6. Age of the article (# days)

Feel free to replace or edit any of the tabs in the spreadsheet linked above. Please keep in mind that this list of articles is strictly for analysis, and is not meant to serve as the final list we use for the structured task.

Event Timeline

Update:

I summarize the article types to extract into four categories based on their source data, and put the results at the end.

  1. Page/Inline Templates
    • Idea: find articles that contains certain page/inline templates in their content.
    • Source data: template list, wmf_content.mediawiki_content_current_v1 table
    • There are some enwiki maintenance template ideas listed as page templates in this spreadsheet. But after checking them out, I found most are inline templates (e.g. {{Compared to?}},{{Editorializing}},{{Fact or opinion}} {{Opinion}},{{POV statement}).
    • Most page and inline templates listed do not have corresponding templates in Czech Wikipedia (cswiki). I only found this template {{Šablona:Vyhýbavá_slova}} which is similar. Its page also mentions related page and inline templates, so I ended up using these templates for cswiki.
  1. Article Topics
    • Idea: find articles with the topics (Culture.Biography, Culture.Sports, Culture.Media.Media*, History_and_Society.Business_and_economics, History_and_Society.Politics_and_government) and probability > 0.5.
    • Source data: research.article_topics table
    • In the results, the Culture.Media.Media* topic contains the second-highest count. This topic isn't just about businesses; it is a catch all for all other items that appear within Media. E.g. Media.Book, Media.Entertainment, Media.Films, etc. It also includes any items that are generally related to Media that don't fit into the specific categories. (see https://www.mediawiki.org/wiki/ORES/Articletopic)
  1. Article edited by new editors
    • Idea: find articles edited by users who registered within the last 3 months ('2025-06-01' to '2025-09-01').
    • Source data: mediawiki_history table (which has the "event_user_registration_timestamp" field)
  1. Articles with relatively few pageviews (WIP)

Results:

  1. Page/Inline Templates
+-------+------------------+-----+
|wiki_id|template          |count|
+-------+------------------+-----+
|enwiki |tone              |5343 |
|enwiki |peacock           |770  |
|enwiki |weasel            |506  |
|enwiki |buzzword inline   |441  |
|enwiki |weasel inline     |386  |
|enwiki |tone inline       |379  |
|enwiki |opinion           |312  |
|enwiki |pov statement     |245  |
|enwiki |editorializing    |216  |
|enwiki |peacock inline    |176  |
|enwiki |technical inline  |157  |
|enwiki |promotion inline  |95   |
|enwiki |fact or opinion   |80   |
|enwiki |unbalanced opinion|55   |
|enwiki |compared to?      |38   |
|enwiki |neologism inline  |7    |
|enwiki |idiom             |2    |
|enwiki |colloquialism     |1    |
+-------+------------------+-----+

+-------+-------------------+-----+
|wiki_id|template           |count|
+-------+-------------------+-----+
|cswiki |neověřeno          |12026|
|cswiki |neověřeno část     |840  |
|cswiki |ujasnit            |656  |
|cswiki |vlastní výzkum     |176  |
|cswiki |celkově zpochybněno|110  |
|cswiki |vyhýbavá slova     |16   |
+-------+-------------------+-----+
  1. Article Topics
+-------+-------------------------------------------+-------+
|wiki_db|topic                                      |count  |
+-------+-------------------------------------------+-------+
|enwiki |Culture.Biography.Biography*               |1394816|
|enwiki |Culture.Media.Media*                       |1019480|
|enwiki |Culture.Sports                             |956776 |
|enwiki |History_and_Society.Politics_and_government|246764 |
|cswiki |Culture.Biography.Biography*               |131752 |
|enwiki |History_and_Society.Business_and_economics |104479 |
|cswiki |Culture.Sports                             |67564  |
|cswiki |Culture.Media.Media*                       |62480  |
|enwiki |Culture.Biography.Women                    |28913  |
|cswiki |History_and_Society.Politics_and_government|11151  |
|cswiki |History_and_Society.Business_and_economics |7669   |
|cswiki |Culture.Biography.Women                    |1639   |
+-------+-------------------------------------------+-------+
  1. Article edited by new editors
+-------+-----+
|wiki_db|count|
+-------+-----+
|enwiki |31151|
|cswiki |2697 |
+-------+-----+
  1. Articles with relatively few pageviews (WIP)

Include the following metadata about each paragraph:
A. [If possible] Number of pageviews to the article (all-time)
B. Number of pageviews to the article (last 90 days)

@diego, do you know if there is any Hive table in Data Lake that contains aggregate pageviews per article? Would really appreciate your insights on this :)

@diego, do you know if there is any Hive table in Data Lake that contains aggregate pageviews per article? Would really appreciate your insights on this :)

You could use the pageview-hourly table, which contains aggregated number of pageviews per article (use the page_id to also get pageviews from redirects). You would only need to aggregate over a corresponding time window. (I am not Diego, but thought I would chime in when I saw the update). Hope this helps.

Thanks @MGerlach , I agree that is the main source of pageviews readable from Spark. I'm not aware of anything aggregated in larger buckets.

There is also a pageviews_daily (significantly smaller than the hourly) but it is just available in turnilo and think it doesn't have article level information.

The knowledge gap pipeline (which is snapshot based) aggregates the historical pageviews too, as the pageview-hourly dataset is sizable for larger time ranges. The knowledge_gaps.pageviews_daily contains a subset of pages (wikipedia, namespace 0 , agent_type=user) has a minimal schema and partitioned by date (i.e. day), one day of data is about is ~4GB of data compared to ~25GB for pageviews-hourly. The code is here. The dag runs weekly.

@MGerlach @diego @fkaelin Wow thank you for all your input! <3 I checked out the knowledge_gaps.pageviews_daily table and it's super easy to use :D

Data collection is complete and this spreadsheet has been updated.

Summary:

  • Estimated number of structured tasks we can generate: 49,327 (enwiki; large wiki); 1,851 (cswiki; small wiki)
    • For enwiki, articles about people contribute the most estimated tasks (~28k), followed by articles about businesses (~10k)
    • For cswiki, while the number of positive paragraphs exceeds enwiki in sample articles, paragraphs with probability ≥ 0.8 are much fewer, sometimes none. Lowering the threshold to 0.7 would yield much more tasks. For example, we'd have 13,339 estimated tasks with probability ≥ 0.7 in articles about people.
  • Some article types received no positive predictions with a probability score ≥ 0.8 in our random sample of 50 articles. This suggest the sample size may be too small for certain article types. For example, among 956,776 articles about sports in English, our sample of 50 articles gives 0 positive predictions with a probability score ≥ 0.8. We should try larger sample sets to verify this.
  • The quality of structured tasks generated needs evaluation
    • I've added sample article paragraphs that received positive predictions with high probability scores, including metadata about each article, to the EN/CS tabs in the spreadsheet.
  • For cswiki, while the number of positive paragraphs exceeds enwiki in sample articles, paragraphs with probability ≥ 0.8 are much fewer, sometimes none. Lowering the threshold to 0.7 would yield much more tasks. For example, we'd have 13,339 estimated tasks with probability ≥ 0.7 in articles about people.

en_vs_cs_prob_dist.png (310×456 px, 24 KB)

A comparison between enwiki vs cswiki probability score distribution (from T398930) may explain this phenomenon? The cswiki's distribution appears denser than enwiki, with scores centering more between the probability range of 0.4-0.6.

Michael added a subscriber: Urbanecm_WMF.
  • For cswiki, while the number of positive paragraphs exceeds enwiki in sample articles, paragraphs with probability ≥ 0.8 are much fewer, sometimes none. Lowering the threshold to 0.7 would yield much more tasks. For example, we'd have 13,339 estimated tasks with probability ≥ 0.7 in articles about people.

en_vs_cs_prob_dist.png (310×456 px, 24 KB)

A comparison between enwiki vs cswiki probability score distribution (from T398930) may explain this phenomenon? The cswiki's distribution appears denser than enwiki, with scores centering more between the probability range of 0.4-0.6.

Oh that is a fascinating graph, thank you! If(!) we assume that the model is at least directionally accurate, does this graph then imply that Czech Wikipedia is written in a subtly less neutral tone than English Wikipedia? Curious if anyone from Research has thoughts on this.

I'd also be curious if @Urbanecm_WMF or other native czech speakers have a gut-reaction to that. Does English Wikipedia feel like it has more "neutral" tone than Czech Wikipedia?

Update:

  • Using a larger sample size (~1k)
    • EN: 46,910 total estimated tasks, similar to our previous estimates. Tasks in articles about sports increased (0 to 9,244), while articles about businesses decreased (10,448 to 3,913). Articles with templates and articles about people remained stable.
    • CS: Estimated numbers increased across all categories, raising the total tasks from 1,851 to 4,055. Lowering the threshold to 0.7 would increase this to 30,508 tasks.
    • Detailed category numbers can be found in the "Counts - larger sample size" tab in the spreadsheet.
  • Additional sample article paragraphs can be found in the "EN - more samples" and "CS - more samples" tabs in the spreadsheet.

I reposted @Sucheta-Salgaonkar-WMF's findings and recommendations here from slack and resolved this ticket.

TLDR: We should have no problem generating 10k high-quality tasks in wikis that are comparable to English Wikipedia.

Below is a summary of my recommendations, and you can find my full notes here. Lmk if you have any questions, thoughts, etc

Article types

  • We should include articles about people, articles about sports, and articles tagged with relevant templates
  • If needed (to generate more tasks, or to serve tasks to a more experienced editor audience) we can include articles about businesses and articles written by new editors
  • We should exclude articles about politics

Filters

  • We should filter out references (eg. using <ref>)
  • We should try to filter out block quotes and inline quotes
  • We should exclude brand new articles that are awaiting New Page Patrol review
  • We should exclude articles that are marked for deletion

Analysis

  • We should be sure to evaluate not just whether the model is identifying tone issues, but also whether the edits newcomers are making are actually fixing them or just changing something that doesn't improve them (or worse, changing them in a way that hides the tone issue without fixing it).
  • We should evaluate whether newcomers who take the onboarding quiz are able to interact with this task more effectively.