Page MenuHomePhabricator

Cross-Lingual Article Quality Exploration
Closed, ResolvedPublic

Event Timeline

Just wanted to comment that I love this task and a few thoughts:

  • The start of some of my work around understanding sources -- specifically their geography: https://meta.wikimedia.org/wiki/Research:Analyzing_sources_on_Wikipedia
  • There are three aspects of citation quality that stand out to me:
    • Do they exist -- i.e. for sentences that need a citation, how many have one? Miriam and Aiko's work tackled this question for a few wikis.
    • The reliability/verifiability of an individual source -- e.g., does it show up on reliable source lists or in Featured Articles? Is it primary, secondary, or tertiary? Is there a digital version that doesn't sit behind a paywall?
    • The overall diversity of sources -- e.g., how many unique sources? mixture of primary/secondary/tertiary? mixture of countries? mixture of source types (books, newspapers, etc.)? mixture of dates?
  • As I mentioned in our meeting, I'm also intrigued by the question of when sources / content go stale. For example, you have an article that exists in multiple languages. On one language, we see a burst of edits that are adding new sources to the article that seem to be recent (e.g., newspaper articles published in the past month). How often do we see the other languages also add recent sources? When the other languages don't add these sources, which of these is true:
    • The articles are incomplete -- e.g., the person won a prize that isn't listed yet?
    • The articles have stale (old data) -- e.g., they have population date from 2010 but there's now data for 2020?
    • The articles now have misinformation / NPOV violations -- e.g., the original estimate of damage for an earthquake was wrong and by not including the new estimate, the article is under/overstating the impact? Or an even simpler case: the person has died and by not including that information, the reader would assume the person to still be alive?
diego renamed this task from Cross-Linngual Citation Quality Exploration to Cross-Linngual Article Quality Exploration.Apr 7 2022, 2:59 PM

Thanks @Isaac for these inputs. There was a mistake on the title, this work is about article quality and not specifically about citations.

There was a mistake on the title, this work is about article quality and not specifically about citations.

Oh drat -- well then a small tweak on my comments that largely repeats what I said in our meeting. The two main gaps I see in the quality model are:

  • Better features related to the actual content (not just its quantity).
    • Martin is working on a language-agnostic measure for readability (meta)
    • Language-agnostic lists of maintenance templates so editor's can flag issues that models might not detect, which I know is something Diego has already worked on
  • Better features related to the sources of the article (detailed thoughts in my previous comment) given how important sources are to article quality

Updates

  • Last week we have done the onboarding on using PySpark and cluster data.
  • As first step @paramita_das will be working on obtain the article quality distribution along the time for enwiki.

@paramita_das please report your progress here every Friday.

Update of the last week-

  1. I tried to be familiar with Cluster access and PySpark.
  2. I have gone through some projects about article quality.

Update of the week-

  1. I am trying to be familiar with cluster data, especially with wikitext_current and wikitext_history and scraping the article quality from namespace 1.

Update of the week- 18,04.2022 - 23.04.2022:

I have prepared the yearly snapshot of the quality class assignment of the English Wikipedia articles from the current dump. For this work, I have used PySpark and cluster data. I have found that the last quality class assessment for many articles was done quite a long time (i.e., years here) ago.

FYI if you want an alterative approach for extracting quality ratings for current articles, you can use the page assessments MySQL table as well. They are available in at least English, French, Arabic, Hungarian, and Turkish but only contain the current state:

FYI if you want an alterative approach for extracting quality ratings for current articles, you can use the page assessments MySQL table as well. They are available in at least English, French, Arabic, Hungarian, and Turkish but only contain the current state:

Thanks, Issac for the reference

weekly update: 25.04.22 - 30.04.22

  1. Trying to generate the yearly snapshot of article quality from the English wikitext history dump.
  2. Trying to run language-agnostic quality model V2 on history data.

weekly update:

  1. crated yearly snapshot of all articles from English wikitext history dump.

weekly update: 08.05.22 - 14.05.22

  1. Implementing language-agnostic quality model V2 on the revision history of English wikitext dump.

weekly update: 15,05.22 - 21.05.22

  1. Implemented language-agnostic quality model V2 for every revision of "wikitext history dump" for the language versions- Spanish.
Aklapper renamed this task from Cross-Linngual Article Quality Exploration to Cross-Lingual Article Quality Exploration.Aug 15 2022, 10:09 AM

@diego resolving this as most of the work is done. Please reopen if that's not the case!