Page MenuHomePhabricator

Create a dashboard for tracking test wikis in the Wikimedia Incubator
Closed, ResolvedPublic

Description

As a language committee member, it is important for me to see which test wikis have a lot of activity in the Wikimedia Incubator within a given timeframe. We currently check the activity in a single wiki via the catanalysis tool (example), but we don't have a good way of knowing which wikis are most active overall.

Therefore, my idea for a Hackathon 2022 project is creating a dashboard that can give an overview of all wikis in the Incubator, and give a quick oversight of which wikis may be getting ready to be approved.

Stats that should be collected:

  • Number of edits per test wiki
  • Number of unique contributors in a test wiki
  • Bytes added/removed per test wiki
  • Anything else?

Should it give live stats, or only monthly summaries? (I'm leaning towards the latter.)

Event Timeline

@jhsoby this is a nice idea, and I am interested to collaborate on this. I will try to think some ideas of how we can approach this before the hackathon begins.

@KCVelaga Awesome! I see you list SQL as one of your skills, and that's perfect, because that's a big gaping hole in my skills, and definitely needed for this. 😊

@jhsoby as I am trying to do some initial data exploration. I have some questions:

  1. When a new wiki is created, do the pages and edits related to that wiki while on incubator get deleted or are they kept? This will be helpful filter out edits from the revision table by rev_deleted parameter. From what I understand, we will only need the dashboard for wikis are currently incubating, but not all historically. We will need some initial way of filtering down to edits belonging to current wikis rather than loading the entire edits data. We possibly do with REGEX from page-title, just wondering if there is an easier way.
  2. Any thoughts on how we can remove incubator related edits, such as edits related to actual incubator related pages such as main page, FAQ etc.
  3. About stats to be collected: how about adding, number of pages as well?
  4. do you have specific endpoints/frameworks in mind for the dashboard?

Also, about live/monthly, depending on how big of a dataset we end up, we can setup a weekly, bi-weekly or monthly update.

  1. They are deleted. See I:SCL for the list of "ex-test-wikis".
  1. Those are all in the Project or Help namespaces. Vice versa, test-wiki content is only found in the main, template, category and module namespaces (and their talk namespaces).

Maybe this query: https://quarry.wmcloud.org/query/61436 (counting raw number of actions) is a helpful reference.

@jhsoby as I am trying to do some initial data exploration. I have some questions:

  1. When a new wiki is created, do the pages and edits related to that wiki while on incubator get deleted or are they kept? This will be helpful filter out edits from the revision table by rev_deleted parameter. From what I understand, we will only need the dashboard for wikis are currently incubating, but not all historically. We will need some initial way of filtering down to edits belonging to current wikis rather than loading the entire edits data. We possibly do with REGEX from page-title, just wondering if there is an easier way.
  2. Any thoughts on how we can remove incubator related edits, such as edits related to actual incubator related pages such as main page, FAQ etc.
  3. About stats to be collected: how about adding, number of pages as well?
  4. do you have specific endpoints/frameworks in mind for the dashboard?

Also, about live/monthly, depending on how big of a dataset we end up, we can setup a weekly, bi-weekly or monthly update.

Wonderful!

  1. Deleted, as MF-Warburg said, so we could safely disregard deleted edits.
  2. What MF-Warburg said.
  3. Yeah, number of pages would be good to have!
  4. Nothing specific, no, I am unfortunately not familiar with any specific frameworks for this – if I'd been all alone in this, I would have made something (most likely substandard, hehe) from scratch, so anything other than that is probably better. ;-)

Also, I think monthly updates would be fine, since we only really look at monthly stats anyways.

@MF-Warburg and @jhsoby; somehow I missed both your messages. Thank you for the information, that is helpful.

I am an attending the local meetup in India. A couple of the participants are interested, along with me.

I am thinking of the following steps broadly

  • Build the required query
  • Build required charts and outputs
  • Setting up on Toolforge (cron job, json storage etc.)
  • Buidling and deploying the dashboard. I am thinking we can use https://dash.plotly.com/

Let me know what you think, and if you have any suggestions.

@KCVelaga Can we meet in the Workadventure space to discuss? :-) I'll be in the "Localization and small wikis" room when I'm online (like now).

Here is a query that gives output of the required metrics for all incubating wikis: https://quarry.wmcloud.org/query/64738

Okay, so here's what I'm thinking for the stats/cron job (again with the disclaimer that I don't really know the limitations/capabilities of Dash especially):

  • Make the query check one calendar month at a time; save the results to e.g. stats/YYYY-MM.tsv
    • Generate similar stats for previous months of 2022?
  • Run the query as a cron job once a month (1st of each month) to create a new results file
  • In the tool (or in the query, but I think it might be easier to do in the tool?), we can do some logic to check the number of consecutive months of activity per project.

Also:

  • We should exclude some users from the user stats. Primarily IPs and bots, but also (ideally) people who edit many different languages (because that's more likely to be maintenance than actual content contribution)