Request
Copying the task for this milestone from the epic task:
Milestone 2: Report for metrics NOT requiring webrequest aggregation
- First test report (in the format specified in the reporting template)
- number of active editors (5 edits per month)
- number of admins (interface-admin)
- number of bureaucrats
- number of property creators
- number of items per active editor
- The presented information should be ready for "copy & paste" to be used in the respective reports.
- For the full report we'd also like the % change from the previous year and all data from the previous quarter
Assignee Planning
Information below this point is filled out by WMDE Analytics and specifically the assignee of this task.
Sub Tasks
Full breakdown of the steps to complete this task:
- Define all tables needed and add them to the ticket
- Make baseline connections to the given tables above and explore
- Including this as this is an initial task for me
- Define the following:
- Methods used:
- Total sum in the period
- Percentage change (yearly)
- Conditions for each group:
- How do we do retroactively check if the user had a certain permission at a given time? snapshot?
- Prior data exists in a preaggregated form, so we'd use that
- Aside from this the logs can be used to determine user rights/right changes at a given time
- Are admins and bureaucrats also active? When do they not count anymore?
- As of now they are still counted as an admin, but we might want to think about this for product metrics
- How to exactly differentiate property creations and the edits we want?
- Are we including edits to user pages?
- As of now including everything
- How to define an active editor over a quarter? Do we switch to 5 edits/30 days?
- We need to look into this further, but right now it's on a given day based on 5 edits over the last 30 days
- Are there conditions to consider for when an item is counted? Like for Wikipedia we might not count stubs?
- For Wikidata there's no distinction at all, and for Wikipedia we'd also just count them
- How do we do retroactively check if the user had a certain permission at a given time? snapshot?
- Time frame for aggregation:
- Current time that notebook is ran
- When would the notebook normally be ran by in case we're doing a cron job (for later planning)?
- Ideality would be self service at time of stakeholders choosing
- Current consideration is Grafana depending on data access constraints
- Methods used:
- Define output and check with stakeholders (@Manuel)
- Aggregates for each group/value at the end of the time frame in a dataframe
- Rows:
- One row per quarter
- Columns:
- One column per group/value
- After each column, one column for yearly retrospective percentage change
- Rows:
- Are we including plots?
- For right now no
- Aggregates for each group/value at the end of the time frame in a dataframe
- Report the numbers for June 2023, May 2023 and if possible April 2023
- June 2023: 13213
- June 2023 (filtering out an ip): 7997
- May 2023: 141
- April 2023 (from the 11th): 87
Data to be used
See Analytics/Data_Lake for the breakdown of the data lake databases and tables.
The following tables will be referenced in this task:
- We need a WHERE clause of wiki_db = 'wikidatawiki'
- For user permissions (admins, bureaucrats): wmf_raw.mediawiki_user_groups
- For edit data and property creations: wmf.mediawiki_history
- For the total number of items: wmf_raw.mediawiki_page
- For normal edits we're talking about an action of event_entity of revision and event_type of create
Notes and Questions
Things that came up during the completion of this task, questions to be answered and follow up tasks:
- The active users stats sql should likely include an upper bound so that calculations are not dependent on the time that the script is ran in the morning
- Create and link a Phabricator task to fix this
- Open question of "quarterly aggregation" vs. "last month of the quarter"
- For now we're doing last month of the quarter
- @AndrewTavis_WMDE is going to check if data is still available for April
- We'll report the number of unique user_agents for May and June at least just for an overview
- Focus will then switch to getting this automated with Airflow
- This will get stated after T340648: [Airflow] Setup Airflow instance for WMDE
- We'll use a running 30 days from that point