Note that this Phabricator account is solely for my work at Wikimedia Deutschland. My private account is AndrewTavis.
User Details
- User Since
- Apr 14 2023, 9:33 AM (7 w, 6 d)
- Availability
- Available
- IRC Nick
- andrewtavis-wmde
- LDAP User
- Andrew McAllister (WMDE)
- MediaWiki User
- Andrew McAllister (WMDE) [ Global Accounts ]
Today
Yesterday
Tue, Jun 6
Moved this to Needs review as I think we have a fairly good idea of what we'll be using notebooks for, @Manuel. For reporting we can check with WMF about tools they use (Superset, public Superset or Grafana), so we wouldn't be using notebooks for this purpose. I'd say that settings them up to do this would be prohibitive as the documentation that could be included could similarly be provided with readme files for Python scripts that would be easier to maintain with Airflow.
Mon, Jun 5
Fri, Jun 2
Another thing to consider @Manuel is that apparently Quarry is being migrated to a wmcloud Superset instance, so this could be a place where we put reporting metrics that we want to be open to the public :)
Notes from meeting with @Manuel:
Thu, Jun 1
Thank you, @Manuel!
- Cron jobs might be possible in either case
- Is something that product analytics at WMF does
- GitLab issue about JupyterLab operators
- We could also use airflow
- Generally the suggestion from WMF was to look into integrating Superset a bit more
- Graphite is something that people would like to sunset
Note that the above notebook is just aggregates and the queries used to get them. In those regards I felt like it was ok to share it via Phabricator for documentation of this task.
The following is the current rendition of the quarterly reporting notebook. As the general plan is to shift to Grafana, I'd say that we're fine for now with this file. We can chat about this tomorrow, @Manuel!
Wed, May 31
Note for a task to be done:
- I need to change the active editors query to be only based on an end date where the start date would be 30 days before. I'm unsure how exactly to do this in Presto.
Tue, May 30
And as far as the time frame being less than months, I'd say it would make sense to either do months or days as if we're open to it being at a lower granularity then we should start saving the data in a way that we could make the jump to days if need be. This would also allow for day-granularity tables on Grafana.
@Manuel, the notebook's generally all ready. I'm not getting data results for one year ago for pages or admins for the current tables (I'm assuming this is the case for other rights metrics as well). We can check the output tomorrow and in that time I'll do some more deep dives into the data to see if I can figure out where more long term data is kept :) Looks like we have a fair number of people in SWE Data now (🙌🙌), so we can also check in there after a discussion tomorrow.
Fri, May 26
Ultimately what needs to happen for the page data query:
@Manuel, current query is:
Thu, May 25
The numbers from above were using wmf_raw.mediawiki_page. What are you using to query? I can't access wikidatawiki.page via Presto.
We're back to the 103,612,928 from before then :) Let me know if there are any other ways of subsetting that you can think of.
Thanks for the tip! Ya I was just using that at the start as I wasn't aware of namespaces :)
Accounting for namespaces brings us back up as we need 0 for main content, 120 for properties and 146 for lexemes.
We seem to be on the right path! :)
Generally in terms of content though we want wikibase-items, wikibase-properties and wikibase-lexemes for this, correct?
I'm down to 103,612,928 with the value on the stats page again being 103,384,803 :)
And beyond that we'd also want page_is_redirect = 0?
So in looking at namespaces you mean we'd also subset by namespace = 0 for the main namespace?
Current query:
Even with just wikibase-item we're getting 106,507,044 🤔 So there's some other kind of subsetting that needs to happen apparently.
Hmmmm and it's still overestimating given a subset on the above mentioned three 🤔 Currently getting 107,609,517 instead of 103,384,803.
@Manuel, I'm getting the baseline query of all the items on Wikidata working well, but the issue is that I'm getting too many results based on the Wikidata statistics page. I think that the results from mediawiki_page need to be subsetted based on page_content_model, with the general idea being that I'd only query pages with the following content models:
Wed, May 24
Active editor notes via conversation with @Manuel: