Fri, Jan 27
Thu, Jan 26
Cool! Moving to "next 2 weeks" since I will have to wait a week or so for the Jan 2023 mediawiki_history snapshot.
I think the neatest way to deal with the pageview data loss is to wait a few more days until the start of February. Then we can do a Jan 2023 snapshot which will look at data going back to Feb 2022. The data loss ended on 27 Jan, so this will avoid it entirely without the need for any special casing.
Just updated the canonical wiki dataset, which added 4 new wikis.
Wed, Jan 25
@SBisson I'm confident it doesn't need approval. It's just a minor tweak to our instrumentation and doesn't change the scope of our data collection.
Some improvements I could potentially make in this round:
- Fix the content page count to be based on AQS or mediawiki_history so it's actually the value at the snapshot time rather than at query time.
- Add external referrer pageviews proportion
- Add Global South traffic percentage
- Add whether the project uses language variants
Tue, Jan 24
The PR to review is here: https://github.com/wikimedia-research/canonical-data/pull/3.
Wed, Jan 11
Dec 23 2022
This review led me to focus on the idea of detecting interpersonal conflict on-wiki by looking at signals such as mutual reverts. If successful, that could lead to applications like:
- quantifying the incidence of on-wiki conflict for use in high-level metrics and comparison with the number of conflict reports
- early detection and automated alerting of user conflict.
This is stalled and it's not clear when we'll be able to finish it, let alone who will do it at that point.
This shouldn't be assigned to me; I've never had a concrete plan to work on it.
Apart from being blocked on T316049, Wikistories is still in its infancy, so we should avoid making a major investment like an ETL job. For now, I am manually running and sharing metrics using the Wikistories dashboard.
I've consolidated my ad-hoc reporting into a spreadsheet dashboard [WMF only]. I think that's enough to count as an initial dashboard!
Dec 22 2022
Drafting ongoing in this Google doc [WMF only].
Dec 21 2022
Dec 20 2022
I've filed T325611: Add TikTok's in-app browser to ua-parser library and consolidated a bunch of the information we have about referrers (the large majority of it from @Isaac! 😄) onto Research:Referrer on Meta-Wiki.
Dec 15 2022
Dec 8 2022
Dec 5 2022
Dec 1 2022
Nov 30 2022
I still kind of like this idea, but it would be significant amount of work for a pretty marginal benefit.
The simpler base environment is definitely real now, and in any case I've created a lot of new stacked environments in the past several months without encountering this issue.
In a conda-analytics environment, pip install -e . works just fine, so there's no need for an install script.
Nov 29 2022
Updated the description to note:
In addition, analytics-mysql is not available on an-test-client1001, which complicates the process of testing Wmfdata.
For the most part, the dependency doesn't matter.
Nov 28 2022
Thanks to T273210, Wmfdata now has the ability to recreate Spark sessions in the same notebook, which should give it the ability to easily recover from a crashed Spark session.
Nov 23 2022
Cool, thank you @xcollazo! 🎉
Okay, I've merged the documentation improvements and version 2.0.0 changes to main and sent a pre-announcement to several Slack channels and email@example.com.
Merged in PR40.
Nov 21 2022
@xcollazo a month ago, I suggested changing the default source of Conda packages in conda-analytics. Let me re-up this here so you can consider doing this before the migration. For context, I think this would be a minor improvement, so it's fine to ignore if you think it's not worth the effort.
Nov 19 2022
The pull request has been merged!
The pull request has been merged!
Nov 17 2022
Nov 16 2022
Nov 15 2022
The removals have been merged. This will stay open until we actually release version 2.0, likely late this week or early next.
I've verified that import pyspark just works in the new conda-analytics environment. Coincidentally, my changes for T273210 have ended up making PySpark available as wmfdata.spark.pyspark. So this is doubly solved.
Soon, we are going to be moving from anaconda-wmf to conda-analytics as the base for new Conda environments (T321088). That will contain Wmfdata-Python 2.0, so we can skip directly to that.