I need to provide a full spec so @mpopov can write it.
It's become clear that there's interest among the Product Analytics team at least, and it doesn't seem worth looking for other users at this point while we're still ironing out the kinks.
I sat down with @Tbayer and it looks like it's now working for him!
Don't we have historic data already? Or is this specific to a particular metric, like pageviews, which only seems to go back to July 2015?
Wed, Dec 12
Tue, Dec 11
This data is now flowing into Hive:
select to_date(dt) as date, count(*) as events from editattemptstep where year = 2018 and month=12 group by to_date(dt)
Sat, Dec 8
Fri, Dec 7
@Volans should this really be world-editable? 🤔
Thu, Dec 6
Tue, Dec 4
Sat, Dec 1
Fri, Nov 30
Oh, I see, it's because his username was in the task description.
Sorry @phuedx, I don't know why you keep getting resubscribed.
@kzimmerman also pulling into Next Up as a blocker for November's movement metrics.
Thu, Nov 29
Tue, Nov 27
@JCuriel, this is done. The results are in the table below; let me know if you notice any issues.
Thu, Nov 22
Wed, Nov 21
@kzimmerman this is the task that I used as an example in our meeting today. As discussed, it's quite small and has clear value, so I'm auto-accepting it :)
@nettrom_WMF, should we plan to pair on this?
Mon, Nov 19
The last time I did these calculations, I wrapped them up in a notebook, so they'll be very easy to rerun.
I fixed this in this commit!
@nettrom_WMF mentioned two current known issues with the data:
- client side mobile events were not submitted until last week's train (the one ending on 15 November)
- there's a continuing issue with mobile init events not being submitted (fix planned for next week).
Sat, Nov 17
Fri, Nov 16
Thanks for working on this, @JAllemandou! I've just started suffering this on a lot of queries, including ones that previously worked fine. The error message is "OperationalError: Error while processing statement: FAILED: Execution Error, return code 134 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask".
Sorry to weigh in so late, but why can't we simply copy the event_user_text_historical to event_user_text for IP editors at the end of the reconstruction process? It doesn't seem like it would be hard to implement, and it's annoying and hard to learn that the way to get the canonical name for any user is not event_user_text like you'd expect, but rather coalesce(event_user_text, event_user_text_historical).
Thu, Nov 15
Wed, Nov 14
I've now uploaded the survey lists to Qualtrics, so we can call this done.
Nov 13 2018
The primary desire here was a gated home for defining things like Global North/Global South, and that exists now that I've created the wikimedia-reasearch/canonical-data repo.
Now that the data is easily available in Hive, I think this is done!
Nov 10 2018
Actually, I'll go ahead and upload the CSV file into the Data Lake so we can join it to the other tables. That actually may be just as good as writing a UDF—you don't have to add a jar and create a function at the start of every query, at least.
The most challenging part of this is coming up with human-readable project names, and I've actually already done that as part of the wiki segmentation work. I just started work wrapping that up in a slightly more general form so it can go in the canonical-data repo, although it's not high priority so I don't know when I'll finish.
@Tbayer, I've created a CSV file with country names, ISO codes, Global North/South classification, and MaxMind continents, tracked in a new wikimedia-research/canonical-data repo. It contains all the countries which appear in projectview_hourly, and I've carefully checked it to make sure the Global North/South classifications match the ones at meta:List of countries by regional classification.
Nov 9 2018
The list of Global North countries I've been using is:
( "AD", "AL", "AT", "AX", "BA", "BE", "BG", "CH", "CY", "CZ", "DE", "DK", "EE", "ES", "FI", "FO", "FR", "FX", "GB", "GG", "GI", "GL", "GR", "HR", "HU", "IE", "IL", "IM", "IS", "IT", "JE", "LI", "LU", "LV", "MC", "MD", "ME", "MK", "MT", "NL", "NO", "PL", "PT", "RO", "RS", "RU", "SE", "SI", "SJ", "SK", "SM", "TR", "VA", "AU", "CA", "HK", "MO", "NZ", "JP", "SG", "KR", "TW", "US" )
Nov 5 2018
Nov 3 2018
@Tbayer and I discussed this yesterday and came to the following conclusions:
- Unknown countries account for roughly:
- 0.3% of pageviews
- 0.9% of editors (technically, of (wiki, country, editor) entities)
- 27% of edits (my estimate in the description was too high)
- We should investigate the small group of editors that are producing all these unknown-country edits
- Starting with October board metrics (T206895, etc.), we will treat unknown countries as a third region alongside the Global North and the Global South. I will produce some shared code to help with this.