User Details
- User Since
- Sep 30 2019, 9:17 PM (218 w, 5 d)
- Availability
- Available
- IRC Nick
- lexnasser
- LDAP User
- Lex Nasser
- MediaWiki User
- Lexnasser [ Global Accounts ]
Apr 19 2021
@Ottomata Thanks for pointing out the huge sizes of those tables. I was mainly keeping them for reference, but it seems that any future utility of those tables is dwarfed by their sizes. Feel free to delate all of the ones you mentioned.
Apr 16 2021
@elukey Oops, I totally missed checking HDFS and Hive.
In stat1007, the directories I want to preserve are /home/lexnasser/lexnasser-stat1007 and /home/lexnasser/notebook1003.
In aqs-test1001, the directory I want to preserve is /home/lexnasser/lexnasser-aqs-test1001.
Apr 15 2021
Passing this task over to Francisco to carry out the implementation of this data into WikiStats.
Apr 14 2021
Apr 11 2021
Following up with some new developments for this task:
Mar 25 2021
Hi @Yair_rand, a pageviews top-per-country AQS endpoint was just released (docs). Does this fulfill your intended use case?
Per the parent task, the pageviews/top-per-country endpoint is now public! Take a look at that parent task for relevant info.
The pageviews/top-per-country endpoint is now public! Take a look at the documentation here. You can query data starting from January 1, 2021, and examples can be found here. Note that, at the moment, the endpoint has a stability of 'experimental', meaning that the endpoint can change in incompatible ways at any time, without incrementing the API version. However, I don't expect this to occur.
Mar 17 2021
Feb 26 2021
Just finished fixing up the Hive query for the Oozie job to load the data into Cassandra for the top per-country AQS pageviews endpoint.
Feb 19 2021
@JAllemandou Thanks for finding the hive.cbo.enable option! That fixed the HiveRelDecorrelator issue, but now I'm getting another error:
Error: Error while compiling statement: FAILED: SemanticException [Error 10250]: Line 23:8 Invalid SubQuery expression '1': SubQuery can contain only 1 item in Select List. (state=42000,code=10250) org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10250]: Line 23:8 Invalid SubQuery expression '1': SubQuery can contain only 1 item in Select List.
Here's the full log: https://hue.wikimedia.org/jobbrowser/apps#!id=job_1612875249838_48256
I'm running into an issue when testing my AQS pageviews/per-country Oozie job that didn't appear when I tested it a few weeks ago
Feb 5 2021
Thanks so much for pointing this out -- I personally overlooked this.
Just submitted a patch set to address the first part of this task, enforcing a maximum time interval of 1 year for the pageviews/per-article and mediarequests/per-file endpoints.
Feb 2 2021
@Milimetric Thanks for clarifying! Should have that completed soon.
Feb 1 2021
Took a look at this over the weekend, and found that restricting the time interval for any given endpoint is very straightforward and simple to implement.
Jan 22 2021
Dec 22 2020
Dec 21 2020
Sorry for the radio silence! I just finished up final exams, so I'm now freed up to make more progress on this.
Dec 2 2020
Hey everyone, @JFishback_WMF has completed his risk analysis of the working API design, and, from a privacy perspective, everything's a go! Thanks to all of you for bringing up various potential privacy threats - it looks like we covered all the main ones.
Nov 25 2020
Thanks to both of you for going into deeper detail about page_title vs. page_id. I'm also leaning towards using page_title because, in addition to being consistent with the other endpoints, it's simpler to implement as you both mentioned. I'm still open to considering this issue further, especially for the potential monthly data as Isaac brought up.
Nov 9 2020
Thanks for everyone's input!
Nov 5 2020
Hey everyone, I just created a table lex.pageview_ranks_with_unique, available on Hive and Superset, that holds the exploratory data (for one day) that I've been analyzing. I created this to make it easier for everyone to examine the data, and see how different thresholds (including unique pageview thresholds) would affect the data returned for different countries.
Oct 31 2020
@JAllemandou Thanks for your thoughts and for the all-projects suggestion! I'm often unaware of those types of existing naming conventions.
Oct 30 2020
Hey everyone, I spent the last couple of days compiling data for less edge-casey countries that are relatively multilingual (India and Belgium). The metrics are below and my takeaways are at the bottom.
Oct 21 2020
@Amire80 I definitely agree that San Marino is an edge case. Do you think there are any other metrics that could help gauge what would be the best way to exclude articles (total views vs unique views vs something else)? Or do you just in general prefer one way over the others?
Oct 20 2020
@Isaac thanks for sharing these!
Oct 14 2020
Just wanted to follow up to say that I'd love for everyone to take a look at the design doc and make suggestions as you see fit.
Oct 9 2020
Yep, that's the correct email. I also confirm that I'm now able to access Turnilo and Stat1007. Thanks for your help!
Oct 8 2020
Just finished a first draft of the design doc for this project! You can find it here: https://docs.google.com/document/d/19HbdPvSHPUF9n4thFOlck0dIgvZvg50K3mcL-guoViY/edit?usp=sharing
In terms of the privacy considerations for countries with low pageview counts, I found that the most-viewed articles by project endpoint reports articles with very low pageview counts. See: https://wikimedia.org/api/rest_v1/metrics/pageviews/top/kl.wikipedia/all-access/2019/10/01. It reports 374 different articles with a single pageview, and I'd assume that the vast majority of those originate from Greenland, given that it is Greenlandic-language Wikipedia. With the relative similarity between these two projects, I was wondering what led to the decision to not perform those privacy transformations on the data, and if those same reasons would be relevant to this case.
Sep 25 2020
May 14 2020
I think that the following should be saved:
- stat1007: api, byc, refinery
- notebook1003: Search_Engine_Testing.ipynb, Geoeditors.ipynb
- hive: lex.webrequest_subset, lex.geoeditors_public_monthly
May 1 2020
Handing the remainder of this task off to Dan.
Apr 27 2020
Did some final verification of pageviews for characters above 0xFFFF, and looks like everything's working! Marking as resolved.
Apr 22 2020
Success! Will do some more testing to ensure that more cases are valid.
Apr 16 2020
Thanks for the suggestions!
Just submitted a patch with the fix and some new tests: https://gerrit.wikimedia.org/r/589383
Apr 13 2020
On @Milimetric 's suggestion, I tested all 3 methods against each other to verify their consistency, and found they all behaved the same over the whole Unicode range.
This is the code I'm using: Pattern.compile("^[ %!\"$&'()*,\\-.\\/0-9:;=?@A-Z\\\\^_a-z~\\x{80}-\\x{10FFFF}\\+]+$");`
Apr 12 2020
Thanks again for all your feedback!
Apr 11 2020
@elukey I don't see the "Unknown error" message anymore. Nothing in the JS console either.
@elukey I can see https://superset.wikimedia.org/superset/dashboard/73/, but still get the same "Unknown error" for https://superset.wikimedia.org/superset/sqllab.
Apr 10 2020
Thanks everyone for your input! I'm a bit busy right now, but I'll be sure to address each of your points later today.
Just wanted to write everything I figured out in the past 3 days. I would love your feedback!
Mar 23 2020
Jan 13 2020
Deployed with the help of @Milimetric ! Hope you find these changes helpful!
One last thing to resolve: There are a few Google Translate referers with the parameter prev=/search... (ex. prev=/search%3Fq%3DBARON%2BDE%2BHIRSCH%26hl%3Del%26rlz%3D1T4GGLL_elGR398GR398%26prmd%3Divns). Should these also be classified under the Google Translate search engine purview?
Jan 6 2020
Hi @Isaac Got to final testing, and found an issue.
Jan 3 2020
- Quick Status Update
Dec 21 2019
Dec 13 2019
Thanks for the question! Nuria is more aware of the intricacies of the source of the data than I, but I believe the main other factor that limits the amount of data from the text cache is that the text data is filtered by is_pageview .
Dec 9 2019
Started looking into the referer class, and had a few questions:
Dec 8 2019
Dec 5 2019
The data has been released!
Dec 2 2019
Updated Wikitech (LINK) once again with a description about the text data. Let me know if you see any last-minute issues.
Nov 21 2019
Checking in again.
Nov 12 2019
I'm not sure if there's a public-facing way to check the frequency of submit queries. Will have to defer to @Nuria about that.
is this for a text_cache or for upload_cache (like cp5006)? I expect that only text caches (like cp5008) would see submit queries.
The only difference would be the save column, which is 1 if uri_query %like% "action=submit" and 0 otherwise.
The upload(.wikimedia.org) uri_query field does not contain an action=submit parameter for any entry.
Nov 11 2019
Are we narrowing the query to a single server, e.g., via WHERE x_cache like '%cp3033%' ?
Yes. I’m using WHERE x_cache like '%cp5006%' .
Which server are we using? Ideally we'd actually create two datasets, one for a cache_text and one for a cache_upload server, but since the ATS deployment (replacing Varnish) I can't figure out the right x_cache query.
As above, I’m using 5006, which is for images only via upload.wikimedia.org.
I'm afraid that we'll have too much data, as Nuria previously pointed out. The x_cache field is one of the largest, we had this in the last dataset and no researcher / paper (afaik) used it. I think we can drop the x_cache column in the output (but keep it in the where clause).
To confirm, the remaining fields are: relative_unix, hashed_host_path_query, image_type, response_size, time_firstbyte . Is that proper?
How are we limiting the response size? It would be great to cover a longer time (say 4 weeks) period.
I’m not sure what you mean by limiting the response size - I currently have no filters on the response size. I’ll have to consider the longer time period.
Nov 7 2019
Also, I saw that in your 2016 dataset request (link) that you wanted a separate query field for a save flag.
Nov 6 2019
Hi @Danielsberger ,
I'm almost finished compiling the data. This is what the dataset would look like:
Nov 1 2019
Hi @Danielsberger, thanks for the thorough response. I'm currently reviewing all the different configurations of the features of the dataset and will try to accommodate your needs as much as practical. And yes, the underlying timestamp uses second-granularity.
Oct 28 2019
Hi @Danielsberger, I’m working on compiling this new public dataset for your caching research. I had a few questions that I hope you could answer so that I could get a better understanding of your specific wants and needs for this new release:
Oct 23 2019
Here's another public ED25519 key: AAAAC3NzaC1lZDI1NTE5AAAAIOBTDDmL8isvso6xqOJB5qkk3n8xuM0XxFc1Q34ZnZRj
Oct 18 2019
@RStallman-legalteam Just sent an email
Oct 17 2019
Oct 16 2019
Approving as the relevant Wikimedia Foundation employee.