@Manuel, as mentioned on Mattermost there as of now doesn't seem to be a good way of deriving agent_type for those tables that don't have it. We can get spider through a UDF, but automated isn't possible at the moment. This makes the final division between desktop and API users pretty difficult. An idea I had was checking uri_path = '/w/api.php'. Some information breakdowns for that follow, with the queries being generally the same as those found directly above.
Hey there, @nshahquinn-wmf! One week on after starting the new computer setup and I've been using this as a means to test it all out 🚀 Has been fun! So here's where we're at:
Some general notes on this: as we're working from wmf.pageview_actor and wmf_raw.mediawiki_private_cu_changes, there might be a way to leverage their expanded agent_type field such that for at least the former we have automated as an option within agent_type :) So for views we can do a more distinct division into mobile, desktop and API users by including agent_type in it. For edits it's a bit more difficult, but maybe there's a way to add in agent_type via a UDF on anther field.
Wed, Sep 20
Here's the above referer breakdown for mobile for reference, with the big difference being that we have dramatically less - requests - good for thinking that these are APIs - and have a lot of extension requests:
@Manuel, re the question of what kind of referer values we have for "desktop" requests, the following query was used to get the results below.
Also you are explicitly filering for is_pageview = True
Fri, Sep 1
Thu, Aug 31
As far as totals for this task are concerned, @Manuel, what I'm getting is the following:
@Manuel, I think we can throw out the idea of creating an edits subset of webrequests, sadly :( The following would be where we'd find the various actions that we'd need to collect to define as edits fully: https://www.wikidata.org/w/api.php. We know at the very least that we'd want uri_query LIKE '?action=edit%' and uri_query LIKE '?action=wbsetclaim%', but figuring out what else needs to be added seems to be prohibitive given the discrepancy:
Here are the tables that break down the device_family values, @Manuel :) As before:
Wed, Aug 30
Here are the values for Tizen as well, @Manuel:
And here are the finalized heuristics (@JAllemandou, tagging you as well). The following query is saved as a temporary view as df_requests_subset:
We'd talked "Tizen" a bit this morning, @Manuel, but let's not focus on it. Did a bit of Wikipedia research and since since 2021 it's mostly in use in Samsung Smart TVs. That leaves us with Android and iOS for the predominant mobile os_family values, and if we want to include a Linux based one it'd be KaiOS.
Here are the answers to the three questions we had from the daily, @Manuel:
Aug 21 2023
Thanks for the efforts on this, @Stevemunene! Please let us know if there's anything needed on our end :)
Aug 11 2023
Or am I just jumping to the question in the description and we just want to figure out mobile edits and views over the period?
I guess I'm confused what the goal here is then 🤔 As I understand it we're looking for users who are using the normal desktop UI on a mobile device. For the wmf.webrequest table we'd then use:
My understanding of access_method is that it's only related to user_agent for mobile apps:
I've already checked and there are differences between a python-user-agents derived device via user_agents.parse(ua_value).is_mobile and the access_method. Specifically we are getting users where the device from .is_mobile is mobile, but the access method is desktop implying that they're not using a m.URL.
Aug 10 2023
@Manuel, I've been using python-user-agents and so far it's going ok in so far as the .is_mobile method seems to be working well. Are we trying then the combination of user_agent_var.is_mobile = True and access_method = "desktop" via the access_method column from wmf.webrequest? For this column:
@Manuel, just a note on using the wmf.webrequest table: now that I'm using Spark a bit more and can see the number of steps, it's definitely worth it to try to restrict the data based on the year and month as we've been doing. Selecting 30 days over two months takes dramatically longer than if we set the month column in the WHERE clause - roughly three times longer based on number of steps.
Thanks a lot for this, @dcausse! The reasoning of singe column, relatively few rows for caching makes a lot of sense. I think that the problems I faced were from trying to cache df_wikidata_rdf. Just ran things through again with just sa_and_sasc_ids cached and it did seem to run through a bit better. With that being said, I did end up running the notebook multiple times and saving the outputs to variables as I went along before then restarting the kernel.
Aug 9 2023
Minor question on this, @dcausse: why aren't we caching df_wikidata_rdf and sa_and_sasc_ids above? My assumption is that we should given that we're using them in multiple later calculations, but then I just tried to cache them and then a calculation that normally would finish then lost resources and stalled with three separate stages running. Did you explicitly choose not to cache them, and if so why not? :)
Is what we were thinking too, @dcausse :) I'm realizing that where I had the .distinct() was incorrect though. Edit: never mind the prior comment. Not sure why it wasn't working within the parentheses at first...
Aug 8 2023
@dcausse, do you have an idea why we're not getting that direct triples for SAs and its subclasses and direct triples for non-SAs and subclasses add to the same amount? Was working out for the last notebook as you saw. Only major change I've made is now it's .where(col("object").isin(sa_and_sasc_qids)) rather than the equality where sa_and_sasc_qids is the hard coded QIDs from above including scholarly article's (I was getting some papers back when directly querying subclasses).
Looking at this further, it seems that AKhatun focussed more on scholarly articles and was just listing subclasses in the report itself as examples. Reference for this is this part of the report.
Notes from the call that @dcausse and had:
Aug 7 2023
The above LATERAL VIEW EXPLODE method came up with 40,529,640 scholarly articles via the claims, @dcausse. I think that that's close enough to the amount from discovery.wikibase_rdf that we don't need to dig more into expanding the WHERE clause :) Thanks again for your help!
Aug 5 2023
Thanks, @dcausse! Really appreciate the detailed explanation :) I totally agree that serializing the full claim would be problematic, and that your method is much better. Need a bit more practice with lateral view explode so that it becomes more natural for me to use. I'll implement the above at the start of the week and see if it works properly 😊
Aug 4 2023
The UDF is up and running now, but we may need to discuss my limits as running what I'd assume to be a fairly simple UDF over wmf.wikidata_entity wasn't finishing (@dcausse, @JAllemandou). Even if it does finish, I'm fairly regularly getting:
As for as the Spark UDF issues are concerned, let me just sketch out the process here as it's in a separate notebook from the main one just linked. The general goal in this is to explore using UDFs to easily derive data via the claims column of wmf.wikidata_entity. We can easily find out how many scholarly articles we have via the discovery.wikibase_rdf table as in the example notebook I linked on people.wikimedia.org, but then the goal was to do something similar via wmf.wikidata_entity.claims so I can have a claims exploration example to work from later :)
@nshahquinn-wmf, just FYI I do have this on my radar. Sorry it's taking so long... I'm in the process of waiting for a new computer and then I'll have my full VS Code setup up and running. I'll update you when I start to work on this :)
@dcausse, just finished the people.wikimedia.org upload. An HTML for the notebook can be found at:
Aug 3 2023
@dcausse, glad to help :) Maybe doing a call to check all of this might make sense? If you have availability tomorrow I'm basically free, or if not then next week for say 25 min sometime?
I'll write some more details of the problems I'm facing tomorrow 😊
Aggregations have been added to the task description :) We'll upload the work for this to GitHub or GitLab once we have or repo set up, and I'd be happy to do a call if someone besides @Manuel wants an explanation :) Also happy to put the notebook on people.wikimedia.org for an interim presentation of the work.
Will check the following with @Manuel later today, but here are the metrics I'm getting from the 20230717 dated data from discovery.wikibase_rdf (note that I don't have access to later ones given permission restrictions that are documented in T342416):
Aug 2 2023
Thank you for the information here, @dcausse! Nice to have all this in one place where I can reference it when I need a recap 😊😊
Checking another concept with you all:
Aug 1 2023
Great to hear, @mpopov! I guess the distinction between HiveQL queries ran with wmfdata.spark.run for scientists/analysts vs. dot notation for software engineering makes sense. Nice to hear that I'll be at home writing some Hive :)
Jul 31 2023
Good to know, this is definitely a lot lower than I expected, thanks!
Also for all's information, the duplicate triple values in discorvery.wikibase_rdf is very very small as seen in the following snippet/output:
@dcausse, a general point on my end is that when I'm trying to run the code that you sent along via an HTML on people.wikimedia.org I'm getting the following as an output of Spark runs repeated over and over again:
Jul 27 2023
@Manuel, looking into cases where Q13442814 (scholarly article) is either the subject or object of a triple, it looks like we can verify that the relationships are only being saved in one way as they should be:
Jul 25 2023
Jul 24 2023
@Manuel, could you give a bit more context to "# of Items" above? Is this all distinct Wikidata entities (QIDs and PIDs), or just QIDs? The wmf.wikidata_entity table for this only has those two entity types in it, so if we're looking for other parts of the graph we'll need to look in other places.
- Let's discuss what we can do to make the metric more robust and reliable (e.g. exclude browser user agents)
@Manuel, based on the query provided in https://w.wiki/77FU (I took out the French comment at the end and regenerated the short link), it looks like the ontology is relatively clean if we keep it to the base subclasses with wdt:P279, but not if we go beyond that to the full graph with wdt:P279*. A summary:
@Manuel, moved this to needs product input as I think that we have everything that we could map out (within reason). Let me know how you'd like to prioritize things from here :)
Jul 21 2023
Updated the totals given the most recent dump to test my connection to it in relation to T342416. As expected, no major changes in terms of percentages :)
Thanks for writing, @tfmorris! :)
Jul 20 2023
Great, @Manuel! Let me know what you want to do for the documentation of this. Happy to setup a repo for us on GitHub in the coming days if that would help :)