Scope
- How many SPARQL queries are coming from Scholia?
Notes
- Scholia queries should be identifiable via HTTP user agent
Edit: from this comment, Scholia queries generally start with the following comment:
# tool: scholia
Edit: there are also cases where the user agent string of "Scholia" is used, but there are no cases where the comment and user agent appear together.
Desired output
Description of the desired output for this task.
- Aggregate Scholia queries for the 90 days for which we have queries still
- 27,639,881
- Time period: 90 days to 2.8.2024
- Total queries: 869,239,193
- Percent Scholia queries for the 90 days for which we have queries still
- 3.18%
- Of these Scholia queries, the percent that are user agent based (vs. query comment based)
- 55.29%
- Number of unique IPs that are making requests to Scholia via each method (total of either and percents)
- Total IPs is: 28,918
- Total IPs derived via comments: 28,740 (99.4%)
- Total IPs derived via user agents: 178 (0.6%)
- Total IPs in the period: 2,115,166
- Percent Scholia IPs for the period: 1.37%
Urgency
When this task should be completed by. If this task is time sensitive then please make this clear. Please also provide the date when the output will be used if there is a specific meeting or event, for example.
09.02.2024
Information below this point is filled out by the Wikidata Analytics team.
General Planning
Information is filled out by the analytics product manager.
Assignee Planning
Information is filled out by the assignee of this task.
Estimation
Estimate: 1/2 day
Actual: 1 hour snapshot -> 1/2 day for full work
Sub Tasks
Full breakdown of the steps to complete this task:
- Check queries with the given comment to mark them as being for Scholia
- Investigate the metadata for these queries to derive if there are other identification methods that should be included
- I.e. OR conditions for WHERE clauses whereby we can use the comment or say a user agent to gain a greater coverage of the queries
- Set up notebook with process to derive aggregate and percentage values
- Run notebook to derive needed values
- Report values and time period considered in this task
Data to be used
See Analytics/Data_Lake for the breakdown of the data lake databases and tables.
The following tables will be referenced in this task:
For the analysis of automate traffic, isSpiderUDF will be used.
Notes and Questions
Things that came up during the completion of this task, questions to be answered and follow up tasks:
- Note