Change Details

=== Scope === - How many SPARQL queries are coming from [Scholia](https://scholia.toolforge.org/)? === Notes === - Scholia queries should be identifiable via HTTP user agent Edit: from [this comment](https://phabricator.wikimedia.org/T353453#9406748), Scholia queries generally start with the following comment: ``` # tool: scholia ``` Edit: there are also cases where the user agent string of `"Scholia"` is used, but there are no cases where the comment and user agent appear together. === Desired output === Description of the desired output for this task. - Aggregate Scholia queries for the 90 days for which we have queries still - `27,639,881` - Time period: 90 days to 2.8.2024 - Total queries: `869,239,193` - Percent Scholia queries for the 90 days for which we have queries still - `3.18%` - Of these Scholia queries, the percent that are user agent based (vs. query comment based) - 55.29% - Number of unique IPs that are making requests to Scholia via each method (total of either and percents) === Urgency === When this task should be completed by. If this task is time sensitive then please make this clear. Please also provide the date when the output will be used if there is a specific meeting or event, for example. 09.02.2024 --- **Information below this point is filled out by the Wikidata Analytics team.** == General Planning == Information is filled out by the analytics product manager. == Assignee Planning == Information is filled out by the assignee of this task. === Estimation === Estimate: 1/2 day Actual: 1 hour snapshot -> 1/2 day for full work === Sub Tasks === Full breakdown of the steps to complete this task: [x] Check queries with the given comment to mark them as being for Scholia [x] Investigate the metadata for these queries to derive if there are other identification methods that should be included - I.e. `OR` conditions for `WHERE` clauses whereby we can use the comment or say a user agent to gain a greater coverage of the queries [x] Set up notebook with process to derive aggregate and percentage values [x] Run notebook to derive needed values [x] Report values and time period considered in this task === Data to be used === See [Analytics/Data_Lake](https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake) for the breakdown of the data lake databases and tables. The following tables will be referenced in this task: - [event.wdqs_external_sparql_query](https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,event.wdqs_external_sparql_query,PROD)) For the analysis of automate traffic, [isSpiderUDF](https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-hive/src/main/java/org/wikimedia/analytics/refinery/hive/IsSpiderUDF.java) will be used. === Notes and Questions === Things that came up during the completion of this task, questions to be answered and follow up tasks: - Note

=== Scope === - How many SPARQL queries are coming from [Scholia](https://scholia.toolforge.org/)? === Notes === - Scholia queries should be identifiable via HTTP user agent Edit: from [this comment](https://phabricator.wikimedia.org/T353453#9406748), Scholia queries generally start with the following comment: ``` # tool: scholia ``` Edit: there are also cases where the user agent string of `"Scholia"` is used, but there are no cases where the comment and user agent appear together. === Desired output === Description of the desired output for this task. - Aggregate Scholia queries for the 90 days for which we have queries still - `27,639,881` - Time period: 90 days to 2.8.2024 - Total queries: `869,239,193` - Percent Scholia queries for the 90 days for which we have queries still - `3.18%` - Of these Scholia queries, the percent that are user agent based (vs. query comment based) - 55.29% - Number of unique IPs that are making requests to Scholia via each method (total of either and percents) - Total IPs is: `29,145` - Total IPs derived via comments: `28,960` (99.4%) - Total IPs derived via user agents: `185` (0.6%) === Urgency === When this task should be completed by. If this task is time sensitive then please make this clear. Please also provide the date when the output will be used if there is a specific meeting or event, for example. 09.02.2024 --- **Information below this point is filled out by the Wikidata Analytics team.** == General Planning == Information is filled out by the analytics product manager. == Assignee Planning == Information is filled out by the assignee of this task. === Estimation === Estimate: 1/2 day Actual: 1 hour snapshot -> 1/2 day for full work === Sub Tasks === Full breakdown of the steps to complete this task: [x] Check queries with the given comment to mark them as being for Scholia [x] Investigate the metadata for these queries to derive if there are other identification methods that should be included - I.e. `OR` conditions for `WHERE` clauses whereby we can use the comment or say a user agent to gain a greater coverage of the queries [x] Set up notebook with process to derive aggregate and percentage values [x] Run notebook to derive needed values [x] Report values and time period considered in this task === Data to be used === See [Analytics/Data_Lake](https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake) for the breakdown of the data lake databases and tables. The following tables will be referenced in this task: - [event.wdqs_external_sparql_query](https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,event.wdqs_external_sparql_query,PROD)) For the analysis of automate traffic, [isSpiderUDF](https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-hive/src/main/java/org/wikimedia/analytics/refinery/hive/IsSpiderUDF.java) will be used. === Notes and Questions === Things that came up during the completion of this task, questions to be answered and follow up tasks: - Note