[Analytics] Impact of Scholia on WDQS
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Manuel
	Dec 14 2023, 2:22 PM

Description

Scope

How many SPARQL queries are coming from Scholia?

Notes

Scholia queries should be identifiable via HTTP user agent

Edit: from this comment, Scholia queries generally start with the following comment:

# tool: scholia

Edit: there are also cases where the user agent string of "Scholia" is used, but there are no cases where the comment and user agent appear together.

Desired output

Description of the desired output for this task.

Aggregate Scholia queries for the 90 days for which we have queries still
- 27,639,881
- Time period: 90 days to 2.8.2024
- Total queries: 869,239,193
Percent Scholia queries for the 90 days for which we have queries still
- 3.18%
Of these Scholia queries, the percent that are user agent based (vs. query comment based)
- 55.29%
Number of unique IPs that are making requests to Scholia via each method (total of either and percents)
- Total IPs is: 28,918
- Total IPs derived via comments: 28,740 (99.4%)
- Total IPs derived via user agents: 178 (0.6%)
- Total IPs in the period: 2,115,166
- Percent Scholia IPs for the period: 1.37%

Urgency

When this task should be completed by. If this task is time sensitive then please make this clear. Please also provide the date when the output will be used if there is a specific meeting or event, for example.

09.02.2024

Information below this point is filled out by the Wikidata Analytics team.

General Planning

Information is filled out by the analytics product manager.

Assignee Planning

Information is filled out by the assignee of this task.

Estimation

Estimate: 1/2 day
Actual: 1 hour snapshot -> 1/2 day for full work

Sub Tasks

Full breakdown of the steps to complete this task:

Check queries with the given comment to mark them as being for Scholia
Investigate the metadata for these queries to derive if there are other identification methods that should be included
- I.e. OR conditions for WHERE clauses whereby we can use the comment or say a user agent to gain a greater coverage of the queries
Set up notebook with process to derive aggregate and percentage values
Run notebook to derive needed values
Report values and time period considered in this task

Data to be used

See Analytics/Data_Lake for the breakdown of the data lake databases and tables.

The following tables will be referenced in this task:

event.wdqs_external_sparql_query

For the analysis of automate traffic, isSpiderUDF will be used.

Notes and Questions

Things that came up during the completion of this task, questions to be answered and follow up tasks:

Note

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Manuel	T337799 [EPIC] Analytics support around splitting the WDQS graph [up to milestone 3]
		Resolved		AndrewTavis_WMDE	T353453 [Analytics] Impact of Scholia on WDQS

Event Timeline

Manuel created this task.Dec 14 2023, 2:22 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 14 2023, 2:22 PM

Manuel updated the task description. (Show Details)Dec 14 2023, 2:22 PM

Manuel mentioned this in T337799: [EPIC] Analytics support around splitting the WDQS graph [up to milestone 3].

Manuel added a parent task: T337799: [EPIC] Analytics support around splitting the WDQS graph [up to milestone 3].Dec 14 2023, 2:25 PM

Manuel renamed this task from [Analytics] Impact of Scholia on WDQS to [Analytics] QUERY-Q3: Extract a set of queries known to be used by scholia.Dec 14 2023, 2:30 PM

Manuel updated the task description. (Show Details)

Manuel edited parent tasks, added: T349512: [Analytics] Collect multiple sets of SPARQL queries; removed: T337799: [EPIC] Analytics support around splitting the WDQS graph [up to milestone 3].

note that scholia queries generally start with the comment:

# tool: scholia

Manuel renamed this task from [Analytics] QUERY-Q3: Extract a set of queries known to be used by scholia to [Analytics] Impact of Scholia on WDQS.Dec 14 2023, 2:39 PM

Manuel updated the task description. (Show Details)

Manuel edited parent tasks, added: T337799: [EPIC] Analytics support around splitting the WDQS graph [up to milestone 3]; removed: T349512: [Analytics] Collect multiple sets of SPARQL queries.

Thank you, David!

Manuel edited projects, added Wikidata Analytics (Kanban); removed Wikidata Analytics.Dec 15 2023, 9:59 AM

Manuel moved this task from Incoming to Prioritized backlog on the Wikidata Analytics (Kanban) board.

Manuel moved this task from Prioritized backlog to Incoming on the Wikidata Analytics (Kanban) board.Jan 29 2024, 9:05 AM

Manuel moved this task from Incoming to Prioritized backlog on the Wikidata Analytics (Kanban) board.Jan 29 2024, 9:50 AM

AndrewTavis_WMDE moved this task from Prioritized backlog to In progress on the Wikidata Analytics (Kanban) board.Jan 31 2024, 10:32 AM

AndrewTavis_WMDE claimed this task.Feb 5 2024, 9:21 AM

AndrewTavis_WMDE moved this task from In progress to Prioritized backlog on the Wikidata Analytics (Kanban) board.

AndrewTavis_WMDE moved this task from Prioritized backlog to In progress on the Wikidata Analytics (Kanban) board.Feb 7 2024, 5:09 PM

AndrewTavis_WMDE updated the task description. (Show Details)Feb 8 2024, 1:20 PM

Task is refined and I'm starting work on it now. I'm assuming that event.wdqs_external_sparql_query is what I'd use for this, and thus we'd be getting aggregate/percent values within a 90 day period given the retention policy :)

Let me know if there's anything else that should be included in this!

Quick note on this:

There are two ways that need to be factored in to deriving if a query is from Scholia. Some queries do start with #tool: scholia as @dcausse suggested, but I checked for user agents and also found that the string "Scholia" is also used as a user agent. Big thing is that some of the queries have the comment and some have the user agent, but in no cases do we have both.

AndrewTavis_WMDE updated the task description. (Show Details)Feb 8 2024, 2:29 PM

AndrewTavis_WMDE updated the task description. (Show Details)

Here are some initial results for consideration. Using the following query over the full dataset from event.wdqs_external_sparql_query (last 90 days):

SELECT
    count(*) AS total_scholia_queries

FROM
    event.wdqs_external_sparql_query

WHERE
    query LIKE '%# tool: scholia%'
    OR http.request_headers['user-agent'] LIKE '%Scholia%'

Aggregate queries over the time period (count(*) with no WHERE clause): 869,239,193
Scholia queries over the time period: 27,639,881
Percent Scholia queries over the time period: 3.18%

AndrewTavis_WMDE changed the task status from Open to In Progress.Feb 8 2024, 2:59 PM

AndrewTavis_WMDE triaged this task as Medium priority.

AndrewTavis_WMDE updated the task description. (Show Details)

AndrewTavis_WMDE moved this task from In progress to Product verification on the Wikidata Analytics (Kanban) board.

In T353453#9524925, @AndrewTavis_WMDE wrote:

Quick note on this:

There are two ways that need to be factored in to deriving if a query is from Scholia. Some queries do start with #tool: scholia as @dcausse suggested, but I checked for user agents and also found that the string "Scholia" is also used as a user agent. Big thing is that some of the queries have the comment and some have the user agent, but in no cases do we have both.

Indeed I saw these two as well, I'm not sure how to interpret this yet but it could be that some are coming from web browsers browsing https://scholia.toolforge.org/ (#tool: scholia in the query) and the "Scholia" user-agent might be from some automated tooling used by scholia that we have yet to discover? Looking at the queries might help.
Regarding #tool: scholia something I noted is a non negligible portion of the traffic is coming from automated web crawlers, this might be interesting to identify and distinguish.

AndrewTavis_WMDE moved this task from Product verification to In progress on the Wikidata Analytics (Kanban) board.Feb 9 2024, 12:20 PM

AndrewTavis_WMDE updated the task description. (Show Details)Feb 9 2024, 2:29 PM

Results from the following query to check automate traffic via isSpiderUDF is that 91.36% of the #tool: scholia queries are automated:

WITH automate_or_not AS (
    SELECT
        is_spider(http['request_headers']['user-agent']) AS is_spider

    FROM
        event.wdqs_external_sparql_query

    WHERE
        query LIKE '%# tool: scholia%'
)

SELECT
    is_spider AS is_spider,
    count(*) AS total_queries
    
FROM
    automate_or_not
    
GROUP BY
    is_spider

@dcausse and I found the aforementioned UDF for this. Note for reporting: the UDF is based on user agents, so a similar comparison for queries that have the user agent "Scholia" will not work as they'd either all be automate or none of them would be.

Shifting now to inspecting queries in the following comparisons:

#tool: scholia queries vs. user agent is "Scholia"
For #tool: scholia queries, those that are spiders and those that aren't

Quick counts as in the sampling task to check uniqueness of queries and HTTP statuses (I don't think that other measures like variance over weeks, duration or number of characters would add much). Note that percentages below are for the sub-groups, not for all Scholia queries. Period for the following is all queries 90 days to the date of posting.

All queries with the `#tool: scholia` comment

query_count	total_queries	percent_of_queries
1	3,089,287	68.0434662425413
2	725,985	15.9902708424602
3	187,262	4.124561937919904
11	68,887	1.5172789899578585
4	67,923	1.4960462908082455
10	44,083	0.9709554736642947
12	33,874	0.7460959035207295
5	27,986	0.6164090439845055
9	27,554	0.6068939754859237
13	26,602	0.5859255837946049

http_status	total_queries	percent_of_queries
200	12,283,936	99.12437793762604
500	106,062	0.8558600250620397
429	2,305	0.01860003920129737
400	122	9.84470621500338E-4
503	22	1.7752748912301178E-4

Spider Comment Queries

query_count	total_queries	percent_of_queries
1	2,884,240	68.91420417646493
2	636,348	15.204496158185558
3	166,386	3.975521723610135
11	68,714	1.6418088043233612
4	52,954	1.2652493440076154
10	43,303	1.0346544612977635
12	33,187	0.7929491630392554
9	26,470	0.6324574184364085
13	26,275	0.6277982119160043
14	25,814	0.6167833698344333

Non-spider Comment Queries

query_count	total_queries	percent_of_queries
1	292,734	66.14231938940128
2	94,555	21.364402528796926
3	15,615	3.528159753446819
4	14,178	3.203474158461031
5	4,670	1.0551716969962628
6	4,602	1.0398073125432123
8	2,513	0.5678043842722931
7	2,475	0.5592184047250001
10	1,476	0.33349752136327276
9	1,398	0.31587366860830307

All queries with the `"Scholia"` user agent

query_count	total_queries	percent_of_queries
1	3,938,171	53.733959138039936
2	1,578,024	21.531182148983962
3	808,041	11.025230259392222
4	431,361	5.885659700339077
5	234,441	3.1988055151188766
6	131,593	1.7955068189908687
7	77,142	1.0525558884636235
8	46,645	0.6364427862563288
9	28,437	0.3880056493251414
10	17,087	0.23314177058123892

http_status	total_queries	percent_of_queries
200	15,193,039	99.9989205699417
500	86	5.660425915457063E-4
429	60	3.949134359621207E-4
503	12	7.898268719242414E-5
400	6	3.949134359621207E-5

Discussion

Nothing jumps out per say from the above. Big thing is that we have a higher percentage of unique queries coming from those identified via a comment than those that are identified via a user agent. I don't think that we can say that one is for traffic/API and another is for general tooling because:

Higher unique queries from those with a comment would possibly indicate unique user searches
But then the commented queries also seem more "routine" where we have a higher percentage of queries being ran 11, 10 and 12 times in the period
- The distribution order of the user agent identified queries is more in line with user behavior

As far as HTTP status, 99+% 200s across commented and user agent queries is great :)

Final thing from what's been discussed so far is looking at some examples from the above breakdowns.

Having derived quick samples (DISTRIBUTE BY rand() to mix it up, but nothing more), what I'm seeing is that the comment queries look to be very similar to one another regardless of if they're spiders or non-spiders. Could be that what we're thinking of as a non-spider just isn't being picked up by the UDF. Each of them has a PREFIX target: <http://www.wikidata.org/entity/QID_TARGET> at the top that then is assigned further on down the query. A final check for this could be to see how our counts above would change if we took the part of the query after this assignment is made such that we just have the template query. My expectation is that the whole distribution will shift to the left such that we have dramatically more unique queries at all levels in that the distinct part for most of these seems to be the QID in question. The queries themselves are varied based on all manner of things that could be found out about researchers: student-supervisor relationships, number of publications, etc, with default views like maps, graphs, bar charts and others being assigned at the top between the #tool: scholia comment and the prefix assignment.

The user agent based queries are totally different and normally extremely small. Templates for these are:

select ?class where { wd:QID_TARGET wdt:P279+ ?class }

SELECT ?class { wd:QID_TARGET wdt:P31 ?class }

SELECT ?doi { wd:QID_TARGET wdt:P356 ?doi }

So it seems that the assumption that these are for automated tooling was correct :) Minor helper queries to check what things are.

Let me know if a check of the queries after removing the the prefix assignment would be helpful. We could also check the user agent queries with just Q rather than the full QID to get an idea of their variance. At the very least this would give us an idea of the total number of query templates they have.

AndrewTavis_WMDE moved this task from In progress to Product verification on the Wikidata Analytics (Kanban) board.Feb 9 2024, 5:02 PM

AndrewTavis_WMDE updated the task description. (Show Details)Feb 9 2024, 5:17 PM

Hi Andrew, good idea to investigate the types of queries per source! The results seem highly relevant: Could you please add the % of user agent based queries to the results summary?

Manuel moved this task from Product verification to Prioritized backlog on the Wikidata Analytics (Kanban) board.Feb 15 2024, 7:38 PM

Credit on checking the queries goes to @dcausse :) Added the percent that are identified via a user agent to the results summary just now: 55.29%.

AndrewTavis_WMDE moved this task from Product verification to In progress on the Wikidata Analytics (Kanban) board.Feb 16 2024, 10:10 AM

AndrewTavis_WMDE updated the task description. (Show Details)Feb 19 2024, 11:58 AM

A follow up request from @Manuel on this was for the total IPs that are accessing Scholia. The following query was run for this:

SELECT
    count(
        DISTINCT CASE 
            WHEN query LIKE '%# tool: scholia%' THEN http.client_ip 
        END
    ) AS total_comment_ips,
    count(
        DISTINCT CASE 
            WHEN http.request_headers['user-agent'] LIKE '%Scholia%' THEN http.client_ip 
        END
    ) AS total_user_agent_ips

FROM
    event.wdqs_external_sparql_query

WHERE
    query LIKE '%# tool: scholia%'
    OR http.request_headers['user-agent'] LIKE '%Scholia%'

Results are:

Total IPs is: 28,918
total_comment_ips: 28,740 (99.4%)
total_user_agent_ips: 178 (0.6%)

AndrewTavis_WMDE updated the task description. (Show Details)Feb 19 2024, 12:34 PM