Page MenuHomePhabricator

[Analytics] Impact of Scholia on WDQS
Closed, ResolvedPublic

Description

Scope

  • How many SPARQL queries are coming from Scholia?

Notes

  • Scholia queries should be identifiable via HTTP user agent

Edit: from this comment, Scholia queries generally start with the following comment:

# tool: scholia

Edit: there are also cases where the user agent string of "Scholia" is used, but there are no cases where the comment and user agent appear together.

Desired output

Description of the desired output for this task.

  • Aggregate Scholia queries for the 90 days for which we have queries still
    • 27,639,881
    • Time period: 90 days to 2.8.2024
    • Total queries: 869,239,193
  • Percent Scholia queries for the 90 days for which we have queries still
    • 3.18%
  • Of these Scholia queries, the percent that are user agent based (vs. query comment based)
    • 55.29%
  • Number of unique IPs that are making requests to Scholia via each method (total of either and percents)
    • Total IPs is: 28,918
    • Total IPs derived via comments: 28,740 (99.4%)
    • Total IPs derived via user agents: 178 (0.6%)
    • Total IPs in the period: 2,115,166
    • Percent Scholia IPs for the period: 1.37%

Urgency

When this task should be completed by. If this task is time sensitive then please make this clear. Please also provide the date when the output will be used if there is a specific meeting or event, for example.

09.02.2024


Information below this point is filled out by the Wikidata Analytics team.

General Planning

Information is filled out by the analytics product manager.

Assignee Planning

Information is filled out by the assignee of this task.

Estimation

Estimate: 1/2 day
Actual: 1 hour snapshot -> 1/2 day for full work

Sub Tasks

Full breakdown of the steps to complete this task:

  • Check queries with the given comment to mark them as being for Scholia
  • Investigate the metadata for these queries to derive if there are other identification methods that should be included
    • I.e. OR conditions for WHERE clauses whereby we can use the comment or say a user agent to gain a greater coverage of the queries
  • Set up notebook with process to derive aggregate and percentage values
  • Run notebook to derive needed values
  • Report values and time period considered in this task

Data to be used

See Analytics/Data_Lake for the breakdown of the data lake databases and tables.

The following tables will be referenced in this task:

For the analysis of automate traffic, isSpiderUDF will be used.

Notes and Questions

Things that came up during the completion of this task, questions to be answered and follow up tasks:

  • Note

Event Timeline

Manuel renamed this task from [Analytics] Impact of Scholia on WDQS to [Analytics] QUERY-Q3: Extract a set of queries known to be used by scholia.Dec 14 2023, 2:30 PM
Manuel updated the task description. (Show Details)

note that scholia queries generally start with the comment:

# tool: scholia
Manuel renamed this task from [Analytics] QUERY-Q3: Extract a set of queries known to be used by scholia to [Analytics] Impact of Scholia on WDQS.Dec 14 2023, 2:39 PM
Manuel updated the task description. (Show Details)

Task is refined and I'm starting work on it now. I'm assuming that event.wdqs_external_sparql_query is what I'd use for this, and thus we'd be getting aggregate/percent values within a 90 day period given the retention policy :)

Let me know if there's anything else that should be included in this!

Quick note on this:

There are two ways that need to be factored in to deriving if a query is from Scholia. Some queries do start with #tool: scholia as @dcausse suggested, but I checked for user agents and also found that the string "Scholia" is also used as a user agent. Big thing is that some of the queries have the comment and some have the user agent, but in no cases do we have both.

AndrewTavis_WMDE updated the task description. (Show Details)
AndrewTavis_WMDE updated the task description. (Show Details)

Here are some initial results for consideration. Using the following query over the full dataset from event.wdqs_external_sparql_query (last 90 days):

SELECT
    count(*) AS total_scholia_queries

FROM
    event.wdqs_external_sparql_query

WHERE
    query LIKE '%# tool: scholia%'
    OR http.request_headers['user-agent'] LIKE '%Scholia%'

Aggregate queries over the time period (count(*) with no WHERE clause): 869,239,193
Scholia queries over the time period: 27,639,881
Percent Scholia queries over the time period: 3.18%

AndrewTavis_WMDE changed the task status from Open to In Progress.Feb 8 2024, 2:59 PM
AndrewTavis_WMDE triaged this task as Medium priority.
AndrewTavis_WMDE updated the task description. (Show Details)

Quick note on this:

There are two ways that need to be factored in to deriving if a query is from Scholia. Some queries do start with #tool: scholia as @dcausse suggested, but I checked for user agents and also found that the string "Scholia" is also used as a user agent. Big thing is that some of the queries have the comment and some have the user agent, but in no cases do we have both.

Indeed I saw these two as well, I'm not sure how to interpret this yet but it could be that some are coming from web browsers browsing https://scholia.toolforge.org/ (#tool: scholia in the query) and the "Scholia" user-agent might be from some automated tooling used by scholia that we have yet to discover? Looking at the queries might help.
Regarding #tool: scholia something I noted is a non negligible portion of the traffic is coming from automated web crawlers, this might be interesting to identify and distinguish.

Results from the following query to check automate traffic via isSpiderUDF is that 91.36% of the #tool: scholia queries are automated:

WITH automate_or_not AS (
    SELECT
        is_spider(http['request_headers']['user-agent']) AS is_spider

    FROM
        event.wdqs_external_sparql_query

    WHERE
        query LIKE '%# tool: scholia%'
)

SELECT
    is_spider AS is_spider,
    count(*) AS total_queries
    
FROM
    automate_or_not
    
GROUP BY
    is_spider

@dcausse and I found the aforementioned UDF for this. Note for reporting: the UDF is based on user agents, so a similar comparison for queries that have the user agent "Scholia" will not work as they'd either all be automate or none of them would be.

Shifting now to inspecting queries in the following comparisons:

  • #tool: scholia queries vs. user agent is "Scholia"
  • For #tool: scholia queries, those that are spiders and those that aren't

Quick counts as in the sampling task to check uniqueness of queries and HTTP statuses (I don't think that other measures like variance over weeks, duration or number of characters would add much). Note that percentages below are for the sub-groups, not for all Scholia queries. Period for the following is all queries 90 days to the date of posting.

All queries with the #tool: scholia comment

query_counttotal_queriespercent_of_queries
13,089,28768.0434662425413
2725,98515.9902708424602
3187,2624.124561937919904
1168,8871.5172789899578585
467,9231.4960462908082455
1044,0830.9709554736642947
1233,8740.7460959035207295
527,9860.6164090439845055
927,5540.6068939754859237
1326,6020.5859255837946049
http_statustotal_queriespercent_of_queries
20012,283,93699.12437793762604
500106,0620.8558600250620397
4292,3050.01860003920129737
4001229.84470621500338E-4
503221.7752748912301178E-4
Spider Comment Queries
query_counttotal_queriespercent_of_queries
12,884,24068.91420417646493
2636,34815.204496158185558
3166,3863.975521723610135
1168,7141.6418088043233612
452,9541.2652493440076154
1043,3031.0346544612977635
1233,1870.7929491630392554
926,4700.6324574184364085
1326,2750.6277982119160043
1425,8140.6167833698344333
Non-spider Comment Queries
query_counttotal_queriespercent_of_queries
1292,73466.14231938940128
294,55521.364402528796926
315,6153.528159753446819
414,1783.203474158461031
54,6701.0551716969962628
64,6021.0398073125432123
82,5130.5678043842722931
72,4750.5592184047250001
101,4760.33349752136327276
91,3980.31587366860830307

All queries with the "Scholia" user agent

query_counttotal_queriespercent_of_queries
13,938,17153.733959138039936
21,578,02421.531182148983962
3808,04111.025230259392222
4431,3615.885659700339077
5234,4413.1988055151188766
6131,5931.7955068189908687
777,1421.0525558884636235
846,6450.6364427862563288
928,4370.3880056493251414
1017,0870.23314177058123892
http_statustotal_queriespercent_of_queries
20015,193,03999.9989205699417
500865.660425915457063E-4
429603.949134359621207E-4
503127.898268719242414E-5
40063.949134359621207E-5

Discussion

Nothing jumps out per say from the above. Big thing is that we have a higher percentage of unique queries coming from those identified via a comment than those that are identified via a user agent. I don't think that we can say that one is for traffic/API and another is for general tooling because:

  • Higher unique queries from those with a comment would possibly indicate unique user searches
  • But then the commented queries also seem more "routine" where we have a higher percentage of queries being ran 11, 10 and 12 times in the period
    • The distribution order of the user agent identified queries is more in line with user behavior

As far as HTTP status, 99+% 200s across commented and user agent queries is great :)

Final thing from what's been discussed so far is looking at some examples from the above breakdowns.

Having derived quick samples (DISTRIBUTE BY rand() to mix it up, but nothing more), what I'm seeing is that the comment queries look to be very similar to one another regardless of if they're spiders or non-spiders. Could be that what we're thinking of as a non-spider just isn't being picked up by the UDF. Each of them has a PREFIX target: <http://www.wikidata.org/entity/QID_TARGET> at the top that then is assigned further on down the query. A final check for this could be to see how our counts above would change if we took the part of the query after this assignment is made such that we just have the template query. My expectation is that the whole distribution will shift to the left such that we have dramatically more unique queries at all levels in that the distinct part for most of these seems to be the QID in question. The queries themselves are varied based on all manner of things that could be found out about researchers: student-supervisor relationships, number of publications, etc, with default views like maps, graphs, bar charts and others being assigned at the top between the #tool: scholia comment and the prefix assignment.

The user agent based queries are totally different and normally extremely small. Templates for these are:

select ?class where { wd:QID_TARGET wdt:P279+ ?class }
SELECT ?class { wd:QID_TARGET wdt:P31 ?class }
SELECT ?doi { wd:QID_TARGET wdt:P356 ?doi }

So it seems that the assumption that these are for automated tooling was correct :) Minor helper queries to check what things are.

Let me know if a check of the queries after removing the the prefix assignment would be helpful. We could also check the user agent queries with just Q rather than the full QID to get an idea of their variance. At the very least this would give us an idea of the total number of query templates they have.

Hi Andrew, good idea to investigate the types of queries per source! The results seem highly relevant: Could you please add the % of user agent based queries to the results summary?

Credit on checking the queries goes to @dcausse :) Added the percent that are identified via a user agent to the results summary just now: 55.29%.

A follow up request from @Manuel on this was for the total IPs that are accessing Scholia. The following query was run for this:

SELECT
    count(
        DISTINCT CASE 
            WHEN query LIKE '%# tool: scholia%' THEN http.client_ip 
        END
    ) AS total_comment_ips,
    count(
        DISTINCT CASE 
            WHEN http.request_headers['user-agent'] LIKE '%Scholia%' THEN http.client_ip 
        END
    ) AS total_user_agent_ips

FROM
    event.wdqs_external_sparql_query

WHERE
    query LIKE '%# tool: scholia%'
    OR http.request_headers['user-agent'] LIKE '%Scholia%'

Results are:

  • Total IPs is: 28,918
  • total_comment_ips: 28,740 (99.4%)
  • total_user_agent_ips: 178 (0.6%)

Updated the above comment and description with a second run and also ran a query for the total IPs for the given period, with the result being 2,115,166. Percent Scholia queries for the period is thus 28918 / 2115166 * 100, or 1.37%.