Page MenuHomePhabricator

Get quantitative data on how the search API is currently used in Wikidata
Closed, ResolvedPublic

Description

In order to get a better understanding of the use cases, we need to find data to answer the following questions:

  • What queries are run after wbsearchentities by the same user agent/IP in the same "session"?
  • What is the % new user agents per month i.e. growth of user agents over time?
  • Search calls per month
  • How many times do people reformulate their search queries?
  • How many people do a simple text search AND wikibase cirrussearch keywords (i.e. haswbstatement)
  • Dist of simple v "complex" (wikibase cirrussearch keywords) searches

Event Timeline

Added this to the kanban board and assigned myself as discussed with @Ifrahkhanyaree_WMDE as I'll be working on the last two points:

  • How many people do a simple text search AND wikibase cirrussearch keywords (i.e. haswbstatement)
  • Dist of simple v "complex" (wikibase cirrussearch keywords) searches

The number of search calls in a month will be calculated along the way in case that's also needed. All of the numbers are coming for December 2024 :)

Initial report for feedback on this, with the following being totals for December 2024. The work took a bit more time as initial explorations honed in on some search queries that were a malformed subset of the rest of the searches. T383135: Some search entries in wmf.webrequest have their query appended to their uri_path was made to report the bug in question, and the following does include both the normal and malformed searches.

Total Searches

Point 3 above

SELECT
    count(*) AS total_searches

FROM
    wmf.webrequest

WHERE
    year = 2024
    AND month = 12
    AND webrequest_source = 'text'
    AND normalized_host.project_family = 'wikidata'
    AND (
        (
            uri_path = '/w/index.php'
            AND uri_query LIKE '?search%'
        )
        -- Accounting more malformed search queries: %3F is '?'.
        OR uri_path LIKE '/w/index.php%3Fsearch%'
    )
;
  • Result: 12,596,105

Devices Doing Simple and Complex Searches

Point 5 above

WITH wd_search_ips_and_uris AS (
    SELECT
        ip,
        uri_path,
        uri_query

    FROM
        wmf.webrequest

    WHERE
        year = 2024
        AND month = 12
        AND normalized_host.project_family = 'wikidata'
        AND (
            (
                uri_path = '/w/index.php'
                AND uri_query LIKE '?search%'
            )
            -- Accounting more malformed search queries: %3F is '?'.
            OR uri_path LIKE '/w/index.php%3Fsearch%'
        )
),

simple_search_ips AS (
    SELECT DISTINCT
        ip

    FROM
        wd_search_ips_and_uris

    WHERE
        (
            uri_path = '/w/index.php'
            AND uri_query LIKE '?search%'
            AND uri_query NOT LIKE '%haswbstatement%'
            AND uri_query NOT LIKE '%inlabel%'
            AND uri_query NOT LIKE '%incaption%'
            AND uri_query NOT LIKE '%wbstatementquantity%'
            AND uri_query NOT LIKE '%hasdescription%'
            AND uri_query NOT LIKE '%haslabel%'
            AND uri_query NOT LIKE '%hascaption%'
        )
        OR (
            uri_path LIKE '/w/index.php%3Fsearch%'
            AND uri_path NOT LIKE '%haswbstatement%'
            AND uri_path NOT LIKE '%inlabel%'
            AND uri_path NOT LIKE '%incaption%'
            AND uri_path NOT LIKE '%wbstatementquantity%'
            AND uri_path NOT LIKE '%hasdescription%'
            AND uri_path NOT LIKE '%haslabel%'
            AND uri_path NOT LIKE '%hascaption%'
        )
),

complex_search_ips AS (
    SELECT DISTINCT
        ip

    FROM
        wd_search_ips_and_uris

    WHERE
        (
            uri_path = '/w/index.php'
            AND uri_query LIKE '?search%'
            AND (
                uri_query LIKE '%haswbstatement%'
                OR uri_query LIKE '%inlabel%'
                OR uri_query LIKE '%incaption%'
                OR uri_query LIKE '%wbstatementquantity%'
                OR uri_query LIKE '%hasdescription%'
                OR uri_query LIKE '%haslabel%'
                OR uri_query LIKE '%hascaption%'
            )
        )
        OR (
            uri_path LIKE '/w/index.php%3Fsearch%'
            AND (
                uri_path LIKE '%haswbstatement%'
                OR uri_path LIKE '%inlabel%'
                OR uri_path LIKE '%incaption%'
                OR uri_path LIKE '%wbstatementquantity%'
                OR uri_path LIKE '%hasdescription%'
                OR uri_path LIKE '%haslabel%'
                OR uri_path LIKE '%hascaption%'
            )
        )
)

SELECT
    count(s.ip) AS total_simple_and_complex_searchers

FROM
    simple_search_ips AS s

INNER JOIN
    complex_search_ips AS c

ON
    s.ip = c.ip
;
  • Result: 2,029

Simple vs. Complex Search Distribution

Point 6 above

SELECT
    sum(
        CASE
            WHEN
                (
                    uri_path = '/w/index.php'
                    AND uri_query LIKE '?search%'
                    AND uri_query NOT LIKE '%haswbstatement%'
                    AND uri_query NOT LIKE '%inlabel%'
                    AND uri_query NOT LIKE '%incaption%'
                    AND uri_query NOT LIKE '%wbstatementquantity%'
                    AND uri_query NOT LIKE '%hasdescription%'
                    AND uri_query NOT LIKE '%haslabel%'
                    AND uri_query NOT LIKE '%hascaption%'
                )
                OR (
                    uri_path LIKE '/w/index.php%3Fsearch%'
                    AND uri_path NOT LIKE '%haswbstatement%'
                    AND uri_path NOT LIKE '%inlabel%'
                    AND uri_path NOT LIKE '%incaption%'
                    AND uri_path NOT LIKE '%wbstatementquantity%'
                    AND uri_path NOT LIKE '%hasdescription%'
                    AND uri_path NOT LIKE '%haslabel%'
                    AND uri_path NOT LIKE '%hascaption%'
                )
            THEN
                1
            ELSE
                0
        END
    ) AS total_simple_searches,

    sum(
        CASE
            WHEN
                (
                    uri_path = '/w/index.php'
                    AND uri_query LIKE '?search%'
                    AND (
                        uri_query LIKE '%haswbstatement%'
                        OR uri_query LIKE '%inlabel%'
                        OR uri_query LIKE '%incaption%'
                        OR uri_query LIKE '%wbstatementquantity%'
                        OR uri_query LIKE '%hasdescription%'
                        OR uri_query LIKE '%haslabel%'
                        OR uri_query LIKE '%hascaption%'
                    )
                )
                OR (
                    uri_path LIKE '/w/index.php%3Fsearch%'
                    AND (
                        uri_path LIKE '%haswbstatement%'
                        OR uri_path LIKE '%inlabel%'
                        OR uri_path LIKE '%incaption%'
                        OR uri_path LIKE '%wbstatementquantity%'
                        OR uri_path LIKE '%hasdescription%'
                        OR uri_path LIKE '%haslabel%'
                        OR uri_path LIKE '%hascaption%'
                    )
                )
            THEN
                1
            ELSE
                0
        END
    ) AS total_complex_searches

FROM
    wmf.webrequest

WHERE
    year = 2024
    AND month = 12
    AND webrequest_source = 'text'
    AND normalized_host.project_family = 'wikidata'
    AND (
        (
            uri_path = '/w/index.php'
            AND uri_query LIKE '?search%'
        )
        -- Accounting more malformed search queries: %3F is '?'.
        OR uri_path LIKE '/w/index.php%3Fsearch%'
    )
;
  • Result: total_simple_searches: 12,414,591 (98.56%); total_complex_searches: 181,514 (1.44%)
  • Check: 12414591 + 181514 == 12596105

Let me know if the above works! :)

2025 w3 review: all good!

initial requests were delivered. reopening and removing from our board. feel free to chuck it back with the new ac when/if needed

marking this as resolved, will open a new ticket when the need arises