Page MenuHomePhabricator

Some search entries in wmf.webrequest have their query appended to their uri_path
Closed, DeclinedPublicBUG REPORT

Description

Context

In working on a task to investigate searches of Wikidata I came across entries in wmf.webrequest where the uri_path includes the value that would be expected to be included in uri_query.

Steps to replicate

Running the following query will return such results:

SELECT
    *

FROM
    wmf.webrequest

WHERE
    year = 2024
    AND month = 12
    AND normalized_host.project_family = 'wikidata'
    -- %3F is '?', with the issue likely stemming from the search query including this instead.
    AND uri_path LIKE '/w/index.php%3Fsearch%'

LIMIT
    5
;

What happens?:

Results are returned, which shouldn't be the case as the normal WHERE clause for deriving search traffic is something like the following that relies on uri_path = '/w/index.php':

WHERE
    ...
    AND uri_path = '/w/index.php'
    AND uri_query LIKE '%search%'

Specifically these entries that are returned above also have empty uri_query fields. General thought from @JAllemandou on this is that the varnish-kafka log builder doesn't split by %3F which is programmatically being included in the requests (they're basically all from a single bot).

What should have happened instead?:

No results should be returned and all entries should have /w/index.php for their uri_path and the rest of the search query as the uri_query value.

Software version

Current version of wmf.webrequest.

Other information

@JAllemandou mentioned that there's varnish-kafka to HAProxy migration in the works and that this task might be something to look into to make sure that this is fixed. Let me know if there's anything else I can do to help :)

Event Timeline

Hi @AndrewTavis_WMDE, can you please associate one or more active project tags with this task (via the Add Action...Change Project Tags dropdown)? That will allow to see a task when looking at project workboards or searching for tasks in certain projects, and get notified about a task when watching a related project tag. Thanks!

Thanks @Aklapper! 🙏 Adding Data-Engineering as my best guess. Hope that this works and that others can add more specific tags for the teams who would look into this :)

SELECT
    hour,
    normalized_host.project_family as project_family,
    count(1) as count
FROM wmf.webrequest
-- FROM wmf_staging.webrequest
WHERE
    year = 2025 AND month = 3 AND day = 20 AND hour >= 13 AND hour <= 15
  AND uri_path LIKE '%\%3F%'
  AND (uri_query IS NULL OR uri_query = '')
group by hour, normalized_host.project_family
order BY hour, count desc;

-- 13      wikipedia       45451
-- 13      wikimedia       31025
-- 13      wikisource      940
-- 13      wikibooks       355
-- 13      wiktionary      325
-- 13      wikinews        301
-- 13      wikiversity     187
-- 13      wikiquote       168
-- 13      mediawiki       66
-- 13      wikivoyage      27
-- 13      wikidata        6

It feels like all families of the project are concerned.

with counts as (SELECT hour,
                       SUM(CASE
                               WHEN uri_path LIKE '%\%3F%' AND uri_query IS NULL OR uri_query = '' THEN 1
                               ELSE 0 END) AS unparsed_and_no_query,
                       SUM(CASE
                               WHEN uri_path LIKE '%\%3F%' AND (uri_query IS NOT NULL AND LENGTH(uri_query) > 0) THEN 1
                               ELSE 0 END) AS unparsed_and_query,
                       COUNT(1)            AS total
                FROM wmf.webrequest
                -- FROM wmf_staging.webrequest
                WHERE year = 2025
                  AND month = 3
                  AND day = 20
                  AND hour >= 13
                  AND hour <= 15
                GROUP BY hour
) SELECT hour,
         unparsed_and_no_query,
         unparsed_and_query,
         total,
         unparsed_and_no_query / total AS unparsed_and_no_query_ratio,
         unparsed_and_query / total AS unparsed_and_query_ratio
  FROM counts
ORDER BY hour, total DESC;
-- hour    unparsed_and_no_query   unparsed_and_query      total   unparsed_and_no_query_ratio     unparsed_and_query_ratio
-- 13      353325930       939     709915607       0.49770131339005763     1.3226924309610227E-6
-- 14      361535242       1059    712255506       0.5075920634581939      1.4868259930306527E-6
-- 15      354933452       1281    692692340       0.5123969639970323      1.8493058548907874E-6

There are ~0.5% of requests with an unparsed ? and no query.

SELECT uri_path, uri_query
FROM wmf_staging.webrequest
WHERE year = 2025 AND month = 3 AND day = 20 AND hour = 15
  AND normalized_host.project_family = 'wikipedia'
  AND uri_path LIKE '%\%3F%'
LIMIT 5 ;

-- uri_path        uri_query
-- /wiki/Worms%3F
-- /wiki/Do_They_Know_It%27s_Christmas%3F
-- /api/rest_v1/page/summary/When_We_All_Fall_Asleep%2C_Where_Do_We_Go%3F
-- /wiki/%E6%89%93%E3%81%A1%E4%B8%8A%E3%81%92%E8%8A%B1%E7%81%AB%E3%80%81%E4%B8%8B%E3%81%8B%E3%82%89%E8%A6%8B%E3%82%8B%E3%81%8B%3F_%E6%A8%AA%E3%81%8B%E3%82%89%E8%A6%8B%E3%82%8B%E3%81%8B%3F
-- /wiki/Why_Am_I_So_Single%3F
-- /api/rest_v1/page/summary/Where_Do_the_Children_Play%3F
-- /wiki/Par_o%C3%B9_t%27es_rentr%C3%A9_%3F_On_t%27a_pas_vu_sortir
-- /api/rest_v1/page/summary/Whose_Line_Is_It_Anyway%3F_(American_TV_series)
-- /api/rest_v1/page/mobile-html/%D0%97%D0%B0%D1%87%D0%B5%D0%BC%3F
-- /wiki/Am_I_Racist%3F
-- /wiki/Who_stole_the_cookie_from_the_cookie_jar%3F
-- /api/rest_v1/page/summary/Bucchigiri%3F!
-- /wiki/Whose_Line_Is_It_Anyway%3F_(British_TV_series)
-- /wiki/What_Child_Is_This%3F
-- /api/rest_v1/page/html/Nvidia_profile_inspector_force_rebar_on_for_all_games%3F
-- /wiki/Category%3AStart-Class_German_cinema_articles%3Ffrom%3DLj
-- /wiki/Fichi%C3%A8r:What_to_do%3F._Thoughts_evoked_by_the_census_of_Moscow_(IA_cu31924060289091).pdf
-- /wiki/%E0%AA%B5%E0%AA%BF%E0%AA%B6%E0%AB%87%E0%AA%B7:%E0%AA%85%E0%AA%B9%E0%AB%80%E0%AA%82_%E0%AA%B6%E0%AB%81%E0%AA%82_%E0%AA%9C%E0%AB%8B%E0%AA%A1%E0%AA%BE%E0%AA%AF_%E0%AA%9B%E0%AB%87%3F/%E0%AA%AE%E0%AA%A6%E0%AA%A8%E0%AA%B2%E0%AA%BE%E0%AA%B2_%E0%AA%A7%E0%AA%BF%E0%AA%82%E0%AA%97%E0%AA%B0%E0%AA%BE
-- /wiki/%E3%82%8F%E3%81%9F%E3%81%97%E3%81%8C%E6%81%8B%E4%BA%BA%E3%81%AB%E3%81%AA%E3%82%8C%E3%82%8B%E3%82%8F%E3%81%91%E3%81%AA%E3%81%84%E3%81%98%E3%82%83%E3%82%93%E3%80%81%E3%83%A0%E3%83%AA%E3%83%A0%E3%83%AA!%EF%BC%88%E2%80%BB%E3%83%A0%E3%83%AA%E3%81%98%E3%82%83%E3%81%AA%E3%81%8B%E3%81%A3%E3%81%9F!%3F%EF%BC%89
-- /api/rest_v1/page/html/Ctrl_delete_triggers_an_copy_event_javascript%3F
-- /api/rest_v1/page/summary/Would_I_Lie_to_You%3F_(British_game_show)
-- /wiki/What_Is_It_Like_to_Be_a_Bat%3F
-- /api/rest_v1/page/summary/Who's_Afraid_of_Virginia_Woolf%3F_(film)
-- /wiki/%28What%27s_the_Story%29_Morning_Glory%3F
-- /api/rest_v1/page/html/canadian_election_can_you_vote_online%3F
-- /api/rest_v1/page/summary/Why_Have_There_Been_No_Great_Women_Artists%3F
-- /api/rest_v1/page/summary/What_If..._Thor_Were_an_Only_Child%3F
-- /wiki/Who's_the_Boss%3F
-- /api/rest_v1/page/summary/Who_Wants_to_Be_a_Millionaire%3F
-- /api/rest_v1/page/summary/%D0%9A%D1%82%D0%BE_%D0%B7%D0%B4%D0%B5%D1%81%D1%8C_%D0%B1%D0%BE%D1%81%D1%81%3F
-- Time taken: 8.402 seconds, Fetched 30 row(s)

We can see plenty of cases where the ? is expected and part of the uri_path.

SELECT uri_path, uri_query
FROM wmf_staging.webrequest
WHERE year = 2025 AND month = 3 AND day = 20 AND hour = 15
  AND normalized_host.project_family = 'wikipedia'
  AND uri_path LIKE '%\%3F%'
  AND uri_query IS NOT NULL AND LENGTH(uri_query) > 0
LIMIT 20 ;

-- uri_path        uri_query
-- /api/rest_v1/data/citation/mediawiki/https%3A%2F%2Fwww.demorgen.be%2Fsnelnieuws%2Fanais-maes-volgt-ans-persoons-op-als-vooruit-schepen-in-brussel~bf3e6b8f%2F%3Futm_source%3Dchatgpt.com%26referrer%3Dhttps%3A%2F%2Fchatgpt.com%2F     ?action=query&format=json
-- /wiki/Scene_It%3F_Twilight      ?useskin=vector&useskin=vector
-- /wiki/Wie_is_Di-rect%3F ?action=edit&veswitched=1
-- /api/rest_v1/page/summary/Que_ferais-je_de_cette_%C3%A9p%C3%A9e_%3F     ?
-- /api/rest_v1/page/summary/Who_shot_J.R.%3F      ?redirect=true
-- /wiki/Is_That_You%3F_(Adventure_Time)   ?useskin=vector&useskin=vector
-- /api/rest_v1/page/summary/Whose_Life_Is_It_Anyway%3F    ?_(film)=
-- /wiki/Special%3ARecentChangesLinked%2FWhat_Price_Confidence%3F  ?useskin=vector&useskin=vector
-- /api/rest_v1/page/summary/%E3%82%AF%E3%83%AA%E3%82%B9%E3%83%9E%E3%82%B9%E3%81%AB%E9%9B%AA%E3%81%AF%E9%99%8D%E3%82%8B%E3%81%AE%3F      ?redirect=true
-- /api/rest_v1/page/summary/Que_font_les_rennes_apr%C3%A8s_No%C3%ABl_%3F  ?
-- /api/rest_v1/page/summary/Will_Any_Gentleman%3F ?redirect=true
-- /api/rest_v1/page/summary/Dude%2C_Where's_My_Car%3F     ?redirect=true
-- /wiki/%E8%87%AA%E7%A7%B0F%E3%83%A9%E3%83%B3%E3%82%AF%E3%81%AE%E3%81%8A%E5%85%84%E3%81%95%E3%81%BE%E3%81%8C%E3%82%B2%E3%83%BC%E3%83%A0%E3%81%A7%E8%A9%95%E4%BE%A1%E3%81%95%E3%82%8C%E3%82%8B%E5%AD%A6%E5%9C%92%E3%81%AE%E9%A0%82%E7%82%B9%E3%81%AB%E5%90%9B%E8%87%A8%E3%81%99%E3%82%8B%E3%81%9D%E3%81%86%E3%81%A7%E3%81%99%E3%82%88%3F ?oldformat=true
-- /api/rest_v1/page/mobile-html/What_Do_You_Mean%3F       ?redirect=true
-- /api/rest_v1/page/html/Mai_e_mult_p%C3%A2n%C4%83_ajungem%3F/15903289    ?redirect=no
-- /api/rest_v1/data/citation/mediawiki/https%3A%2F%2Fwww.imdb.com%2Fit%2Ftitle%2Ftt35530321%2F%3Fref_%3Dnm_knf_t_1        ?action=query&format=json
-- /wiki/Wer_wei%C3%9F_denn_sowas%3F/Kandidatenliste       ?utm_source=perplexity
-- /api/rest_v1/page/mobile-html/Czy_m%C4%99%C5%BCczy%C5%BAni_to_pedofile%3F       ?redirect=true
-- /wiki/Scene_It%3F_Twilight      ?useskin=vector&useskin=vector
-- /api/rest_v1/data/citation/mediawiki/https%3A%2F%2Fdas.nlpi.edu.tw%2Fcgi-bin%2Fgs32%2Fgsweb.cgi%2Fccd%3DzV52Ll%2Fmygraphviewer_nlpi%3Fdbid%3DZjJkcW1YWitaNDU3Z1lCUlpHU1RTbStUZGxjPQ%3D%3D%26fulltype%3Dospath%26initpage%3D1%26%23     ?action=query&format=json
-- Time taken: 20.336 seconds, Fetched 20 row(s)

So, I don't think we should try to modify the uri_path & uri_query for all rows of webrequest.

Some normalization, including URL decoding, could be performed. But I feel it would be for a restricted dataset.

What I'm sure of is that the HAProxy migration is neutral to the problem.

This is still a thing with HAProxy data:

scala> spark.sql("""
     | SELECT
     |     normalized_host.project_family as project_family,
     |     count(1) as count
     | FROM wmf.webrequest
     | -- FROM wmf_staging.webrequest
     | WHERE
     |     year = 2025 AND month = 4 AND day = 4
     |   AND uri_path LIKE '%\%3F%'
     |   AND (uri_query IS NULL OR uri_query = '')
     | group by normalized_host.project_family
     | order BY count desc
     | """).show(100, false)
+----------------+------+                                                       
|project_family  |count |
+----------------+------+
|wikipedia       |310222|
|wikimedia       |192680|
|wikisource      |31855 |
|wiktionary      |14891 |
|wikivoyage      |5267  |
|wikinews        |3932  |
|wikibooks       |2645  |
|wikiversity     |2003  |
|wikidata        |1417  |
|wikiquote       |1237  |
|mediawiki       |584   |
|atomicmail      |3     |
|evilhostav9oufub|1     |
|wikibrief       |1     |
|evilhostrsepwxep|1     |
|-               |1     |
+----------------+------+

I don't think we should alter the current value of the column uri_query. The creators of those malformed rows inserted an escaped character %3F making the whole url unexpected, but yet the current algorithm correctly (to me) split the path from the query.

@AndrewTavis_WMDE closing the ticket. Let us know if you want to reopen it.

I don't have any major preference on this being done, so if it's unlikely to be prioritized then it can be dropped :) Was opened based on the original observation and reporting it to @JAllemandou who agreed that fixing this around the time of the HAProxy migration could make sense. I have a work around on my end, so it's certainly not pressing.

Thanks all for looking into this! 😊