Context
In working on a task to investigate searches of Wikidata I came across entries in wmf.webrequest where the uri_path includes the value that would be expected to be included in uri_query.
Steps to replicate
Running the following query will return such results:
SELECT * FROM wmf.webrequest WHERE year = 2024 AND month = 12 AND normalized_host.project_family = 'wikidata' -- %3F is '?', with the issue likely stemming from the search query including this instead. AND uri_path LIKE '/w/index.php%3Fsearch%' LIMIT 5 ;
What happens?:
Results are returned, which shouldn't be the case as the normal WHERE clause for deriving search traffic is something like the following that relies on uri_path = '/w/index.php':
WHERE ... AND uri_path = '/w/index.php' AND uri_query LIKE '%search%'
Specifically these entries that are returned above also have empty uri_query fields. General thought from @JAllemandou on this is that the varnish-kafka log builder doesn't split by %3F which is programmatically being included in the requests (they're basically all from a single bot).
What should have happened instead?:
No results should be returned and all entries should have /w/index.php for their uri_path and the rest of the search query as the uri_query value.
Software version
Current version of wmf.webrequest.
Other information
@JAllemandou mentioned that there's varnish-kafka to HAProxy migration in the works and that this task might be something to look into to make sure that this is fixed. Let me know if there's anything else I can do to help :)