Steps to replicate the issue (include links if applicable):
- On Superset (on database presto_analytics_hive) run the following query:
SELECT user_agent from wmf.webrequest WHERE webrequest_source = 'text' AND year = 2022 AND month = 10 AND day = 11 AND uri_host = 'www.wikidata.org' AND is_pageview AND namespace_id = 640 AND agent_type = 'spider' AND user_agent = '-' LIMIT 10
(replace year/date/month with a date in the last 30 days to make sure there is data)
What happens?:
Presto Error presto error: Corrupted statistics for column "[user_agent] optional binary " in Parquet file "hdfs://analytics-hadoop/wmf/data/wmf/webrequest/webrequest_source=text/year=2022/month=10/day=11/hour=7/000084_0": [min: 0xD181D182D180D0BED0B8D182D0B5D0BBD18C204D524120352E3520286275696C64203032383432293B, max: 0x7B27557365722D4167656E74273A202777696B695F7061727365725F6F6E6C696E652F302E31372E31202868747470733A2F2F646565707061766C6F762E61693B20696E666F40646565707061766C6F762E61692920646565707061766C6F762F302E31372E31277D, num_nulls: 0] This may be triggered by: Issue 1002 - The database returned an unexpected error.
That last line has a link to https://superset.apache.org/docs/miscellaneous/issue-codes#issue-1002
What should have happened instead?:
I should see results that allow my to further look into those spider requests that have a user agent of -.
Other information (browser name/version, screenshots, etc.):
Running the same query without the line AND user_agent = '-' works as expect. So selecting (and GROUP BY) works, filtering somehow doesn't.
SELECT user_agent from wmf.webrequest WHERE webrequest_source = 'text' AND year = 2022 AND month = 10 AND day = 11 AND uri_host = 'www.wikidata.org' AND is_pageview AND namespace_id = 640 AND agent_type = 'spider' LIMIT 10