Page MenuHomePhabricator

wmf.webrequest: 'presto error: Corrupted statistics for column "[user_agent] optional binary " in Parquet file ...'
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • On Superset (on database presto_analytics_hive) run the following query:
SELECT
  user_agent
from wmf.webrequest
WHERE  webrequest_source = 'text'
    AND year = 2022
    AND month = 10
    AND day = 11
    AND uri_host = 'www.wikidata.org'
    AND is_pageview
    AND namespace_id = 640
    AND agent_type = 'spider'
    AND user_agent = '-'
LIMIT 10

(replace year/date/month with a date in the last 30 days to make sure there is data)

What happens?:

image.png (329×906 px, 56 KB)

Presto Error

presto error: Corrupted statistics for column "[user_agent] optional binary " in Parquet file "hdfs://analytics-hadoop/wmf/data/wmf/webrequest/webrequest_source=text/year=2022/month=10/day=11/hour=7/000084_0": [min: 0xD181D182D180D0BED0B8D182D0B5D0BBD18C204D524120352E3520286275696C64203032383432293B, max: 0x7B27557365722D4167656E74273A202777696B695F7061727365725F6F6E6C696E652F302E31372E31202868747470733A2F2F646565707061766C6F762E61693B20696E666F40646565707061766C6F762E61692920646565707061766C6F762F302E31372E31277D, num_nulls: 0]

This may be triggered by:
Issue 1002 - The database returned an unexpected error.

That last line has a link to https://superset.apache.org/docs/miscellaneous/issue-codes#issue-1002

What should have happened instead?:
I should see results that allow my to further look into those spider requests that have a user agent of -.

Other information (browser name/version, screenshots, etc.):
Running the same query without the line AND user_agent = '-' works as expect. So selecting (and GROUP BY) works, filtering somehow doesn't.

SELECT
  user_agent
from wmf.webrequest
WHERE  webrequest_source = 'text'
    AND year = 2022
    AND month = 10
    AND day = 11
    AND uri_host = 'www.wikidata.org'
    AND is_pageview
    AND namespace_id = 640
    AND agent_type = 'spider'
LIMIT 10

Event Timeline

Reading from here: https://github.com/prestodb/presto/issues/12338
It seems we can disable the failure on parquet statistics by setting hive.parquet.fail-on-corrupted-statistics = false - @BTullis can you help with this please?
I think the problem is fixed on presto newer versions, but we don't update often.

@Michael: Until we get a fix, AND length(user_agent) < 2 as a workaround might be OK?

I just discovered this old ticket whilst searching for something else. Apologies for having missed the request to update the config for presto @JAllemandou

Is anyone able to verify whether or not this is still a problem please? We have updated Presto a couple of times since the ticket was created.

Gehel triaged this task as Medium priority.Feb 9 2024, 1:30 PM
Gehel moved this task from Incoming to 2024.02.12 - 2024.03.03 on the Data-Platform-SRE board.

Just tried the query in the description with some recent dates but the dates I picked didn't have any requests with '-' UA strings, so it's hard to know if the problem persists. The query runs fine with user_agent = '-' just returns no data. I just made a request to https://www.wikidata.org/wiki/EntitySchema:E1 (according to https://www.wikidata.org/wiki/Help:Namespaces namespace 640 is EntitySchema so I just picked one):

curl --user-agent '-' https://www.wikidata.org/wiki/EntitySchema:E1

I made this at year = 2024 AND month = 2 AND day = 9 AND hour = 14 so I can check in a couple of hours if such a UA string results in any issues with Parquet & Presto.

SELECT
  user_agent
from wmf.webrequest
WHERE  webrequest_source = 'text'
    AND year = 2024
    AND month = 2
    AND day = 9
    AND hour = 14
    AND uri_host = 'www.wikidata.org'
    AND is_pageview
    AND namespace_id = 640
    AND agent_type = 'spider'
    AND user_agent = '-'
LIMIT 10

No problems there. Returns the UA from my request.

BTullis claimed this task.
BTullis moved this task from Backlog to Done on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.

Thanks for checking @mpopov - I'll tentatively resolve this ticket, given that we can't reproduce the original error. If it comes back, or if we find another way to reproduce it, please do feel free to reopen.