Page MenuHomePhabricator

Provide a way to save extracted query-information in parquet format
Closed, ResolvedPublic

Description

Being able to save the information in Parquet will be very useful as it allows to automatically process the queries as they flow in (hourly or daily for instance), facilitating regular analysis.

Update

~92% of the daily SPARQL queries are processed successfully (for 10 May, 2021). Those that aren't parsed contain some additional prefixes like mwapi. All queries that failed parsing were not checked.
Daily data size is around ~ 3 - 4 G (3.9, 3.3, 2.7 G for 10, 13, 17 May, 2021 respectvely).

The data was finally arranged as case classes, therefore forming named structs in spark. The schema of the extracted data is as follows:

root
 |-- id: string (nullable = true)
 |-- query: string (nullable = true)
 |-- query_time: long (nullable = true)
 |-- query_time_class: string (nullable = false)
 |-- ua: string (nullable = true)
 |-- q_info: struct (nullable = true)
 |    |-- queryReprinted: string (nullable = true)
... more fields ...
 |    |-- triples: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- subjectNode: struct (nullable = true)
 |    |    |    |    |-- nodeType: string (nullable = true)
 |    |    |    |    |-- nodeValue: string (nullable = true)
 |    |    |    |-- predicateNode: struct (nullable = true)
 |    |    |    |    |-- nodeType: string (nullable = true)
 |    |    |    |    |-- nodeValue: string (nullable = true)
 |    |    |    |-- objectNode: struct (nullable = true)
 |    |    |    |    |-- nodeType: string (nullable = true)
 |    |    |    |    |-- nodeValue: string (nullable = true)

The schema allows easy access of the nodes, their types and values as well. We can now process and save the data as they flow. T273854

Event Timeline

@AKhatun_WMF That's great! could you please provide some info on expected data-size in parquet (for daily data for instance)? Many thanks.

@AKhatun_WMF That's great! could you please provide some info on expected data-size in parquet (for daily data for instance)? Many thanks.

@JAllemandou Added estimate of daily data size.

Great ! Thanks for that :) Closing the ticket.