Being able to save the information in Parquet will be very useful as it allows to automatically process the queries as they flow in (hourly or daily for instance), facilitating regular analysis.
Update
~92% of the daily SPARQL queries are processed successfully (for 10 May, 2021). Those that aren't parsed contain some additional prefixes like mwapi. All queries that failed parsing were not checked.
Daily data size is around ~ 3 - 4 G (3.9, 3.3, 2.7 G for 10, 13, 17 May, 2021 respectvely).
The data was finally arranged as case classes, therefore forming named structs in spark. The schema of the extracted data is as follows:
root |-- id: string (nullable = true) |-- query: string (nullable = true) |-- query_time: long (nullable = true) |-- query_time_class: string (nullable = false) |-- ua: string (nullable = true) |-- q_info: struct (nullable = true) | |-- queryReprinted: string (nullable = true) ... more fields ... | |-- triples: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- subjectNode: struct (nullable = true) | | | | |-- nodeType: string (nullable = true) | | | | |-- nodeValue: string (nullable = true) | | | |-- predicateNode: struct (nullable = true) | | | | |-- nodeType: string (nullable = true) | | | | |-- nodeValue: string (nullable = true) | | | |-- objectNode: struct (nullable = true) | | | | |-- nodeType: string (nullable = true) | | | | |-- nodeValue: string (nullable = true)
The schema allows easy access of the nodes, their types and values as well. We can now process and save the data as they flow. T273854