Provide a way to save extracted query-information in parquet format
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	JAllemandou
	May 6 2021, 1:34 PM

Description

Being able to save the information in Parquet will be very useful as it allows to automatically process the queries as they flow in (hourly or daily for instance), facilitating regular analysis.

Update

~92% of the daily SPARQL queries are processed successfully (for 10 May, 2021). Those that aren't parsed contain some additional prefixes like mwapi. All queries that failed parsing were not checked.
Daily data size is around ~ 3 - 4 G (3.9, 3.3, 2.7 G for 10, 13, 17 May, 2021 respectvely).

The data was finally arranged as case classes, therefore forming named structs in spark. The schema of the extracted data is as follows:

root
 |-- id: string (nullable = true)
 |-- query: string (nullable = true)
 |-- query_time: long (nullable = true)
 |-- query_time_class: string (nullable = false)
 |-- ua: string (nullable = true)
 |-- q_info: struct (nullable = true)
 |    |-- queryReprinted: string (nullable = true)
... more fields ...
 |    |-- triples: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- subjectNode: struct (nullable = true)
 |    |    |    |    |-- nodeType: string (nullable = true)
 |    |    |    |    |-- nodeValue: string (nullable = true)
 |    |    |    |-- predicateNode: struct (nullable = true)
 |    |    |    |    |-- nodeType: string (nullable = true)
 |    |    |    |    |-- nodeValue: string (nullable = true)
 |    |    |    |-- objectNode: struct (nullable = true)
 |    |    |    |    |-- nodeType: string (nullable = true)
 |    |    |    |    |-- nodeValue: string (nullable = true)

The schema allows easy access of the nodes, their types and values as well. We can now process and save the data as they flow. T273854

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• AKhatun_WMF	T280640 [EPIC] Refine WDQS queries analysis
		Resolved		• AKhatun_WMF	T282130 Provide a way to save extracted query-information in parquet format

Event Timeline

JAllemandou created this task.May 6 2021, 1:34 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 6 2021, 1:34 PM

JAllemandou added a parent task: T280640: [EPIC] Refine WDQS queries analysis.May 6 2021, 1:34 PM

Maintenance_bot added a project: Wikidata.May 6 2021, 1:45 PM

• AKhatun_WMF claimed this task.May 19 2021, 11:34 AM

• AKhatun_WMF updated the task description. (Show Details)

@AKhatun_WMF That's great! could you please provide some info on expected data-size in parquet (for daily data for instance)? Many thanks.

In T282130#7100051, @JAllemandou wrote:

@AKhatun_WMF That's great! could you please provide some info on expected data-size in parquet (for daily data for instance)? Many thanks.

@JAllemandou Added estimate of daily data size.

Great ! Thanks for that :) Closing the ticket.

Provide a way to save extracted query-information in parquet formatClosed, ResolvedPublicActions

Description

Update

Related ObjectsSearch...

Event Timeline

Provide a way to save extracted query-information in parquet format
Closed, ResolvedPublic
Actions

Related Objects
Search...