Page MenuHomePhabricator

Scraper should write directly to Hive
Closed, ResolvedPublic

Description

We ran into some quirks when relying on the EventGate pipeline to capture scraper per-page summaries and ingest them to Hive. Specifically, there seems to be no way to guarantee that the data has been fully imported. As a fallback plan, we'll go back to using the native Hive connector we started writing, which is something like 80% completed. This has some drawbacks as well, but if we can make it work then it would significantly simplify the scraper job in Airflow.

Be sure to safeguard against SQL injection since we're forced to use the legacy HiveServer2 protocol and it lacks query parameterization. Known special characters are \ and ', but we should look through the Hive server and client code, and read CVEs for Hive, to be as certain as possible that we're properly escaping and sanitizing inputs.

Implementation is roughly,

Event Timeline

awight updated the task description. (Show Details)
awight updated the task description. (Show Details)

Implementation has been smoke-tested on the Analytics testing cluster. It's verified as able to perform inserts and queries, and can authenticate and encrypt through Kerberos.

The thoroughput was 20 rows/sec even after batching heavily (100 rows/statement). This is far slower than we can accept, so I'm abandoning the approach.

We'll now try a JSON file output and Spark ingestion of that file, to be outlined in a new subtask.

Tobi_WMDE_SW claimed this task.