We ran into some quirks when relying on the EventGate pipeline to capture scraper per-page summaries and ingest them to Hive. Specifically, there seems to be no way to guarantee that the data has been fully imported. As a fallback plan, we'll go back to using the native Hive connector we started writing, which is something like 80% completed. This has some drawbacks as well, but if we can make it work then it would significantly simplify the scraper job in Airflow.
Be sure to safeguard against SQL injection since we're forced to use the legacy HiveServer2 protocol and it lacks query parameterization. Known special characters are \ and ', but we should look through the Hive server and client code, and read CVEs for Hive, to be as certain as possible that we're properly escaping and sanitizing inputs.
Implementation is roughly,
- Finish Elixir connector for Hive, including krb5 support. Verify on stat1010. Implementation:
- Write a method which safely escapes freeform text column values.
- Switch scraper to use the Hive connector.
- Create per-page schema in Hive.
- Adjust aggregation to read this schema.