As a search platform developer i need more than just unit test coverage to reduce the number of post-deployment issues
We have a variety of unit testing of the spark code in airflow, but nothing that can actually run the system in an end to end fashion outside production. Integration testing a script before shipping today mostly means copying to production and running in the prod clusters with test output locations. A real integration environment would not only be useful for integration testing, but would likely also ease introducing new developers to this codebase as they would have more options for experimenting with the system in a safe environment.
Needed:
- real table schemas for our inputs
- small amounts of fake data, even if it's only 1 user session.
Nice to have:
- Installs airflow and can execute individual tasks from dags via airflow test
- yarn for integration testing with sparks cluster deploy mode and our dependency shipping
Not included:
- Being able to run dags end-to-end. This would be possible, but seems like a significant amount of work beyond just having a system where we can run the scripts and expect some sort of valid output
- A test suite. While a test suite can be built on top of this environment, this task is only for ths first step of creating an environment that can run most things outside production.