Page MenuHomePhabricator

Build integration envirionment for search platform airflow + hadoop + spark integration
Closed, ResolvedPublic

Description

As a search platform developer i need more than just unit test coverage to reduce the number of post-deployment issues

We have a variety of unit testing of the spark code in airflow, but nothing that can actually run the system in an end to end fashion outside production. Integration testing a script before shipping today mostly means copying to production and running in the prod clusters with test output locations. A real integration environment would not only be useful for integration testing, but would likely also ease introducing new developers to this codebase as they would have more options for experimenting with the system in a safe environment.

Needed:

  • real table schemas for our inputs
  • small amounts of fake data, even if it's only 1 user session.

Nice to have:

  • Installs airflow and can execute individual tasks from dags via airflow test
  • yarn for integration testing with sparks cluster deploy mode and our dependency shipping

Not included:

  • Being able to run dags end-to-end. This would be possible, but seems like a significant amount of work beyond just having a system where we can run the scripts and expect some sort of valid output
  • A test suite. While a test suite can be built on top of this environment, this task is only for ths first step of creating an environment that can run most things outside production.

Event Timeline

One option would be faking a prod setup with some docker images. Cloudera (our hadoop distribution) used to (may still) provide a docker image that stand up hdfs + hive + related stuff . We ought to be able to setup a second image for integration testing that can talk to this.

We could do much of this in pytest, but what we lose is having a shell where you can execute the plain python scripts or invoke the airflow tasks so they execute in a manner similar to production and cover as much of the integration as possible.

TJones renamed this task from Build integration test suite for search platform airflow + hadoop +spark integration to Build integration test suite for search platform airflow + hadoop + spark integration.Oct 19 2020, 3:20 PM
EBernhardson raised the priority of this task from Medium to High.
EBernhardson moved this task from needs triage to ML & Data Pipeline on the Discovery-Search board.

Been moving this forward. While initial work started using the cloudera quickstart docker images, after talking to analytics everything was switched over to using apache bigtop 1.4 which will be replacing cloudera in wmf prod. This is designed as a docker-compose project, although the first step here only uses a single image.

Current status:

  • Managing the process is all wrapped behind two commands: ./build.sh --create and ./build.sh --destroy
  • bigtop hadoop image boots up, hdfs, hive and yarn are all running
  • spark is installed, communicates properly with hive in local, client, and cluster deploy modes
  • Imports all relevant table schemas from analytics/refinery and wikimedia/discovery/analytics repositories
    • Added auto-generated schemas (eventgate) to wikimedia/discovery/analytics repository and create those as well
  • Basic data generation scripts populates event.mediawiki_cirrussearch_request and event.mediawiki_revision_score.
  • Airflow is installed to the hadoop image, webserver and scheduler *don't* (yet) start or run properly

Change 639876 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[search/analytics-integration@master] Search platform analytics integration environment

https://gerrit.wikimedia.org/r/639876

Attached patch provides an integration environment capable of fully running the ores pipelines from sourcing the data in eventgate tables to writing out the formatted elasticsearch bulk files in hdfs. Some support for manually testing the upload to swift is included, but the integration from airflow to swift is incomplete. In general anything that relies on yarn+hadoop+spark should be testable, the very edges of the system are less testable but can be improved on in the future. There will still be significant difficulties testing jobs that have specific resource provisioning (some glent, some mjolnir), we don't have a way to override those settings currently.

Change 639876 merged by DCausse:
[search/analytics-integration@master] Search platform analytics integration environment

https://gerrit.wikimedia.org/r/639876

EBernhardson renamed this task from Build integration test suite for search platform airflow + hadoop + spark integration to Build integration envirionment for search platform airflow + hadoop + spark integration.Nov 10 2020, 3:22 PM
EBernhardson updated the task description. (Show Details)