Build integration envirionment for search platform airflow + hadoop + spark integration
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	Oct 15 2020, 5:46 PM

Description

As a search platform developer i need more than just unit test coverage to reduce the number of post-deployment issues

We have a variety of unit testing of the spark code in airflow, but nothing that can actually run the system in an end to end fashion outside production. Integration testing a script before shipping today mostly means copying to production and running in the prod clusters with test output locations. A real integration environment would not only be useful for integration testing, but would likely also ease introducing new developers to this codebase as they would have more options for experimenting with the system in a safe environment.

Needed:

real table schemas for our inputs
small amounts of fake data, even if it's only 1 user session.

Nice to have:

Installs airflow and can execute individual tasks from dags via airflow test
yarn for integration testing with sparks cluster deploy mode and our dependency shipping

Not included:

Being able to run dags end-to-end. This would be possible, but seems like a significant amount of work beyond just having a system where we can run the scripts and expect some sort of valid output
A test suite. While a test suite can be built on top of this environment, this task is only for ths first step of creating an environment that can run most things outside production.

Details

	Subject	Repo	Branch	Lines +/-
	Search platform analytics integration environment	search/analytics-integration	master	+2 K -0

Customize query in gerrit

Event Timeline

EBernhardson created this task.Oct 15 2020, 5:46 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 15 2020, 5:46 PM

EBernhardson updated the task description. (Show Details)Oct 15 2020, 5:46 PM

EBernhardson updated the task description. (Show Details)Oct 15 2020, 9:49 PM

One option would be faking a prod setup with some docker images. Cloudera (our hadoop distribution) used to (may still) provide a docker image that stand up hdfs + hive + related stuff . We ought to be able to setup a second image for integration testing that can talk to this.

We could do much of this in pytest, but what we lose is having a shell where you can execute the plain python scripts or invoke the airflow tasks so they execute in a manner similar to production and cover as much of the integration as possible.

EBernhardson updated the task description. (Show Details)Oct 15 2020, 9:52 PM

TJones renamed this task from Build integration test suite for search platform airflow + hadoop +spark integration to Build integration test suite for search platform airflow + hadoop + spark integration.Oct 19 2020, 3:20 PM

EBernhardson triaged this task as Medium priority.Oct 19 2020, 3:21 PM

EBernhardson raised the priority of this task from Medium to High.

EBernhardson moved this task from needs triage to ML & Data Pipeline on the Discovery-Search board.

EBernhardson updated the task description. (Show Details)Oct 19 2020, 4:08 PM

Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.Oct 29 2020, 3:19 PM

Gehel moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.

Been moving this forward. While initial work started using the cloudera quickstart docker images, after talking to analytics everything was switched over to using apache bigtop 1.4 which will be replacing cloudera in wmf prod. This is designed as a docker-compose project, although the first step here only uses a single image.

Current status:

Managing the process is all wrapped behind two commands: ./build.sh --create and ./build.sh --destroy
bigtop hadoop image boots up, hdfs, hive and yarn are all running
spark is installed, communicates properly with hive in local, client, and cluster deploy modes
Imports all relevant table schemas from analytics/refinery and wikimedia/discovery/analytics repositories
- Added auto-generated schemas (eventgate) to wikimedia/discovery/analytics repository and create those as well
Basic data generation scripts populates event.mediawiki_cirrussearch_request and event.mediawiki_revision_score.
Airflow is installed to the hadoop image, webserver and scheduler *don't* (yet) start or run properly

EBernhardson claimed this task.Oct 29 2020, 3:55 PM

Change 639876 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[search/analytics-integration@master] Search platform analytics integration environment

https://gerrit.wikimedia.org/r/639876

gerritbot added a project: Patch-For-Review.Nov 6 2020, 10:23 PM

Attached patch provides an integration environment capable of fully running the ores pipelines from sourcing the data in eventgate tables to writing out the formatted elasticsearch bulk files in hdfs. Some support for manually testing the upload to swift is included, but the integration from airflow to swift is incomplete. In general anything that relies on yarn+hadoop+spark should be testable, the very edges of the system are less testable but can be improved on in the future. There will still be significant difficulties testing jobs that have specific resource provisioning (some glent, some mjolnir), we don't have a way to override those settings currently.

EBernhardson moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Nov 6 2020, 10:28 PM

Change 639876 merged by DCausse:
[search/analytics-integration@master] Search platform analytics integration environment

https://gerrit.wikimedia.org/r/639876

Maintenance_bot removed a project: Patch-For-Review.Nov 10 2020, 9:10 AM

EBernhardson renamed this task from Build integration test suite for search platform airflow + hadoop + spark integration to Build integration envirionment for search platform airflow + hadoop + spark integration.Nov 10 2020, 3:22 PM

EBernhardson updated the task description. (Show Details)

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.

Gehel closed this task as Resolved.Nov 16 2020, 2:02 PM

Build integration envirionment for search platform airflow + hadoop + spark integrationClosed, ResolvedPublicActions

Description

Details

Event Timeline

Build integration envirionment for search platform airflow + hadoop + spark integration
Closed, ResolvedPublic
Actions