Page MenuHomePhabricator

[Spike] Review Equity Landscape Pipeline and Data Collection Approach
Closed, ResolvedPublic

Description

Description

As part of transfer of the Equity Landscape dashboard to Experimentation Lab we need to conduct a short review of its current state to make sure we have a clear understanding and are able to check our understanding with @ntsako before his departure.

In particular, we are interested in analysis of the current implementation in terms of

  • Operation
  • Any potential improvements
  • Ways it could be incorporated into a larger visualization initiative alongside other visualizations (wikistats, graph extension, graphoid, etc)

References

Event Timeline

I have reviewed the passed sources and videos (BTW, thank you @ntsako for the videos!) and here's a summary of:

  • What I've learned, so we can refer to this in the future and avoid some rework.
  • Questions that arised (I reached out today to @ntsako but I fear it was too late for his time zone).
  • Some observations on potential improvements for each section of the project.

Learnings

The project is divided in 3 sections: data processing, workflow management and dashboard.

Data processing

The data processing code is in 2 different locations:

https://gitlab.wikimedia.org/repos/gdi/equity-landscape/gdi-source/-/tree/main/gdi_source/equity_landscape
This is a collection of Python scripts that implement the following functions:

  • Collecting data
load_wbgapi.py: Collects World Bank data from its API and stores it in a Hive table.
load_lua_data.py: Collects data from Lua tables in MediaWiki and stores it in a Hive table.
load_csv.py: Collects CSV data from HDFS and stored it in a Hive table.
  • Transforming data
affiliate_input_metrics.py, input_metrics.py, regional_input_metrics.py and output_metrics.py:
Those 4 scripts read data from Hive tables, transform it, and write it to other Hive tables.
  • Generating dashboard files
dashboard_output.py: Reads the data from Hive and generates files to back up the dashboard for a given year.

https://github.com/wikimedia/analytics-refinery/tree/master/hql/gdi/equity_landscape
This is a list of Hive SQL queries that read from Data Lake tables
as well as GDI tables and the tables created by the above Python scripts.
They populate another set of 11 Hive tables containing GDI metrics.

Workflow management:

This code configures 4 Airflow DAGs.
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/tree/main/analytics/dags/gdi/equity_landscape

  • API DAG: Orchestrates the collection of API data (load_wbgapi.py and load_lua_data.py - see data processing section)
  • CSV DAG: Orchestrates the collection of CSV data (load_csv.py - see data processing section)
  • HQL DAG: Orchestrates the execution of the HiveSQL refinery queries and the transformation of the data.
  • APP DAG: Orchestrates the execution of the SparkSQL Python scripts that generate more tables.

These DAGs don't have a schedule, they must be executed manually.
The reason is that the API and CSV data sources don't have a reliable schedule,
thus the owner/data eng. must check for the existence of the right sources,
and run the DAGs manually in the correct order and with the proper parameters.
I understand this needs only to happen once a year.

Dashboard

https://gitlab.wikimedia.org/repos/gdi/equity-landscape/dashboard
This is a dashboard app written in Vue.js that reads the files generated by dashboard_output.py (see data processing section),
and shows 4 metric groups: Engagement, Presence and growth, Affiliates, and Projects and Languages.
The first 3 use a similar component, a world map (can be also viewed as a bar chart or column chart etc.)
The last one uses a couple tablular views.

Questions

These are questions directed to @ntsako or anyone in the GDI team that can help! :-)

  • Where do the CSV source files in HDFS come from?
  • Is the dashboard public/in production somewhere? How do we deploy it?
  • You mentioned you had the dashboard files in google docs at some point, can you elaborate?
  • The dashobard_output.py is the script you use to generate the data files that back up the dashboard right? It seems to be incomplete, since it's not writing anywhere, no? What should we do with the generated files, so that the dashboard can access them?
  • Is there a reason why some data transformation queries are in HiveSQL in the refinery repo, while others are in SparkSQL within pyspark scripts, in the gdi_source repo? Could we unify them?
  • Are all the generated tables used for analysis? Or are them just intermediate steps to calculating the final metrics?
  • Is the dashboard meant to be public? Or just for WMF audience?

Potential Improvements

Data processing:

  • Maybe we can unify all the data processing code in 1 technology (SQL or Spark), and put everything in 1 repository.
  • If some of the generated tables are not used for analysis, rather just as intermediate steps, we could delete them after all the final metrics are computed. This way we wouldn't need to maintain them.

Workfow management:

  • We can refactor the DAGs to make use of the latest conventions, like DagProperties, artifacts, create_easy_dag, etc.
  • We can try to reduce the complexity of parameter passing through variables, especially regarding the year parameters.

Dashboard:

  • It would be cool if we had a single tool that can be reused each time a team needs a dashboard. We have Dashiki, which was meant exactly for that, but hasn't been touched in a while. @Milimetric, could Dashiki help here? We have Superset, which should be the go-to tool for WMF-internal dashboards. @GDI is the equity landscape dashboard public? We have Wikistats, which would be a great place to put metrics of interest to the community. I think equity landscape metrics could fit...

Hi @mforns,

Answers

Where do the CSV source files in HDFS come from?

Is the dashboard public/in production somewhere? How do we deploy it?

  • No, the dashboard is not in production yet. For deployment, we were planning on asking @Milimetric for assistance.

You mentioned you had the dashboard files in google docs at some point, can you elaborate?

  • Yes, the dashboard files are on Google Drive. You can unzip the files and execute cors_http_server.py to mimick a file server. Afterwards, you can run the dashboard locally via npm run dev provided you ran npm install to load the dashboard.

The dashobard_output.py is the script you use to generate the data files that back up the dashboard right? It seems to be incomplete, since it's not writing anywhere, no? What should we do with the generated files, so that the dashboard can access them?

  • The dashboard script writes to an hdfs location, altough, the main aim is to host the files somewhere that the dashboard can read from. I was manually storing the files on my local machine and the using the cors_http_server.py script to mimick a file server.

Is there a reason why some data transformation queries are in HiveSQL in the refinery repo, while others are in SparkSQL within pyspark scripts, in the gdi_source repo? Could we unify them?

  • Yes, we could have them exist in the gdi-source repo instead of having them on refinery as well. Initially, it was a design decision taken when the project started, but at this point, it would be better to have them all in the gdi-source repo.

Are all the generated tables used for analysis? Or are them just intermediate steps to calculating the final metrics?

  • @JAnstee_WMF can answer this question better than me as she is the one who is using the data for analysis.

Is the dashboard meant to be public? Or just for WMF audience?

  • The dashboard is meant to be public.
  • I hope this answers your questions. Please let me know if you have any further questions.