Page MenuHomePhabricator

[spike] Evaluate feasibility of calculating Commons Impact Metrics for all Commons categories
Closed, ResolvedPublic

Description

Spike!

Description

We can implement the Commons Impact Metrics calculations in two ways:

  • Using a Commons category allow-list that will reduce the amount of categories that we tackle (only GLAM and other institutions/affiliates)
  • Not using any allow-list and calculating them for all Commons categories regardless of who created them.

Both would use practically the same code, and the "all categories" option would bring more value to the community.
However, "all categories" represents a bigger data engineering challenge, since the source data is much bigger;
and thus the data at each step of the pipeline would also be bigger, even the resulting dumps and base data for AQS.

In this task we aim to evaluate as quickly as possible how much more difficult will be to go for all categories versus allow-list.
This way we can decide which way to go with the time that we have for this project.

Acceptance Criteria

  • We have a sensible estimation of how much time and effort it would take be to calculate the metrics for all categories

Required

  • Review the prototype code and look for major inefficiencies (we wrote it in a hurry)
  • Prepare (adapt) it to run against all categories
  • Troubleshoot execution, tweak Spark parameters, follow execution logs/graphs
  • Estimate how much time and effort it would take

Event Timeline

Experiment overview

We tried to execute the Commons Impact Metrics pipeline for all Commons categories and media files. This required some parametrization of the code, and some optimizations. We discovered some caveats of the execution and perceived more or less how difficult it would be to make this pipeline work in production for all Commons categories as opposed of just for an allow-list of categories.

The Commons Impact Metrics pipeline, as built for the GLAM prototype, consists of 7 steps or tasks:

  1. Read wmf_raw.mediawiki_categorylinks and extract from it the graph of all categories and media files. It's a Spark-Scala job that uses the pregel library to build the graph.
  2. Read the category and media file graph created in step (1) and join it with 1 month of wmf.pageview_hourly to extract usage stats for all categories and media files, and decorate the graph with them. It's a set of SparkSql queries.
  3. Generate the final categories dataset. Read both datasets generated in (1) and (2), and process the data to calculate and format the desired metrics. It's a set of SparkSql queries.
  4. Generate the final media_files dataset. Read both datasets generated in (1) and (2), and process the data to calculate and format the desired metrics. It's a set of SparkSql queries.
  5. Generate the final article_views_by_category dataset. Reads (2) and wmf_raw.mediawiki_page and calculates and formats the desired metrics. It's a set of SparkSql queries.
  6. Generate the final article_views_by_media_file dataset. Reads (2) and (4) and calculates and formats the desired metrics. It's a set of SparkSql queries.
  7. Generate the final commons_edits dataset. Reads (1), wmf_raw.mediawiki_page, wmf_raw.mediawiki_revision` and wmf_raw.mediawiki_private_actor, and calculate and format the desired metrics. It's a set of SparkSql queries.

Note that steps 3 to 7 can be executed in any order, but require that steps 1 and 2 are done.

Experiment log
  1. We adapted the Spark-Scala code of step (1) to: a) read just 2% of wmf_raw.categorylinks, b) not use any kind of allow-listing, c) format the input, since the source table used in the prototype didn't exist any more.
  2. We ran the modified code above (step 1 with 2% of data) with: executor-memory=16GB, executor-cores=4, max-executors=64, shuffle-partitions=512. It finished successfully after a couple of trial and error loops.
  3. We adapted the SparkSql code of step (2) to read from and write to proper locations.
  4. We ran the modified code above (step 2 with 2% of data) with: executor-memory=16GB, executor-cores=4, max-executors=64, shuffle-partitions=512. It finished successfully after a couple of trial and error loops.
  5. Since steps (1) and (2) were theoretically the heaviest, we moved on to trying those with more data, namely 20% of wmf_raw.mediawiki_categorylinks. We modified (1) accordingly.
  6. We ran step 1 with 20% of data, with: executor-memory=32GB, executor-cores=8, max-executors=64, shuffle-partitions=1024. We had skewed data problems, were a few tasks would take forever to finish, also we saw a lot of garbage collection.
  7. We studied the distribution of media files per Commons category, and realized that there are some categories with a lot of media files. We thought they could be the cause of the skewed tasks.
  8. We modified the code in (1) to filter out the top 100 categories with most media files.
  9. We ran step 1 with 20% of data and top-category deny list, with: executor-memory=32GB, executor-cores=8, max-executors=64, shuffle-partitions=1024. It finished successfully.
  10. We ran step 2 with 20% of data and top-category deny list, with: executor-memory=32GB, executor-cores=8, max-executors=64, shuffle-partitions=1024. It finished successfully.
  11. We ran step 3 with 20% of data and top-category deny list, with same config as avobe. It failed by getting stuck at the cross join operation.
  12. We re-wrote the cross join into 3 sub-queries that would do the same calculation with regular joins. Those executed successfully after a couple trial and error loops.
  13. We vetted the resulting data and realized something was wrong. The way we were splitting the 20% of the data was breaking category trees in the middle. Unfortunately, to be able to split correctly, we'd have to run step 1 with 100% of the data. So we decided to go ahead and try running the pipeline with all categories and media files.
  14. At this point we tried several things to run step (1) with 100% of input, like: Incrementing the deny-list to the top 200, 500, 100 and 2500 categories with most media files. Tweaking Spark configurations (memory, memory/cpu ratio, shuffle-partitions, max-executors) to give more power to the job. But all without success; the same issue of the supposedly skewed data would make the job get stuck after some time.
  15. We discussed further options, and agreed that going forward would require looking deeper into the code and likely making substantial modifications. We decided to stop the spike at this point.
Observations

These facts might be helpful when developing the final version of this pipeline, regardless of the path we choose.

Categories with lots of media files. We found there are some Commons categories that have millions of media files associated to them. For instance, there's one Creative Commons category with ~40M media files. If we chose the all categories path, we'll have to take care of those in some way, deny-listing them, or breaking them apart, otherwise they will probably pose computational problems.

Difficult to split the input data without an allow list. We used a "modulo" method to split the input. Meaning we filtered out all page_ids with page_id % 5 != 0 (20%). This returns an actual 20% of the data, but breaks the category trees, since most media files won't have their parent categories included in the 20% data. And thus, this splitting method generates sub-trees that have media-files without their categories, which is just incorrect data. A better approach to splitting data would be looking at connected components of the graph, but for that we need a first pass with 100% data at the Pregel algorithm to determine those components, and that seems difficult to achieve. An allow-list of all "root" categories seems to work since it greatly limits the data that is computed at each iteration of the Pregel algorithm.

Instead of cross join. Instead of using a cross join to calculate category metrics at step (3), we tried this method:

files_with_ancestors_parents_and_imagelinks_immediate_parent_explode (
    select
        page_id,
        explode(parents) as parent_page_id,
        usage_map,
        wikis_where_used,
        fqn_articles_where_used,
        used
    from files_with_ancestors_parents_and_imagelinks
),

files_with_ancestors_parents_and_imagelinks_ancestor_explode (
    select
        page_id,
        explode(ancestors) as ancestor_page_id,
        usage_map,
        wikis_where_used,
        fqn_articles_where_used,
        used
    from files_with_ancestors_parents_and_imagelinks
),

immediate_parent_metrics (
 select
        category.page_title as name,
        category.primary_category_names as primary_categories,
        count(1) as media_files,
        sum(if(file.used, 1, 0)) as media_files_used,
        cardinality(array_distinct(flatten(collect_list(file.wikis_where_used)))) as leveraging_wikis,
        cardinality(array_distinct(flatten(collect_list(file.fqn_articles_where_used)))) as leveraging_articles,
        '2023-12' as snapshot
   from primary_categories_with_names category
            left join
        files_with_ancestors_parents_and_imagelinks_immediate_parent_explode file
        on (category.page_id == file.parent_page_id)
  group by category.page_title,
        category.primary_category_names
),

ancestor_metrics (
select
        category.page_title as name,
        category.primary_category_names as primary_categories,
        count(1) as tree_media_files,
        sum(if(file.used, 1, 0)) as tree_media_files_used,
        cardinality(array_distinct(flatten(collect_list(file.wikis_where_used)))) as tree_leveraging_wikis,
        cardinality(array_distinct(flatten(collect_list(file.fqn_articles_where_used)))) as tree_leveraging_articles,
        '2023-12' as snapshot
   from primary_categories_with_names category
            left join
        files_with_ancestors_parents_and_imagelinks_ancestor_explode file
        on (category.page_id == file.ancestor_page_id)
  group by category.page_title,
        category.primary_category_names
  limit 100
)

Pregel lib is a black box. It is difficult to know what's happening inside Spark when using the Pregel library, since the behavior of the library is unclear, especially in regards to how the full graph is split into parallelized tasks. We didn't find logs that would tell us the origin of the skewed data issues.

Connected components hypotheses. We thought that maybe the reason some tasks are taking forever to finish (even after filtering out huge categories) is that Pregel is splitting the full graph into parallel tasks by connected components. Meaning, connected sub-graphs might need to be computed inside the same task. This would explain the memory issues.

Allow-list might help with graph algorithm simplicity. Even if the allow-list is not needed from the perspective of computational efficiency, having a list of root categories of interest might be useful to make the graph traversal simpler, since we'd go from top to bottom of the sub-tree.

Data size model

It's difficult to estimate what final data sizes might be, since we didn't get to generate lots of data in this experiment, because of the problems explained above. Nevertheless, here's an attempt at it. I made some assumptions:

  • We trust the size of the prototype data.
  • We trust the ratio of number of subcategories per root category observed in the prototype. We had 14 root categories and 63 categories in total.
  • The numbers in regular writing represent measured sizes, the ones in italics represent estimated values.
  • The initial size of a full allow-list would be of 1400 root categories, x100 the size of the prototype.
  • The size of the 20% data is estimated to be from x4700 the prototype to x47K the prototype. The lower bound comes from the partial results that we obtained in this task, the upper bound comes from the mere count of categories in the 20% data vs prototype.
  • We consider that in average we'll get a 90% compression rate in csv files (dumps).
  • The last 5 rows correspond to the 5 generated datasets.
  • The columns of interest are ALLOW-LIST COMPRESSED vs. FULL COMPRESSED.
  • The listed sizes correspond to one monthly release.
CONCEPTPROTOTYPEALLOW-LISTALLOW-LIST COMPRESSED20% DATAFULL DATAFULL COMPRESSED
#categories636.3K-3M15M-
decorated graph (intermediate dataset)2.7MB270MB-12-120GB?60-600GB?-
guess multiplying factor-prototype x 100allow list / 10prototype x 4700-47K?20% data x 5full data / 10
categories7.8KB780KB78KB36-360MB?180MB-1.8GB?18-180MB?
media files764KB76MB7.6MB3.6-36GB?18-180GB?1.8-18GB?
article views by category1.1MB110MB11MB5.2-52GB?25-250GB?2.5-25GB?
article views by media file35.5MB3.5GB350MB166GB-1.6TB?800GB-8TB?80-800GB?
commons edits764KB76MB7.6MB3.6-36GB?18-180GB?1.8-18GB?

In general, it seems that the monthly releases of the all categories dataset would be about x200 to x2000 bigger than the allow-list datasets.
I believe that any monthly release bigger than a few GB already poses a significant difficulty to the community.

Future work estimation - All categories
  • Solve the Pregel problem. Difficulty medium-high. 2-3 weeks of work for 2-3 people?.
  • Solve any potential problems that arise in the steps 2 - 7 of the pipeline. Risk medium-high, difficulty medium. 3-4 weeks of work for 2-3 people?
  • Solve the dumps data size problem. Maybe splitting by category group? Difficulty high. Potential sacrifices to make, i.e. remove metrics/dimensions from dataset. 2-3 weeks of work for 2-3 people?
  • Potentially increase size of the Druid/Cassandra/Hadoop clusters to lodge all data. Risk medium. Difficulty low-medium. Looong calendar time for 1-2 people? Other teams involved.
  • Other tasks that are the same for both 100% data and allow-list, like data vetting, Airflow dags, AQS endpoints, dumps location and docs, back-filling etc. A couple months of team work.
  • Maintenance to the heavy jobs. Risk medium. Difficulty medium-high, since code is tight (optimizations and complexities).
Future work estimation - Category allow-list
  • Design and setup allow-list: Location, process, etc. Risk low. Difficulty medium. Potential long calendar time because of discussions with teams and community.
  • Organize the code, remove inefficiencies, solve any potential problems of executing with full allow-list. Risk medium, difficulty low-medium. 2-3 weeks for 2-3 people.
  • Other tasks that are the same for both 100% data and allow-list, like data vetting, Airflow dags, AQS endpoints, dumps location and docs, back-filling etc. A couple months of team work.
  • Maintaining the allow-list. Risk medium. Depending on whether we backfill new categories in the list or not, this will be quite light (minutes/new cat) or quite heavy (days/new cat).

Categories with lots of media files. We found there are some Commons categories that have millions of media files associated to them. For instance, there's one Creative Commons category with ~40M media files. If we chose the all categories path, we'll have to take care of those in some way, deny-listing them, or breaking them apart, otherwise they will probably pose computational problems.

Also, see T343131. Amir had found these categories with millions of members problematic as well from a MariaDB scalability perspective, and they have further insight over there if interested.