Draft merge request at https://gitlab.wikimedia.org/repos/generated-data-platform/datapipelines/-/merge_requests/51. marked as ready to be merged
- What technologies are used by your project? Spark, HDFS, Hive, Cassandra
- What are the data sources? Wikidata, Commons, categories, image links, page links, properties, and revisions from all Wikis
- Where do you plan to store data? HDFS, Hive, Cassandra
- What is the pipeline schedule? ideally weekly, it should wait for the latest Wikidata snapshot. However, we need to resolve T307371: Data pipeline weekly schedule is on hold first.
- Privacy review? No
- Data volumes, cluster resources? volumes not measured yet; memory resources require an assessment, see T307362: Memory errors break the expected output of some Airflow tasks and
https://gitlab.wikimedia.org/repos/generated-data-platform/datapipelines/-/merge_requests/51#issues-to-be-solved-before-mergeDone