Page MenuHomePhabricator

Decide on data required for launch
Closed, ResolvedPublic

Description

A content gap (gender) is associated with a set of categories (female, male, non-binary,...). The content gap metrics timeseries are available at 4 aggregation levels.

  • by category / content gap / wiki_db
  • by category / content gap
  • content gap / wiki_db
  • content gap

The current implementation of the content gaps integration in AQS is for the most granular level (by category / content gap / wiki_db), and the transforms the (content gap / wiki_db) level into the most granular by using an all_categories category.

All these datasets are available for use, and the aqs api does not use all of them.

  • In particular, the aggregation level by category / content gap (i.e. aggregated across all wikis) has been used in notebooks/reports, but is not available with the current api proposal. Can and should we include this (e.g. by using a wiki called "all_wikis")?
  • How does the choice of data to use impact the work required on the wikistats side?

Event Timeline

@fkaelin all the data will be static, correct?

If yes, at what intervals will the static data be updated and who will do those updates?

@VirginiaPoundstone

  • correct, the data is static- once data is inserted into AQS, it will not be updated.
  • the AQS data will be updated monthly using a scheduled airflow dag that will alert upon failure

@Milimetric question: looking at wikistats, I see is project=all-projects is used for some metrics. Can we follow this pattern to include content gap metrics across all wikis? The data is already available in this table.

The suggestion for the data to include:

  • use the table already created (T340494)
  • pending Dan's input, include the metrics across all wikis (all-projects)

This approach would not require any changes to the existing table/schema, and minimal updates to existing ingestion job.

Pinging @Milimetric regarding the all-projects question above.

I went ahead and implemented the data ingestion job with the suggestion in my previous comment, please let me know if there are other questions/considerations; otherwise I will close this as resolved.

to keep the archives happy, I talked to Fabian on Monday and answered this question - yes, all-projects means all wikis. For aggregates at the project family level, for example "all wikipedias", we use all-wikipedia-projects (see wikistats example)