Page MenuHomePhabricator

Productionize geography gaps data, cultural model
Closed, ResolvedPublic

Description

Context:
For the Geography Gaps data, our current published datasets for geography are reporting data that use the geospatial model. It will be helpful and beneficial to have data dumps reporting data that use the cultural model.

Request:
Productionize geography gap datasets that use cultural model.

Use cases:

  1. Have the data available so that analysts can explore using the cultural model data with the geospatial model data, when looking at %s of new quality articles about regional content
  2. Have the data available so that I can create a visualization notebook in PAWS (complementary to this PAWS notebook which uses the geospatial model) providing instructions and examples for how to use and visualize the geo gaps data using the cultural model

Event Timeline

The content gaps for the geography gap using the cultural model are available on hive:

(spark.table("content_gap_metrics.by_category")
.where("content_gap='geography_cultural_region'")
.show()
+-------+--------------------+--------------------+--------------------+--------------------+-----------+
|wiki_db|            category|             metrics|           quantiles|         content_gap|time_bucket|
+-------+--------------------+--------------------+--------------------+--------------------+-----------+
| frwiki|         Afghanistan|{1, 499592, 434.8...|{{1, 1, 1, 1, 1},...|geography_cultura...|    2023-10|
| frwiki|             Albania|{13, 551487, 363....|{{1, 1, 1, 1, 1},...|geography_cultura...|    2023-10|
| frwiki|             Algeria|{13, 5211013, 503...|{{1, 1, 1, 1, 1},...|geography_cultura...|    2023-10|
)
  • the cultural model metrics are currently not published externally as it wasn't clear which model is preferable and multiple geography gaps might be confusing.
  • the implementation of the cultural model depends on a geography mapping that didn't include the concept of the WMF region which would seem a requirement, this work is captured in T348348.

@CMyrick-WMF do the existing cultural model metrics above solve your use case? Do you recommend to publish the cultural model metrics in addition to the geospatial model or to replace the published geospatial dataset with the cultural model? If the former, we would have to generalize the names.

fkaelin changed the task status from Open to In Progress.Jan 23 2024, 2:29 PM
fkaelin closed this task as Resolved.
fkaelin claimed this task.
fkaelin moved this task from Backlog to In Progress on the Research board.

Somehow I accidentally closed..

Example dataset for the cultural geographic gap (aggregated for wmf regions) for review: https://analytics.wikimedia.org/published/datasets/one-off/fab/content_gap/ . The code is merged and for the next scheduled run the new gap will be published as well (same format as the linked file above), if needed we can easily also re-run the previous pipeline to have the data sooner.

@CMyrick-WMF Since you authored this task I'm assuming you had use cases in mind for using the cultural model. Would it be possible for you to add some usecases in the description ? thanks a lot!! <3

The cultural geographical gap data is now in production, aggregated on the level of the WMF regions. The gap name is geography_cultural_wmf_region, e.g. see here, the documentation is also updated as well as the example intersections notebook.

I am resolving this task, though I am interested in learning about use cases. It could be worth it to refine/improve this gap, though I would like to capture that in a different phab.

@CMyrick-WMF Since you authored this task I'm assuming you had use cases in mind for using the cultural model. Would it be possible for you to add some usecases in the description ? thanks a lot!! <3

Use cases added. Thanks @Mayakp.wiki!