Page MenuHomePhabricator

Develop metrics for Geographic gaps
Closed, ResolvedPublic


Develop metrics for quantifying geographic knowledge gaps across readers, contributors, and content. Geography has been defined as both within-country and between-country. There are at least three clear aspects of this work:

  • Determine inner categories -- e.g., urban/rural, what countries
  • Determine how to measure these categories -- e.g., surveys, webrequests
  • Determine how to convert measurements into a metric

Isaac is responsible for this task but Martin and Marc will be closely involved on (at least) the content and metric aspect.

Event Timeline

Weekly updates: nothing big. Continued development of article -> country classifier.

Weekly updates: nothing major beyond continued article -> country classifier development. First version of the model should be ready in the next few days.

Also, other resource to saves:

Weekly updates: continued development. Some internet connection challenges have slowed down the internship work but I am still comfortable with the progress.

Weekly updates:

  • Linked up with GDI over country / regional classifications so I can work to connect our two lists and handle any disparities. The vast majority will be the same (as they also are pulling from our pageview / edit data).
  • Received an unrelated request for how to identify articles relevant to Nigeria so already seeing use-cases appear for this ability to identify content at the country-level

Weekly updates:

  • Reached a stable agreement on the countries list with GDI. Might be some changes if issues are discovered but we'll notify each other if that's the case.
  • I loaded country -> UN subregion, continent, Global North/South data into Hive so that it'll be easier to map countries to these more aggregated regions. I already used this to identify the distribution across continents for the training data for each topic in the topic classification model comes from. This allows us to see that e.g., the Linguistics topic pulls its training data pretty uniformly from all continents but over half of the training data for the education topic comes from North America.
  • Meeting set up on Monday w/ Marc to discuss his experience working with geographic gaps and align the metric side with the other gaps he's working on

Weekly updates:

  • Had a meeting w/ Marc about geographic gaps. We discussed a few things:
    • Agreement that urban/rural is very challenging for content because of how contextual both "objective" definitions are -- e.g., the population density that makes a place urban will very country to country -- and how even more contextual the cultural elements are of this -- e.g., to what degree is agriculture or industry associated with rural areas can vary country to country.
    • One alternative approach to capturing some of the urban vs. rural distinctions is not just assigning articles to countries but also seeking to go to more fine-grained administrative districts where possible. This will reduce coverage greatly -- e.g., many people can be assigned "geography" because they have citizenship in a certain country but there's less data that clearly ties them to individual districts. This would take substantial additional work though so I don't think is anything that I'll pick up anytime soon.
    • Discussion of what articles should be covered by geography. I've operationalized geography as countries -- i.e. political entities -- but include items that cover a very wide range of ties to a country -- e.g., people, places, culture, etc. No clear right decision about what to include / exclude, but it might be useful at some point to not just track which countries are relevant to an article but why -- e.g., the Wikidata property that led to the connection.
    • Not fully clear yet what the right metrics will be but hopefully can transfer what is learned about the other facets to geography as far as what aspects of selection, extent, and framing are useful.
    • We will follow up again in a few weeks when some of Marc's work on the other gaps is a bit more crystallized.

Trying to summarize with a few more details why mapping content to urban/rural distinctions is quite challenging:

  • The best work I know on "what is rural?" comes from Hardy et al. They describe how various researchers have operationalized rurality (Section 2.1), which are broken into a two main categories:
    • Descriptive rurals: observable/measurable features such as population size/density, distance to urban area, economic indicators
      • Most of these are quite difficult to do at scale -- in particular, population density, distance to urban area, economic indicators. These can often be gathered for a single country but interpreting what is urban and what is rural from these numbers varies greatly country-to-country and I don't know of good global datasets for this sort of categorization (see UN page, Section D for more details). This is what I did though in 2017 for my urban-rural research on the US/China and Wikipedia (paper).
      • As mentioned in the UN page, the best approach is based on population size though it is not comparable between countries. I do have a script for doing this based on place names or coordinates (description) but that greatly narrows the scope of articles being considered to just those with coordinates (or existing population data via Wikidata). At that point, given that it's a poor measure of urban/rural and would be greatly limited in what articles it could be applied to compared to the country approach, the value is not clear.
    • Sociocultural/symbolic rurals: defining rural based on the values people hold or cultural traditions
      • This captures the contextual/social nature of "urban vs. rural" that simple measurements like population density can miss. It is nearly impossible to delineate and would have to be captured by categories or similar tagging systems by Wikipedians. A quick skim of categories such that such tagging is far from complete and very focused on identifying descriptive rurals such as towns (e.g., enwiki categories starting w/ 'rural'; enwiki 'Rural tourism' category).

In summary, no great options, and I'd suggest leaving off urban/rural metric for content for now:

  • Simple global metric but hard to interpret and coverage is not high
  • More nuanced metrics but would have to collected country-by-country (and therefore would have quite a lot of gaps and same comparison challenges)
  • Ideal case of using tagging done by Wikipedians (allows for most nuance) seems too incomplete to be of great use

Weekly update: wrapped up entry into metric schema doc for geographic gaps across readers/contributors/content. Can reasonably close out this task but some more long-term follow-up aspects:

  • Update geographic metrics based on findings from Marc's work on other metrics and feedback that GDI receives about their geographic metrics
  • Expand out model for labeling articles with relevant countries to improve content geographic metric coverage
  • Consider any improvements to geographic content approach -- e.g., also model sub-country level?

Resolving this. I also updated the prototype API for this data so that it doesn't just provide regions but also provides the aggregations to reflect the current status -- e.g., WandaVision: