Page MenuHomePhabricator

Population input metrics
Closed, ResolvedPublic

Description

Read external data to create the population input metrics

Event Timeline

ntsako changed the task status from Open to In Progress.May 27 2022, 2:31 PM

Overall Engagement (Percentile) and Total Population Presence*Growth Percentile depend on Affilate and Overall Engagement being completed

The other columns have been calculated.

Loaded data onto

SELECT *
  FROM ntsako.population_data
WHERE year = 2021

Work on including data specified on Asana
ntsako.population_data has bee replaced by ntsako.population_input_metrics

We have decided to supplement World Bank Population Gaps first with UN, then with Penn World Table, and then with IMF sources where there are missing nations across agencies.

I am continuing work to see which gaps are covered by this plan and which remain and to complete the supplementing plans which I will continue to update here

This will apply to the following World Bank Data currently in our pipeline for Build 1:

Population, total from https://data.worldbank.org/indicator/SP.POP.TOTL

Population growth (annual %) from https://data.worldbank.org/indicator/SP.POP.GROW

Individuals using the Internet (% of population) from https://data.worldbank.org/indicator/IT.NET.USER.ZS

  • Supplement with: UN database TBD may be REDUNDANT as it is sourced from World Bank (World Development Indicators) rather
  • Supplement with: TBD
  • Supplement with: TBD

GDP per capita, PPP (current international $) from https://data.worldbank.org/indicator/NY.GDP.PCAP.PP.CD

GDP per capita, PPP (constant 2017 international $) from https://data.worldbank.org/indicator/NY.GDP.PCAP.PP.KD

Also to World Bank additions to our pipeline for Build 2:
Population, female from https://data.worldbank.org/indicator/SP.POP.TOTL.FE.IN

PPP conversion factor, GDP (LCU per international $) from https://data.worldbank.org/indicator/PA.NUS.PPP

Price level ratio of PPP conversion factor (GDP) to market exchange rate from https://data.worldbank.org/indicator/PA.NUS.PPPC.RF

Official exchange rate (LCU per US$, period average) (PA.NUS.FCRF) https://data.worldbank.org/indicator/PA.NUS.FCRF

DEC Adjusted exchange rate (LCU per US$, period average) (PA.NUS.ATLS) https://data.worldbank.org/indicator/PA.NUS.ATLS

Inflation, consumer prices (annual %) from https://data.worldbank.org/indicator/FP.CPI.TOTL.ZG?view=chart (sourced from International Monetary Fund, International Financial Statistics)

Query used:

-- Run in Hue

WITH country_data AS (
   SELECT distinct iso3_country_code,
          first_value(country_area_label) over(PARTITION BY iso3_country_code)  country_name,
          iso2_country_code,
          un_continent_description      as continent,
          un_subcontinent_description   as sub_content          
     FROM ntsako.country_meta_data country
    WHERE iso3_country_code IS NOT NULL
), wb_gdp_per_capita_current AS (
     SELECT economy,
            value 
       FROM ntsako.world_bank_data_input_metrics
      WHERE series = 'NY.GDP.PCAP.PP.CD' 
        AND time = 'YR'||'2020'
), imf_gdp_per_capita_current AS (
     SELECT iso3_country_code,
            value    as gdp_per_capita_current
       FROM ntsako.imf_input_metrics
      WHERE year = '2020'
        AND weo_subject_code = 'PPPPC'
), wb_gdp_per_capita_2007 AS (
     SELECT economy,
            value
      FROM ntsako.world_bank_data_input_metrics
     WHERE series = 'NY.GDP.PCAP.PP.KD' 
       AND time = 'YR'||'2020'

), imf_gdp_per_capita_2007 AS (
     SELECT iso3_country_code,
            value as gdp_per_capita_constant
       FROM ntsako.imf_input_metrics
      WHERE year = '2020'
        AND weo_subject_code = 'NGDPRPPPPC'
), wb_population_growth AS (
     -- World Bank population growth
    SELECT economy,
           value 
      FROM ntsako.world_bank_data_input_metrics
     WHERE series = 'SP.POP.GROW'
       AND time = 'YR'||'2020'

), un_population_growth AS (
     -- UN population growth
     SELECT country_or_region   country, 
            abs(value )         value
       FROM ntsako.un_data_pop_input_metrics
      WHERE metric = 'pop_growth'
        AND 2020 BETWEEN cast(split(year, '-')[0] as int) AND cast(split(year, '-')[1] as int) - 1
), wb_population_current AS (
     -- World Bank population current
SELECT economy, 
       value
  FROM ntsako.world_bank_data_input_metrics
 WHERE series = 'SP.POP.TOTL'
   AND time = 'YR'||'2020'
), imf_input_metrics_current AS (
     -- IMF population current
SELECT iso3_country_code, 
       value * 1000000    AS population
  FROM ntsako.imf_input_metrics
 WHERE year = '2020'
   AND weo_subject_code = 'LP'
), imf_input_metrics_previous AS (
     -- IMF population previous
SELECT iso3_country_code, 
       value * 1000000  AS population
  FROM ntsako.imf_input_metrics
 WHERE year = '2019'
   AND weo_subject_code = 'LP'
), penn_world_table_input_metrics_current AS (
     -- Penn World Table population current
  SELECT country_code,
         population * 1000000                          AS population,
         current_gdp_output/(population * 1000000)     AS gdp_per_capita_current,
         real_gdp_output / (population * 1000000)      AS gdp_per_capita_constant
    FROM ntsako.penn_world_table_input_metrics
   WHERE year = '2020'
), penn_world_table_input_metrics_previous AS (
     -- Penn World Table population previous
  SELECT country_code,
         population  * 1000000 AS population
    FROM ntsako.penn_world_table_input_metrics
   WHERE year = '2019'
), un_population_current AS (
     -- UN population current
  SELECT country_or_region    AS country,
         value * 1000000      AS value
    FROM ntsako.un_data_pop_input_metrics un_data 
   WHERE year = '2020'
     AND variant = 'Medium'
     AND metric = 'population'
), un_population_previous AS (
     -- UN population previous
  SELECT country_or_region    AS country,
         value * 1000000      AS value
    FROM ntsako.un_data_pop_input_metrics un_data 
   WHERE year = '2019'
     AND variant = 'Medium'
     AND metric = 'population'
),wb_percentage_using_internet AS (
     -- World Bank percentage using internet
   SELECT economy,
          value
    FROM ntsako.world_bank_data_input_metrics
     WHERE series = 'IT.NET.USER.ZS'
       AND time = 'YR'||'2020'
), wb_population AS (
     -- World Bank population current - previous     
     SELECT curr.economy,
          curr.value                as population_thousands,
          growth.value              as population_growth_rate,
          internet.value            as percent_using_internet,
          gdp_current.value         as gdp_per_capita_current,
          gdp_2007.value            as gdp_per_capita_constant
     FROM wb_population_current curr,
          wb_population_growth growth,
          wb_percentage_using_internet internet,
          wb_gdp_per_capita_2007 gdp_2007,
          wb_gdp_per_capita_current gdp_current
     WHERE curr.economy = growth.economy
       AND growth.economy = internet.economy
       AND growth.economy = gdp_2007.economy
       AND growth.economy = gdp_current.economy
), un_population AS (
     -- UN population current - previous
     SELECT curr.country,
            curr.value    as population_thousands,
            growth.value  as population_growth_rate
       FROM un_population_current curr,
            un_population_growth  growth
      WHERE curr.country = growth.country
), imf_population AS (
     -- IMF population current - previous
     SELECT curr.iso3_country_code,
            curr.population * 1000000                  as population_thousands,
           (curr.population/prev.population) * 1000000 as population_growth_rate,
            gdp_current.gdp_per_capita_current,
            gdp_2007.gdp_per_capita_constant
     FROM imf_input_metrics_current curr,
          imf_input_metrics_previous prev,
          imf_gdp_per_capita_2007 gdp_2007,
          imf_gdp_per_capita_current gdp_current
     WHERE curr.iso3_country_code = prev.iso3_country_code
       AND prev.iso3_country_code = gdp_2007.iso3_country_code
       AND prev.iso3_country_code = gdp_current.iso3_country_code
), penn_world_table_population AS (
     -- Penn World Table population current - previous
     SELECT curr.country_code,
          curr.population                         as population_thousands,
          (curr.population/prev.population)       as population_growth_rate,
          curr.gdp_per_capita_current,
          curr.gdp_per_capita_constant
     FROM penn_world_table_input_metrics_current curr,
          penn_world_table_input_metrics_previous prev
     WHERE curr.country_code = prev.country_code
), wb_population_rank AS (

  SELECT economy                                             AS country_code,
         population_thousands,
         population_growth_rate,
         population_thousands * (1 + population_growth_rate) AS population_presence,
         percent_using_internet,
         gdp_per_capita_current,
         gdp_per_capita_constant
    FROM wb_population

), un_population_rank AS (
  SELECT country,
         population_thousands,
         population_growth_rate,
         population_thousands * (1 + population_growth_rate) AS population_presence
    FROM un_population
), imf_population_rank AS (
  SELECT iso3_country_code,
         population_thousands,
         population_growth_rate,
         population_thousands * (1 + population_growth_rate) AS population_presence,
         gdp_per_capita_current,
         gdp_per_capita_constant
    FROM imf_population
), penn_world_table_population_rank AS (
  SELECT country_code,
         population_thousands,
         population_growth_rate,
         population_thousands * (1 + population_growth_rate) AS population_presence,
         gdp_per_capita_current,
         gdp_per_capita_constant
    FROM penn_world_table_population
), wikipedia_editors AS (
 SELECT country_code,wikipedia
  FROM ntsako.geoeditor_input_metrics_pivot
 WHERE metric = 'monthly_bins'
   AND year = 2021
), average_editors AS (
  SELECT country_code,
         editorship_monthly,
         monthly_distinct_editors
    FROM ntsako.geoeditor_online_input_metrics
   WHERE year = 2021
)
--INSERT OVERWRITE TABLE ${destination_table} PARTITION (year='2021')
SELECT country.iso3_country_code,
       coalesce(pop.population_thousands,un_pop.population_thousands, pwt_pop.population_thousands, imf_pop.population_thousands)            AS population_thousands,
       coalesce(pop.population_growth_rate, un_pop.population_growth_rate, pwt_pop.population_growth_rate, imf_pop.population_growth_rate)   AS population_growth_rate, 
       wedit.wikipedia,
       avg_cal_year.editorship_monthly,
       avg_cal_year.monthly_distinct_editors,
       coalesce(pop.population_presence, un_pop.population_presence, pwt_pop.population_presence, imf_pop.population_presence)               AS population_presence,
       pop.percent_using_internet,
       coalesce(pop.gdp_per_capita_current, pwt_pop.gdp_per_capita_current, imf_pop.gdp_per_capita_current)                                        AS gdp_per_capita_current,
       coalesce(pop.gdp_per_capita_constant, pwt_pop.gdp_per_capita_constant, imf_pop.gdp_per_capita_constant)                                           AS gdp_per_capita_constant
  FROM country_data country
  LEFT JOIN wb_population_rank pop ON (country.iso3_country_code = pop.country_code)
  LEFT JOIN wikipedia_editors wedit ON (country.iso2_country_code = wedit.country_code)
  LEFT JOIN average_editors avg_cal_year ON (country.iso2_country_code = avg_cal_year.country_code)
  LEFT JOIN un_population_rank un_pop ON (lower(country.country_name) = lower(un_pop.country))
  LEFT JOIN imf_population_rank imf_pop ON (country.iso3_country_code = imf_pop.iso3_country_code)
  LEFT JOIN penn_world_table_population_rank pwt_pop ON (country.iso3_country_code = pwt_pop.country_code);

Resultant table:

SELECT *
  FROM ntsako.population_data_input_metrics
 WHERE year = 2021
JAnstee_WMF triaged this task as Medium priority.
JAnstee_WMF raised the priority of this task from Medium to High.Nov 28 2022, 3:58 PM

@KCVelaga_WMF @ntsako As discussed, for now, we will use a mirrored extract from my supplementation mapping workbook. I created a mirror of only the build 1 data inputs for the 2020 and 2021 calendar years - I pulled these data as of Jan 19, 2023 for date noting. I think we should maybe use 2020 for 2021 and then 2021 for 2022 given the timing of the updates and how it will roll-on forward, but we have the option to also just use the most up-to-date data for both - we should decide and QA.

Final input metrics table moved from ntsako.population_data_input_metrics which now stores the csv data to ntsako.population_leadership_input_metrics

@JAnstee_WMF I have QC'ed the inputs at on this sheet. As the population metrics are directly loaded from the supplements sheet, they are all good.

For editor related inputs, apart from the minor differences, the big differences (countries: AGO, VGB, ZMB, ZWE) are being caused by the unavailability of the data in the 2021 Data Reference sheet. Rest looks good.

Noting column name change from population_thousands to population_total

@ntsako There is one column needing renaming in the inputs file ntsako.population_data_input_metrics

population_thousands should be changed to population_total

@KCVelaga_WMF - With the caveat of the column name change, signing off on the QA for the data

@ntsako -ok here are the revised column names round 2:
FROM:
population_total TO population_annual_signal
population_growth_rate TO population_annual_change
population_wikipedia_editors TO wikipedia_annual_signal
population_global_editors_annual_signal TO wikimedia_annual_signal
population_global_active_editors_annual_signal TO wikimedia_active_annual_signal
internet_user_access TO internet_percent_annual_signal

Data loaded in production