Read external data to create the population input metrics
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T305468 Deploy the GDI Equity Landscape Dashboard | |||
| Open | None | T305475 Milestone: Ingest and Transform Input Data | |||
| Open | None | T306625 Extract + Transformation Raw Data into Input Metrics | |||
| Resolved | • ntsako | T309279 Population input metrics |
Event Timeline
Overall Engagement (Percentile) and Total Population Presence*Growth Percentile depend on Affilate and Overall Engagement being completed
The other columns have been calculated.
Work on including data specified on Asana
ntsako.population_data has bee replaced by ntsako.population_input_metrics
We have decided to supplement World Bank Population Gaps first with UN, then with Penn World Table, and then with IMF sources where there are missing nations across agencies.
I am continuing work to see which gaps are covered by this plan and which remain and to complete the supplementing plans which I will continue to update here
This will apply to the following World Bank Data currently in our pipeline for Build 1:
Population, total from https://data.worldbank.org/indicator/SP.POP.TOTL
- Supplement with: UN database Total population, both sexes combined (thousands) from http://data.un.org/Data.aspx?q=population&d=PopDiv&f=variableID%3a12
- Supplement with: Penn World Table database Population (in millions)(pop) from https://www.rug.nl/ggdc/productivity/pwt/ BE SURE TO NOTE DIFFERENCE IN SCALE as you will need to multiply by 1 mil
- Supplement with IMF database Population Millions of people (LP) from https://www.imf.org/external/datamapper/LP@WEO/OEMDC/ADVEC/WEOWORLD. NOTE DIFFERENCE IN SCALE as you will need to multiply by 1 mil**
Population growth (annual %) from https://data.worldbank.org/indicator/SP.POP.GROW
- Supplement with: UN database Average annual rate of population change (percentage) from http://data.un.org/Data.aspx?q=population&d=PopDiv&f=variableID%3a47 NOTE DIFFERENCE IN SCALING - NEED TO TRANSFORM TO POSITIVE RATE FOR MULTIPLICATION
- Supplement with: Penn World Table database Population (in millions) (pop) from https://www.rug.nl/ggdc/productivity/pwt/ NOTE NEED TO TRANSFORM TWO ANNUAL SCORES TO DERIVE GROWTH RATE FOR MULTIPLICATION
- Supplement with IMF database Population from World Economic Outlook via https://www.imf.org/en/Publications/WEO/weo-database/2022/April NOTE NEED TO TRANSFORM TWO ANNUAL SCORE GROWTH RATE FOR MULTIPLICATION
Individuals using the Internet (% of population) from https://data.worldbank.org/indicator/IT.NET.USER.ZS
- Supplement with: UN database TBD may be REDUNDANT as it is sourced from World Bank (World Development Indicators) rather
- Supplement with: TBD
- Supplement with: TBD
GDP per capita, PPP (current international $) from https://data.worldbank.org/indicator/NY.GDP.PCAP.PP.CD
- Supplement with: UN database GDP per capita, PPP (current international $) from http://data.un.org/Data.aspx?d=WDI&f=Indicator_Code%3ANY.GDP.PCAP.PP.CD is REDUNDANT as it is sourced from World Bank (World Development Indicators) rather
- Supplement with: Penn World Table database GDP PPP - OUT -current int'l $ (cgdpo) and Population (in millions) (pop). Dividing cgdpo by pop for approximate replacement. from https://www.rug.nl/ggdc/productivity/pwt/
- Supplement with IMF database Gross domestic product per capita, current prices from World Economic Outlook via https://www.imf.org/en/Publications/WEO/weo-database/2022/April (WEO Subject Code: PPPPC)
GDP per capita, PPP (constant 2017 international $) from https://data.worldbank.org/indicator/NY.GDP.PCAP.PP.KD
- Supplement with: UN database GDP per capita, PPP (constant international $) (GDP.PCAP.PP.KD) from http://data.un.org/Data.aspx?d=WDI&f=Indicator_Code%3ANY.GDP.PCAP.PP.KD is REDUNDANT as it is sourced from World Bank (World Development Indicators) rather
- Supplement with: Penn World Table database GDP - OUT - constant int'l $ (rgdpo) and Population (in millions) (pop). Dividing rgdpo by pop for approximate replacement. from https://www.rug.nl/ggdc/productivity/pwt/
- Supplement with IMF database Gross domestic product per capita, constant prices from World Economic Outlook via https://www.imf.org/en/Publications/WEO/weo-database/2022/April (WEO Subject Code: NGDPRPPPPC)
Also to World Bank additions to our pipeline for Build 2:
Population, female from https://data.worldbank.org/indicator/SP.POP.TOTL.FE.IN
- Supplement with: UN database Female population (thousands) from http://data.un.org/Data.aspx?q=population&d=PopDiv&f=variableID%3a11
- Supplement with: Penn World Table database — No gender % data available
- Supplement with IMF database — No gender % data available
PPP conversion factor, GDP (LCU per international $) from https://data.worldbank.org/indicator/PA.NUS.PPP
- Supplement with: UN database TBD from
- Supplement with: Penn World Table database TBD from https://www.rug.nl/ggdc/productivity/pwt/
- Supplement with IMF database Implied PPP conversion rate National currency per international dollar from https://www.imf.org/external/datamapper/PPPEX@WEO/OEMDC/ADVEC/WEOWORLD
Price level ratio of PPP conversion factor (GDP) to market exchange rate from https://data.worldbank.org/indicator/PA.NUS.PPPC.RF
- Supplement with: UN database TBD from
- Supplement with: Penn World Table database TBD from https://www.rug.nl/ggdc/productivity/pwt/
- Supplement with IMF database Implied PPP conversion rate National currency per international dollar from https://www.imf.org/external/datamapper/PPPEX@WEO/OEMDC/ADVEC/WEOWORLD transform PPPEX by dividing it by PA.NUS.ATLS (DEC Adjusted exchange rate)
Official exchange rate (LCU per US$, period average) (PA.NUS.FCRF) https://data.worldbank.org/indicator/PA.NUS.FCRF
- Supplement with: UN database UN Operational Rates of Exchange from https://treasury.un.org/operationalrates/OperationalRates.php
- Supplement with PWT database Exchange rate, national currency/USD (market+estimated) (xr) from https://www.rug.nl/ggdc/productivity/pwt/
- Supplement with IMF database Domestic Currency per U.S. Dollar, End of Period (ENDA_XDC_USD_RATE) from https://data.imf.org/regular.aspx?key=61545850 (NOTE: STILL NO TAIWAN - Maybe we can use: Taiwan Dollar to US Dollar Spot Exchange Rates for YEAR from https://www.exchangerates.org.uk/USD-ILS-exchange-rate-history.html)
DEC Adjusted exchange rate (LCU per US$, period average) (PA.NUS.ATLS) https://data.worldbank.org/indicator/PA.NUS.ATLS
- Supplement with: UN database UN Operational Rates of Exchange from https://treasury.un.org/operationalrates/OperationalRates.php
- Supplement with PWT database Exchange rate, national currency/USD (market+estimated) (xr) from https://www.rug.nl/ggdc/productivity/pwt/
- Supplement with IMF database Domestic Currency per U.S. Dollar, End of Period (ENDA_XDC_USD_RATE) from https://data.imf.org/regular.aspx?key=61545850 (NOTE STILL NO TAIWAN - Maybe we can use: Taiwan Dollar to US Dollar Spot Exchange Rates for YEAR from https://www.exchangerates.org.uk/USD-ILS-exchange-rate-history.html)
Inflation, consumer prices (annual %) from https://data.worldbank.org/indicator/FP.CPI.TOTL.ZG?view=chart (sourced from International Monetary Fund, International Financial Statistics)
- Supplement with: UN database CPI % CHANGE from Consumer price Index via http://data.un.org/Data.aspx?q=cpi+change&d=IFS&f=SeriesCode%3a64
- Supplement with: Penn World Table database TBD from https://www.rug.nl/ggdc/productivity/pwt/
- Supplement with IMF database TBD from (Redundant as this is the World Bank's source)
Query used:
-- Run in Hue
WITH country_data AS (
SELECT distinct iso3_country_code,
first_value(country_area_label) over(PARTITION BY iso3_country_code) country_name,
iso2_country_code,
un_continent_description as continent,
un_subcontinent_description as sub_content
FROM ntsako.country_meta_data country
WHERE iso3_country_code IS NOT NULL
), wb_gdp_per_capita_current AS (
SELECT economy,
value
FROM ntsako.world_bank_data_input_metrics
WHERE series = 'NY.GDP.PCAP.PP.CD'
AND time = 'YR'||'2020'
), imf_gdp_per_capita_current AS (
SELECT iso3_country_code,
value as gdp_per_capita_current
FROM ntsako.imf_input_metrics
WHERE year = '2020'
AND weo_subject_code = 'PPPPC'
), wb_gdp_per_capita_2007 AS (
SELECT economy,
value
FROM ntsako.world_bank_data_input_metrics
WHERE series = 'NY.GDP.PCAP.PP.KD'
AND time = 'YR'||'2020'
), imf_gdp_per_capita_2007 AS (
SELECT iso3_country_code,
value as gdp_per_capita_constant
FROM ntsako.imf_input_metrics
WHERE year = '2020'
AND weo_subject_code = 'NGDPRPPPPC'
), wb_population_growth AS (
-- World Bank population growth
SELECT economy,
value
FROM ntsako.world_bank_data_input_metrics
WHERE series = 'SP.POP.GROW'
AND time = 'YR'||'2020'
), un_population_growth AS (
-- UN population growth
SELECT country_or_region country,
abs(value ) value
FROM ntsako.un_data_pop_input_metrics
WHERE metric = 'pop_growth'
AND 2020 BETWEEN cast(split(year, '-')[0] as int) AND cast(split(year, '-')[1] as int) - 1
), wb_population_current AS (
-- World Bank population current
SELECT economy,
value
FROM ntsako.world_bank_data_input_metrics
WHERE series = 'SP.POP.TOTL'
AND time = 'YR'||'2020'
), imf_input_metrics_current AS (
-- IMF population current
SELECT iso3_country_code,
value * 1000000 AS population
FROM ntsako.imf_input_metrics
WHERE year = '2020'
AND weo_subject_code = 'LP'
), imf_input_metrics_previous AS (
-- IMF population previous
SELECT iso3_country_code,
value * 1000000 AS population
FROM ntsako.imf_input_metrics
WHERE year = '2019'
AND weo_subject_code = 'LP'
), penn_world_table_input_metrics_current AS (
-- Penn World Table population current
SELECT country_code,
population * 1000000 AS population,
current_gdp_output/(population * 1000000) AS gdp_per_capita_current,
real_gdp_output / (population * 1000000) AS gdp_per_capita_constant
FROM ntsako.penn_world_table_input_metrics
WHERE year = '2020'
), penn_world_table_input_metrics_previous AS (
-- Penn World Table population previous
SELECT country_code,
population * 1000000 AS population
FROM ntsako.penn_world_table_input_metrics
WHERE year = '2019'
), un_population_current AS (
-- UN population current
SELECT country_or_region AS country,
value * 1000000 AS value
FROM ntsako.un_data_pop_input_metrics un_data
WHERE year = '2020'
AND variant = 'Medium'
AND metric = 'population'
), un_population_previous AS (
-- UN population previous
SELECT country_or_region AS country,
value * 1000000 AS value
FROM ntsako.un_data_pop_input_metrics un_data
WHERE year = '2019'
AND variant = 'Medium'
AND metric = 'population'
),wb_percentage_using_internet AS (
-- World Bank percentage using internet
SELECT economy,
value
FROM ntsako.world_bank_data_input_metrics
WHERE series = 'IT.NET.USER.ZS'
AND time = 'YR'||'2020'
), wb_population AS (
-- World Bank population current - previous
SELECT curr.economy,
curr.value as population_thousands,
growth.value as population_growth_rate,
internet.value as percent_using_internet,
gdp_current.value as gdp_per_capita_current,
gdp_2007.value as gdp_per_capita_constant
FROM wb_population_current curr,
wb_population_growth growth,
wb_percentage_using_internet internet,
wb_gdp_per_capita_2007 gdp_2007,
wb_gdp_per_capita_current gdp_current
WHERE curr.economy = growth.economy
AND growth.economy = internet.economy
AND growth.economy = gdp_2007.economy
AND growth.economy = gdp_current.economy
), un_population AS (
-- UN population current - previous
SELECT curr.country,
curr.value as population_thousands,
growth.value as population_growth_rate
FROM un_population_current curr,
un_population_growth growth
WHERE curr.country = growth.country
), imf_population AS (
-- IMF population current - previous
SELECT curr.iso3_country_code,
curr.population * 1000000 as population_thousands,
(curr.population/prev.population) * 1000000 as population_growth_rate,
gdp_current.gdp_per_capita_current,
gdp_2007.gdp_per_capita_constant
FROM imf_input_metrics_current curr,
imf_input_metrics_previous prev,
imf_gdp_per_capita_2007 gdp_2007,
imf_gdp_per_capita_current gdp_current
WHERE curr.iso3_country_code = prev.iso3_country_code
AND prev.iso3_country_code = gdp_2007.iso3_country_code
AND prev.iso3_country_code = gdp_current.iso3_country_code
), penn_world_table_population AS (
-- Penn World Table population current - previous
SELECT curr.country_code,
curr.population as population_thousands,
(curr.population/prev.population) as population_growth_rate,
curr.gdp_per_capita_current,
curr.gdp_per_capita_constant
FROM penn_world_table_input_metrics_current curr,
penn_world_table_input_metrics_previous prev
WHERE curr.country_code = prev.country_code
), wb_population_rank AS (
SELECT economy AS country_code,
population_thousands,
population_growth_rate,
population_thousands * (1 + population_growth_rate) AS population_presence,
percent_using_internet,
gdp_per_capita_current,
gdp_per_capita_constant
FROM wb_population
), un_population_rank AS (
SELECT country,
population_thousands,
population_growth_rate,
population_thousands * (1 + population_growth_rate) AS population_presence
FROM un_population
), imf_population_rank AS (
SELECT iso3_country_code,
population_thousands,
population_growth_rate,
population_thousands * (1 + population_growth_rate) AS population_presence,
gdp_per_capita_current,
gdp_per_capita_constant
FROM imf_population
), penn_world_table_population_rank AS (
SELECT country_code,
population_thousands,
population_growth_rate,
population_thousands * (1 + population_growth_rate) AS population_presence,
gdp_per_capita_current,
gdp_per_capita_constant
FROM penn_world_table_population
), wikipedia_editors AS (
SELECT country_code,wikipedia
FROM ntsako.geoeditor_input_metrics_pivot
WHERE metric = 'monthly_bins'
AND year = 2021
), average_editors AS (
SELECT country_code,
editorship_monthly,
monthly_distinct_editors
FROM ntsako.geoeditor_online_input_metrics
WHERE year = 2021
)
--INSERT OVERWRITE TABLE ${destination_table} PARTITION (year='2021')
SELECT country.iso3_country_code,
coalesce(pop.population_thousands,un_pop.population_thousands, pwt_pop.population_thousands, imf_pop.population_thousands) AS population_thousands,
coalesce(pop.population_growth_rate, un_pop.population_growth_rate, pwt_pop.population_growth_rate, imf_pop.population_growth_rate) AS population_growth_rate,
wedit.wikipedia,
avg_cal_year.editorship_monthly,
avg_cal_year.monthly_distinct_editors,
coalesce(pop.population_presence, un_pop.population_presence, pwt_pop.population_presence, imf_pop.population_presence) AS population_presence,
pop.percent_using_internet,
coalesce(pop.gdp_per_capita_current, pwt_pop.gdp_per_capita_current, imf_pop.gdp_per_capita_current) AS gdp_per_capita_current,
coalesce(pop.gdp_per_capita_constant, pwt_pop.gdp_per_capita_constant, imf_pop.gdp_per_capita_constant) AS gdp_per_capita_constant
FROM country_data country
LEFT JOIN wb_population_rank pop ON (country.iso3_country_code = pop.country_code)
LEFT JOIN wikipedia_editors wedit ON (country.iso2_country_code = wedit.country_code)
LEFT JOIN average_editors avg_cal_year ON (country.iso2_country_code = avg_cal_year.country_code)
LEFT JOIN un_population_rank un_pop ON (lower(country.country_name) = lower(un_pop.country))
LEFT JOIN imf_population_rank imf_pop ON (country.iso3_country_code = imf_pop.iso3_country_code)
LEFT JOIN penn_world_table_population_rank pwt_pop ON (country.iso3_country_code = pwt_pop.country_code);Resultant table:
SELECT * FROM ntsako.population_data_input_metrics WHERE year = 2021
@KCVelaga_WMF @ntsako As discussed, for now, we will use a mirrored extract from my supplementation mapping workbook. I created a mirror of only the build 1 data inputs for the 2020 and 2021 calendar years - I pulled these data as of Jan 19, 2023 for date noting. I think we should maybe use 2020 for 2021 and then 2021 for 2022 given the timing of the updates and how it will roll-on forward, but we have the option to also just use the most up-to-date data for both - we should decide and QA.
Final input metrics table moved from ntsako.population_data_input_metrics which now stores the csv data to ntsako.population_leadership_input_metrics
@JAnstee_WMF I have QC'ed the inputs at on this sheet. As the population metrics are directly loaded from the supplements sheet, they are all good.
For editor related inputs, apart from the minor differences, the big differences (countries: AGO, VGB, ZMB, ZWE) are being caused by the unavailability of the data in the 2021 Data Reference sheet. Rest looks good.
@ntsako There is one column needing renaming in the inputs file ntsako.population_data_input_metrics
population_thousands should be changed to population_total
@KCVelaga_WMF - With the caveat of the column name change, signing off on the QA for the data
@ntsako -ok here are the revised column names round 2:
FROM:
population_total TO population_annual_signal
population_growth_rate TO population_annual_change
population_wikipedia_editors TO wikipedia_annual_signal
population_global_editors_annual_signal TO wikimedia_annual_signal
population_global_active_editors_annual_signal TO wikimedia_active_annual_signal
internet_user_access TO internet_percent_annual_signal