(creating this task for QA coordination as it was never created)
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T305468 Deploy the GDI Equity Landscape Dashboard | |||
| Open | None | T305475 Milestone: Ingest and Transform Input Data | |||
| Open | None | T306625 Extract + Transformation Raw Data into Input Metrics | |||
| Resolved | • ntsako | T324968 Access input metrics |
Event Timeline
Please run this in a Jupyter notebook
java_function = "CREATE TEMPORARY FUNCTION array_avg AS 'org.wikimedia.analytics.refinery.hive.ArrayAvgUDF' USING JAR 'hdfs://analytics-hadoop/wmf/refinery/current/artifacts/org/wikimedia/analytics/refinery/refinery-hive-0.2.9.jar';"
country_data = "ntsako.country_meta_data"
world_bank = "ntsako.world_bank_data_input_metrics"
world_bank_year = 2020
mobile_index = "ntsako.mci_input_metrics"
score_year = 2019
year = 2021
social_progress = "ntsako.social_progress_input_metrics"
spi_year = 2020
sql = f"""
WITH country_data AS (
SELECT distinct iso3_country_code,
first_value(country_area_label) over(PARTITION BY iso3_country_code) country_area_label,
iso2_country_code
FROM {country_data} country
WHERE iso3_country_code IS NOT NULL
),
world_bank AS (
SELECT *
FROM {world_bank}
WHERE time ='YR{world_bank_year}'
),
mobile_connectivity_index AS (
SELECT *
FROM (
SELECT iso_code,
connectivity_index / lag(connectivity_index) over (partition by iso_code order by score_year) as connectivity_index,
connectivity_index as curr_year_value,
lag(connectivity_index) over (partition by iso_code order by score_year) prev_year_value,
score_year
FROM {mobile_index}
WHERE score_year IN ({score_year}-1,{score_year})
) mci
WHERE score_year = {score_year}
)
SELECT country.iso3_country_code,
nvl(internet.value,0) percent_pop_using_internet,
nvl(mobile.value,0) mobile_cellular_subscriptions,
nvl(social.access_to_basic_knowledge,0) access_to_basic_knowledge,
nvl(social.access_to_information,0) access_to_information,
nvl(mci.connectivity_index,0) connectivity_index,
array_avg(array(internet.value, mci.connectivity_index)) as growth_rate
FROM country_data country
LEFT JOIN world_bank internet ON (country.iso3_country_code = internet.economy AND internet.series = 'IT.NET.USER.ZS')
LEFT JOIN world_bank mobile ON (country.iso3_country_code = mobile.economy AND mobile.series = 'IT.CEL.SETS.P2')
LEFT JOIN {social_progress} social ON (country.iso3_country_code = social.country_code AND social.year = {year} AND social.spi_year = {spi_year})
LEFT JOIN mobile_connectivity_index mci ON (country.iso3_country_code = mci.iso_code)
spark.sql(java_function)
spark.sql(sql).toPandas()
"""Where spark is the SparkSession
@ntsako and @KCVelaga_WMF - looking on track, but as I commented in the QA workbook - we have a dropped input metric - we should be extracting both the connectivity_index score as well as its growth rate from the GSMA data - I accidentally dropped it in the background but note the gap here now. In addition, there are some label adjustments to be aligned:
FROM: access_percent_pop_using_internet TO: percent_population_using_intermet FROM: access_mobile_cellular_subscriptions TO: mobile_subscriptions FROM: access_basic_knowledge TO: access_to_basic_knowledge FROM: access_connectivity_index TO: connectivity_index ** NEED TO ADD: access_connectivity_index_growth ** FROM: access_growth_internet TO: annual_internet_growth FROM: access_access_to_information TO: access_to_information
@KCVelaga_WMF Also one change to the output labels FROM: access_presence_growth TO access
I realized that we are already considering growth, however, the column title is slightly confusing
calculation for growth in SQL query
connectivity_index / lag(connectivity_index) over (partition by iso_code order by score_year) as connectivity_index
so the current access_connectivity_index has to be renamed to access_connectivity_index_growth
@ntsako -ok here are the revised column names round 2:
FROM:
mobile_subscriptions TO mobile_subscriptions_annual_signal
access_to_basic_knowledge TO access_to_basic_knowledge_annual_signal
connectivity_index TO annual_connectivity_index_annual_change
annual_internet_growth TO internet_annual_change
access_to_information TO access_to_information_annual_change