Page MenuHomePhabricator

Access input metrics
Closed, ResolvedPublic

Description

(creating this task for QA coordination as it was never created)

Event Timeline

KCVelaga_WMF added a subscriber: ntsako.

@ntsako please share the query for access inputs

Please run this in a Jupyter notebook

java_function = "CREATE TEMPORARY FUNCTION array_avg AS 'org.wikimedia.analytics.refinery.hive.ArrayAvgUDF' USING JAR 'hdfs://analytics-hadoop/wmf/refinery/current/artifacts/org/wikimedia/analytics/refinery/refinery-hive-0.2.9.jar';"

country_data = "ntsako.country_meta_data"
world_bank = "ntsako.world_bank_data_input_metrics"
world_bank_year = 2020
mobile_index = "ntsako.mci_input_metrics"
score_year = 2019
year = 2021
social_progress = "ntsako.social_progress_input_metrics"
spi_year = 2020


sql = f"""

WITH country_data AS (
   SELECT distinct iso3_country_code,
          first_value(country_area_label) over(PARTITION BY iso3_country_code)  country_area_label,
          iso2_country_code
     FROM {country_data} country
    WHERE iso3_country_code IS NOT NULL
),
 world_bank AS (
   SELECT *
     FROM {world_bank}
     WHERE time ='YR{world_bank_year}'
),
mobile_connectivity_index AS (

SELECT *
  FROM (
     SELECT iso_code,
            connectivity_index / lag(connectivity_index) over (partition by iso_code order by score_year)  as connectivity_index,
            connectivity_index as curr_year_value,
            lag(connectivity_index) over (partition by iso_code order by score_year) prev_year_value,
            score_year
       FROM {mobile_index}
      WHERE score_year IN ({score_year}-1,{score_year})
    ) mci
  WHERE score_year = {score_year}


)
SELECT country.iso3_country_code,
       nvl(internet.value,0)                            percent_pop_using_internet,
       nvl(mobile.value,0)                              mobile_cellular_subscriptions,
       nvl(social.access_to_basic_knowledge,0)          access_to_basic_knowledge,
       nvl(social.access_to_information,0)              access_to_information,
       nvl(mci.connectivity_index,0)                    connectivity_index,
       array_avg(array(internet.value, mci.connectivity_index)) as growth_rate
  FROM country_data country
  LEFT JOIN world_bank internet ON (country.iso3_country_code = internet.economy AND internet.series = 'IT.NET.USER.ZS')
  LEFT JOIN world_bank mobile ON (country.iso3_country_code = mobile.economy AND mobile.series = 'IT.CEL.SETS.P2')
  LEFT JOIN {social_progress} social ON (country.iso3_country_code = social.country_code AND social.year = {year} AND social.spi_year = {spi_year})
  LEFT JOIN mobile_connectivity_index mci ON (country.iso3_country_code = mci.iso_code)

spark.sql(java_function)

spark.sql(sql).toPandas()

"""

Where spark is the SparkSession

@ntsako and @KCVelaga_WMF - looking on track, but as I commented in the QA workbook - we have a dropped input metric - we should be extracting both the connectivity_index score as well as its growth rate from the GSMA data - I accidentally dropped it in the background but note the gap here now. In addition, there are some label adjustments to be aligned:

 FROM:   	access_percent_pop_using_internet	        TO:	percent_population_using_intermet
 FROM:   	access_mobile_cellular_subscriptions	        TO:	mobile_subscriptions
 FROM:   	access_basic_knowledge	        TO:	access_to_basic_knowledge
 FROM:   	access_connectivity_index	        TO:	connectivity_index
** NEED TO ADD:   	access_connectivity_index_growth	        **
 FROM:   	access_growth_internet	        TO:	annual_internet_growth
 FROM:   	access_access_to_information	        TO:	access_to_information

@KCVelaga_WMF Also one change to the output labels FROM: access_presence_growth TO access

@JAnstee_WMF

I realized that we are already considering growth, however, the column title is slightly confusing

calculation for growth in SQL query

connectivity_index / lag(connectivity_index) over (partition by iso_code order by score_year)  as connectivity_index

so the current access_connectivity_index has to be renamed to access_connectivity_index_growth

@JAnstee_WMF Access inputs QC is complete (sheet). Please review and sign-off.

@ntsako -ok here are the revised column names round 2:
FROM:
mobile_subscriptions TO mobile_subscriptions_annual_signal
access_to_basic_knowledge TO access_to_basic_knowledge_annual_signal
connectivity_index TO annual_connectivity_index_annual_change
annual_internet_growth TO internet_annual_change
access_to_information TO access_to_information_annual_change