Page MenuHomePhabricator

geowiki data for Global Innovation Index
Closed, ResolvedPublic

Description

Data Requirements

Analytics pinging users to fill in

What data is used for

(might need to graduate to its own task later)

The Global Innovation Index (GII) is a ranking of 141 economies in terms of their innovation capabilities and results. A total of 79 metrics in the form of data-based indicators are at its core. These rich metrics can be used —on the level of the index, the sub-indices, or as individual variables—to monitor performance over time and to benchmark developments against their peers. These can also help study country profiles over time, and to identify their relative strengths and weaknesses from the rich and unique GII dataset.

The report is co-published by Cornell University, INSEAD, and the World Intellectual Property Organization (WIPO, a specialized agency of the United Nations), with the collaboration of three Knowledge Partners: the Confederation of Indian Industry, du, and A.T. Kearney and IMP³rove – European Innovation Management Academy. Now in its ninth edition, the report has established itself as a premier reference among innovation metrics and as a tool to facilitate public-private dialogue and evidence-based policymaking.

The Global Innovation Index (GII) is a ranking of 141 economies in terms of their innovation capabilities and results. A total of 79 metrics in the form of data-based indicators are at its core. These rich metrics can be used —on the level of the index, the sub-indices, or as individual variables—to monitor performance over time and to benchmark developments against their peers. These can also help study country profiles over time, and to identify their relative strengths and weaknesses from the rich and unique GII dataset.

Each year the GII results are presented within the framework of a top-level international event:

  • 2013 Geneva, Switzerland at the Opening Session of the United Nations Economic and Social Council (ECOSOC) High-Level Segment, organized by WIPO;
  • 2014 Sydney, Australia in the context of the B20/G20 preparations; and
  • 2015 London, United Kingdom before the Minister of Innovation and Industry.

This year the launch is scheduled for the summer in Beijing, China preceding the preparations for the 2016 G20 summit.

Recognizing the need for a broad horizontal vision of innovation applicable to developed and emerging economies alike, the GII includes indicators that go beyond the traditional measures such as expenditure in research and development. That said, an area that is of great relevance and limited to the GII is that of creative outputs. Within it, Wikipedia monthly page edits (per million population 15-69 y/o) is a key metric. This indicator, along with others that measure the number of generic top-level and country-code top-level domains and video uploads in YouTube, helps capture what we define as online creativity.

Lastly, we believe that the GII can be an important vehicle to signal that Wikipedia is a critical lever to innovation and a factor contributing to a new understanding of the digital information landscape and innovation globally.

Code for the data

The code for generating this data is maintained at http://localhost:8000/user/leila/notebooks/T131889.ipynb . The code is not readable by others atm, I'm linking it here to increase the bus factor so someone else can pick this work up if something happens to me. :)

Event Timeline

Nuria created this task.Apr 5 2016, 8:38 PM
Nuria updated the task description. (Show Details)Apr 5 2016, 8:41 PM
Nuria updated the task description. (Show Details)

The data used in previous editions of the GII focused on measuring Wikipedia Page Edits per Country. As per the definition we showed in our technical notes, the count of monthly page edits data was based on a 1:1,000 sampled server log (squids), averages of quarterly reports. Countries were included only if the number of page edits in the period exceeded 100,000 (100 matching records in 1:1,000 sampled log). Page edits by bots were not included. Also all IP addresses that occur more than once on a given day were discarded for that day. A few false negatives were taken for granted. We then at the GII compiled and reporterd the data per million population 15–69 years old per country.

Nuria moved this task from Incoming to Dashiki on the Analytics board.Apr 18 2016, 4:59 PM
Nuria moved this task from Dashiki to Backlog (Later) on the Analytics board.Jul 25 2016, 4:38 PM
Nuria moved this task from Backlog (Later) to Wikistats on the Analytics board.Oct 10 2016, 3:39 PM

Ok, some details about what data is available. Wiki projects store IPs in a table called recentchanges. A project called geowiki [1] mines this table and computes edits per country per project since 2012. This data is private due to data security concerns [2]. For those who have signed an NDA, the data can be accessed as follows:

ssh stat1003.eqiad.wmnet
mysql --defaults-file=/etc/mysql/conf.d/research-client.cnf -hs1-analytics-slave.eqiad.wmnet
use staging;
show tables like 'erosen%';
select * from staging.erosen_geocode_country_edits where project='en' and country='Canada' limit 10;

These tables are all the data we've collected over time. Anyone with access to stat1003 and the research user's password has access. The main question is, can we release this data while avoiding the privacy problems found in [1].

[1] (could not find reference to qchris's email, should be in archives somewhere, will keep looking)
[2] https://wikitech.wikimedia.org/wiki/Analytics/Geowiki

leila claimed this task.Jan 23 2017, 9:17 PM
leila added a subscriber: leila.
leila added a comment.Jan 23 2017, 9:19 PM

thanks @Milimetric . Based on my discussion today with you, Nuria, Rafael and Jordan, I will check the data you linked above and work on possible ways we may be able to release a scoring of countries based on the needs of GII to them. I will follow up on this task when I have more to say.

leila moved this task from Staged to Time Sensitive on the Research board.

@Milimetric I ran the following

select country, sum(edits) from staging.erosen_geocode_country_edits where ts like '2016-%' group by country order by sum(edits) desc;

and I'd like to have a chat with you to see if you see any problem with publishing the top, for example, 100 country level results.

Also, I'm going to share with you via email a sample spreadsheet that Rafael has shared with other companies/organizations. Basically, they don't need even the raw data from us. All they need is for us to fill out that spreadsheet for at least 60-70 countries and then report the column that reports the final normalized index to them. Let's chat about this tomorrow or Thursday if you have time (I'm in UTC+1 and signing off soon.).

Here's a better query that accounts for how that table handles dates (ts I think is the run-time, start and end are the range):

select country, sum(edits) edits from erosen_geocode_country_edits where start like '2016-%-01' group by country order by edits desc;

@Milimetric @ezachte (Erik, if you need some background, please ping. I'm happy to chat about it.) I need your feedback here, please. Here is where we are with this task:

  • I used the query Dan has in T131889#3033103 and ran it for '2016-%-01', '2015-%-01' and '2014-%-01'. (We don't need to disclose anything about the 2015 or 2014 data, the reason I queried those years was to see if we observe any significant changes in edit counts by country from one year to the next that can potentially give away sensitive information about a user or a bot.
  • Please check /home/leila/GII_2016.csv on on stat1003. The columns are country, total edits in 2016, 2015, and 2014, an indicator which says if edit counts in 2016 are less than 10K, an indicator which says if edit counts in 2016 are more than 10K but less than 100K, and increase rate in edit counts in 2016 when compared to 2015.
  • I'm suggesting that we consider disclosing data for countries that have at least 100K edits in 2016. This will leave us with 71 countries. However, even disclosing this set is not obvious to me. For example, check edits from Norway in 2016 when compared to 2015. What if 1 bot is responsible for that change in Norwegian Wikipedia and by disclosing edit counts for Norway in 2016, we're basically disclosing the country that one bot operates from.
  • I think the biggest risk in disclosing the overall edit counts in 2016 for the 71 countries is to disclose the country from which bots operate, the edits done by humans are much harder to track down, I believe, if we only focus on countries with 100K+ edits in 2016.
  • And last but not least: in this task, we're not asked to disclose the raw edit counts per country, however, @Milimetric and I discussed the request in more details last week and we believe if we provide the requested index, we should assume that the raw data will become more predictable, so it's safer to solve this task assuming that we will be offering raw data per country.

First, @Rafaesrey is this data ok with you? Can you do what you need with the 71 countries? Leila, based on the spreadsheet does Rafa get what he needs here?

If that's true, then I think Leila what you mention about Norway is one way to read the data. But there are others, and there seems to be enough ambiguity that the active bots could still hide. I mean, if someone were to take the exact number of edits we disclose and the number of edits per user (including bots) and try to tile them geographically like a puzzle to see where all the pieces fit, maybe such a sophisticated approach could reveal something. But cutting the countries with < 100k edits adds some fuzziness. Maybe we can also add a bit of fuzziness on the numbers themselves, not decreasing anything significantly but adding a random +/- X% variance to each country?

Dear Leila,

This is great news and would be extremely useful for the GII.

I have to point out, however, that the indicator would be going from
perfect coverage (128 countries) to one with low coverage (71). That said,
would it be too much to ask if the number of countries could be expanded
by, say, setting the cutoff at < 50K? Is that even possible? Or perhaps
adding a higher +/-X% variance to those countries with lower edits? Or a
combination of both? Perhaps these options can be explored for future
extractions?

Let me know and, once again, let me reiterate how thankful we are about the
possibility of having access to this data.

Sincerely,

*Rafael Escalona Reynoso, PhD, MPA. *

Lead Researcher at The Global Innovation Index

Cornell SC Johnson College of Business

389 Statler Hall, Ithaca, NY 14853

Phone: +1 (607) 262-0983 |

Email: *re32@cornell.edu <re32@cornell.edu>*

www.globalinnovationindex.org

Please consider the environment before printing this email or its
attachment(s). Please note that this message may contain confidential
information. If you have received this message in error, please notify me
and then delete it from your system.

*From:* Milimetric [mailto:no-reply@phabricator.wikimedia.org]
*Sent:* Thursday, February 23, 2017 11:03 AM
*To:* re32@cornell.edu
*Subject:* [Maniphest] [Commented On] T131889: geowiki data for Global
Innovation Index

Milimetric added a comment.

First, *@Rafaesrey* https://phabricator.wikimedia.org/p/Rafaesrey/ is
this data ok with you? Can you do what you need with the 71 countries?
Leila, based on the spreadsheet does Rafa get what he needs here?

If that's true, then I think Leila what you mention about Norway is one way
to read the data. But there are others, and there seems to be enough
ambiguity that the active bots could still hide. I mean, if someone were to
take the exact number of edits we disclose and the number of edits per user
(including bots) and try to tile them geographically like a puzzle to see
where all the pieces fit, maybe such a sophisticated approach could reveal
something. But cutting the countries with < 100k edits adds some fuzziness.
Maybe we can also add a bit of fuzziness on the numbers themselves, not
decreasing anything significantly but adding a random +/- X% variance to
each country?

*TASK DETAIL*

https://phabricator.wikimedia.org/T131889

*EMAIL PREFERENCES*

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

*To: *leila, Milimetric
*Cc: *ezachte, leila, Rafaesrey, Aklapper, Ijon, csteipp, Milimetric,
QChris, Neil_P._Quinn_WMF, StudiesWorld, Nuria, Avner, JAllemandou, jeremyb

Hi all,

Or how about giving a set number to those below 100k so that we still have
the coverage? Just another idea.

Best,

*Rafael Escalona Reynoso, PhD, MPA. *

Lead Researcher at The Global Innovation Index

Cornell SC Johnson College of Business

389 Statler Hall, Ithaca, NY 14853

Phone: +1 (607) 262-0983 |

Email: *re32@cornell.edu <re32@cornell.edu>*

www.globalinnovationindex.org

Please consider the environment before printing this email or its
attachment(s). Please note that this message may contain confidential
information. If you have received this message in error, please notify me
and then delete it from your system.

*From:* Rafael Escalona Reynoso [mailto:re32@cornell.edu]
*Sent:* Thursday, February 23, 2017 12:09 PM
*To:* 'T131889+public+cb829a81ae5fd603@phabricator.wikimedia.org'
*Subject:* RE: [Maniphest] [Commented On] T131889: geowiki data for Global
Innovation Index
*Importance:* High

Dear Leila,

This is great news and would be extremely useful for the GII.

I have to point out, however, that the indicator would be going from
perfect coverage (128 countries) to one with low coverage (71). That said,
would it be too much to ask if the number of countries could be expanded
by, say, setting the cutoff at < 50K? Is that even possible? Or perhaps
adding a higher +/-X% variance to those countries with lower edits? Or a
combination of both? Perhaps these options can be explored for future
extractions?

Let me know and, once again, let me reiterate how thankful we are about the
possibility of having access to this data.

Sincerely,

*Rafael Escalona Reynoso, PhD, MPA. *

Lead Researcher at The Global Innovation Index

Cornell SC Johnson College of Business

389 Statler Hall, Ithaca, NY 14853

Phone: +1 (607) 262-0983 |

Email: *re32@cornell.edu <re32@cornell.edu>*

www.globalinnovationindex.org

Please consider the environment before printing this email or its
attachment(s). Please note that this message may contain confidential
information. If you have received this message in error, please notify me
and then delete it from your system.

*From:* Milimetric [mailto:no-reply@phabricator.wikimedia.org
<no-reply@phabricator.wikimedia.org>]
*Sent:* Thursday, February 23, 2017 11:03 AM
*To:* re32@cornell.edu
*Subject:* [Maniphest] [Commented On] T131889: geowiki data for Global
Innovation Index

Milimetric added a comment.

First, *@Rafaesrey* https://phabricator.wikimedia.org/p/Rafaesrey/ is
this data ok with you? Can you do what you need with the 71 countries?
Leila, based on the spreadsheet does Rafa get what he needs here?

If that's true, then I think Leila what you mention about Norway is one way
to read the data. But there are others, and there seems to be enough
ambiguity that the active bots could still hide. I mean, if someone were to
take the exact number of edits we disclose and the number of edits per user
(including bots) and try to tile them geographically like a puzzle to see
where all the pieces fit, maybe such a sophisticated approach could reveal
something. But cutting the countries with < 100k edits adds some fuzziness.
Maybe we can also add a bit of fuzziness on the numbers themselves, not
decreasing anything significantly but adding a random +/- X% variance to
each country?

*TASK DETAIL*

https://phabricator.wikimedia.org/T131889

*EMAIL PREFERENCES*

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

*To: *leila, Milimetric
*Cc: *ezachte, leila, Rafaesrey, Aklapper, Ijon, csteipp, Milimetric,
QChris, Neil_P._Quinn_WMF, StudiesWorld, Nuria, Avner, JAllemandou, jeremyb

leila added a comment.Feb 23 2017, 5:40 PM

First, @Rafaesrey is this data ok with you? Can you do what you need with the 71 countries? Leila, based on the spreadsheet does Rafa get what he needs here?

Yeah. He said they need to have at least 50 to 60, that's how I settled at 71 for now. :) But of course, the more we can provide the better. (and note that 100K is to some extent arbitrary, I did eyeball the countries on the list for their population size to avoid obvious issues but I can do the same for 80K, for example).

Maybe we can also add a bit of fuzziness on the numbers themselves, not decreasing anything significantly but adding a random +/- X% variance to each country?

I'm not sure if this helps. The main issue I can spot is when a country's number of edits change in large amounts from one year to the next, and adding small randomness won't have impact on that scale of change.

leila added a comment.Feb 23 2017, 5:47 PM

I have to point out, however, that the indicator would be going from
perfect coverage (128 countries) to one with low coverage (71). That said,
would it be too much to ask if the number of countries could be expanded
by, say, setting the cutoff at < 50K? Is that even possible? Or perhaps
adding a higher +/-X% variance to those countries with lower edits? Or a
combination of both? Perhaps these options can be explored for future
extractions?

Let's first settle on these 71 countries. If there are no major concerns, I can look at 80K or lower threshold. In general, our goal is to give you more countries if we can assure to a reasonable extent that the risk won't be high. But please also keep in mind that at some point we may have to settle with a "good enough" solution and not an optimal one, given that we are very much under-resourced and I need to prioritize between many tasks. :/

Dear Leila,

Thank you for this reply. I understand. Let’s move on with the initial 71
and try to explore the possibility of expanding as you suggest.

I have two further questions/ideas/observations:

i) would reporting back only the scores not be enough to
codify the totals for countries with less than 100k? By this I mean would
the use of the Excel spreadsheet to estimate scores (and reporting back
only these) not be sufficient to anonymize this data?

ii) I am also thinking of a way to merge the data we have
(2014) and the new one we might get this year from you (2016) to try to
complete a full set. Generally we use a data range of 10 years for all
indicators and countries. In this case some countries would have data for
2014 and others for 2016. If we find a way to do so, would it be
recommended to have such mixed a set?

Just a few ideas. Again, thank you.

Sincerely,

*Rafael Escalona Reynoso, PhD, MPA. *

Lead Researcher at The Global Innovation Index

Cornell SC Johnson College of Business

389 Statler Hall, Ithaca, NY 14853

Phone: +1 (607) 262-0983 |

Email: *re32@cornell.edu <re32@cornell.edu>*

www.globalinnovationindex.org

Please consider the environment before printing this email or its
attachment(s). Please note that this message may contain confidential
information. If you have received this message in error, please notify me
and then delete it from your system.

*From:* leila [mailto:no-reply@phabricator.wikimedia.org]
*Sent:* Thursday, February 23, 2017 12:48 PM
*To:* re32@cornell.edu
*Subject:* [Maniphest] [Commented On] T131889: geowiki data for Global
Innovation Index

leila added a comment.

In *T131889#3050139* https://phabricator.wikimedia.org/T131889#3050139,
*@Rafaesrey* https://phabricator.wikimedia.org/p/Rafaesrey/ wrote:

*I have to point out, however, that the indicator would be going
fromperfect coverage (128 countries) to one with low coverage (71). That
said,would it be too much to ask if the number of countries could be
expandedby, say, setting the cutoff at < 50K? Is that even possible? Or
perhapsadding a higher +/-X% variance to those countries with lower edits?
Or acombination of both? Perhaps these options can be explored for
futureextractions?*

Let's first settle on these 71 countries. If there are no major concerns, I
can look at 80K or lower threshold. In general, our goal is to give you
more countries if we can assure to a reasonable extent that the risk won't
be high. But please also keep in mind that at some point we may have to
settle with a "good enough" solution and not an optimal one, given that we
are very much under-resourced and I need to prioritize between many tasks.
:/

*TASK DETAIL*

https://phabricator.wikimedia.org/T131889

*EMAIL PREFERENCES*

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

*To: *leila
*Cc: *ezachte, leila, Rafaesrey, Aklapper, Ijon, csteipp, Milimetric,
QChris, Neil_P._Quinn_WMF, StudiesWorld, Nuria, Avner, JAllemandou, jeremyb

leila added a comment.Feb 24 2017, 3:20 PM

i) would reporting back only the scores not be enough to
codify the totals for countries with less than 100k? By this I mean would
the use of the Excel spreadsheet to estimate scores (and reporting back
only these) not be sufficient to anonymize this data?

Milimetric and I explored this. The issue there is that if someone knows the value of x_max (which may not be very hard to estimate), and if we include all countries, he/she can start adding one unit at a time starting with x_min=1 to get reasonably close to the actual counts per country. For countries with very few editors who contribute to small languages, this can be a problem. Therefore, the real problem we should think about is: what edit counts per country are we reasonably confident that we can share without running into issues?

ii) I am also thinking of a way to merge the data we have
(2014) and the new one we might get this year from you (2016) to try to
complete a full set. Generally we use a data range of 10 years for all
indicators and countries. In this case some countries would have data for
2014 and others for 2016. If we find a way to do so, would it be
recommended to have such mixed a set?

There are a couple of approaches you can try to deal with missing data in this case:

  • Using 2014's data is one option.
  • The other would be extrapolating based on all the other countries you have edit counts for or the countries that are "similar" to the country with missing data. In this case, you basically have 2014 and 2016 data for some countries, and you can use linear regression or some other technique to estimate the value of the missing edit counts.

The most important thing is to clearly state in your reports how you dealt with missing data, especially if all individual metrics will eventually be used to compute one global index per country.

Leila,

Thanks for this reply. Let's work on the initial set of ~71 and perhaps
then explore if the <100k threshold can be lowered slightly to achieve an
increase in coverage.

As for the calculation of a 2014 and 2016 set, I rather stay away from
estimations beyond scaling. At this point we avoid imputing missing values
throughout the index.

Let me know if this makes sense.

Best,

Rafael.

sure, @Rafaesrey . You know your context better, so please handle the missing data however works best for you.

Dear Leila,

Hope you are doing fine. I write you to follow up on the data collection
process. I also want to let you know that we now have an official launch
date for the GII 2017, June 15, 2017 in Geneva, Switzerland.

That said, I would like to ask two main questions:

  1. When will the data be available? Given the above launch date, time

is now of essence.

  1. Regarding missing data. This year missing data is one of the

issues that we are trying to tackle as much as possible. As I mentioned in
previous conversations, the Wikipedia-based variable used to have full
coverage. Given that fact, I have been thinking of ways to keep this
constant this year.

a. The first, is to explore your suggestion of reducing the threshold
from >100K to >80k and see if the coverage improves. The issue is that we
might not have enough time to explore this possibility;

b. A second option is to set a floor (lowest number) for all other
countries. This would be, say, 100k for all. Then all countries at or below
that mark will receive the same value of uploads and thus same scores and
rankings. This, however will affect our current overall rankings, given the
fact that countries currently standing at the end in this indicator (i.e.
rank below 71) would see an artificial boost of their overall ranking due
to this change;

c. A last option is for me to prepare another Excel with the scaled
values for all countries used in the GII 2014, then your team could mix
these results (meaning they would use the 2014 scaled values for those
countries not part of the newly estimated 71). This would entail that a
full data set would have values for 2014 and 2016. The set would be
normalize (0-100) by your team and the rankings and scores would then be
provided to us.

From the options above, I prefer 2c. However, we need to move fast, given
that we are now being to feel the pressured of time and need to close the
GII model in the next three weeks.

Please let me know if you would prefer for me to explain these options over
the phone. As I mentioned earlier, time is now of essence for us.

I appreciate your prompt reply and know that we are very thankful of your
help and that of your team.

Sincerely,

*Rafael Escalona Reynoso, PhD, MPA. *

Lead Researcher at The Global Innovation Index

Cornell SC Johnson College of Business

389 Statler Hall, Ithaca, NY 14853

Phone: +1 (607) 262-0983 |

Email: *re32@cornell.edu <re32@cornell.edu>*

www.globalinnovationindex.org

Please consider the environment before printing this email or its
attachment(s). Please note that this message may contain confidential
information. If you have received this message in error, please notify me
and then delete it from your system.

*From:* leila [mailto:no-reply@phabricator.wikimedia.org]
*Sent:* Saturday, February 25, 2017 7:48 PM
*To:* re32@cornell.edu
*Subject:* [Maniphest] [Commented On] T131889: geowiki data for Global
Innovation Index

leila added a comment.

sure, *@Rafaesrey* https://phabricator.wikimedia.org/p/Rafaesrey/ . You
know your context better, so please handle the missing data however works
best for you.

*TASK DETAIL*

https://phabricator.wikimedia.org/T131889

*EMAIL PREFERENCES*

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

*To: *leila
*Cc: *ezachte, leila, Rafaesrey, Aklapper, Ijon, csteipp, Milimetric,
QChris, Neil_P._Quinn_WMF, StudiesWorld, Nuria, Avner, JAllemandou, jeremyb

Nuria moved this task from Wikistats to Radar on the Analytics board.Mar 16 2017, 5:16 PM
leila moved this task from Time Sensitive to In Progress on the Research board.Mar 20 2017, 9:10 PM
leila added a comment.Mar 21 2017, 7:59 PM
  1. When will the data be available? Given the above launch date, time

is now of essence.

I want to finish this work in the next 2 days. So let's have the data ready by Thursday end of the day PST.

  1. Regarding missing data. This year missing data is one of the

issues that we are trying to tackle as much as possible. As I mentioned in
previous conversations, the Wikipedia-based variable used to have full
coverage. Given that fact, I have been thinking of ways to keep this
constant this year.

a. The first, is to explore your suggestion of reducing the threshold
from >100K to >80k and see if the coverage improves. The issue is that we
might not have enough time to explore this possibility;

I checked. Reducing the threshold gives you 4 more countries. I'd rather keep the threshold at 100K.

b. A second option is to set a floor (lowest number) for all other
countries. This would be, say, 100k for all. Then all countries at or below
that mark will receive the same value of uploads and thus same scores and
rankings. This, however will affect our current overall rankings, given the
fact that countries currently standing at the end in this indicator (i.e.
rank below 71) would see an artificial boost of their overall ranking due
to this change;
c. A last option is for me to prepare another Excel with the scaled
values for all countries used in the GII 2014, then your team could mix
these results (meaning they would use the 2014 scaled values for those
countries not part of the newly estimated 71). This would entail that a
full data set would have values for 2014 and 2016. The set would be
normalize (0-100) by your team and the rankings and scores would then be
provided to us.

Option (c) sounds fine to me. I'm missing why you should send me the scaled scored data from the past though. The data from 2014 is already publicly available: https://stats.wikimedia.org/wikimedia/squids/SquidReportPageEditsPerCountryOverview2014Q4.htm I can use the raw edit count per country from 2014 directly in the spreadsheet you've given earlier.

Let me know if you approve option (c) and me using 2014 raw data for the countries where I wont' be able to disclose edit per country data, and I'll go ahead and create the sheet for you.

leila updated the task description. (Show Details)Mar 21 2017, 8:02 PM

Dear Leila,

Thank you very much! Yes, please use the data as suggested in c) since it
is the same that I already have. This way we can produce a more complete
set.

I am also attaching our file used for population, in case that you need
this information for scaling pre-2016 data.

I look forward to your next communication and thank you again.

Sincerely,

*Rafael Escalona Reynoso, PhD, MPA. *

Lead Researcher at The Global Innovation Index

Cornell SC Johnson College of Business

389 Statler Hall, Ithaca, NY 14853

Phone: +1 (607) 262-0983 |

Email: *re32@cornell.edu <re32@cornell.edu>*

www.globalinnovationindex.org

Please consider the environment before printing this email or its
attachment(s). Please note that this message may contain confidential
information. If you have received this message in error, please notify me
and then delete it from your system.

*From:* leila [mailto:no-reply@phabricator.wikimedia.org]
*Sent:* Tuesday, March 21, 2017 4:00 PM
*To:* re32@cornell.edu
*Subject:* [Maniphest] [Commented On] T131889: geowiki data for Global
Innovation Index

leila added a comment.

In *T131889#3085445* https://phabricator.wikimedia.org/T131889#3085445,
*@Rafaesrey* https://phabricator.wikimedia.org/p/Rafaesrey/ wrote:

  1. *When will the data be available? Given the above launch date, time is now of essence.*

I want to finish this work in the next 2 days. So let's have the data ready
by Thursday end of the day PST.

*Regarding missing data. This year missing data is one of the issues that

we are trying to tackle as much as possible. As I mentioned in previous
conversations, the Wikipedia-based variable used to have full coverage.
Given that fact, I have been thinking of ways to keep this constant this
year. a. The first, is to explore your suggestion of reducing the threshold
from >100K to >80k and see if the coverage improves. The issue is that we
might not have enough time to explore this possibility;*

I checked. Reducing the threshold gives you 4 more countries. I'd rather
keep the threshold at 100K.

*b. A second option is to set a floor (lowest number) for all
othercountries. This would be, say, 100k for all. Then all countries at or
belowthat mark will receive the same value of uploads and thus same scores
andrankings. This, however will affect our current overall rankings, given
thefact that countries currently standing at the end in this indicator
(i.e.rank below 71) would see an artificial boost of their overall ranking
dueto this change; c. A last option is for me to prepare another Excel with
the scaledvalues for all countries used in the GII 2014, then your team
could mixthese results (meaning they would use the 2014 scaled values for
thosecountries not part of the newly estimated 71). This would entail that
afull data set would have values for 2014 and 2016. The set would
benormalize (0-100) by your team and the rankings and scores would then
beprovided to us.*

Option (c) sounds fine to me. I'm missing why you should send me the scaled
scored data from the past though. The data from 2014 is already publicly
available:
https://stats.wikimedia.org/wikimedia/squids/SquidReportPageEditsPerCountryOverview2014Q4.htm
I can use the raw edit count per country from 2014 directly in the
spreadsheet you've given earlier.

Let me know if you approve option (c) and me using 2014 raw data for the
countries where I wont' be able to disclose edit per country data, and I'll
go ahead and create the sheet for you.

*TASK DETAIL*

https://phabricator.wikimedia.org/T131889

*EMAIL PREFERENCES*

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

*To: *leila
*Cc: *ezachte, leila, Rafaesrey, Aklapper, Ijon, csteipp, Milimetric,
QChris, Neil_P._Quinn_WMF, StudiesWorld, Nuria, Avner, JAllemandou, jeremyb

Dear Leila,

Apologies for pestering so much. Do you have an update for the data?

Thank again for all your help and time.

Best,

Rafael.

*From:* leila [mailto:no-reply@phabricator.wikimedia.org]
*Sent:* Tuesday, March 21, 2017 4:02 PM
*To:* re32@cornell.edu
*Subject:* [Maniphest] [Edited] T131889: geowiki data for Global Innovation
Index

leila edited the task description. (Show Details)
https://phabricator.wikimedia.org/transactions/detail/PHID-XACT-TASK-c3svxf6c6rhrbn3/

*EDIT DETAILS*

...

Analytics pinging users to fill in

What data is used for

...

Lastly, we believe that the GII can be an important vehicle to signal

that Wikipedia is a critical lever to innovation and a factor contributing
to a new understanding of the digital information landscape and innovation
globally.

Code for the data

The code for generating this data is maintained at
http://localhost:8000/user/leila/notebooks/T131889.ipynb . The code is not
readable by others atm, I'm linking it here to increase the bus factor so
someone else can pick this work up if something happens to me. :)

*TASK DETAIL*

https://phabricator.wikimedia.org/T131889

*EMAIL PREFERENCES*

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

*To: *leila
*Cc: *ezachte, leila, Rafaesrey, Aklapper, Ijon, csteipp, Milimetric,
QChris, Neil_P._Quinn_WMF, StudiesWorld, Nuria, Avner, JAllemandou, jeremyb

computed data (index, scale) sent to Rafael, plus a description of how the data is generated.

Dear Leila,

If the results for the latest test are correct, I will not require of an
additional data extraction. If Ok with you I will let you know later today
as I complete and verify these preliminary results.

Best,

Rafael

*Rafael Escalona Reynoso, PhD, MPA. *

Lead Researcher at The Global Innovation Index

Cornell SC Johnson College of Business

389 Statler Hall, Ithaca, NY 14853

Phone: +1 (607) 262-0983 |

Email: *re32@cornell.edu <re32@cornell.edu>*

www.globalinnovationindex.org

Please consider the environment before printing this email or its
attachment(s). Please note that this message may contain confidential
information. If you have received this message in error, please notify me
and then delete it from your system.

*From:* leila [mailto:no-reply@phabricator.wikimedia.org]
*Sent:* Thursday, April 13, 2017 4:54 PM
*To:* re32@cornell.edu
*Subject:* [Maniphest] [Commented On] T131889: geowiki data for Global
Innovation Index

leila added a comment.

computed data (index, scale) sent to Rafael, plus a description of how the
data is generated.

*TASK DETAIL*

https://phabricator.wikimedia.org/T131889

*EMAIL PREFERENCES*

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

*To: *leila
*Cc: *ezachte, leila, Rafaesrey, Aklapper, Ijon, csteipp, Milimetric,
QChris, Neil_P._Quinn_WMF, StudiesWorld, Nuria, Avner, JAllemandou, jeremyb

Sure. Ping if you need help, Rafael. Otherwise, we will assume this is Done.

@Rafaesrey: for future reference, when you reply here you seem to always include a huge email signature. You might be doing that over email so that makes sense, but it makes the actual phabricator thread harder to read: https://phabricator.wikimedia.org/T131889#3181864. Take a look and see if you can disable the footer on future replies.

Also, you may want to edit the footer out of all these comments because it has your contact information pasted all over, and these threads are public.

ggellerman closed this task as Resolved.Jul 17 2017, 6:34 PM
ggellerman edited projects, added Research-Archive; removed Research.
Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM