Page MenuHomePhabricator

Identify approaches for defining article importance
Closed, ResolvedPublic

Description

The goal is to identify approaches used by the Wikipedia contributor community, Wikimedia developers, and researchers for defining article importance.

There are three main components to this work:

Event Timeline

Weekly update: discussed two possible directions with this work with my colleagues at UMN that would provide some insight into the new article importance metrics:

  • Identify metrics that reasonably proxy the importance factor and measure how well existing approaches to importance capture these new factors -- e.g., if "political impact" is a new factor, then you might assert that one way to identify articles that would have a political impact is to find articles that are under WikiProject Politics and have page protections in place (assuming that page protections means that either the article is impactful and attracted vandalism or was deemed potentially impactful and so was protected in advance of vandalism). Then you could look at how well pageviews or inlinks capture these new importance factors.
  • Identify important measures of bias (taking care to define) such as gender bias and look at how the different article importance metrics would contribute to or reduce bias if used in recommender systems.

Weekly update: didn't meet this week

Weekly update:

  • Clustering complete -- presentation to team scheduled for Tuesday to discuss current findings and next steps
  • Working on gather data that is currently used for determining article importance to compare lists of "most important" articles via these metrics vs. what shows up in vital articles lists:
    • Generated link data for enwiki for computing pagerank
    • Gathering pageview data from month of July for enwiki

Weekly update:

Weekly update:

In case anyone is interested, these are the top-50 by pagerank. Clearly identifiers that are linked to by lots of templates dominate the top list along with countries. While some may argue that templates should be filtered out to remove this noise, in practice for our research, I don't think it will affect things because we're more interested in questions like the ratio of men biographies to women biographies so the identifiers just create noise but no bias. See http://ceur-ws.org/Vol-1586/know2.pdf for more details on how different parameterizations affect the end result.

              pagerank                                       page_title
page_id
14919     18539.217549               International_Standard_Book_Number
48361     12773.951173                     Geographic_coordinate_system
25175562   9884.802812             Virtual_International_Authority_File
1339871    9871.535021                                         WorldCat
1491462    8700.221976               Library_of_Congress_Control_Number
422994     7746.997877                        Digital_object_identifier
18852926   6679.594561           International_Standard_Name_Identifier
23538754   6517.643916                                  Wayback_Machine
35566739   6381.356316                        Integrated_Authority_File
3434750    5876.866751                                    United_States
234930     4634.422376             International_Standard_Serial_Number
199503     4369.373111                 Bibliothèque_nationale_de_France
35412202   4272.300880                                         Wikidata
33551951   3746.292572           Système_universitaire_de_documentation
30890      3558.718332                                        Time_zone
2855554    3487.339865                                             IMDb
30463      3402.739290                               Taxonomy_(biology)
2059097    3300.022037                 Royal_Library_of_the_Netherlands
20012045   3107.963084           National_Library_of_the_Czech_Republic
503009     2923.236514                                           PubMed
31717      2889.001945                                   United_Kingdom
2987862    2873.072664         Global_Biodiversity_Information_Facility
883885     2721.090562                                             OCLC
10568      2716.385663                             Association_football
20797      2646.225196                                      MusicBrainz
11039790   2629.919376                                           Animal
5843419    2453.356574                                           France
39736      2367.731758                            Binomial_nomenclature
59339836   2326.238994  Interim_Register_of_Marine_and_Nonmarine_Genera
14533      2309.633695                                            India
11867      2309.388769                                          Germany
373399     2261.990119                            National_Diet_Library
1117352    2254.126099                                            ISO_4
68253      2243.586521                         List_of_sovereign_states
53990144   2176.279523                                             SNAC
32927      2169.858057                                     World_War_II
30680      2092.253109                               The_New_York_Times
11143173   2083.445541                             Encyclopedia_of_Life
47548      2022.681817                             Daylight_saving_time
5042916    2021.058163                                           Canada
3198808    1980.219512                    Biblioteca_Nacional_de_España
4689264    1937.603723                                        Australia
216262     1894.087509                                            JSTOR
14532      1861.629344                                            Italy
2775325    1820.163992                                   PubMed_Central
25391      1789.633555                                           Russia
40178883   1784.755362                                      INaturalist
19827221   1782.528917                                        Arthropod
149306     1756.216763    National_Center_for_Biotechnology_Information
998826     1741.334612                                      Wikispecies

Weekly update:

  • Generated edit history statistics (revision count by type of editor and unique editors by type of editor where editor types are: user, bot, anonymous) for every article in enwiki as another proxy for importance but defined by editor interest. See query below. It's a very quick query (and something that years ago I would have spent a week running parallel enwiki dump history scripts to count)
  • Switched from ORES topics to Wikidata for measuring gender distribution of article lists and got much higher recall (query to make this simple courtesy of DS): https://github.com/geohci/miscellaneous-wikimedia/blob/master/wikidata-properties-spark/wikidata_gender_information.ipynb
  • Began comparing top-importance article lists based on pagerank, pageviews, and vital articles based on how well they represent each WikiProject. Early results suggest that these three lists tend to agree re: geography, history, and religion WikiProjects but that much bigger differences appear around arts/culture, sports, and equity-oriented WikiProjects. While there's no clear baseline for prevalence of a WikiProject in an top-importance list, we're also looking at gender of biographies in the list and geographic distribution, which have more agreed-upon baselines.
SELECT page_id,
       MAX(page_revision_count) AS num_revisions, 
       SUM(IF(size(event_user_is_bot_by)>0,1,0)) AS num_bot_revisions,
       SUM(IF(event_user_is_anonymous, 1, 0)) AS num_anon_revisions,
       COUNT(DISTINCT(event_user_text)) AS num_unique_editors, 
       COUNT(DISTINCT(event_user_id)) AS num_unique_users,
       COUNT(DISTINCT(IF(size(event_user_is_bot_by)>0, event_user_id, NULL))) AS num_unique_bots
  FROM wmf.mediawiki_history
 WHERE wiki_db = 'enwiki'
       AND snapshot = '2020-07'
       AND page_namespace = 0
       AND event_type = 'create'
       AND event_entity = 'revision'
       AND NOT page_is_redirect
       AND NOT page_is_deleted
 GROUP BY page_id;

Weekly update:

  • Generated article -> county and congressional district mappings for enwiki in case we choose to do any deeper analyses around importance vs. socioeconomic status (focusing on the US allows us to analyze aspects that are very difficult to do globally such as race)

Weekly update:

  • Pause on qual analysis as collaborators work on CHI deadline. That work has already identified importance components but now is assessing how different approaches to defining article importance capture these components with an emphasis on the equity component.
  • Generated first dataset of page -> article importance assessments but need to do some cleaning as importance assessments are not fully standardized (e.g., both low and Low appear in the dataset as well as infrequent tags like Related or Bottom) so a mapping needs to be generated to aggregate these into more consistent ordinal categories. For example, in a sample of ~2M articles:
assessment    # art.
top              149
Bottom           225
Related          388
high             585
mid             1881
low             3064
Top            19973
High           23595
NA             30793
Mid           227261
              227310
Unknown       677603
Low           699005

EDIT: updated with more levels and to indicate that I had split out High and Top

Weekly update:

  • Re-ran WikiProject importance gathering -- had been gathering just the "top" assessment but I now gather all the assessments associated with an article
  • Some data below -- I drop some assessments and grouped the rest into Low, Mid, High. Low importance articles are by far the most common (as expected).
  • Also setup office hours in the next week or two to hear from Suggested Edits / Newcomer Tasks about how they recommend articles and what data exists to measure how users interact with the recommendations
Standardization:
Dropped: Unknown, NA, na
Low: Low, low, Bottom, Related
Mid: Mid, mid
High: High, high
Top: Top, top

Counts per category:
Low: 3,687,536 assessments
Mid:   751,592 assessments
High:  174,025 assessments
Top:    40,411 assessments

Because an article can have multiple assessments and I'm curious about how "universal" an article's importance assessment is, I then looked at how consistent the ratings are for a given article. Data below:

  • 29.2% of articles (1,722,021) had no assessment
  • 63.2% of articles (3,721,599) had a single assessment
  • 7.6% of articles had multiple assessments. From those:
    • 0.1% of articles (5,274) were consistently assessed to the same level
    • 6.1% of articles (358,234) were assessed at multiple levels but they were adjacent -- i.e Low + Mid or Mid + High or High + Top
    • 1.2% of articles (71,653) were assessed at multiple levels two steps apart -- i.e Low + High or Mid + Top
  • 0.2% of articles (12,038) were assessed across the full range -- i.e. assessed as both Low AND Top

And finally, I looked more deeply into what types of articles were rated as low importance and top importance -- i.e. where do we see the most ambiguity / importance of context when it comes to importance -- it seems to be articles around history/society. Some of this is explained by these articles just having more WikiProjects rating them, but topics like Physics/Mathematics/Video Games have a much higher rate of agreement between ratings when they are assessed by multiple WikiProjects:

                                                   n  no assess.    single  mult assess.    agreed  adjacent  two steps      full
History and Society.Society                    64148    0.185758  0.565302      0.248940  0.000452  0.170434   0.058724  0.019330
Culture.Philosophy and religion               158108    0.095460  0.696429      0.208111  0.000829  0.150935   0.042578  0.013769
Geography.Regions.Africa.Central Africa        11458    0.266539  0.533165      0.200297  0.000175  0.155263   0.034561  0.010298
History and Society.History                   173010    0.170921  0.629796      0.199283  0.000329  0.148043   0.040298  0.010612
STEM.Medicine & Health                         76955    0.107738  0.698460      0.193802  0.000195  0.148554   0.038022  0.007030
Culture.Media.Software                         23391    0.278098  0.530033      0.191869  0.001069  0.141422   0.043265  0.006113
Geography.Regions.Africa.Eastern Africa        35776    0.266547  0.545114      0.188339       NaN  0.136432   0.041564  0.010342
Culture.Visual arts.Architecture              167657    0.076853  0.735901      0.187245  0.004754  0.148446   0.030556  0.003489
STEM.Physics                                   19164    0.013671  0.802233      0.184095  0.015028  0.126905   0.035796  0.006366
STEM.Mathematics                               23688    0.067671  0.748860      0.183468  0.078141  0.077170   0.023092  0.005066
STEM.Libraries & Information                    9965    0.080983  0.738083      0.180933  0.000803  0.138083   0.033718  0.008329
Culture.Visual arts.Visual arts*              280611    0.158746  0.694171      0.147083  0.007387  0.112248   0.024090  0.003357
History and Society.Education                  98776    0.151282  0.701982      0.146736  0.001002  0.107395   0.032467  0.005872
Geography.Regions.Africa.Africa*              147027    0.278507  0.575037      0.146456  0.000109  0.111102   0.027886  0.007359
Geography.Regions.Africa.Southern Africa       25755    0.226286  0.627412      0.146302  0.000078  0.123199   0.018210  0.004815
STEM.Earth and environment                     80403    0.077049  0.780630      0.142321  0.000211  0.107222   0.029912  0.004975
Geography.Geographical                        313358    0.073446  0.785226      0.141327  0.000109  0.116123   0.021123  0.003973
Geography.Regions.Africa.Northern Africa       30895    0.210714  0.650688      0.138598  0.000356  0.103447   0.026800  0.007995
Culture.Literature                            203361    0.183934  0.678193      0.137873  0.006402  0.107449   0.019812  0.004209
Geography.Regions.Asia.Central Asia            11776    0.115489  0.746858      0.137653       NaN  0.096637   0.033033  0.007982
Geography.Regions.Asia.South Asia             241528    0.087133  0.778270      0.134597  0.000248  0.106712   0.024001  0.003635
Culture.Visual arts.Comics and Anime           37093    0.018818  0.847222      0.133961  0.034050  0.081093   0.015879  0.002939
Culture.Biography.Women                       263244    0.211568  0.660893      0.127539  0.000874  0.107334   0.016578  0.002754
Culture.Visual arts.Fashion                    12653    0.257251  0.616850      0.125899  0.000079  0.096183   0.025844  0.003794
Geography.Regions.Asia.North Asia              85780    0.059827  0.818163      0.122010  0.000455  0.096258   0.021042  0.004255
Culture.Media.Books                            60291    0.226137  0.656682      0.117182  0.001028  0.097295   0.016553  0.002305
Geography.Regions.Oceania                     241103    0.040215  0.842760      0.117025  0.000174  0.099119   0.015782  0.001949
Geography.Regions.Europe.Northern Europe      429847    0.136248  0.748145      0.115606  0.000384  0.095529   0.016859  0.002834
Geography.Regions.Africa.Western Africa        36403    0.358899  0.526495      0.114606  0.000027  0.082933   0.024861  0.006785
STEM.Space                                     33298    0.043997  0.843474      0.112529  0.003814  0.083458   0.021293  0.003964
History and Society.Military and warfare      213470    0.365293  0.525994      0.108713  0.000309  0.082316   0.021802  0.004286
Culture.Media.Television                      116103    0.242500  0.649303      0.108197  0.003988  0.078473   0.022015  0.003721
Geography.Regions.Asia.Asia*                  785879    0.136870  0.761822      0.101308  0.000328  0.081384   0.016840  0.002756
Culture.Internet culture                       48879    0.055648  0.843859      0.100493  0.007181  0.069969   0.020029  0.003314
STEM.Chemistry                                 29487    0.131109  0.769661      0.099230  0.000441  0.074677   0.020958  0.003154
Geography.Regions.Asia.Southeast Asia          92044    0.140009  0.762385      0.097605  0.000261  0.080364   0.014776  0.002205
Geography.Regions.Americas.South America      102947    0.142996  0.759478      0.097526  0.000204  0.079439   0.015260  0.002623
Geography.Regions.Asia.East Asia              180427    0.146369  0.757276      0.096355  0.000571  0.078985   0.014787  0.002012
Culture.Performing arts                        41368    0.294068  0.610689      0.095243  0.000266  0.073414   0.017356  0.004206
Geography.Regions.Europe.Europe*             1197367    0.174059  0.731506      0.094435  0.000408  0.078784   0.012933  0.002310
STEM.Engineering                               84612    0.410261  0.496797      0.092942  0.000461  0.073595   0.016109  0.002777
Geography.Regions.Europe.Eastern Europe       267187    0.134206  0.775464      0.090330  0.000389  0.077137   0.010816  0.001987
Culture.Sports                                933564    0.181110  0.729999      0.088891  0.001055  0.074605   0.011495  0.001735
History and Society.Transportation            223236    0.315558  0.595921      0.088521  0.000287  0.074316   0.012435  0.001483
Geography.Regions.Europe.Western Europe       304552    0.209015  0.706014      0.084971  0.000581  0.071200   0.011253  0.001937
Geography.Regions.Europe.Southern Europe      209630    0.242160  0.673978      0.083862  0.000234  0.066641   0.013748  0.003239
Culture.Biography.Biography*                 1838864    0.339359  0.579076      0.081565  0.001282  0.065869   0.012143  0.002270
STEM.STEM*                                    880243    0.107787  0.811999      0.080214  0.002273  0.061126   0.014377  0.002438
All Articles                                 5890819    0.292323  0.631763      0.075915  0.000895  0.060812   0.012164  0.002044
Culture.Media.Video games                      37133    0.004632  0.927288      0.068080  0.009183  0.045539   0.011607  0.001750
Culture.Media.Media*                          924129    0.366187  0.567583      0.066230  0.001308  0.050880   0.011902  0.002139
Geography.Regions.Asia.West Asia              182848    0.224060  0.710175      0.065765  0.000175  0.051261   0.011583  0.002745
Culture.Media.Radio                            29169    0.219685  0.726696      0.053619  0.000137  0.041962   0.009634  0.001886
Culture.Media.Music                           389662    0.431751  0.518726      0.049522  0.000252  0.038626   0.008790  0.001855
Culture.Media.Films                           242294    0.429751  0.523731      0.046518  0.001284  0.035527   0.008052  0.001655
Culture.Linguistics                            98242    0.745934  0.214043      0.040024  0.000234  0.028002   0.010047  0.001741
STEM.Biology                                  462863    0.013555  0.946660      0.039785  0.000065  0.032489   0.006190  0.001041

Weekly update:

  • Some updates to the data in the previous comment
  • Talked with SN from Product Analytics and DB from Mobile Apps to understand how Suggested Edits works. Data from this schema and edits tagged with the #suggestededit tag on Wikidata/Commons will give me insight into what is edited via the tool. I was pointed to this repo and this code in particular for how recommendations on Android work. Will go through the code to figure out how articles are selected for recommendation. Additionally, there are some Superset dashboards with descriptive statistics about usage (this and this)
  • I scheduled a meeting to talk with MW from Product Analytics about recommendation data from Newcomer tasks on Growth next week.
  • In general, I have some documentation to do next week to summarize what has been found for each aspect of this project, but I have made sufficient progress on each aspect to feel like I have a strong grounding for moving to the next stage of this project. I'll close this out when I have the documentation complete and link to it.

Weekly updates:

  • I'm going to close out this task as I see it as complete. I'll work with my collaborators to continue to expand the documentation on meta
  • Work is now moving onto quantitative analyses such as the analysis of Suggested Edits I am documented here (I'll open a new task for this)