Page MenuHomePhabricator

Analyze results of sameAs A/B test
Closed, ResolvedPublic



We are hoping to analyze the results of the sameAs A/B test to answer the following questions:

  • Was there a difference in total pageviews between the control and test groups?
  • Was there a difference between the control and test groups in traffic from search engines?
  • Were any temporal patterns observed for either of those two variables?

Acceptance Criteria

  • Produce a mediawiki page with the answers the above questions that also includes:
  • our main takeaways
  • a summary of the analysis
  • a formal/technical report (if needed)

Related Objects

Resolved Jdlrobson
Resolved Tbayer
Resolved Tbayer
Invalid Tbayer

Event Timeline

ovasileva triaged this task as Normal priority.
ovasileva assigned this task to mpopov.Nov 20 2018, 4:41 PM

Besides determining whether there was a change, I think we should also try to assess its size (and sign ;)

And regarding temporal patterns in pageviews: Most of these are likely going to be due to factors and events that don't have to do with the sameAs feature (e.g. seasonal changes). So I don't know how much relevant information we would be getting from these patterns. If we are specifically interested in how fast the newly added structured data propagated in (e.g.) Google's index, there might be more direct methods to find that out, using Search Console data.

And to record something here from our earlier offline discussions:

One could conceivably try to focus the analysis only on pages that have an associated Wikidata item, as only these will be affected by the sameAs change. But (besides the aspect that we may mostly be interested in the overall change in pageviews for an entire project) this does not seem worth the effort, because it appears that on a typical project, the vast majority of mainspace (non-redirect) pages do have a Wikidata item. I haven't run the actual numbers, but I quickly checked this assumption by hitting "random article" 10 times on a large and a small project (English and Latin WP), both times getting pages with Wikidata item 10 out of 10 times.

(@GoranSMilovanovic , is that something that can be gleaned from WDCM ? E.g. is it true that the top 10 Wikipedias by size all have >90% Wikidata coverage?)

GoranSMilovanovic added a comment.EditedNov 22 2018, 11:00 AM

@Tbayer Assuming that we are focusing on the main namespace only: a part of the answer is found on the Percentage of articles making use of data from Wikidata page, however, the (S)itelink usage aspect (see: wbc entity_usage in the wikibase schema) is not accounted for. I can run a check for you if want, including all WD item usage aspects.

But I think you really need to look up for the answer in the wb_items_per_site table from the Wikibase schema, which is currently not in the scope of the WDCM analyses.

Thanks @GoranSMilovanovic! It is indeed about mainspace pages only, but about those that have an associated Wikidata item (i.e. appear in the sitelinks of said item), rather than making use of its properties.
I started drafting a query myself using wb_items_per_site, but the result for enwiki looks implausibly low: Do you happen to see what might be wrong with the query?

(CC @Niedzielski )

GoranSMilovanovic added a comment.EditedNov 26 2018, 12:08 PM


(i.e. appear in the sitelinks of said item)

If you want sitelinks data only:, and if you need any numbers that are not reported on the dashboard let me know and I will see what I can do to help. Dashboard documentation:

N.B. I will accept suggestions to incorporate new data tables for download into the existing dashboard.

@Tbayer In general, when you need Wikidata usage datasets:

  • there is the wdcm_clients_wb_entity_usage table in the goransm database in Hadoop;
  • it is a result of an Apache Sqoop operation orchestrated from R (code) which concatenates all wbc_entity_usage tables from clients with client-side WD usage tracking enabled;
  • we have fresh data produced every 1st, 7th, 14th, 20th, and 27th of the month, running the script on my crontab from stat1004.

From beeline:

describe goransm.wdcm_clients_wb_entity_usage;
col_name        data_type       comment
eu_row_id       bigint
eu_entity_id    string
eu_aspect       string
eu_page_id      bigint
wiki_db string  The wiki_db project
        NULL    NULL
# Partition Information NULL    NULL
# col_name              data_type               comment             
        NULL    NULL
wiki_db string  The wiki_db project

Maybe we can just check the wikibase_item in page_props?

@Tbayer what @Niedzielski has just suggested (tested) is probably the most straightforward way to obtain the dataset.

ovasileva raised the priority of this task from Normal to High.Dec 4 2018, 6:35 PM

Thanks @Niedzielski and @GoranSMilovanovic! I ran a query based on that approach (the wikibase_item page property) for a few wikis, more out of curiosity (I guess @mpopov might incorporate a more thorough look at this in his analysis). It confirmed the assumption that the vast majority of Wikipedia articles have a Wikidata item.

SELECT SUM(IF(pp_page IS NOT NULL, 1,0)) AS articles_with_item,
  SUM(1) AS all_articles
    page_namespace = 0
    page_is_redirect = 0
) AS page_titles
    pp_propname = 'wikibase_item'
    pp_value IS NOT null
) AS pp_pages
  page_titles.page_id = pp_pages.pp_page;

(There might be some rare oddities with this data though, see e.g. or T119738 .)

ovasileva updated the task description. (Show Details)Dec 5 2018, 11:08 AM
mpopov moved this task from Triage to Backlog on the Product-Analytics board.Dec 6 2018, 9:23 PM

I've identified a few potential issues with the query I've written for the past check-ins so I'm working on resolving that to make sure the analysis is performed on vetted, correct data. (Gotta love those joins of partitioned tables in Hive.)

I am also doing some research into different models (with pre-test and post-test measurements and treatment & control groups) to correctly infer impact of the sameAs property. For example, two competing models would be:

  • logPost = intercept + β1 * logPre + β2 * treatment + error, where treatment is an indicator variable and exp(β2) is the multiplication factor by which traffic changes due to the treatment
  • multilevel model with random, hierarchical intercepts:
    • log avg. daily search engine-referred traffic ~ N(group intercept + β * treated, σ) -- two observations per group (within wiki): one pre, one post-test
    • group intercept ~ N(wiki intercept, φ) -- two groups per wiki (control & treatment)
    • wiki intercept ~ N(overall intercept, τ) -- 200+ wikis
    • exp(β) is the multiplication factor by which traffic changes due to the treatment
mpopov added a comment.EditedMar 7 2019, 4:57 PM

Draft posted at:

Key takeaways:

  • an estimated 1.4% increase in traffic on average (95% CI: 0.7-2.1)
  • more wiks benefited from the feature than not
  • based on our decision plan, we should probably rollout to 100% and rollout to other wikis

Update: final draft posted at

mpopov moved this task from Backlog to Doing on the Product-Analytics board.May 9 2019, 3:14 PM
This comment was removed by mpopov.
kzimmerman closed this task as Resolved.Tue, Jun 4, 6:08 PM
kzimmerman added a subscriber: kzimmerman.

Closing this as done! As Mikhail noted, the final draft was posted at

There was some discussion about expanding the analysis to include other Wikimedia projects (as this was launched on projects besides Wikipedia), but after discussion with Jon and Mikhail I think it makes sense to close this as is. Those other projects make a tiny % of overall traffic, and languages are less consistent; logically speaking we don’t expect sameAs to have any negative impact on readers looking for specific knowledge.