Page MenuHomePhabricator

Analyze results of sameAs A/B test
Closed, ResolvedPublic

Description

Background

We are hoping to analyze the results of the sameAs A/B test to answer the following questions:

  • Was there a difference in total pageviews between the control and test groups?
  • Was there a difference between the control and test groups in traffic from search engines?
  • Were any temporal patterns observed for either of those two variables?

Acceptance Criteria

  • Produce a mediawiki page with the answers the above questions that also includes:
  • our main takeaways
  • a summary of the analysis
  • a formal/technical report (if needed)

Related Objects

StatusSubtypeAssignedTask
ResolvedJdlrobson
Resolved nray
ResolvedKrinkle
Resolvedovasileva
Resolvedovasileva
Resolvedmpopov
Resolvedovasileva
Resolvedovasileva
ResolvedNone
ResolvedJdlrobson
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedJdforrester-WMF
Resolved Tbayer
Resolved Tbayer
ResolvedNone
Resolvedovasileva
Invalid Tbayer
ResolvedJan 13 2019mpopov
Resolvedmpopov

Event Timeline

Besides determining whether there was a change, I think we should also try to assess its size (and sign ;)

And regarding temporal patterns in pageviews: Most of these are likely going to be due to factors and events that don't have to do with the sameAs feature (e.g. seasonal changes). So I don't know how much relevant information we would be getting from these patterns. If we are specifically interested in how fast the newly added structured data propagated in (e.g.) Google's index, there might be more direct methods to find that out, using Search Console data.

And to record something here from our earlier offline discussions:

One could conceivably try to focus the analysis only on pages that have an associated Wikidata item, as only these will be affected by the sameAs change. But (besides the aspect that we may mostly be interested in the overall change in pageviews for an entire project) this does not seem worth the effort, because it appears that on a typical project, the vast majority of mainspace (non-redirect) pages do have a Wikidata item. I haven't run the actual numbers, but I quickly checked this assumption by hitting "random article" 10 times on a large and a small project (English and Latin WP), both times getting pages with Wikidata item 10 out of 10 times.

(@GoranSMilovanovic , is that something that can be gleaned from WDCM ? E.g. is it true that the top 10 Wikipedias by size all have >90% Wikidata coverage?)

@Tbayer Assuming that we are focusing on the main namespace only: a part of the answer is found on the Percentage of articles making use of data from Wikidata page, however, the (S)itelink usage aspect (see: wbc entity_usage in the wikibase schema) is not accounted for. I can run a check for you if want, including all WD item usage aspects.

But I think you really need to look up for the answer in the wb_items_per_site table from the Wikibase schema, which is currently not in the scope of the WDCM analyses.

Thanks @GoranSMilovanovic! It is indeed about mainspace pages only, but about those that have an associated Wikidata item (i.e. appear in the sitelinks of said item), rather than making use of its properties.
I started drafting a query myself using wb_items_per_site, but the result for enwiki looks implausibly low: https://quarry.wmflabs.org/query/31482 Do you happen to see what might be wrong with the query?

(CC @Niedzielski )

@Tbayer

(i.e. appear in the sitelinks of said item)

If you want sitelinks data only: http://wdcm.wmflabs.org/WDCM_SitelinksDashboard/, and if you need any numbers that are not reported on the dashboard let me know and I will see what I can do to help. Dashboard documentation: https://wikitech.wikimedia.org/wiki/Wikidata_Concepts_Monitor#WDCM_(S)itelinks_Dashboard

N.B. I will accept suggestions to incorporate new data tables for download into the existing dashboard.

@Tbayer In general, when you need Wikidata usage datasets:

  • there is the wdcm_clients_wb_entity_usage table in the goransm database in Hadoop;
  • it is a result of an Apache Sqoop operation orchestrated from R (code) which concatenates all wbc_entity_usage tables from clients with client-side WD usage tracking enabled;
  • we have fresh data produced every 1st, 7th, 14th, 20th, and 27th of the month, running the script on my crontab from stat1004.

From beeline:

describe goransm.wdcm_clients_wb_entity_usage;
col_name        data_type       comment
eu_row_id       bigint
eu_entity_id    string
eu_aspect       string
eu_page_id      bigint
wiki_db string  The wiki_db project
        NULL    NULL
# Partition Information NULL    NULL
# col_name              data_type               comment             
        NULL    NULL
wiki_db string  The wiki_db project

@Tbayer what @Niedzielski has just suggested (tested) is probably the most straightforward way to obtain the dataset.

ovasileva raised the priority of this task from Medium to High.Dec 4 2018, 6:35 PM

Thanks @Niedzielski and @GoranSMilovanovic! I ran a query based on that approach (the wikibase_item page property) for a few wikis, more out of curiosity (I guess @mpopov might incorporate a more thorough look at this in his analysis). It confirmed the assumption that the vast majority of Wikipedia articles have a Wikidata item.

wikiarticles_with_itemall_articlesWikidata_ratio
enwiki5736035.05761481.00.995583
frwiki2057046.02061655.00.997764
dewiki2239557.02244880.00.997629
zhwiki1029501.01033831.00.995812
jawiki1125250.01130528.00.995331
lawiki129487.0129587.00.999228
barwiki25899.027175.00.953045
enwikivoyage28598.028783.00.993573
commonswiki92410.0115614.00.799298
SELECT SUM(IF(pp_page IS NOT NULL, 1,0)) AS articles_with_item,
  SUM(1) AS all_articles
FROM (
  SELECT
    page_title,
    page_id
  FROM
    page
  WHERE
    page_namespace = 0
  AND
    page_is_redirect = 0
) AS page_titles
LEFT JOIN (
  SELECT
    pp_page
  FROM
    page_props
  WHERE
    pp_propname = 'wikibase_item'
  AND
    pp_value IS NOT null
) AS pp_pages
ON
  page_titles.page_id = pp_pages.pp_page;

(There might be some rare oddities with this data though, see e.g. https://quarry.wmflabs.org/query/31750 or T119738 .)

I've identified a few potential issues with the query I've written for the past check-ins so I'm working on resolving that to make sure the analysis is performed on vetted, correct data. (Gotta love those joins of partitioned tables in Hive.)

I am also doing some research into different models (with pre-test and post-test measurements and treatment & control groups) to correctly infer impact of the sameAs property. For example, two competing models would be:

  • logPost = intercept + β1 * logPre + β2 * treatment + error, where treatment is an indicator variable and exp(β2) is the multiplication factor by which traffic changes due to the treatment
  • multilevel model with random, hierarchical intercepts:
    • log avg. daily search engine-referred traffic ~ N(group intercept + β * treated, σ) -- two observations per group (within wiki): one pre, one post-test
    • group intercept ~ N(wiki intercept, φ) -- two groups per wiki (control & treatment)
    • wiki intercept ~ N(overall intercept, τ) -- 200+ wikis
    • exp(β) is the multiplication factor by which traffic changes due to the treatment

Draft posted at: https://www.mediawiki.org/wiki/User:MPopov_(WMF)/SEO/sameAs_test

Key takeaways:

  • an estimated 1.4% increase in traffic on average (95% CI: 0.7-2.1)
  • more wiks benefited from the feature than not
  • based on our decision plan, we should probably rollout to 100% and rollout to other wikis

Update: final draft posted at https://www.mediawiki.org/wiki/Reading/Search_Engine_Optimization/sameAs_test

This comment was removed by mpopov.
kzimmerman subscribed.

Closing this as done! As Mikhail noted, the final draft was posted at https://www.mediawiki.org/wiki/Reading/Search_Engine_Optimization/sameAs_test

There was some discussion about expanding the analysis to include other Wikimedia projects (as this was launched on projects besides Wikipedia), but after discussion with Jon and Mikhail I think it makes sense to close this as is. Those other projects make a tiny % of overall traffic, and languages are less consistent; logically speaking we don’t expect sameAs to have any negative impact on readers looking for specific knowledge.