Maniphest T209891

Analyze results of sameAs A/B test
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ovasileva
	Nov 19 2018, 10:46 PM

Description

Background

We are hoping to analyze the results of the sameAs A/B test to answer the following questions:

Was there a difference in total pageviews between the control and test groups?
Was there a difference between the control and test groups in traffic from search engines?
Were any temporal patterns observed for either of those two variables?

Acceptance Criteria

Produce a mediawiki page with the answers the above questions that also includes:
our main takeaways
a summary of the analysis
a formal/technical report (if needed)

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		Jdlrobson	T209319 SEO: Post-AB test cleanup
Resolved		• nray	T211264 Remove Wikibase page split test configuration code from mediawiki-config
Resolved		Krinkle	T198970 Epic: Implement SEO improvements suggested by Go Fish Digital
Resolved		ovasileva	T209306 [Epic] [SEO] Enable Schema.org Article linked data for all main namespace pages
Resolved		ovasileva	T209377 Remove A/B test and launch to 100%
Resolved		mpopov	T209891 Analyze results of sameAs A/B test
Resolved		ovasileva	T208755 Launch A/B test for sameAs property
Resolved		ovasileva	T206868 [Spike 24hrs] How do we measure the effects of the sameAs property on pageviews using an A/B test
Resolved		None	T198946 Add Schema property 'sameAs' pointing to Wikidata entries
Resolved		Jdlrobson	T204070 [Spike, 8hrs] Where should Schema property 'sameAs' live?
Resolved		None	T207790 Add Wikibase page schema system messages
Resolved		None	T208772 QA page schemas
Resolved		None	T208763 Enable page schemas on the beta cluster
Resolved		Jdforrester-WMF	T208809 All pages on Beta Cluster Wikidata and Commons do not load, "Error: invalid magic word 'translation'"
Resolved		• Tbayer	T208789 Identify pages to be bucketed in page schema linked data A/B test
Resolved		• Tbayer	T208909 [Bug] Update old nonuniformly distributed page_random values
Resolved		None	T208796 Use wikibase-debug Logstash channel to log unexpected page_random values
Resolved		ovasileva	T209309 [Spike, 1hrs] Evaluate approximate page size increase
Invalid		• Tbayer	T209315 Enable Google Developer Access for SEO deployers
Resolved	Jan 13 2019	mpopov	T211191 Check in sameAs A/B test results
Resolved		mpopov	T211190 SameAs A/B test preliminary analysis

Event Timeline

ovasileva triaged this task as Medium priority.Nov 19 2018, 10:46 PM

ovasileva created this task.

ovasileva added a subtask: T208755: Launch A/B test for sameAs property.

ovasileva closed subtask T208755: Launch A/B test for sameAs property as Resolved.Nov 20 2018, 3:32 PM

ovasileva assigned this task to mpopov.Nov 20 2018, 4:41 PM

Besides determining whether there was a change, I think we should also try to assess its size (and sign ;)

And regarding temporal patterns in pageviews: Most of these are likely going to be due to factors and events that don't have to do with the sameAs feature (e.g. seasonal changes). So I don't know how much relevant information we would be getting from these patterns. If we are specifically interested in how fast the newly added structured data propagated in (e.g.) Google's index, there might be more direct methods to find that out, using Search Console data.

• Tbayer awarded a token.Nov 22 2018, 1:47 AM

And to record something here from our earlier offline discussions:

One could conceivably try to focus the analysis only on pages that have an associated Wikidata item, as only these will be affected by the sameAs change. But (besides the aspect that we may mostly be interested in the overall change in pageviews for an entire project) this does not seem worth the effort, because it appears that on a typical project, the vast majority of mainspace (non-redirect) pages do have a Wikidata item. I haven't run the actual numbers, but I quickly checked this assumption by hitting "random article" 10 times on a large and a small project (English and Latin WP), both times getting pages with Wikidata item 10 out of 10 times.

(@GoranSMilovanovic , is that something that can be gleaned from WDCM ? E.g. is it true that the top 10 Wikipedias by size all have >90% Wikidata coverage?)

@Tbayer Assuming that we are focusing on the main namespace only: a part of the answer is found on the Percentage of articles making use of data from Wikidata page, however, the (S)itelink usage aspect (see: wbc entity_usage in the wikibase schema) is not accounted for. I can run a check for you if want, including all WD item usage aspects.

But I think you really need to look up for the answer in the wb_items_per_site table from the Wikibase schema, which is currently not in the scope of the WDCM analyses.

Thanks @GoranSMilovanovic! It is indeed about mainspace pages only, but about those that have an associated Wikidata item (i.e. appear in the sitelinks of said item), rather than making use of its properties.
I started drafting a query myself using wb_items_per_site, but the result for enwiki looks implausibly low: https://quarry.wmflabs.org/query/31482 Do you happen to see what might be wrong with the query?

(CC @Niedzielski )

@Tbayer

(i.e. appear in the sitelinks of said item)

If you want sitelinks data only: http://wdcm.wmflabs.org/WDCM_SitelinksDashboard/, and if you need any numbers that are not reported on the dashboard let me know and I will see what I can do to help. Dashboard documentation: https://wikitech.wikimedia.org/wiki/Wikidata_Concepts_Monitor#WDCM_(S)itelinks_Dashboard

N.B. I will accept suggestions to incorporate new data tables for download into the existing dashboard.

@Tbayer In general, when you need Wikidata usage datasets:

there is the wdcm_clients_wb_entity_usage table in the goransm database in Hadoop;
it is a result of an Apache Sqoop operation orchestrated from R (code) which concatenates all wbc_entity_usage tables from clients with client-side WD usage tracking enabled;
we have fresh data produced every 1st, 7th, 14th, 20th, and 27th of the month, running the script on my crontab from stat1004.

From beeline:

describe goransm.wdcm_clients_wb_entity_usage;
col_name        data_type       comment
eu_row_id       bigint
eu_entity_id    string
eu_aspect       string
eu_page_id      bigint
wiki_db string  The wiki_db project
        NULL    NULL
# Partition Information NULL    NULL
# col_name              data_type               comment             
        NULL    NULL
wiki_db string  The wiki_db project

Maybe we can just check the wikibase_item in page_props?

https://quarry.wmflabs.org/query/31528

@Tbayer what @Niedzielski has just suggested (tested) is probably the most straightforward way to obtain the dataset.

ovasileva raised the priority of this task from Medium to High.Dec 4 2018, 6:35 PM

mpopov added a project: Product-Analytics.Dec 4 2018, 6:36 PM

Thanks @Niedzielski and @GoranSMilovanovic! I ran a query based on that approach (the wikibase_item page property) for a few wikis, more out of curiosity (I guess @mpopov might incorporate a more thorough look at this in his analysis). It confirmed the assumption that the vast majority of Wikipedia articles have a Wikidata item.

wiki	articles_with_item	all_articles	Wikidata_ratio
enwiki	5736035.0	5761481.0	0.995583
frwiki	2057046.0	2061655.0	0.997764
dewiki	2239557.0	2244880.0	0.997629
zhwiki	1029501.0	1033831.0	0.995812
jawiki	1125250.0	1130528.0	0.995331
lawiki	129487.0	129587.0	0.999228
barwiki	25899.0	27175.0	0.953045
enwikivoyage	28598.0	28783.0	0.993573
commonswiki	92410.0	115614.0	0.799298

SELECT SUM(IF(pp_page IS NOT NULL, 1,0)) AS articles_with_item,
  SUM(1) AS all_articles
FROM (
  SELECT
    page_title,
    page_id
  FROM
    page
  WHERE
    page_namespace = 0
  AND
    page_is_redirect = 0
) AS page_titles
LEFT JOIN (
  SELECT
    pp_page
  FROM
    page_props
  WHERE
    pp_propname = 'wikibase_item'
  AND
    pp_value IS NOT null
) AS pp_pages
ON
  page_titles.page_id = pp_pages.pp_page;

(There might be some rare oddities with this data though, see e.g. https://quarry.wmflabs.org/query/31750 or T119738 .)

ovasileva updated the task description. (Show Details)Dec 5 2018, 11:08 AM

• Niedzielski removed a subtask: T211190: SameAs A/B test preliminary analysis.Dec 6 2018, 12:11 AM

mpopov moved this task from Triage to Backlog on the Product-Analytics board.Dec 6 2018, 9:23 PM

nshahquinn-wmf mentioned this in T212172: Provide feature parity between the wiki replicas and the Analytics Data Lake.Dec 19 2018, 9:05 PM

mpopov closed subtask T211191: Check in sameAs A/B test results as Resolved.Jan 16 2019, 4:40 PM

• Tbayer mentioned this in T215616: Improve interlingual links across wikis through Wikidata IDs.Feb 12 2019, 12:11 AM

I've identified a few potential issues with the query I've written for the past check-ins so I'm working on resolving that to make sure the analysis is performed on vetted, correct data. (Gotta love those joins of partitioned tables in Hive.)

I am also doing some research into different models (with pre-test and post-test measurements and treatment & control groups) to correctly infer impact of the sameAs property. For example, two competing models would be:

logPost = intercept + β1 * logPre + β2 * treatment + error, where treatment is an indicator variable and exp(β2) is the multiplication factor by which traffic changes due to the treatment
multilevel model with random, hierarchical intercepts:
- log avg. daily search engine-referred traffic ~ N(group intercept + β * treated, σ) -- two observations per group (within wiki): one pre, one post-test
- group intercept ~ N(wiki intercept, φ) -- two groups per wiki (control & treatment)
- wiki intercept ~ N(overall intercept, τ) -- 200+ wikis
- exp(β) is the multiplication factor by which traffic changes due to the treatment

Draft posted at: https://www.mediawiki.org/wiki/User:MPopov_(WMF)/SEO/sameAs_test

Key takeaways:

an estimated 1.4% increase in traffic on average (95% CI: 0.7-2.1)
more wiks benefited from the feature than not
based on our decision plan, we should probably rollout to 100% and rollout to other wikis

Update: final draft posted at https://www.mediawiki.org/wiki/Reading/Search_Engine_Optimization/sameAs_test

ovasileva mentioned this in T209377: Remove A/B test and launch to 100%.Apr 12 2019, 10:20 AM

mpopov moved this task from Backlog to Doing on the Product-Analytics board.May 9 2019, 3:14 PM

This comment was removed by mpopov.

Closing this as done! As Mikhail noted, the final draft was posted at https://www.mediawiki.org/wiki/Reading/Search_Engine_Optimization/sameAs_test

There was some discussion about expanding the analysis to include other Wikimedia projects (as this was launched on projects besides Wikipedia), but after discussion with Jon and Mikhail I think it makes sense to close this as is. Those other projects make a tiny % of overall traffic, and languages are less consistent; logically speaking we don’t expect sameAs to have any negative impact on readers looking for specific knowledge.

Analyze results of sameAs A/B testClosed, ResolvedPublicActions