Page MenuHomePhabricator

Map analytics
Closed, DeclinedPublic6 Story Points

Description

We'd like to start making some analytic comparisons of the maps we currently have implemented on various wikis and projects.

We'd like to determine the difference between pages that have a map (-frame or -link) on them and if anyone viewed them (or the article it was on).

We'd also like to compare that to how many articles are on that wiki or project; comparing page views would be a good start.

Another interesting tangent - do people look at articles with maps in them more than they look at articles with images in them?

Here's the list of the wiki's and projects that have map (-frame or -link) enabled:

  • Meta Wikimedia
  • MediaWiki
  • Wikivoyage (all languages)
  • Wikipedia
    • Catalan
    • Hebrew
    • Russian
    • Macedonian
    • French
    • Finnish
    • Norwegian
    • Swedish

Event Timeline

debt created this task.Jul 7 2017, 10:17 PM
Restricted Application added subscribers: jhsoby, Aklapper. · View Herald TranscriptJul 7 2017, 10:17 PM
mpopov added a subscriber: MaxSem.Jul 11 2017, 12:32 AM

@MaxSem I'm going through your discovery-stats repo and currently taking a look at the geo_tag table. I'm noticing that sometimes there are geotags in the database that are no longer present on wiki.

For example, the geotag for File:Vossloh Euro 4000 pupitre.JPG got removed: https://commons.wikimedia.org/w/index.php?title=File%3AVossloh_Euro_4000_pupitre.JPG&type=revision&diff=88315031&oldid=86196480 but it's still in the geo_tags table. Do you know if there is any script that performs maintenance on those tables? Do you know who owns the code that adds data to geo_tags when a file on Commons has a location?

That'd be T143366. Who? In theory, Discov... oh, nevermind =)

mpopov added a comment.EditedJul 12 2017, 10:33 PM

@debt: first draft: https://people.wikimedia.org/~bearloga/reports/maps-usage.html

That doesn't include event logging stuff, which will be my next step.

P.S. repo: https://github.com/wikimedia-research/Discovery-Interactive-Adhoc-Usage

Change 368462 had a related patch set uploaded (by Bearloga; owner: Bearloga):
[wikimedia/discovery/golden@master] metrics::maps: Add maplink mapframe prevalence metrics

https://gerrit.wikimedia.org/r/368462

Change 368462 merged by Bearloga:
[wikimedia/discovery/golden@master] metrics::maps: Add maplink mapframe prevalence metrics

https://gerrit.wikimedia.org/r/368462

Restricted Application added a subscriber: jeblad. · View Herald TranscriptAug 29 2017, 9:03 PM

Change 377807 had a related patch set uploaded (by Bearloga; owner: Bearloga):
[wikimedia/discovery/golden@master] Fix maplink/mapframe query

https://gerrit.wikimedia.org/r/377807

Change 377807 merged by Chelsyx:
[wikimedia/discovery/golden@master] Fix maplink/mapframe query

https://gerrit.wikimedia.org/r/377807

Update: I fixed the query for prevalence stats in https://people.wikimedia.org/~bearloga/reports/maps-usage.html -- specifically I am now counting only pages that are articles and that are not redirects. I also added an "% of sessions that activated mapframe" to https://people.wikimedia.org/~bearloga/reports/maps-interactions.html

I'm going to try to do something with the static maps service, which is an indicator for how many times maps were seen on a page.

mpopov added a comment.EditedSep 16 2017, 1:04 AM

R script & Hive query that finds static map thumbnail requests and then uses those to find the pages that have a mapframe and how many pageviews those pages have and the total pageviews the respective project has:

library(glue)
library(magrittr)

query <- "ADD JAR hdfs:///wmf/refinery/current/artifacts/refinery-hive.jar;
CREATE TEMPORARY FUNCTION get_host_properties AS 'org.wikimedia.analytics.refinery.hive.GetHostPropertiesUDF';
USE wmf;
WITH static_maps AS (
  SELECT DISTINCT
    IF(
      get_host_properties(PARSE_URL(CONCAT('http://', uri_host, uri_path, uri_query), 'QUERY', 'domain')).project = '-',
      get_host_properties(PARSE_URL(CONCAT('http://', uri_host, uri_path, uri_query), 'QUERY', 'domain')).project_class,
      CONCAT(
        get_host_properties(PARSE_URL(CONCAT('http://', uri_host, uri_path, uri_query), 'QUERY', 'domain')).project,
        '.',
        get_host_properties(PARSE_URL(CONCAT('http://', uri_host, uri_path, uri_query), 'QUERY', 'domain')).project_class
      )
    ) AS project,
    REGEXP_REPLACE(REFLECT('java.net.URLDecoder', 'decode', PARSE_URL(CONCAT('http://', uri_host, uri_path, uri_query), 'QUERY', 'title'), 'UTF-8'), '\\\\s', '_') AS page_title
  FROM webrequest
  WHERE
    webrequest_source = 'upload'
    AND year = ${year} AND month = ${month} AND day = ${day}
    AND uri_host = 'maps.wikimedia.org'
    AND http_status IN('200', '304')
    AND uri_path RLIKE '^/img/.*\\.png$'
    AND PARSE_URL(CONCAT('http://', uri_host, uri_path, uri_query), 'QUERY', 'domain') IS NOT NULL
    AND PARSE_URL(CONCAT('http://', uri_host, uri_path, uri_query), 'QUERY', 'title') IS NOT NULL
    AND uri_query <> '?loadtesting'
),
maps_pvs AS (
  SELECT
    static_maps.project AS project,
    static_maps.page_title AS page_title,
    pvh.view_count AS pageviews
  FROM static_maps
  LEFT JOIN pageview_hourly pvh ON (
    static_maps.page_title = pvh.page_title
    AND static_maps.project = pvh.project
    AND pvh.year = ${year} AND pvh.month = ${month} AND pvh.day = ${day}
  )
),
daily_pvs AS (
  SELECT project, COUNT(DISTINCT(page_title)) AS pages_with_thumbnails, SUM(pageviews) AS pageviews
  FROM maps_pvs
  WHERE project IS NOT NULL
  GROUP BY project
)
SELECT
  daily_pvs.project AS project, pages_with_thumbnails, total_pages, pageviews, total_pageviews
FROM daily_pvs
LEFT JOIN (
  SELECT project, COUNT(DISTINCT(page_title)) AS total_pages, SUM(view_count) AS total_pageviews
  FROM pageview_hourly
  WHERE year = ${year} AND month = ${month} AND day = ${day} AND view_count > 0
  GROUP BY project
) AS project_pvs ON daily_pvs.project = project_pvs.project;"

end_date <- Sys.Date() - 1
start_date <- end_date - 60

filters <- paste0(c("", paste(" grep -v", c("JAVA_TOOL_OPTIONS", "parquet.hadoop", "WARN:", ":WARN"))), collapse = " |")

results <- do.call(rbind, lapply(
  seq(start_date, end_date, by = "day"),
  function(date) {
    message("Fetching data from ", format(date, "%d %B %Y"))
    year <- lubridate::year(date)
    month <- lubridate::month(date)
    day <- lubridate::mday(date)
    query <- glue(query, .open = "${", .close = "}")
    query_dump <- tempfile()
    cat(query, file = query_dump)
    results_dump <- tempfile()
    system(glue("export HADOOP_HEAPSIZE=1024 && hive -f {query_dump} 2> /dev/null {filters} > {results_dump}"))
    result <- read.delim(results_dump, sep = "\t", quote = "", as.is = TRUE, header = TRUE)
    file.remove(query_dump, results_dump)
    return(cbind(date = date, result))
  }
))

readr::write_tsv(results, "mapframe_pageviews.tsv")

Change 379150 had a related patch set uploaded (by Bearloga; owner: Bearloga):
[wikimedia/discovery/wetzel@develop] [WIP] Add maplink & mapframe prevalence graphs and modularize

https://gerrit.wikimedia.org/r/379150

Still need to add the logic that auto-selects "(None)" in the languages list if the user selects "Commons" in the projects list, but here's what I have so far:

Result of running T170022#3611637:

library(magrittr)
library(ggplot2)
library(dplyr)

results <- readr::read_tsv("~/Desktop/mapframe_pageviews.tsv", col_types = "Dciiii")
results$prop <- results$pageviews / results$total_pageviews
results$prop[is.na(results$prop)] <- 0
results$include <- FALSE
results$include[results$project %in% unique(results$project[results$prop >= 0.01])] <- TRUE

results %>%
  group_by(date) %>%
  summarize(prop = sum(pageviews, na.rm = TRUE) / sum(total_pageviews, na.rm = TRUE)) %>%
  ggplot(aes(x = date, y = prop)) +
  geom_bar(stat = "identity", fill = "gray40", color = "white") +
  scale_x_date(date_breaks = "1 week", date_labels = "%d %b") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(
    x = "Date", y = "Proportion of pageviews",
    title = "Overall proportion of pageviews that pages with mapframes are responsible for",
    subtitle = "Total across all projects with mapframes"
  )

results %>%
  group_by(date) %>%
  summarize(prop = sum(pages_with_thumbnails) / sum(total_pages)) %>%
  ggplot(aes(x = date, y = prop)) +
  geom_bar(stat = "identity", fill = "gray40", color = "white") +
  scale_x_date(date_breaks = "1 week", date_labels = "%d %b") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(
    x = "Date", y = "Proportion of pages viewed",
    title = "Overall proportion of pages viewed that have a mapframe",
    subtitle = "Out of total pages viewed that day"
  )

results %>%
  filter(include) %>%
  ggplot(aes(x = date, y = prop)) +
  geom_bar(stat = "identity", fill = "gray40", color = "white") +
  scale_x_date(date_breaks = "1 week", date_labels = "%d %b") +
  scale_y_continuous(labels = scales::percent_format(), breaks = seq(0, 0.1, 0.02)) +
  facet_wrap(~ project) +
  labs(
    x = "Date", y = "Proportion of pageviews",
    title = "Proportion of pageviews that pages with mapframes are responsible for",
    subtitle = "Only including projects & languages for which we've observed a share of at least 1%"
  )

results %>%
  group_by(project) %>%
  mutate(include = any(pages_with_thumbnails > 100)) %>%
  ungroup %>%
  filter(include) %>%
  mutate(prop = pages_with_thumbnails / total_pages) %>%
  ggplot(aes(x = date, y = prop)) +
  geom_bar(stat = "identity", fill = "gray40", color = "white") +
  scale_x_date(date_breaks = "1 week", date_labels = "%d %b") +
  scale_y_continuous(labels = scales::percent_format(), breaks = seq(0, 0.1, 0.01)) +
  facet_wrap(~ project) +
  labs(
    x = "Date", y = "Proportion of pages viewed",
    title = "Proportion of pages viewed that have a mapframe out of the total pages viewed for that project",
    subtitle = "Only including projects & languages for which at least 100 pages with a mapframe have been viewed on any one day"
  )

Change 379150 merged by Chelsyx:
[wikimedia/discovery/wetzel@develop] Add maplink & mapframe prevalence graphs and modularize

https://gerrit.wikimedia.org/r/379150

MaxSem removed a subscriber: MaxSem.Sep 28 2017, 1:12 AM

There are a few more things to be done before putting these dashboards into production:

  • the shadow and the text are a bit off (when not using a really wide screen)

  • the list of projects/wikis that have maps enabled should be updated as well
  • https://www.mediawiki.org/wiki/Maps#Wikimedia_projects_that_have_Maps_enabled
  • the project selector is weird—selecting wikipedia and commons also shows results from mediawiki and meta and wikivoyage
  • removing 'none' for language makes it crash
  • the summary prevalence chart needs to be updated to take into account the small increases in mapframe/maplink prevalence, or use a different chart. the way it is right now, it doesn't look like there is any increase (or decrease) in the prevalence. this is for kartographer and kartotherian summary pages.

  • can the lang/project selectors only show the projects/languages that have mapframe—when the user selects to only show mapframe?
  • when selecting only mapframe, the top of the chart is missing (missing 100%)
  • when clicking on Wikimedia chapters, the chart crashes and doesn't recover
  • when selecting mapframe, the top 4 languages and then clicking between avg, median and overall - the numbers don't change

Change 391138 had a related patch set uploaded (by Bearloga; owner: Bearloga):
[wikimedia/discovery/golden@master] Add Spanish Wikipedia to list of wikis that have mapframe enabled

https://gerrit.wikimedia.org/r/391138

Change 391138 merged by Bearloga:
[wikimedia/discovery/golden@master] Add Spanish Wikipedia to list of wikis that have mapframe enabled

https://gerrit.wikimedia.org/r/391138

Change 391288 had a related patch set uploaded (by Bearloga; owner: Bearloga):
[wikimedia/discovery/wetzel@develop] Fix prevalence bugs

https://gerrit.wikimedia.org/r/391288

debt added a comment.Jan 19 2018, 2:38 AM

Hey @mpopov -- just wanted to let you know that Latvian and Arabic wikis now have mapframe too: https://www.mediawiki.org/w/index.php?title=Maps&diff=2695678&oldid=2595033

Cheers! :)

mpopov moved this task from Triage to Doing on the Product-Analytics board.Apr 23 2018, 11:08 PM
kzimmerman changed the task status from Open to Stalled.Dec 5 2018, 5:18 AM
kzimmerman moved this task from Doing to Epics on the Product-Analytics board.
kzimmerman added subscribers: jmatazzoni, kzimmerman.

Moving to Blocked for now, pending review of Wiki Maps current needs.

@jmatazzoni are you currently overseeing work on Wiki Maps? If so, are there any 2018/19 Annual Plan goals you're working toward, and do you need support from Product Analytics (either regarding this task or other needs)?

! In T170022#4799374, @kzimmerman wrote:

@jmatazzoni are you currently overseeing work on Wiki Maps? If so, are there any 2018/19 Annual Plan goals you're working toward, and do you need support from Product Analytics (either regarding this task or other needs)?

Collaboration team worked on maps for a limited engagement, which is completed. I think jhernandez@wikimedia.org is running maps now. Or talk to @Mholloway

@Jhernandez See the question above from @kzimmerman...

Hey @kzimmerman, Reading Infrastructure is doing only maintenance and critical bug fixing for maps, so we don't have a Product Manager that would want to look at the analytics.

I think this would be interesting to do at some point, but given the lack of PM, it is why it is stalled.

We could either flag this to other PMs in audiences or just decline it for now.

Whatever you prefer.

kzimmerman closed this task as Declined.Jun 18 2019, 6:50 PM

Change 391288 abandoned by Bearloga:
Fix prevalence bugs

Reason:
Related Phab task declined

https://gerrit.wikimedia.org/r/391288