Page MenuHomePhabricator

Use search log to find currently existing namespace combinations
Closed, ResolvedPublic3 Estimated Story Points

Description

Ideally we would love to have the following:
Per wiki project and language a csv of the last searches[1] with

project + language | namespaces searched (in alphabetical order) | number of times

[1] Many would be good, but the file should still be sendable via mail ;) We thought about something like 24h maybe? But if you have a better sampling strategy, feel free!

Event Timeline

mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.
mpopov set the point value for this task to 3.

Current draft of Hive query for extracting namespace and counting searches:

ADD JAR hdfs:///wmf/refinery/current/artifacts/refinery-hive.jar;
CREATE TEMPORARY FUNCTION is_spider as 'org.wikimedia.analytics.refinery.hive.IsSpiderUDF';
CREATE TEMPORARY FUNCTION ua_parse as 'org.wikimedia.analytics.refinery.hive.GetUAPropertiesUDF';
CREATE TEMPORARY FUNCTION host_parse as 'org.wikimedia.analytics.refinery.hive.GetHostPropertiesUDF';
USE wmf_raw;
WITH wiki_map AS (
  SELECT DISTINCT
    dbname AS wiki_id,
    host_parse(hostname) AS normalized_hostname,
    namespace_localized_name AS localized_namespace,
    namespace_canonical_name AS canonical_namespace
  FROM mediawiki_project_namespace_map
  WHERE snapshot = '2017-04' AND namespace_canonical_name RLIKE '^[A-Za-z\\s]+$'
),
cirrus_searches AS (
  SELECT
    wikiid AS wiki_id,
    source,
    requests[size(requests)-1].querytype AS query_type,
    REGEXP_EXTRACT(requests[size(requests)-1].query, '^([^\\:]+)\\:{1}[^\\:]+.*', 1) AS localized_namespace,
    COUNT(1) AS search_count
  FROM CirrusSearchRequestSet
  WHERE
    year = 2017 AND month = 5 AND day = 23
    AND requests[size(requests)-1].query RLIKE  '^([^\\:]+)\\:{1}[^\\:]+.*'
    -- Filter out bots (not all, but still many):
    AND NOT (
      ua_parser(useragent)['device_family'] = 'Spider'
      OR is_spider(useragent)
      OR ip = '127.0.0.1'
      OR useragent RLIKE 'https?://'
      OR INSTR(useragent, 'www.') > 0
      OR INSTR(useragent, 'github') > 0
      OR LOWER(useragent) RLIKE '([a-z0-9._%-]+@[a-z0-9.-]+\\.(com|us|net|org|edu|gov|io|ly|co|uk))'
      OR (
        ua_parser(useragent)['browser_family'] = 'Other'
        AND ua_parser(useragent)['device_family'] = 'Other'
        AND ua_parser(useragent)['os_family'] = 'Other'
      )
    )
  GROUP BY
    wikiid, source,
    requests[size(requests)-1].querytype,
    REGEXP_EXTRACT(requests[size(requests)-1].query, '^([^\\:]+)\\:{1}[^\\:]+.*', 1)
)
SELECT
  normalized_hostname.project_class AS project,
  normalized_hostname.project AS language,
  source, query_type, canonical_namespace AS namespace_searched,
  search_count
FROM cirrus_searches
INNER JOIN wiki_map
  ON cirrus_searches.wiki_id = wiki_map.wiki_id
  AND cirrus_searches.localized_namespace = wiki_map.localized_namespace;

Sample output:

projectlanguagesourcequery_typenamespace_searchedsearch_count
wikibooksenwebfull_textCookbook263
wikipediafrapiprefixProject67
wikimediacommonswebfull_textCategory64
wikipediaiaapiprefixProject57
wikipediaesapiprefixProject45
wikimediacommonswebfull_textFile31
wikipediafrapiprefixPortail31
wikipediaenwebfull_textCategory18
wikipediaenwebfull_textTemplate17
wikimediacommonsapiprefixCreator16

TODO: expand it to work with multi-namespace searches. I'm not a power searcher, so I asked for some examples in T165492#3293257

New query that actually does the thing desired (keeping previous one just for future reference):

ADD JAR hdfs:///wmf/refinery/current/artifacts/refinery-hive.jar;
CREATE TEMPORARY FUNCTION is_spider as 'org.wikimedia.analytics.refinery.hive.IsSpiderUDF';
CREATE TEMPORARY FUNCTION ua_parse as 'org.wikimedia.analytics.refinery.hive.GetUAPropertiesUDF';
CREATE TEMPORARY FUNCTION host_parse as 'org.wikimedia.analytics.refinery.hive.GetHostPropertiesUDF';
WITH searches AS (
  SELECT DISTINCT
    id, wikiid AS wiki_id,
    STR_TO_MAP(payload['queryString'], '&', '=') AS query_map
  FROM wmf_raw.cirrussearchrequestset
  WHERE
    year = 2017 AND month = 5 AND day = 31 AND hour = 23
    AND source = 'web'
    AND NOT (
      useragent IS NULL
      OR useragent = ''
      OR ua_parse(useragent)['device_family'] = 'Spider'
      OR is_spider(useragent)
      OR ip = '127.0.0.1'
      OR useragent RLIKE 'https?://'
      OR INSTR(useragent, 'www.') > 0
      OR INSTR(useragent, 'github') > 0
      OR LOWER(useragent) RLIKE '([a-z0-9._%-]+@[a-z0-9.-]+\\.(com|us|net|org|edu|gov|io|ly|co|uk))'
      OR (
        ua_parse(useragent)['browser_family'] = 'Other'
        AND ua_parse(useragent)['device_family'] = 'Other'
        AND ua_parse(useragent)['os_family'] = 'Other'
      )
    )
    AND INSTR(payload['queryString'], 'profile=advanced') > 0
    AND INSTR(payload['queryString'], 'fulltext=1') > 0
  LIMIT 100
),
exploded_searches AS (
  SELECT
    s.id, s.wiki_id,
    exp.key AS query_key,
    exp.val AS query_value
  FROM searches s
  LATERAL VIEW EXPLODE(s.query_map) exp AS key, val
),
wiki_map AS (
  SELECT DISTINCT
    dbname AS wiki,
    host_parse(hostname) AS normalized_hostname,
    CONCAT('ns', namespace) AS ns,
    namespace_canonical_name AS canonical_namespace
  FROM wmf_raw.mediawiki_project_namespace_map
  WHERE snapshot = '2017-04' AND namespace_canonical_name RLIKE '^[A-Za-z\\s]+$'
)
SELECT
  project, language,
  id AS search_id,
  namespace AS namespace_searched
FROM (
  SELECT
    es.id AS id, es.wiki_id AS wiki,
    wm.normalized_hostname.project_class AS project,
    wm.normalized_hostname.project AS language,
    wm.canonical_namespace AS namespace
  FROM exploded_searches es
  LEFT JOIN wiki_map wm
    ON es.wiki_id = wm.wiki
    AND es.query_key = wm.ns
) ns
WHERE namespace IS NOT NULL;

e.g.

project	language	search_id	namespace_searched
wikipedia	en	qphgphgx9aducocvfql7kl09	MediaWiki
wikipedia	es	elek4r79vho4h4hn8pke3pmgz	Portal
wikipedia	es	elek4r79vho4h4hn8pke3pmgz	Project
wikipedia	es	elek4r79vho4h4hn8pke3pmgz	Anexo
wikipedia	es	elek4r79vho4h4hn8pke3pmgz	Gadget definition
wikipedia	es	elek4r79vho4h4hn8pke3pmgz	Gadget
wikipedia	es	elek4r79vho4h4hn8pke3pmgz	File
wikipedia	fr	c7b6bxjm7ggbd17478yrzly90	Help
wikipedia	es	be6eyaoy1n4slvg3qvrzbso23	Project
wikipedia	es	be6eyaoy1n4slvg3qvrzbso23	Help
wikipedia	es	be6eyaoy1n4slvg3qvrzbso23	Template
wikimedia	commons	alb6fbdmyrloy04ht282scu70	Category
wikimedia	commons	alb6fbdmyrloy04ht282scu70	File

I'll run it for searches made on June 1st and aggregate/count as a separate step.

That is great! I have accessrights now, but seeing the query they are of no use in this particular issue: this is way beyond my Hive/SQL skills! :-)

Attached:

Counts of advanced Special:Search searches for each observed combination of namespaces on 1 June 2017. The proportion is how many searches out of total namespace searches within the wiki -- e.g. the 1.6K Category + File searches accounted for 20% of the advanced Special:Search searches on Commons.

Example:

projectlanglanguagenamespacessearchesproportion
wikipediaenEnglishBook, Book talk, Category, Category talk, Draft, Draft talk, Education Program, Education Program talk, File, File talk, Gadget, Gadget definition, Gadget definition talk, Gadget talk, Help, Help talk, MediaWiki, MediaWiki talk, Module, Module talk, Portal, Portal talk, Project, Project talk, Talk, Template, Template talk, TimedText, TimedText talk, User, User talk31620.5254
wikimediacommonsNACategory, File16360.2077
wikimediacommonsNACategory, Creator, File, Help, Institution11250.1428
wikimediacommonsNACategory, Creator, File, Help, Institution, MediaWiki talk11220.1425
wikipediaesSpanish; CastilianAnexo, Category, Category talk, Education Program, Education Program talk, File, File talk, Gadget, Gadget definition, Gadget definition talk, Gadget talk, Help, Help talk, MediaWiki, MediaWiki talk, Module, Module talk, Portal, Project, Project talk, Talk, Template, Template talk, User, User talk, Wikiproyecto10920.4190
wikipediadeGermanCategory, Category talk, File, File talk, Gadget, Gadget definition, Gadget definition talk, Gadget talk, Help, Help talk, MediaWiki, MediaWiki talk, Module, Module talk, Portal, Portal Diskussion, Project, Project talk, Talk, Template, Template talk, User, User talk10350.5302
wikimediacommonsNACampaign, Campaign talk, Category, Category talk, Creator, Creator talk, Data, Data talk, File, File talk, Gadget, Gadget definition, Gadget definition talk, Gadget talk, GWToolset, GWToolset talk, Help, Help talk, Institution, Institution talk, MediaWiki, MediaWiki talk, Module, Module talk, Project, Project talk, Sequence, Sequence talk, Talk, Template, Template talk, TimedText, TimedText talk, Translations, Translations talk, User, User talk9990.1268
wikipediazhChineseCategory, Category talk, Draft, Draft talk, File, File talk, Gadget, Gadget definition, Gadget definition talk, Gadget talk, Help, Help talk, MediaWiki, MediaWiki talk, Module, Module talk, Portal, Portal talk, Project, Project talk, Talk, Template, Template talk, User, User talk5750.8110
wikimediacommonsNACampaign, Campaign talk, Category, Category talk, Creator, Creator talk, Data, Data talk, File, Gadget, Gadget definition, Gadget definition talk, Gadget talk, GWToolset, GWToolset talk, Help, Help talk, Institution, Institution talk, MediaWiki, MediaWiki talk, Module, Module talk, Sequence, Sequence talk, Template, Template talk, TimedText, TimedText talk, Translations, Translations talk5580.0708
wikipediaesSpanish; CastilianAnexo, Portal5120.1965

Aggregation code:

library(tidyverse)

namespaces <- read_tsv("~/Desktop/namespace-searches-edited.tsv")
# ^ not attached in the comment

data("ISO_639_2", package = "ISOcodes")

aggregates <- namespaces %>%
  arrange(project, language, search_id, namespace_searched) %>%
  group_by(project, language, search_id) %>%
  mutate(namespaces = paste0(namespace_searched, collapse = ", ")) %>%
  group_by(project, language, namespaces) %>%
  count %>%
  mutate(proportion = round(n/sum(n), 4)) %>%
  left_join(select(ISO_639_2, c(Alpha_2, Name)),
            by = c("language" = "Alpha_2")) %>%
  select(project, lang = language, language = Name, namespaces, searches = n, proportion) %>%
  arrange(project, lang, desc(searches))

write_csv(aggregates, "~/Desktop/namespace-searches-counts.csv")

Let me know if there's anything else you'd like to know! :) Hope this helps~

Let us know if there is anything else we can help with, @Lea_WMDE :)

@mpopov was it intentional to not include ns0, the article/main namespace in the results of the query?

@mpopov Are we correct to assume that the numbers here only reflect what the users searched for when they clicked on "advanced" on SpecialPage:Search, but not if they clicked on "Content Pages", "Multimedia" or "Everything". If so, would it be possible to get the data with these searches included?

Background: We are investigating which defaults should be offered for namespace selection, and maybe the current ones are already the ones that everybody needs.

@mpopov was it intentional to not include ns0, the article/main namespace in the results of the query?

No, that was not intentional. For some reason ns0 doesn't have a name in our database, which got it excluded by accident.

FROM wmf_raw.mediawiki_project_namespace_map
WHERE snapshot = '2017-04' AND namespace_canonical_name RLIKE '^[A-Za-z\\s]+$'

Screen Shot 2017-08-09 at 10.02.51 AM.png (591×1 px, 133 KB)

@Lea_WMDE: Would it help if I fixed the query to include ns0 and uploaded a new dataset?

@mpopov Are we correct to assume that the numbers here only reflect what the users searched for when they clicked on "advanced" on SpecialPage:Search, but not if they clicked on "Content Pages", "Multimedia" or "Everything". If so, would it be possible to get the data with these searches included?

Which numbers? The proportions? If so, I think I can re-calculate it as the % of all searches those combinations represent.

@mpopov
In the current search interface, there are four option: Content articles, multimedia, everything and advanced. When you click on advanced, you get the table of all namespaces and can choose them individually. If we understand your query correctly, you only look at searches that have profile=advanced in the url. The first three options have other profiles though. For our need, we would need to have these searches included, too.

Bildschirmfoto 2017-08-09 um 21.30.47.png (580×1 px, 119 KB)

@mpopov
In the current search interface, there are four option: Content articles, multimedia, everything and advanced. When you click on advanced, you get the table of all namespaces and can choose them individually. If we understand your query correctly, you only look at searches that have profile=advanced in the url. The first three options have other profiles though. For our need, we would need to have these searches included, too.

Even though those tabs don't let you choose namespaces?

Even though those tabs don't let you choose namespaces?

They don't? I thought that profile = a bundle of namespaces? What is a profile then?

They don't? I thought that profile = a bundle of namespaces? What is a profile then?

That's fair. When I get back to this, I'll include "Content pages" (mainspace), "Multimedia" (File), "Everything" (all plus File) profiles.

@mpopov Do you think you will be able to give us insights in the next days? Our principle investigator of namespace correlations is only available until the end of next week. So if you manage to get back to it before, it would make it much easier for us to evaluate :)

@mpopov Do you think you will be able to give us insights in the next days? Our principle investigator of namespace correlations is only available until the end of next week. So if you manage to get back to it before, it would make it much easier for us to evaluate :)

Sure! I've rewritten the query and recounted the namespace combos. I skipped profiles like Translations and Discussions since they're rare and weird and I don't have time to figure out how to deal with each one, but I included logic for content, multimedia, and everything profiles. Attaching counts from 2017-08-01:

Hope this helps!

P.S. Here's the updated query (for future me's reference):

Query run via CLI:

export HADOOP_HEAPSIZE=1024
hive -S \
  -d year='2017' \
  -d month='8' \
  -d day='1' \
  -f ~/namespaces.hql \
  2> /dev/null | grep -v parquet.hadoop | grep -v WARN: \
  > ~/namespace-searches.tsv

Aggregation done in R:

library(tidyverse)

namespaces <- read_tsv("~/namespace-searches.tsv") # not attached

data("ISO_639_2", package = "ISOcodes")

aggregates <- namespaces %>%
  arrange(project, language, search_id, namespace_searched) %>%
  group_by(project, language, search_id) %>%
  mutate(namespaces = paste0(namespace_searched, collapse = ", ")) %>%
  group_by(project, language, namespaces) %>%
  count %>%
  left_join(select(ISO_639_2, c(Alpha_2, Name)),
            by = c("language" = "Alpha_2")) %>%
  select(project, lang = language, language = Name, namespaces, searches = n) %>%
  arrange(project, lang, desc(searches))

write_csv(aggregates, "~/namespace-searches-counts.csv") # attached