Use search log to find currently existing namespace combinations
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	• Lea_WMDE
	May 20 2017, 11:04 AM

Description

Ideally we would love to have the following:
Per wiki project and language a csv of the last searches[1] with

project + language | namespaces searched (in alphabetical order) | number of times

[1] Many would be good, but the file should still be sendable via mail ;) We thought about something like 24h maybe? But if you have a better sampling strategy, feel free!

Related Objects
Search...

Status	Assigned	Task
Resolved	• Lea_WMDE	T143310 Implement a way to access keywords such as "incategory", "intitle" via the Search special page
Resolved	Tobi_WMDE_SW	T165292 Implement general UI of advanced search extension
Resolved	• Lea_WMDE	T165492 Find out which namespace combinations are used for searching
Resolved	mpopov	T165861 Use search log to find currently existing namespace combinations

Event Timeline

• Lea_WMDE created this task.May 20 2017, 11:04 AM

debt added a project: Discovery-Analysis (Current work).May 20 2017, 11:53 AM

debt added subscribers: mpopov, • chelsyx.

mpopov claimed this task.May 25 2017, 7:21 PM

mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.

mpopov set the point value for this task to 3.

Current draft of Hive query for extracting namespace and counting searches:

ADD JAR hdfs:///wmf/refinery/current/artifacts/refinery-hive.jar;
CREATE TEMPORARY FUNCTION is_spider as 'org.wikimedia.analytics.refinery.hive.IsSpiderUDF';
CREATE TEMPORARY FUNCTION ua_parse as 'org.wikimedia.analytics.refinery.hive.GetUAPropertiesUDF';
CREATE TEMPORARY FUNCTION host_parse as 'org.wikimedia.analytics.refinery.hive.GetHostPropertiesUDF';
USE wmf_raw;
WITH wiki_map AS (
  SELECT DISTINCT
    dbname AS wiki_id,
    host_parse(hostname) AS normalized_hostname,
    namespace_localized_name AS localized_namespace,
    namespace_canonical_name AS canonical_namespace
  FROM mediawiki_project_namespace_map
  WHERE snapshot = '2017-04' AND namespace_canonical_name RLIKE '^[A-Za-z\\s]+$'
),
cirrus_searches AS (
  SELECT
    wikiid AS wiki_id,
    source,
    requests[size(requests)-1].querytype AS query_type,
    REGEXP_EXTRACT(requests[size(requests)-1].query, '^([^\\:]+)\\:{1}[^\\:]+.*', 1) AS localized_namespace,
    COUNT(1) AS search_count
  FROM CirrusSearchRequestSet
  WHERE
    year = 2017 AND month = 5 AND day = 23
    AND requests[size(requests)-1].query RLIKE  '^([^\\:]+)\\:{1}[^\\:]+.*'
    -- Filter out bots (not all, but still many):
    AND NOT (
      ua_parser(useragent)['device_family'] = 'Spider'
      OR is_spider(useragent)
      OR ip = '127.0.0.1'
      OR useragent RLIKE 'https?://'
      OR INSTR(useragent, 'www.') > 0
      OR INSTR(useragent, 'github') > 0
      OR LOWER(useragent) RLIKE '([a-z0-9._%-]+@[a-z0-9.-]+\\.(com|us|net|org|edu|gov|io|ly|co|uk))'
      OR (
        ua_parser(useragent)['browser_family'] = 'Other'
        AND ua_parser(useragent)['device_family'] = 'Other'
        AND ua_parser(useragent)['os_family'] = 'Other'
      )
    )
  GROUP BY
    wikiid, source,
    requests[size(requests)-1].querytype,
    REGEXP_EXTRACT(requests[size(requests)-1].query, '^([^\\:]+)\\:{1}[^\\:]+.*', 1)
)
SELECT
  normalized_hostname.project_class AS project,
  normalized_hostname.project AS language,
  source, query_type, canonical_namespace AS namespace_searched,
  search_count
FROM cirrus_searches
INNER JOIN wiki_map
  ON cirrus_searches.wiki_id = wiki_map.wiki_id
  AND cirrus_searches.localized_namespace = wiki_map.localized_namespace;

Sample output:

project	language	source	query_type	namespace_searched	search_count
wikibooks	en	web	full_text	Cookbook	263
wikipedia	fr	api	prefix	Project	67
wikimedia	commons	web	full_text	Category	64
wikipedia	ia	api	prefix	Project	57
wikipedia	es	api	prefix	Project	45
wikimedia	commons	web	full_text	File	31
wikipedia	fr	api	prefix	Portail	31
wikipedia	en	web	full_text	Category	18
wikipedia	en	web	full_text	Template	17
wikimedia	commons	api	prefix	Creator	16

TODO: expand it to work with multi-namespace searches. I'm not a power searcher, so I asked for some examples in T165492#3293257

Tobi_WMDE_SW added a project: Advanced-Search.May 30 2017, 3:36 PM

New query that actually does the thing desired (keeping previous one just for future reference):

ADD JAR hdfs:///wmf/refinery/current/artifacts/refinery-hive.jar;
CREATE TEMPORARY FUNCTION is_spider as 'org.wikimedia.analytics.refinery.hive.IsSpiderUDF';
CREATE TEMPORARY FUNCTION ua_parse as 'org.wikimedia.analytics.refinery.hive.GetUAPropertiesUDF';
CREATE TEMPORARY FUNCTION host_parse as 'org.wikimedia.analytics.refinery.hive.GetHostPropertiesUDF';
WITH searches AS (
  SELECT DISTINCT
    id, wikiid AS wiki_id,
    STR_TO_MAP(payload['queryString'], '&', '=') AS query_map
  FROM wmf_raw.cirrussearchrequestset
  WHERE
    year = 2017 AND month = 5 AND day = 31 AND hour = 23
    AND source = 'web'
    AND NOT (
      useragent IS NULL
      OR useragent = ''
      OR ua_parse(useragent)['device_family'] = 'Spider'
      OR is_spider(useragent)
      OR ip = '127.0.0.1'
      OR useragent RLIKE 'https?://'
      OR INSTR(useragent, 'www.') > 0
      OR INSTR(useragent, 'github') > 0
      OR LOWER(useragent) RLIKE '([a-z0-9._%-]+@[a-z0-9.-]+\\.(com|us|net|org|edu|gov|io|ly|co|uk))'
      OR (
        ua_parse(useragent)['browser_family'] = 'Other'
        AND ua_parse(useragent)['device_family'] = 'Other'
        AND ua_parse(useragent)['os_family'] = 'Other'
      )
    )
    AND INSTR(payload['queryString'], 'profile=advanced') > 0
    AND INSTR(payload['queryString'], 'fulltext=1') > 0
  LIMIT 100
),
exploded_searches AS (
  SELECT
    s.id, s.wiki_id,
    exp.key AS query_key,
    exp.val AS query_value
  FROM searches s
  LATERAL VIEW EXPLODE(s.query_map) exp AS key, val
),
wiki_map AS (
  SELECT DISTINCT
    dbname AS wiki,
    host_parse(hostname) AS normalized_hostname,
    CONCAT('ns', namespace) AS ns,
    namespace_canonical_name AS canonical_namespace
  FROM wmf_raw.mediawiki_project_namespace_map
  WHERE snapshot = '2017-04' AND namespace_canonical_name RLIKE '^[A-Za-z\\s]+$'
)
SELECT
  project, language,
  id AS search_id,
  namespace AS namespace_searched
FROM (
  SELECT
    es.id AS id, es.wiki_id AS wiki,
    wm.normalized_hostname.project_class AS project,
    wm.normalized_hostname.project AS language,
    wm.canonical_namespace AS namespace
  FROM exploded_searches es
  LEFT JOIN wiki_map wm
    ON es.wiki_id = wm.wiki
    AND es.query_key = wm.ns
) ns
WHERE namespace IS NOT NULL;

e.g.

project	language	search_id	namespace_searched
wikipedia	en	qphgphgx9aducocvfql7kl09	MediaWiki
wikipedia	es	elek4r79vho4h4hn8pke3pmgz	Portal
wikipedia	es	elek4r79vho4h4hn8pke3pmgz	Project
wikipedia	es	elek4r79vho4h4hn8pke3pmgz	Anexo
wikipedia	es	elek4r79vho4h4hn8pke3pmgz	Gadget definition
wikipedia	es	elek4r79vho4h4hn8pke3pmgz	Gadget
wikipedia	es	elek4r79vho4h4hn8pke3pmgz	File
wikipedia	fr	c7b6bxjm7ggbd17478yrzly90	Help
wikipedia	es	be6eyaoy1n4slvg3qvrzbso23	Project
wikipedia	es	be6eyaoy1n4slvg3qvrzbso23	Help
wikipedia	es	be6eyaoy1n4slvg3qvrzbso23	Template
wikimedia	commons	alb6fbdmyrloy04ht282scu70	Category
wikimedia	commons	alb6fbdmyrloy04ht282scu70	File

I'll run it for searches made on June 1st and aggregate/count as a separate step.

That is great! I have accessrights now, but seeing the query they are of no use in this particular issue: this is way beyond my Hive/SQL skills! :-)

Attached:

namespace-searches-counts.csv75 KBDownload

Counts of advanced Special:Search searches for each observed combination of namespaces on 1 June 2017. The proportion is how many searches out of total namespace searches within the wiki -- e.g. the 1.6K Category + File searches accounted for 20% of the advanced Special:Search searches on Commons.

Example:

project	lang	language	namespaces	searches	proportion
wikipedia	en	English	Book, Book talk, Category, Category talk, Draft, Draft talk, Education Program, Education Program talk, File, File talk, Gadget, Gadget definition, Gadget definition talk, Gadget talk, Help, Help talk, MediaWiki, MediaWiki talk, Module, Module talk, Portal, Portal talk, Project, Project talk, Talk, Template, Template talk, TimedText, TimedText talk, User, User talk	3162	0.5254
wikimedia	commons	NA	Category, File	1636	0.2077
wikimedia	commons	NA	Category, Creator, File, Help, Institution	1125	0.1428
wikimedia	commons	NA	Category, Creator, File, Help, Institution, MediaWiki talk	1122	0.1425
wikipedia	es	Spanish; Castilian	Anexo, Category, Category talk, Education Program, Education Program talk, File, File talk, Gadget, Gadget definition, Gadget definition talk, Gadget talk, Help, Help talk, MediaWiki, MediaWiki talk, Module, Module talk, Portal, Project, Project talk, Talk, Template, Template talk, User, User talk, Wikiproyecto	1092	0.4190
wikipedia	de	German	Category, Category talk, File, File talk, Gadget, Gadget definition, Gadget definition talk, Gadget talk, Help, Help talk, MediaWiki, MediaWiki talk, Module, Module talk, Portal, Portal Diskussion, Project, Project talk, Talk, Template, Template talk, User, User talk	1035	0.5302
wikimedia	commons	NA	Campaign, Campaign talk, Category, Category talk, Creator, Creator talk, Data, Data talk, File, File talk, Gadget, Gadget definition, Gadget definition talk, Gadget talk, GWToolset, GWToolset talk, Help, Help talk, Institution, Institution talk, MediaWiki, MediaWiki talk, Module, Module talk, Project, Project talk, Sequence, Sequence talk, Talk, Template, Template talk, TimedText, TimedText talk, Translations, Translations talk, User, User talk	999	0.1268
wikipedia	zh	Chinese	Category, Category talk, Draft, Draft talk, File, File talk, Gadget, Gadget definition, Gadget definition talk, Gadget talk, Help, Help talk, MediaWiki, MediaWiki talk, Module, Module talk, Portal, Portal talk, Project, Project talk, Talk, Template, Template talk, User, User talk	575	0.8110
wikimedia	commons	NA	Campaign, Campaign talk, Category, Category talk, Creator, Creator talk, Data, Data talk, File, Gadget, Gadget definition, Gadget definition talk, Gadget talk, GWToolset, GWToolset talk, Help, Help talk, Institution, Institution talk, MediaWiki, MediaWiki talk, Module, Module talk, Sequence, Sequence talk, Template, Template talk, TimedText, TimedText talk, Translations, Translations talk	558	0.0708
wikipedia	es	Spanish; Castilian	Anexo, Portal	512	0.1965

Aggregation code:

library(tidyverse)

namespaces <- read_tsv("~/Desktop/namespace-searches-edited.tsv")
# ^ not attached in the comment

data("ISO_639_2", package = "ISOcodes")

aggregates <- namespaces %>%
  arrange(project, language, search_id, namespace_searched) %>%
  group_by(project, language, search_id) %>%
  mutate(namespaces = paste0(namespace_searched, collapse = ", ")) %>%
  group_by(project, language, namespaces) %>%
  count %>%
  mutate(proportion = round(n/sum(n), 4)) %>%
  left_join(select(ISO_639_2, c(Alpha_2, Name)),
            by = c("language" = "Alpha_2")) %>%
  select(project, lang = language, language = Name, namespaces, searches = n, proportion) %>%
  arrange(project, lang, desc(searches))

write_csv(aggregates, "~/Desktop/namespace-searches-counts.csv")

Let me know if there's anything else you'd like to know! :) Hope this helps~

mpopov mentioned this in T165492: Find out which namespace combinations are used for searching.Jun 6 2017, 1:52 AM

Tobi_WMDE_SW moved this task from Incoming to Advanced Search on the TCB-Team (now WMDE-TechWish) board.Jun 6 2017, 10:08 AM

Thanks @mpopov, that looks great!

• Lea_WMDE moved this task from Backlog to Watching on the Advanced-Search board.Jun 7 2017, 2:41 PM

Let us know if there is anything else we can help with, @Lea_WMDE :)

@mpopov was it intentional to not include ns0, the article/main namespace in the results of the query?

@mpopov Are we correct to assume that the numbers here only reflect what the users searched for when they clicked on "advanced" on SpecialPage:Search, but not if they clicked on "Content Pages", "Multimedia" or "Everything". If so, would it be possible to get the data with these searches included?

Background: We are investigating which defaults should be offered for namespace selection, and maybe the current ones are already the ones that everybody needs.

In T165861#3488876, @James_Budday wrote:

@mpopov was it intentional to not include ns0, the article/main namespace in the results of the query?

No, that was not intentional. For some reason ns0 doesn't have a name in our database, which got it excluded by accident.

FROM wmf_raw.mediawiki_project_namespace_map
WHERE snapshot = '2017-04' AND namespace_canonical_name RLIKE '^[A-Za-z\\s]+$'

Screen Shot 2017-08-09 at 10.02.51 AM.png (591×1 px, 133 KB)

@Lea_WMDE: Would it help if I fixed the query to include ns0 and uploaded a new dataset?

In T165861#3512071, @Lea_WMDE wrote:

@mpopov Are we correct to assume that the numbers here only reflect what the users searched for when they clicked on "advanced" on SpecialPage:Search, but not if they clicked on "Content Pages", "Multimedia" or "Everything". If so, would it be possible to get the data with these searches included?

Which numbers? The proportions? If so, I think I can re-calculate it as the % of all searches those combinations represent.

@mpopov
In the current search interface, there are four option: Content articles, multimedia, everything and advanced. When you click on advanced, you get the table of all namespaces and can choose them individually. If we understand your query correctly, you only look at searches that have profile=advanced in the url. The first three options have other profiles though. For our need, we would need to have these searches included, too.

Bildschirmfoto 2017-08-09 um 21.30.47.png (580×1 px, 119 KB)

In T165861#3513857, @Lea_WMDE wrote:

@mpopov
In the current search interface, there are four option: Content articles, multimedia, everything and advanced. When you click on advanced, you get the table of all namespaces and can choose them individually. If we understand your query correctly, you only look at searches that have profile=advanced in the url. The first three options have other profiles though. For our need, we would need to have these searches included, too.

Even though those tabs don't let you choose namespaces?

Even though those tabs don't let you choose namespaces?

They don't? I thought that profile = a bundle of namespaces? What is a profile then?

In T165861#3515068, @Lea_WMDE wrote:

They don't? I thought that profile = a bundle of namespaces? What is a profile then?

That's fair. When I get back to this, I'll include "Content pages" (mainspace), "Multimedia" (File), "Everything" (all plus File) profiles.

thanks @mpopov!

@mpopov Do you think you will be able to give us insights in the next days? Our principle investigator of namespace correlations is only available until the end of next week. So if you manage to get back to it before, it would make it much easier for us to evaluate :)

In T165861#3527622, @Lea_WMDE wrote:

@mpopov Do you think you will be able to give us insights in the next days? Our principle investigator of namespace correlations is only available until the end of next week. So if you manage to get back to it before, it would make it much easier for us to evaluate :)

Sure! I've rewritten the query and recounted the namespace combos. I skipped profiles like Translations and Discussions since they're rare and weird and I don't have time to figure out how to deal with each one, but I included logic for content, multimedia, and everything profiles. Attaching counts from 2017-08-01:

namespace-searches-counts.csv188 KBDownload

Hope this helps!

P.S. Here's the updated query (for future me's reference):

namespaces.hql3 KBDownload

Query run via CLI:

export HADOOP_HEAPSIZE=1024
hive -S \
  -d year='2017' \
  -d month='8' \
  -d day='1' \
  -f ~/namespaces.hql \
  2> /dev/null | grep -v parquet.hadoop | grep -v WARN: \
  > ~/namespace-searches.tsv

Aggregation done in R:

library(tidyverse)

namespaces <- read_tsv("~/namespace-searches.tsv") # not attached

data("ISO_639_2", package = "ISOcodes")

aggregates <- namespaces %>%
  arrange(project, language, search_id, namespace_searched) %>%
  group_by(project, language, search_id) %>%
  mutate(namespaces = paste0(namespace_searched, collapse = ", ")) %>%
  group_by(project, language, namespaces) %>%
  count %>%
  left_join(select(ISO_639_2, c(Alpha_2, Name)),
            by = c("language" = "Alpha_2")) %>%
  select(project, lang = language, language = Name, namespaces, searches = n) %>%
  arrange(project, lang, desc(searches))

write_csv(aggregates, "~/namespace-searches-counts.csv") # attached

awesome, thanks @mpopov!

debt awarded a token.Aug 17 2017, 6:06 PM

thiemowmde removed a project: TCB-Team (now WMDE-TechWish).Jan 7 2022, 2:49 PM

Maintenance_bot added a project: Product-Analytics.Jan 7 2022, 3:46 PM

	F9000062: Bildschirmfoto 2017-08-09 um 21.30.47.png
	Aug 9 2017, 7:32 PM

	F8999194: Screen Shot 2017-08-09 at 10.02.51 AM.png
	Aug 9 2017, 5:08 PM

	F9090851: namespaces.hql
	Aug 16 2017, 9:47 PM

Use search log to find currently existing namespace combinationsClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Use search log to find currently existing namespace combinations
Closed, ResolvedPublic3 Estimated Story Points
Actions

Related Objects
Search...