Page MenuHomePhabricator

Analyze access log to see whether we need to add filetype: aliases
Closed, ResolvedPublic

Description

In T150887, we have added capability to add aliases for filetype: searches. We'd like to go through access logs for sites where such searches are used (commons, maybe enwiki and others?) and see if unsupported types are used with filetype: search and if we identify frequently used unsupported types, add support for them in the configuration.

We probably want to do it somewhere in January so we have enough data set to rely on.

Supported types are listed here: https://www.mediawiki.org/wiki/Help:CirrusSearch#filetype

Event Timeline

Smalyshev created this task.Dec 8 2016, 9:27 PM
Restricted Application edited projects, added Discovery-Search; removed Discovery-Search (Current work). · View Herald TranscriptDec 8 2016, 9:27 PM
Smalyshev edited subscribers, added: mpopov, chelsyx; removed: gerritbot.Dec 8 2016, 9:29 PM
debt moved this task from Needs triage to Later on the Discovery-Analysis board.Dec 15 2016, 9:17 PM
debt moved this task from Later to Up Next on the Discovery-Analysis board.Jan 19 2017, 9:06 PM
debt moved this task from Up Next to Current work on the Discovery-Analysis board.

Results: (source = "web", and I didn't exclude automata)

FiletypeCounts
pdf451
video353
ppt49
multimedia48
audio31
doc30
drawing26
eml19
image14
jpg12
svg12
xml9
mp38
webp8
gif7
txt7
PDF6
fb6
torrent6
png5
sql5
rcf5
swf5
xls5
docx5
webm4
bitmap4
vector4
wav4
jpeg3
mime3
ogg3
pdfinurl3
WEBP3
epub2
multime2
csv2
msi2
docs2
pof2

Query:

SELECT regexp_extract(ex_query, '\\bfiletype\\:(\\w+)\\b', 1) AS filetype, COUNT(*) AS Counts
FROM
(SELECT ex_query
  FROM wmf_raw.CirrusSearchRequestSet
  LATERAL VIEW explode(requests.query) exploded_tbl AS ex_query
  WHERE ((year = 2016 AND month = 12) OR (year = 2017 AND month = 1))
  AND source = 'web'
) AS tb1
WHERE ex_query RLIKE('\\bfiletype\\:(\\w+)\\b')
GROUP BY regexp_extract(ex_query, '\\bfiletype\\:(\\w+)\\b', 1)
ORDER BY Counts DESC
LIMIT 40;
Smalyshev closed this task as Resolved.Jan 26 2017, 7:55 PM

Thanks, this is great, next steps in T156413.