Page MenuHomePhabricator

[REQUEST] Extract search queries from HTTP_REFERER field for a Wikibook
Open, MediumPublic

Description

What's requested:

I am requesting recent search term data extracted from http_referer headers for http requests pointed at the popular chapters of the Wikibook on OpenSSH. For Google, this would be the part after ?q= and for Baidu, it might be ?wd= The chapters to investigate would be the following, listed in order of preference, the most interesting first:

Why it's requested:

Having search term data extracted from the http_referer header would help explain various spikes and surges in usage for the book and its chapters. With more clues explaining the spikes and surges, the book content can be improved. For example, the top three chapters in this book, as of 2016-09-05, do not follow the patterns they've otherwise held for months:

https://tools.wmflabs.org/pageviews/?project=en.wikibooks.org&platform=all-access&agent=user&range=latest-60&pages=OpenSSH|OpenSSH/Cookbook/Proxies_and_Jump_Hosts|OpenSSH/Cookbook/Multiplexing|OpenSSH/Logging_and_Troubleshooting|OpenSSH/Cookbook/File_Transfer_with_SFTP|OpenSSH/Cookbook/Host-based_Authentication|OpenSSH/Cookbook/Public_Key_Authentication|OpenSSH/Client_Configuration_Files

When it's requested:

Any time in 2016 would be helpful in answering questions about the current spikes and surges. However, even early 2017 might be ok for other reasons.

Other helpful information:

Google Search Console might be another option at least for the Google referrers. However, it needs addition of a "meta" element in the HTML head or else uploading a file.

Event Timeline

Tbayer added subscribers: APalmer_WMF, leila, Tbayer.

(For cross-reference, the Analytics-l thread which gave rise to this request: https://lists.wikimedia.org/pipermail/analytics/2016-September/thread.html#5379 )

Hi Lars,

this is an interesting idea and if it could help editors to improve content by better addressing the main information needs of readers that come to a page, that would be awesome.

First a clerical note: I understand you filed this request following the instructions in the Research FAQ. As mentioned there, the WMF team that deals with referred traffic, search engine ranking etc. is the Discovery team, not the Reading team (for which I'm the data analyst). That does not mean that they will be able to prioritize such requests, or conversely that I or other people with access may not be able to chip in as well. But normally they should be the first point of contact for such a request.

Legal issues: By default we need to treat referrers as private. I think that it should be fine to release e.g. all search terms for a page that occur with a certain minimum frequency, say >100 times/month. But this should be confirmed with the WMF Legal team - I'm CCing @APalmer_WMF as their privacy expert. If you can negotiate their OK, I or other data analysts can then pick it up from there.

As you described on the mailing list, the referer field from the webrequest table contains the information you are interested in. If you can construct a concrete HQL query to extract it and limit the result to sufficiently frequent terms per the legal concerns mentioned above, I would be happy to run it for you this time. (Again, it's outside my core work area, and I may not have time to help a lot with debugging.) By pure coincidence, I just posted related webrequest queries here, which may help you constructing yours. They are about extracting the referrer domain rather than search strings. For that, you should be able to use e.g. the SUBSTRING and LOCATE functions.

(Also CCing @leila , who suggested on the mailing list to make this into a larger research collaboration.)

On 09/08/2016 11:10 AM, Tbayer wrote:

Tbayer added subscribers: APalmer_WMF, leila, Tbayer. Tbayer edited
projects, added Discovery-Analysis; removed Reading-analysis. Tbayer
added a comment.

(For cross-reference, the Analytics-l thread which gave rise to this
request:
https://lists.wikimedia.org/pipermail/analytics/2016-September/thread.html#5379
)

Hi Lars,

this is an interesting idea and if it could help editors to improve
content by better addressing the main information needs of readers
that come to a page, that would be awesome.

First a clerical note: I understand you filed this request following
the instructions in the Research FAQ
https://meta.wikimedia.org/wiki/Research:FAQ. As mentioned there,
the WMF team that deals with referred traffic, search engine ranking
etc. is the Discovery team, not the Reading team (for which I'm the
data analyst). That does not mean that they will be able to
prioritize such requests, or conversely that I or other people with
access may not be able to chip in as well. But normally they should
be the first point of contact for such a request.

Thanks. Noted.

Legal issues: By default we need to treat referrers as private.

Yes. That's pretty much expected.

I think that it should be fine to release e.g. all search terms for a
page that occur with a certain minimum frequency, say >100
times/month.

Any other information beyond the bare search terms and the Wikibook pages that they arrive at are not wanted or needed for my part. I'm thinking that the search terms alone, when totally decoupled from the rest of the referrer so that the address of origin and other possibly sensitive information is removed, would be more than adequately sanitized.

Many of the pages I would like to examine have just below 1500 views per month according to the Pageviews tool, the rest below 3000. So it might be very unlikely that any given search term will meet such a high limit as more than 100 times per month. If a limit is needed, it might be a much lower number.

As a primary author / maintainer for a Wikibook, the edge cases are perhaps more interesting to me in order to indicate where the book needs more attention. Though while I expect (W.A.G.) that few of the low frequency of occurrence search terms will be highly relevant, I expect that a few will provide very good insight and pointers about which material to cover or how to cover it. The high-volume search terms could be useful (again W.A.G.) in confirming suspicions about use, but either way some data is needed to confirm either idea.

But this should be confirmed with the WMF Legal team -
I'm CCing @APalmer_WMF
https://phabricator.wikimedia.org/p/APalmer_WMF/ as their privacy
expert. If you can negotiate their OK, I or other data analysts can
then pick it up from there.

Thanks. Again, the search terms are the small subset of the referrer containing what the visitors say they are looking for not where they are coming from or anything else. Brutal use of the SUBSTRING function you mention below should work to remove all else from the HTTP_REFERER header.

The following data, I think, is pretty dry even when combined.

  • date of access (rounded to week or month is fine)
  • search term (not the whole query and definitely not the whole header)
  • Wikibook destination chapter (or book, but chapter is better)

I've thought about these and think they should be safe. However, I expect that the privacy experts have much more insight and experience in regards to potential pitfalls.

As you described on the mailing list, the referer field from the
webrequest table
https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest
contains the information you are interested in. If you can construct
a concrete HQL query to extract it and limit the result to
sufficiently frequent terms per the legal concerns mentioned above, I
would be happy to run it for you this time. (Again, it's outside my
core work area, and I may not have time to help a lot with
debugging.)

By pure coincidence, I just posted related webrequest
...

I appreciate that very much and the links. It may take some time for me to get up to speed on what to extract and how to express it in HQL.

On 09/08/2016 11:10 AM, Tbayer wrote:

Legal issues: By default we need to treat referrers as private.

Yes. That's pretty much expected.

I think that it should be fine to release e.g. all search terms for a
page that occur with a certain minimum frequency, say >100
times/month.

Any other information beyond the bare search terms and the Wikibook pages that they arrive at are not wanted or needed for my part. I'm thinking that the search terms alone, when totally decoupled from the rest of the referrer so that the address of origin and other possibly sensitive information is removed, would be more than adequately sanitized.

Many of the pages I would like to examine have just below 1500 views per month according to the Pageviews tool, the rest below 3000. So it might be very unlikely that any given search term will meet such a high limit as more than 100 times per month. If a limit is needed, it might be a much lower number.

As a primary author / maintainer for a Wikibook, the edge cases are perhaps more interesting to me in order to indicate where the book needs more attention. Though while I expect (W.A.G.) that few of the low frequency of occurrence search terms will be highly relevant, I expect that a few will provide very good insight and pointers about which material to cover or how to cover it. The high-volume search terms could be useful (again W.A.G.) in confirming suspicions about use, but either way some data is needed to confirm either idea.

But this should be confirmed with the WMF Legal team -
I'm CCing @APalmer_WMF
https://phabricator.wikimedia.org/p/APalmer_WMF/ as their privacy
expert. If you can negotiate their OK, I or other data analysts can
then pick it up from there.

Thanks. Again, the search terms are the small subset of the referrer containing what the visitors say they are looking for not where they are coming from or anything else. Brutal use of the SUBSTRING function you mention below should work to remove all else from the HTTP_REFERER header.

The following data, I think, is pretty dry even when combined.

  • date of access (rounded to week or month is fine)
  • search term (not the whole query and definitely not the whole header)
  • Wikibook destination chapter (or book, but chapter is better)

OK, based on off-ticket input from the Legal team my understanding is now that this should be fine event without restriction to a minimum frequency.

I've thought about these and think they should be safe. However, I expect that the privacy experts have much more insight and experience in regards to potential pitfalls.

As you described on the mailing list, the referer field from the
webrequest table
https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest
contains the information you are interested in. If you can construct
a concrete HQL query to extract it and limit the result to
sufficiently frequent terms per the legal concerns mentioned above, I
would be happy to run it for you this time. (Again, it's outside my
core work area, and I may not have time to help a lot with
debugging.)

By pure coincidence, I just posted related webrequest
...

I appreciate that very much and the links. It may take some time for me to get up to speed on what to extract and how to express it in HQL.

Cool - and feel free to ask questions if you get stuck with assembling the query, either here or on the #wikimedia-analytics IRC channel on Freenode (I'm "HaeB" there). No promises that (or how soon) we will be able to help, but if there is a quick answer I am happy to give it.

PS, two more remarks:

  • A caveat about the proposed approach: Many referrals from Google come without the query part of the URL; which is probably an additional argument for not imposing a frequency limit.
  • There is also the possibility that we could get some information from Google Search Console (webmaster tools). I don't have access to the en.wikibooks domain there though, and I'm not even sure it is registered. Also, as far as I know the information there is not complete (e.g. for en.wikipedia.org, for which I have access, it lists "openssl" as the only search term for https://en.wikipedia.org/wiki/OpenSSL , and nothing for https://en.wikipedia.org/wiki/Forward_secrecy ).

Google Search Console (webmaster tools) might be useful. Which Wikimedia group would that fall under?

About the query part of the URI being empty, that is probably the case for nearly all of the GET requests. I was able to get a small log sample from a very active web site and only a very, very small proportion of the URIs there had any referrer info. But maybe that's enough.

Anyway, using that sample, and what was described by Google's documentation ( https://support.google.com/analytics/answer/2795821?hl=en#searchEngine ), I used a Regular Expression Test Page for Java ( http://www.regexplanet.com/advanced/java/ ) to take a stab at a pattern. I'm not expert at regex so this is just a draft for discussion:

\b(?:q=|query=|qs=|qt=|p=|k=|encquery=|rdata=|search_word=|szukaj=|terms=|text=|wd=|words=)([\w\%\+]+)\b

Then, I am guessing, that would fit in a search query something like this:

pseudo code:

SELECT

year,
month,
uri_path,
referer REGEXP '\b(?:q=|query=|qs=|qt=|p=|k=|encquery=|rdata=|search_word=|szukaj=|terms=|text=|wd=|words=)([\w\+]+)\b'

FROM

wmf_raw.webrequest_sequence_stats_hourly

WHERE

is_pageview = true
AND agent_type = "user"
AND http_method = "GET"
AND http_status = "200"
AND uri_host = "en.wikibooks.org"
AND uri_path LIKE "/wiki/OpenSSH%"
AND referer REGEXP '\bq=|query=|qs=|qt=|p=|k=|encquery=|rdata=|search_word=|szukaj=|terms=|text=|wd=|words='
AND year = 2016
AND month >= 08
AND referer != ""

ORDER BY

referer, year, month

GROUP BY

uri_path

LIMIT 1000;

Again, I am guessing about the SELECT and that REGEXP would filter in only the desired data.

Google Search Console (webmaster tools) might be useful. Which Wikimedia group would that fall under?

Again, due to our privacy policy we do not send data to third parties. So we do not use any external analytics solutions.

Google Search Console (webmaster tools) might be useful. Which Wikimedia group would that fall under?

Again, due to our privacy policy we do not send data to third parties. So we do not use any external analytics solutions.

There may be some confusion between Google Analytics and Google Search Console here. As mentioned above, we do actually have data from Search Console, which is collected by Google on their servers before users reach our site. If the difference is still unclear, perhaps start a thread on Analytics-l - but there is also a lot of documentation about this on the web.

Milimetric triaged this task as Medium priority.Sep 19 2016, 3:49 PM
Milimetric moved this task from Incoming to Radar on the Analytics board.
Milimetric added a project: good first task.
debt subscribed.

I'm not exactly sure what the Discovery Analysis team can do right now for this ticket...removing it from our board for now. Please re-add us if there is anything we can do specifically.

Google Search Console (webmaster tools) might be useful. Which Wikimedia group would that fall under?

About the query part of the URI being empty, that is probably the case for nearly all of the GET requests. I was able to get a small log sample from a very active web site and only a very, very small proportion of the URIs there had any referrer info. But maybe that's enough.

Anyway, using that sample, and what was described by Google's documentation ( https://support.google.com/analytics/answer/2795821?hl=en#searchEngine ), I used a Regular Expression Test Page for Java ( http://www.regexplanet.com/advanced/java/ ) to take a stab at a pattern. I'm not expert at regex so this is just a draft for discussion:

\b(?:q=|query=|qs=|qt=|p=|k=|encquery=|rdata=|search_word=|szukaj=|terms=|text=|wd=|words=)([\w\%\+]+)\b

Then, I am guessing, that would fit in a search query something like this:

pseudo code:

SELECT

year,
month,
uri_path,
referer REGEXP '\b(?:q=|query=|qs=|qt=|p=|k=|encquery=|rdata=|search_word=|szukaj=|terms=|text=|wd=|words=)([\w\+]+)\b'

FROM

wmf_raw.webrequest_sequence_stats_hourly

WHERE

is_pageview = true
AND agent_type = "user"
AND http_method = "GET"
AND http_status = "200"
AND uri_host = "en.wikibooks.org"
AND uri_path LIKE "/wiki/OpenSSH%"
AND referer REGEXP '\bq=|query=|qs=|qt=|p=|k=|encquery=|rdata=|search_word=|szukaj=|terms=|text=|wd=|words='
AND year = 2016
AND month >= 08
AND referer != ""

ORDER BY

referer, year, month

GROUP BY

uri_path

LIMIT 1000;

Again, I am guessing about the SELECT and that REGEXP would filter in only the desired data.

This query does not work for several reasons. As I had said above, I unfortunately don't have time to help a lot with debugging (or acting as as coder who implements pseudocode that is drafted for discussion into actual queries). A few remarks though to leave you not entirely without feedback:

  • My suggestion was to start simple and try SUBSTRING and LOCATE. It seems you have opted for a more elaborate approach using regexes. Can you point me to the part of the HQL documentation where you found this use of REGEXP for search and replace (i.e. resulting in a string)? SELECT a REGEXP bresults in a boolean, not a string as you seem to intend.
  • What's more, the second use of REGEXP, in the WHERE clause where it appears to be intended as boolean, does not seem to work as intended either, failing to identify the following example referer:
 SELECT 'http://www.google.com/search?q=example+search+term&btnG=&hl=en' REGEXP '\bq=|query=|qs=|qt=|p=|k=|encquery=|rdata=|search_word=|szukaj=|terms=|text=|wd=|words=';
_c0
false

BTW how about starting simple and making it work for "q=" first? Not sure why one would need to cover all eventualities on first try. (Once you provide a working query we can easily re-run it in modified form.)

  • It's unclear why the proposed query accesses the table wmf_raw.webrequest_sequence_stats_hourly instead of just the webrequest table (the documentation of which I linked above).

...

This query does not work for several reasons. As I had said above, I unfortunately don't have time to help a lot with debugging (or acting as as coder who implements pseudocode that is drafted for discussion into actual queries). A few remarks though to leave you not entirely without feedback:

  • My suggestion was to start simple and try SUBSTRING and LOCATE. It seems you have opted for a more elaborate approach using regexes. Can you point me to the part of the HQL documentation where you found this use of REGEXP for search and replace (i.e. resulting in a string)? SELECT a REGEXP bresults in a boolean, not a string as you seem to intend.

Since I just needed to look up this function for other reasons: If you prefer to extract the search terms using a full-blown regex, that can actually be done in Hive using the REGEXP_EXTRACT function. For the above mentioned example referer:

SELECT REGEXP_EXTRACT('http://www.google.com/search?q=example+search+term&btnG=&hl=en', 'q=([^&]*)&');
_c0
example+search+term
1 row selected (0.123 seconds)

This should already be enough for piecing together a basic version of the desired query. I may take a stab at that once I have a bit more time, but feel free to have a go at it too.

PS: or even easier, use the PARSE_URL function described in the same section of the above linked documentation.

Thanks.

(I don't suppose there is a test database with sample data to develop queries against? I know some of the expensive subscription databases used to offer training sets useless except for learning the ropes.)

If I understand the PARSE_URL function, then we could try the following against the webrequest table:

SELECT
  year,
  month,
  uri_path,
  PARSE_URL( referer, QUERY, 'q' )
  
FROM
  webrequest

WHERE
  is_pageview = true
  AND agent_type = "user"
  AND http_method = "GET"
  AND http_status = "200"
  AND uri_host = "en.wikibooks.org"
  AND uri_path LIKE "/wiki/OpenSSH%"
  AND PARSE_URL( referer, QUERY, 'q' )
  AND year = 2016
  AND month >= 09
  AND referer != ""

ORDER BY
  referer, year, month

GROUP BY
  uri_path

LIMIT 1000;

(I don't suppose there is a test database with sample data to develop queries against? I know some of the expensive subscription databases used to offer training sets useless except for learning the ropes.)

No, we have tried to do this on vagrant/labs couple times w/o much success. It will be easy to offer a "canned" cluster but plugging in labs kafka + pageviews +testing cluster (so schemas and such evolves just like they do in prod) has proven hard operationally. Seems that our infrastructure in beta labs cannot easily sustain even a test replica of prod cc @Ottomata