Page MenuHomePhabricator

Count keyword usage over time
Closed, InvalidPublic

Description

For the parent ticket to be realized, please implement the following:

  • For each search that
    • was made through a web request
    • happened on the wikimedia wikis
    • included keywords that are (or will be) accessible through advancedSearch: file_type, in_title, subpageof, inlanguage, hastemplate, deepcategory
    • no matter where the search was executed from

Please record:

  • Which keywords it included
  • Whether the search was submitted from special:search from a person that has advancedSearch enabled.

Note
For people who have advancedSearch enabled, we are already counting keywords when the search was submitted from specialpage:search. Here we are interested in the overall searches (by humans). It might be worth refactoring the old code to do that as well, though (without taking away any of the current functionality)

Event Timeline

Lea_WMDE triaged this task as Medium priority.Feb 21 2018, 4:45 PM
Lea_WMDE created this task.
Lea_WMDE moved this task from Backlog to Tickets ready for pickup on the Advanced-Search board.

@Lea_WMDE This could be implemented in a one single event logging table that we would then use for the (1) Advanced Search Extension dashboard as we have it now as well as for any (2) new features that we want to have there (or on some new dashboard).

However decides to develop event logging for this, please get in touch so that we can design the SQL table together before the event logging code is rolled out. Thanks!

@GoranSMilovanovic do you just want to go ahead and make a suggestion how the SQL table should look like? Or else whoever picks this up from the FUN team could send you a suggestion or you two do it together.

@Lea_WMDE It would be better if someone from the FUN team could provide the first draft of the table design.

Ratio:

  • there will be constraints on both sides (FUN team - what can be "eventLogged" from PHP/JS at all; what is difficult and time consuming and what is not; Analytics - what do I need in order to deliver the desired data product, what design features/columns/keys are a must have and what not; etc)
  • then we can go through several iterations explaining to each other our needs and motivations; in my viewpoint, this is how mutual understanding develops (especially with one of us working remotely).

@dcausse @EBernhardson @Milimetric Do you know by any chance if we are already logging keyword use continuously? And how often the Special:Search page is called?

dcausse added a comment.EditedFeb 27 2018, 1:50 PM

@Lea_WMDE yes all this data is available in hive, updated hourly (with 90 days of historical data).

For keyword usage:

SELECT
    csr.source,
    asyn as kw,
    count(*) as cnt
FROM
    CirrusSearchRequestSet csr
    LATERAL VIEW EXPLODE(requests) req AS areq
    LATERAL VIEW EXPLODE(split(areq.payload["syntax"], ",")) syn as asyn
WHERE
    year = YYYY AND month = MM AND day = DD AND -- hour = HH AND
    areq.queryType = 'full_text'
GROUP BY csr.source, asyn
ORDER BY kw, csr.source LIMIT 500;

For Special:Search usage what are you looking for? If it's simple quantitave data I think that it's available here.

@dcausse awesome, thank you!
@GoranSMilovanovic can you check the sources out and see if there is anything left that you need to be done?

@dcausse @EBernhardson @Milimetric I will need your advise on this. Please let me know if I am missing something (I was relying on the Analytics/Data Lake/Traffic/Cirrus page to study the schema).

@Lea_WMDE In respect to the description of the task, my conclusions are:

  • Which keywords it included. We can have the keywords, of course. For example:
full_text  full_text_querystring full_text_simple_match             near_match           query_string 
                    95                     18                     77                     22                     10

keeping queryType = 'full_text' as @dcausse did. The counts in the example above are derived from a small sample of observations only.

  • Whether the person had advancedSearch enabled or not. This table does not encompass any user Id information, so we cannot match it to any other table that could tells us whether the user did or did not install the Advanced Search Extension. One possibility that comes to mind is the following: if the requests.payload field records the Advanced Search Extension keywords by default, then anytime we find an extension specific keyword we know that the extension was used. This is not an answer to the question in the description, of course, because any user might have installed the extension without really using any of the extension specific keywords, I guess. However, to answer to your team's needs directly, I would need to have a user Id and a table which records the extensions installed by a specific user to join.
  • I can't see any obvious way to differentiate between human and bot searches from this table, except if the source field is indicative of bot behavior (I am guessing here), e.g. if api, cli, or web values can help determine whether a search came from a human or a bot.
  • Whether the person had advancedSearch enabled or not. This table does not encompass any user Id information, so we cannot match it to any other table that could tells us whether the user did or did not install the Advanced Search Extension. One possibility that comes to mind is the following: if the requests.payload field records the Advanced Search Extension keywords by default, then anytime we find an extension specific keyword we know that the extension was used. This is not an answer to the question in the description, of course, because any user might have installed the extension without really using any of the extension specific keywords, I guess. However, to answer to your team's needs directly, I would need to have a user Id and a table which records the extensions installed by a specific user to join.

It depends on what you want to track but currently Cirrus only receives a search request from the browser. IIRC we store the request URI in the payload, would there be something in the request sent that could help identify if AdvancedSearch was used? If no I guess you'll have to pass a markup in the URI to track this information.

  • I can't see any obvious way to differentiate between human and bot searches from this table, except if the source field is indicative of bot behavior (I am guessing here), e.g. if api, cli, or web values can help determine whether a search came from a human or a bot.

For bot filtering I'd look into the available UDFs to see if there's something you could reuse to filter bots based on the User-Agent.
But for now you can easily filter on web requests since AdvancedSearch is a web interface (unless you want all human search metrics and not only AdvancedSearch usage). We have bots using the web interfaces but those ones are extremely hard to filter since they frequently fake a web browser.

@Lea_WMDE The following:

It depends on what you want to track but currently Cirrus only receives a search request from the browser. IIRC we store the request URI in the payload, would there be something in the request sent that could help identify if AdvancedSearch was used? If no I guess you'll have to pass a markup in the URI to track this information.

needs to be answered by some mediawiki developer for the Advanced Search Extension - I can't help around this.

As of the following:

For bot filtering I'd look into the available UDFs to see if there's something you could reuse to filter bots based on the User-Agent.

The cirrusrsearchrequest.useragent field does not differentiate between bot and human users. (NOTE: typically, User-Agent in our databases has a user-is-bot field or something similar). Again, I would say the Advanced Search Extension development team should make an assessment on whether is it possible to pass something that would differentiate - at least approximately - between human and bot users in this case.

@dcausse Thank you very much for your assessment of our data and the suggestions that you have provided.

Lea_WMDE updated the task description. (Show Details)Mar 13 2018, 6:50 PM
Lea_WMDE closed this task as Invalid.Mar 13 2018, 6:53 PM

We now know how the data should look like. This ticket was meant for the backend part that might have needed to be done for @GoranSMilovanovic to display the graphs. Since all info is already available, I suggest we switch our focus to the frontend part, i.e. T187039: Measure change in keyword usage with AdvancedSearch