Page MenuHomePhabricator

Advanced Search Extension Cirrus Updates are broken since the first week of July 2018
Closed, ResolvedPublic

Description

Two following two tabs of the Advanced Search Extension Dashboard:

  • Special: Search
  • Search Keywords

are not updated starting from the second week of July 2018 or so.

This probably indicates that some changes have taken place in the Cirrus search schema (Hadoop).

  • Investigate the issue;
  • Fix the dashboard update cycle;
  • Check whether it is possible fill in the data gaps.

Event Timeline

GoranSMilovanovic updated the task description. (Show Details)
GoranSMilovanovic updated the task description. (Show Details)

This was broken for me, too, but now it is working again. Did you fix it already?

@Lea_WMDE In that case, the problem is rather strange: I still can't see any data on Special: Search and Search Keywords tabs after the first week of July or so.
Could you please provide a screenshot with the whole month of July being shown on any of the graphs from any of these two tabs? Thank you.

oh sorry you are right. The behavior that I had around July 16 and after was that the graphs did not load at all (infinite loading bar). That is not happening anymore, but you are right, the most recent data is missing...

@Lea_WMDE Something has changed in the Cirrus (Hadoop, not us) schema, I can bet. Enjoy your summer I will take care of this.

The SQL schema has changed on 1. August 2018 to include the event_deepcategory field:

mysql:research@analytics-slave.eqiad.wmnet [log]> describe AdvancedSearchRequest_18227136;
+--------------------+---------------+------+-----+---------+-------+
| Field              | Type          | Null | Key | Default | Extra |
+--------------------+---------------+------+-----+---------+-------+
| id                 | int(11)       | NO   | PRI | NULL    |       |
| uuid               | char(32)      | YES  | UNI | NULL    |       |
| dt                 | datetime      | YES  | MUL | NULL    |       |
| timestamp          | varchar(14)   | YES  | MUL | NULL    |       |
| userAgent          | varchar(1024) | YES  |     | NULL    |       |
| webHost            | varchar(1024) | YES  |     | NULL    |       |
| wiki               | varchar(1024) | YES  |     | NULL    |       |
| event_deepcategory | tinyint(1)    | YES  |     | NULL    |       |
| event_filetype     | tinyint(1)    | YES  |     | NULL    |       |
| event_hastemplate  | tinyint(1)    | YES  |     | NULL    |       |
| event_inlanguage   | tinyint(1)    | YES  |     | NULL    |       |
| event_intitle      | tinyint(1)    | YES  |     | NULL    |       |
| event_not          | tinyint(1)    | YES  |     | NULL    |       |
| event_or           | tinyint(1)    | YES  |     | NULL    |       |
| event_phrase       | tinyint(1)    | YES  |     | NULL    |       |
| event_plain        | tinyint(1)    | YES  |     | NULL    |       |
| event_subpageof    | tinyint(1)    | YES  |     | NULL    |       |
+--------------------+---------------+------+-----+---------+-------+
17 rows in set (0.00 sec)

mysql:research@analytics-slave.eqiad.wmnet [log]> describe AdvancedSearchRequest_17841562;
+-------------------+---------------+------+-----+---------+-------+
| Field             | Type          | Null | Key | Default | Extra |
+-------------------+---------------+------+-----+---------+-------+
| id                | int(11)       | NO   | PRI | NULL    |       |
| uuid              | char(32)      | YES  | UNI | NULL    |       |
| dt                | datetime      | YES  | MUL | NULL    |       |
| timestamp         | varchar(14)   | YES  | MUL | NULL    |       |
| userAgent         | varchar(1024) | YES  |     | NULL    |       |
| webHost           | varchar(1024) | YES  |     | NULL    |       |
| wiki              | varchar(1024) | YES  |     | NULL    |       |
| event_filetype    | tinyint(1)    | YES  |     | NULL    |       |
| event_hastemplate | tinyint(1)    | YES  |     | NULL    |       |
| event_inlanguage  | tinyint(1)    | YES  |     | NULL    |       |
| event_intitle     | tinyint(1)    | YES  |     | NULL    |       |
| event_not         | tinyint(1)    | YES  |     | NULL    |       |
| event_or          | tinyint(1)    | YES  |     | NULL    |       |
| event_phrase      | tinyint(1)    | YES  |     | NULL    |       |
| event_plain       | tinyint(1)    | YES  |     | NULL    |       |
| event_subpageof   | tinyint(1)    | YES  |     | NULL    |       |
+-------------------+---------------+------+-----+---------+-------+
16 rows in set (0.00 sec)

It is currently unclear why the update fails for the wmf_raw.CirrusSearchRequestSet table in the Data Lake. Inspecting the issue now and manually driving the update until the current date to prevent data losses.

The wmf_raw.CirrusSearchRequestSet schema related features are now updated but the update will still be run manually until I figure out what to do with the SQL event logging schema changes.
Possible solution: a heuristic to automatically recognize schema changes (can do us good for any future future dashboards).

GoranSMilovanovic lowered the priority of this task from High to Medium.Aug 23 2018, 6:49 AM

Testing the update engine now. Hopefully, this is fixed.

@Lea_WMDE @RazShuty

  • The Special: Search and Special Keywords tabs will be update themselves during the day and before CEST lunchtime (I guess; depends on when does one usually lunch).
  • Data losses were prevented.
  • @Lea_WMDE The developers from the WMDE Technical Wishlist do not have to inform me on any changes in the schemata that imply the additions of new fields beginning with event_ anymore; the new code will check for all existing relevant schemata, determine the desired time range, select the schemata within that range, and match all schemata across the filed names taking care of inconsistencies (i.e. missing fields in older schemata). The new solution can (and will) be generalized to work with any WMDE dashboard, so I am filing this ticket in my monthly overview and invoice under the Wikidata (approx. 80% of my working hours) and not Technical Wishes.
GoranSMilovanovic lowered the priority of this task from Medium to Low.Aug 23 2018, 7:52 AM
  • Tested.
  • Dashboard ergonomics improved (correlation matrices).
  • Closing as resolved.