Page MenuHomePhabricator

Queries for the edit analysis dashboard failing since December 2015 [5 pts]
Closed, ResolvedPublic

Description

The edit analysis dashboard does not show any data after 4 December 2015.

It seems like the queries started failing at that point for some reason. There's a failure rate by type query that's running right now on analytics-store (and has been for 90 minutes), but it's the query for 4 December 2015! I've been hanging out in the processlist a lot recently (partly for reasons related to T123634) and I've seen queries for this dashboard running for similar periods of time during the past couple of weeks, but I didn't have a reason to look closely at them.

It's about time to update these queries to remove references to the Edit_11448630 table, which has stopped receiving events, and I can definitely do that myself, but I don't think that has anything to do with this issue.

Anyway, I don't think this is unbreak now because we've already lived without it for a month and a half—I imagine we can keep doing that for a bit longer :) @Jdforrester-WMF, is that accurate?

Output for SHOW FULL PROCESSLIST:

|  9371156 | research | 10.64.36.103:54651 | log    | Query   | 5543 | Queried about 680000 rows   | 
select day,
        sum(if(actions like '%saveAttempt%saveFailure%', repeated, 0)) /
        sum(if(actions like '%saveAttempt%', repeated, 0)) as 'total',
        sum(if(actions like '%saveAttempt%saveFailure%' and failures like '%userBadToken%', repeated, 0)) /
        sum(if(actions like '%saveAttempt%', repeated, 0)) as 'bad-token',
        sum(if(actions like '%saveAttempt%saveFailure%' and failures like '%userNewUser%', repeated, 0)) /
        sum(if(actions like '%saveAttempt%', repeated, 0)) as 'new-user',
        sum(if(actions like '%saveAttempt%saveFailure%' and failures like '%extensionAbuseFilter%', repeated, 0)) /
        sum(if(actions like '%saveAttempt%', repeated, 0)) as 'abuse-filter',
        sum(if(actions like '%saveAttempt%saveFailure%' and failures like '%extensionCaptcha%', repeated, 0)) /
        sum(if(actions like '%saveAttempt%', repeated, 0)) as 'captcha',
        sum(if(actions like '%saveAttempt%saveFailure%' and failures like '%extensionSpamBlacklist%', repeated, 0)) /
        sum(if(actions like '%saveAttempt%', repeated, 0)) as 'spam-blacklist',
        sum(if(actions like '%saveAttempt%saveFailure%' and failures like '%responseEmpty%', repeated, 0)) /
        sum(if(actions like '%saveAttempt%', repeated, 0)) as 'response-empty',
        sum(if(actions like '%saveAttempt%saveFailure%' and failures like '%responseUnknown%', repeated, 0)) /
        sum(if(actions like '%saveAttempt%', repeated, 0)) as 'response-unknown',
        sum(if(actions like '%saveAttempt%saveFailure%' and failures like '%editPageDeleted%', repeated, 0)) /
        sum(if(actions like '%saveAttempt%', repeated, 0)) as 'page-deleted',
        sum(if(actions like '%saveAttempt%saveFailure%' and failures like '%editConflict%', repeated, 0)) /
        sum(if(actions like '%saveAttempt%', repeated, 0)) as 'conflict'
from (
   select
      day,
      actions,
      failures,
      count(*) as repeated
   from (
      select
         date(timestamp) as day,
         group_concat(event_action order by timestamp, action_order.ord separator '-') as actions,
         group_concat(`event_action.saveFailure.type` order by timestamp separator '-') as failures
      from (
         select
            timestamp,
            event_action,
            `event_action.saveFailure.type`,
            clientIp,
            event_editingSessionId
         from Edit_11448630
         where
            event_editor = 'visualeditor' and
            ('all' = 'all' or wiki = 'all') and
            timestamp >= '20151204000000' and
            timestamp < '20151205000000'

         # Add in events using the new schema. To be removed once everything
         # is switched over.
         union all

         select
            timestamp,
            event_action,
            `event_action.saveFailure.type`,
            clientIp,
            event_editingSessionId
         from Edit_13457736
         where
            event_editor = 'visualeditor' and
            ('all' = 'all' or wiki = 'all') and
            timestamp >= '20151204000000' and
            timestamp < '20151205000000'
         ) raw_events

         inner join
            (select 'init' as action, 0 as ord
                union all
             select 'ready' as action, 1 as ord
                union all
             select 'saveIntent' as action, 2 as ord
                union all
             select 'saveAttempt' as action, 3 as ord
                union all
             select 'saveFailure' as action, 4 as ord
                union all
             select 'saveSuccess' as action, 5 as ord
                union all
             select 'abort' as action, 6 as ord
            ) action_order
         on action = event_action

         # client side, the ip is set but the editingSessionId is bad
         # due to a bad library in Safari, it creates duplicates
         # server side, the ip is not set, but editingSessionId is good
         group by if(
           if( 'visualeditor' = 'wikitext', 0, 1 ),
           concat( coalesce( clientIp, '' ), event_editingSessionId ),
           event_editingSessionId
        )

         # this purposefully ignores sessions that start before "from_timestamp",
         # as well as some other sessions that legitimately do not have an "init"
         # as well as sessions that have an "init" after "ready"
         having actions like 'init%'
   ) sessions
   group by day, actions, failures
) sessions_by_occurrences

Event Timeline

nshahquinn-wmf raised the priority of this task from to Needs Triage.
nshahquinn-wmf updated the task description. (Show Details)
nshahquinn-wmf renamed this task from Data on the edit analysis dashboard stops in December 2015 to Queries for the edit analysis dashboard failing since December 2015.Jan 22 2016, 2:55 AM
nshahquinn-wmf set Security to None.

@Milimetric, I'd be happy to do some joint investigation with you if you're in the mood to transfer knowledge!

Interesting...it's now on to the same query for 5 December. So conceivably the queries aren't failing, they're just really far behind (and this failures query across all wikis takes the longest, so it's the one I keep seeing).

Hm... all of a sudden around that time we've had lots of db issues. We were busy putting out fires last week, so this seems to me like another victim of the performance problem that started around mid December. In general I think this dashboard and other reporting methods get data in a very inefficient way.

I'm not sure there's any specific cause, but I don't see anything else wrong. We've added a couple of new dashboards, but nothing that would break the system. I'll add this to our board to look into.

Milimetric triaged this task as High priority.
Milimetric edited projects, added Analytics-Kanban; removed Analytics.
Milimetric renamed this task from Queries for the edit analysis dashboard failing since December 2015 to Queries for the edit analysis dashboard failing since December 2015 [5 pts].Jan 25 2016, 5:43 PM

I've looked a bit into it.

The queries really are taking too long. The one mentioned in the task description is taking around 5-6 hours to compute a day of data. And this only for a given metric, wiki and editor. Considering we have 5 metrics, 51 wikis and 2 editors, this is not going to stabilize any time soon.

And this has been going on for some time, the majority of the metrics stopped updating on Dec 1st, 2015; but there are some files that haven't been updated since October. There is only 1 file that is slowly catching up: all-wiki failure_rates_by_user_type for visual editor, the one in the description. But this alone is hardly helpful.

The conclusion I draw, which is also confirmed by other recent issues, is that the Edit table in the analytics-slave has gotten too big to query (see T124676). We should maybe concentrate on solving that problem and stop edit analysis dashboard until then.

Change 267245 had a related patch set uploaded (by Mforns):
Disable reports until database is able to respond queries

https://gerrit.wikimedia.org/r/267245

If everybody agrees with disabling reports until the database can respond to the queries, there is the change.

From my perspective, that seems reasonable since I'm working on T118063. Any objections, @Jdforrester-WMF?

Change 267245 merged by Mforns:
Disable reports until database is able to respond queries

https://gerrit.wikimedia.org/r/267245

nshahquinn-wmf raised the priority of this task from High to Needs Triage.Mar 30 2018, 10:30 AM
nshahquinn-wmf moved this task from Backlog to Radar on the Contributors-Analysis board.