Page MenuHomePhabricator

Cron on stat1003 for mobile data is causing an avalanche of queries on dbstore1002
Closed, ResolvedPublic

Description

We had to disable temporarily on stat1003 the cron job:

0 * * * * python /srv/limn-mobile-data/generate.py /srv/limn-mobile-data/mobile/ >> /var/log/limn-data/limn-mobile-data.log 2>&1

because it is taking over 4 hours to execute, causing an avalanche of queries on dbstore1002, generating lag without an actual outcome.

The queries this script was executing were:

SELECT DATE_FORMAT(CONCAT(Month.month, '01'), '%Y-%m-%d') AS Month, COALESCE(Main.count, 0) AS Main, COALESCE(Other.count, 0) AS Other -- http://stackoverflow.com/a/6871220/365238 -- ... using MariaDB 10 SEQUENCE engine instead of information_schema.columns FROM ( SELECT EXTRACT(YEAR_MONTH FROM SUBDATE(CURDATE(), INTERVAL @num:=@num+1 MONTH)) AS month FROM seq_1_to_12, (SELECT @num:=-1) num LIMIT 12 ) AS Month LEFT JOIN ( SELECT EXTRACT(YEAR_MONTH FROM timestamp) AS month, COUNT(DISTINCT event_username) AS count FROM (SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount FROM MobileWebEditing_5644223 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount FROM MobileWebEditing_6077315 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount from MobileWebEditing_6637866 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount from MobileWebEditing_7675117 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount from MobileWebEditing_8599025) as MobileWebEditing WHERE event_action = 'success' AND event_namespace = 0 AND wiki != 'testwiki' GROUP BY month ) AS Main ON Month.month = Main.month LEFT JOIN ( SELECT EXTRACT(YEAR_MONTH FROM timestamp) AS month, COUNT(DISTINCT event_username) AS count FROM (SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount FROM MobileWebEditing_5644223 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount FROM MobileWebEditing_6077315 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount from MobileWebEditing_6637866 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount from MobileWebEditing_7675117 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount from MobileWebEditing_8599025) as MobileWebEditing WHERE event_action = 'success' AND event_namespace != 0 AND wiki != 'testwiki' GROUP BY month ) AS Other ON Month.month = Other.monthSELECT DATE_FORMAT(CONCAT(Month.month, '01'), '%Y-%m-%d') AS Month, COALESCE(Main.count, 0) AS Main, COALESCE(Other.count, 0) AS Other -- http://stackoverflow.com/a/6871220/365238 -- ... using MariaDB 10 SEQUENCE engine instead of information_schema.columns FROM ( SELECT EXTRACT(YEAR_MONTH FROM SUBDATE(CURDATE(), INTERVAL @num:=@num+1 MONTH)) AS month FROM seq_1_to_12, (SELECT @num:=-1) num LIMIT 12 ) AS Month LEFT JOIN ( SELECT EXTRACT(YEAR_MONTH FROM timestamp) AS month, COUNT(DISTINCT event_username) AS count FROM (SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount FROM MobileWebEditing_5644223 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount FROM MobileWebEditing_6077315 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount from MobileWebEditing_6637866 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount from MobileWebEditing_7675117 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount from MobileWebEditing_8599025) as MobileWebEditing WHERE event_action = 'success' AND event_namespace = 0 AND wiki != 'testwiki' GROUP BY month ) AS Main ON Month.month = Main.month LEFT JOIN ( SELECT EXTRACT(YEAR_MONTH FROM timestamp) AS month, COUNT(DISTINCT event_username) AS count FROM (SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount FROM MobileWebEditing_5644223 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount FROM MobileWebEditing_6077315 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount from MobileWebEditing_6637866 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount from MobileWebEditing_7675117 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount from MobileWebEditing_8599025) as MobileWebEditing WHERE event_action = 'success' AND event_namespace != 0 AND wiki != 'testwiki' GROUP BY month ) AS Other ON Month.month = Other.month
SELECT Month.Date, COALESCE(Web.Web, 0) AS Web -- http://stackoverflow.com/a/6871220/365238 -- ... using MariaDB 10 SEQUENCE engine instead of information_schema.columns FROM ( SELECT DATE_FORMAT( ADDDATE(CURDATE() - INTERVAL 30 - 1 DAY, @num:=@num+1), '%Y-%m-%d' ) AS Date FROM seq_1_to_100, (SELECT @num:=-1) num LIMIT 30 ) AS Month LEFT JOIN ( SELECT DATE(timestamp) AS Date, SUM(1) AS Web FROM (SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount FROM MobileWebEditing_5644223 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount FROM MobileWebEditing_6077315 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount from MobileWebEditing_6637866 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount from MobileWebEditing_7675117 UNION SELECT timestamp, wiki, event_username, event_action, event_namespace, event_userEditCount from MobileWebEditing_8599025) as MobileWebEditing WHERE event_namespace != 0 AND event_action = 'success' AND wiki != 'testwiki' GROUP BY Date ) AS Web ON Month.Date = Web.Date

Disabling cron on puppet: https://gerrit.wikimedia.org/r/220716

Event Timeline

jcrespo created this task.Jun 25 2015, 7:22 AM
jcrespo raised the priority of this task from to Needs Triage.
jcrespo updated the task description. (Show Details)
jcrespo added a project: Research.
jcrespo added a subscriber: jcrespo.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 25 2015, 7:22 AM

@Milimetric who should we notify about this?

These queries are maintained by the former mobile team, most of which is on the new readership team. I help with the queries from time to time and I shamefully knew about these problems for a while but I have no real alternative to offer. We should keep the job disabled and think of a proper plan going forward. Until then, the dashboards that depend on these queries will not work: http://mobile-reportcard.wmflabs.org

jcrespo set Security to None.
kevinator added a subscriber: kevinator.

adding @JKatzWMF so he is aware of some dashboards not showing in the mobile report card.

ggellerman triaged this task as Normal priority.Jun 26 2015, 3:36 PM

@kevinator thanks for the heads up, yes this is a problem. I don't care about the edit graphs (sorry), but I do care about the main menu and ui graphs. I use them occasionally to track changes or get a quick number. That being said, they have been down for a couple weeks and I haven't pushed anyone to get them fixed--we have higher priorities right now in finishing out the quarter. Are the monthly pageviews broken?

@Jdlrobson @Jhernandez I think the ui level limn graphs are helpful, but not more important than our current workload. Let me know if you feel otherwise.

@kevinator @Milimetric Just an innocent question: Would we be able to get most of what we see on those graphs from graphite + a grafana dashboard? I've been poking around and I've seen that there's event logging data in graphite.

Can you guys clarify for me the relationship between these tools?

Pardon my ignorance, thanks.

Would we be able to get most of what we see on those graphs from graphite + a grafana dashboard? I've been poking around and I've seen that there's event logging data in graphite.
Can you guys clarify for me the relationship between these tools?

Currently graphite gives you operational Event Logging data. The statistics there are about how many events each schema emits, how many events are processed and considered "valid" JSON data, etc.

The cron that this phabricator task is talking about runs SQL scripts that aggregate data from the events themselves, sometimes joining with mediawiki database data. These SQL scripts are generated from templates by the python that also executes them. That python code was written collaboratively by four people who basically wanted nothing to do with it but just tried to get it working so they could move on with their life (I'm one of the four). As a result, it's terrible and it should not be maintained going forward.

The alternatives are:

  • Use the new python runner that we wrote and reviewed more lovingly with a focus on maintainability.
  • Write a new piece of code to get aggregated / analyzed / enriched Event Logging data into graphite. You'd have to create a statsd writer that would produce the interesting metrics from the SQL databases.

In either scenario it would be much better to audit which data we're actually interested in so we don't do extra work for nothing.

@JKatzWMF, I promised to keep these dashboards maintained if anyone needed them, so I'm happy to do maintenance. I look forward to our meeting, we'll sort everything out there.

@yuvipanda, generic action metrics would be lovely but it would continue this rift where people have to go to multiple places to get data. I'm for it though, and support it as a partial solution here. Some of the metrics limn-mobile-data is computing don't fit nicely in a generic action metric, but that's what Tuesday's meeting will be about.

Milimetric closed this task as Resolved.Jun 30 2015, 8:05 PM
Milimetric claimed this task.

Closing this issue because we decided what to do with each graph here: T104379. I will re-enable the cron job once that task is done and the queries and configuration reflects the desires of the mobile team.

Milimetric reopened this task as Open.Jun 30 2015, 8:06 PM
Milimetric moved this task from Paused to In Progress on the Analytics-Kanban board.
Milimetric moved this task from In Progress to Done on the Analytics-Kanban board.
kevinator closed this task as Resolved.Jul 1 2015, 3:47 PM