Page MenuHomePhabricator

Troubleshoot Wikimetrics RAE reports
Closed, DeclinedPublic

Description

James Hare sent an email to Analytics notifying that some centralauth-expanded reports on Rolling Active Editor and newly registered did fail.

Dan tested the Edits metric on the same cohort (4703) and it worked. However NRAE metric took too long, so it indeed seems something is wrong. We should have a look at that.

Related Objects

Event Timeline

mforns raised the priority of this task from to Needs Triage.
mforns updated the task description. (Show Details)
mforns subscribed.

The Rolling New Active Editor metric did not complete for this cohort with default parameters.

[2015-03-19 16:18:46,078: INFO/Worker-29] wikimetrics.models.report_nodes.report.queue_task[df913480-654b-4564-a9ec-cfced7e7f944]: running <RunReport("3016026")> on celery as df913480-654b-4564-a9ec-cfced7e7f944
[2015-03-19 16:59:16,126: ERROR/Worker-29] Exception closing connection None
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 243, in _close_connection
    self._dialect.do_close(connection)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 401, in do_close
    dbapi_connection.close()
AttributeError: 'NoneType' object has no attribute 'close'
[2015-03-19 16:59:16,134: WARNING/Worker-29] /usr/local/lib/python2.7/dist-packages/celery/app/trace.py:343: RuntimeWarning: Exception raised outside body: AttributeError('_Connection__connection',):
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 285, in trace_task
    retval=retval, state=state)
  File "/usr/local/lib/python2.7/dist-packages/celery/utils/dispatch/signal.py", line 150, in send
    response = receiver(signal=self, sender=sender, **named)
  File "/srv/wikimetrics/wikimetrics/configurables.py", line 212, in task_postrun
    session.remove()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/scoping.py", line 86, in remove
    self.registry().close()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 999, in close
    transaction.close()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 437, in close
    connection.close()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 637, in close
    del self.__connection
AttributeError: _Connection__connection

  exc, exc_info.traceback)))
[2015-03-19 16:59:16,135: CRITICAL/MainProcess] Task wikimetrics.models.report_nodes.report.queue_task[df913480-654b-4564-a9ec-cfced7e7f944] INTERNAL ERROR: AttributeError('_Connection__connection',)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 285, in trace_task
    retval=retval, state=state)
  File "/usr/local/lib/python2.7/dist-packages/celery/utils/dispatch/signal.py", line 150, in send
    response = receiver(signal=self, sender=sender, **named)
  File "/srv/wikimetrics/wikimetrics/configurables.py", line 212, in task_postrun
    session.remove()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/scoping.py", line 86, in remove
    self.registry().close()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 999, in close
    transaction.close()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 437, in close
    connection.close()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 637, in close
    del self.__connection
AttributeError: _Connection__connection

@Milimetric, can you clarify: The title refers to RAE, but the comment says RNAE didn't complete. Are all rolling metrics not working?

@kevinator
I wrote the description of the task.
I used RAE in the first place because James' email made reference to it.
I think Dan executed RNAE to test, so that's why I used it in the second place.

Looking at the report table, I see, James executed indeed RAE metric at 2015-03-18 19:42:36.
So maybe yes, both RAE and RNAE have similar problems.

Two things:

  1. Regarding previous comment, I've executed RAE report with the cohort in question, and it worked fine. So the problem seems to exist only for RNAE.
  1. After some research I found that at some point wikimetrics sends this query to mysql:

http://pastebin.com/zVk8zAVE
I grabbed it from the logs, so it is still separated in template and instantiation like this: "blah blah %s blah %s blah" % (v1, v2, ...).

But you can still see that the query gets really big (37KB).
The point where that happens is in the "logging.log_user IN (...)" where clause.
Inside the IN list there are 2600+ values.

Maybe this represents a performance problem?
Wikimetrics is logging "MySQL server has gone away" errors,
that may be caused by a too large query or a timeout:
https://dev.mysql.com/doc/refman/5.0/en/gone-away.html

Well I executed the critical projects (with the big queries) separately and they went fine, so yea, forget it.

In fact, the whole query with the problematic cohort does work for me now, it takes some time (40mins) but it finishes and returns reasonable results.
Also, I could not isolate the problem (cohort-wise), all subsets of the cohort I have separated work on their own.
I also looked at the mysql explain select for the query and everything seems fine. Good keys are being used.

The issue seems to me like we are getting close to a limit of wikimetrics reporting.
Where the nature of the cohort (20k+ users across 700+ wikis) combined with the nature of RNAE metric makes the queries run slow enough to reach timeout and as a consequence a report failure.

kevinator lowered the priority of this task from High to Medium.Mar 31 2015, 4:38 AM

Declining because Wikimetrics is being discontinued. See: T211835.