Page MenuHomePhabricator

LabsDB problems negatively affect analytics tools like Wikimetrics, Vital Signs, Quarry, etc. {mole}
Closed, DeclinedPublic

Description

These problems are mentioned elsewhere (like posts on the labs list by Sean and Marc-Andre that we have seen). However, since these issues have not been completely resolved, Analytics-Engineering is just making it explicit that it's a problem for tools like Wikimetrics and Vital Signs which we maintain, and tools like Quarry which others maintain. This task is mostly a tracking task for the purpose of Scrum-of-Scrums. The problems include:

  • Missing data. We have found data that's in production databases but not in LabsDB, and this does not appear to be fixed after some period of time. Furthermore, if it's fixed, we don't have a way of finding out about it so we can re-generate the affected metrics.
  • Slow queries. We are hammering the databases every night for a few hours with the Wikimetrics recurrent reports, so we are happy to help debug if we might be causing a problem. However, since about the end of September, queries that used to be fast are now slow. Especially queries against English Wikipedia, but sometimes other queries have problems.

UPDATE: we are going to most likely work around these problems by moving some of our querying to the production database for now. I'll keep this around as it will be an issue again at some point.

Event Timeline

Milimetric raised the priority of this task from to Needs Triage.
Milimetric updated the task description. (Show Details)
Milimetric changed Security from none to None.
Milimetric added a subscriber: Milimetric.

Regarding missing data, this is still useful context:

https://lists.wikimedia.org/pipermail/labs-l/2014-November/003090.html

Re-sync has occurred for s1, s2, s4, s5, and will shortly run for s3, s6, s7. We also have a MariaDB 10.0.15 upgrade happening tomorrow (2014-12-04) which includes a fix for https://tokutek.atlassian.net/browse/DB-739

The slow queries are a harder nut to crack. Labs replicas have many poorly optimized queries that run for many hours, often with duplicates, often appearing in bursts, usually without thought given to batching. It is possible that the problem has increased recently, but it is hard to tell with the low level of monitoring currently in use on labsdb.

There have been discussions on labs-l and IRC about how to police the replicas in order to maintain a decent level of service for all. User education aside, the simple options are to kill queries based on runtime or resource usage; we have started doing this so we at least avoid the kernel OOM killer, but this is a crude blunt-instrument approach without any real insight into the problem -- ie, slow queries may be inherently slow or merely slow because they are fighting for resources held by other truly slow queries. Presently both would get killed arbitrarily.

Milimetric triaged this task as Low priority.Dec 3 2014, 6:36 PM
Milimetric moved this task from Scheduled to Done on the Scrum-of-Scrums board.
Milimetric moved this task from Done to Scheduled on the Scrum-of-Scrums board.
Milimetric updated the task description. (Show Details)
Tnegrin added a subscriber: Tnegrin.Dec 8 2014, 6:49 PM
coren added a subscriber: coren.Dec 8 2014, 7:19 PM

MariaDB 10.0.15 upgrade done. Seems good with no *new* replication glitches that we've found. About to run the resync processes again, across the board, just to be suitably paranoid.

scfc added a subscriber: scfc.Dec 16 2014, 6:07 PM
coren moved this task from Triage to Tracking on the Cloud-Services board.Feb 5 2015, 4:19 PM

@yuvipanda, queries against labsdb are faster, and we saw some back-filling going on, but it's still not fast enough to provide daily data for the Vital Signs project. We still think these metrics are very valuable. We have had other projects to worry about but we still want a platform that's performant enough to do this. Our current thinking is that we'll be building this platform ourselves as an extension of our Hadoop cluster. I'd obviously love to make this available publicly, but we haven't gotten any further than high level vision discussion as of right now.

The missing data problem got uglier since I filed this issue. We found a really strange case that basically caused us to stop working, take ten steps and spin in circles while spitting as if we'd seen the devil:

  • load up a few months of rows from enwiki.revision into another data store
  • check that the rows imported match the source rows - good
  • wait an hour
  • check again that the rows imported match the source rows - bad!! The source rows *change*. Not just like the last few days, but months and months in the past!
  • take ten steps backwards and spin in a circle while spitting
coren added a comment.Apr 7 2015, 10:39 PM

@Milimetric: that's actually downright scary. What kind of changes are you noticing (i.e. additions, changes, deletions)? Is there a pattern to the changes? Can you give me a couple of rows you know have changed so that I can compare with production?

@coren: this problem was observed while pulling data out of analytics-store actually, so it's happening in mediawiki somewhere, and shouldn't be labs's problem (sorry if I suggested that). We didn't look into it closely, but @mforns might have some examples saved, he was the one who found the problem.

kevinator renamed this task from LabsDB problems negatively affect analytics tools like Wikimetrics, Vital Signs, Quarry, etc. to LabsDB problems negatively affect analytics tools like Wikimetrics, Vital Signs, Quarry, etc. {mole}.Apr 27 2015, 4:21 PM
kevinator edited projects, added Analytics-Kanban; removed Analytics-Engineering.
mforns added a comment.May 1 2015, 2:44 PM

@coren, @Milimetric What I observed back in February is that the revision tables in the wiki DBs (analytics-store) are receiving inserts of edits with a timestamp of several days/weeks ago.

How did I see that? At that time we were testing some import scripts against analytics-store. The test query would check that the imported data matched the data in the wiki DBs. At the moment right after the import, the data would actually match, but executing the script a couple of days/weeks later would show that the wiki DBs had more registers than the imported data for the queried period.

Today I executed the test query again, only for dewiki and only from 2014-12-01T00:00:00 to 2014-12-02T00:00:00. The following rev_id's exist in dewiki and not in the imported table. Note that the import was done some time in February:

140493771
139797017
140649946
139509330
140650606
138758966
140650607
141543860
138755470
141543113
141543115
139499162
139499406
140357247
140817368
141485999
138773886
140817369
140372153
140250245
141044197
139015634
140420859

Maybe this is just a problem with replication to analytics-store?

kevinator closed this task as Declined.Jun 3 2015, 4:14 PM
kevinator claimed this task.
kevinator added a subscriber: kevinator.

I'm closing this task because it is very broad and there are no clear next steps to resolve the issue.
Also, the impact of the problems are fairly low right now.
We'll open new more concise tasks when we run into these problems again with some urgency to fix the issue.

kevinator moved this task from Paused to Done on the Analytics-Kanban board.Jun 3 2015, 7:07 PM