Page MenuHomePhabricator

Update MariaDB on analytics-store to a version that supports JSON functions
Closed, DeclinedPublic

Description

As discussed at T156681:

  • the recent rollout of T153207 introduced a JSON field to all EventLogging tables,
  • querying this would be much easier using MariaDB's JSON functions,
  • we currently use an old version of MariaDB (10.0.22) that does not yet support these.

Event Timeline

Restricted Application added a project: Analytics. · View Herald TranscriptMay 1 2017, 8:48 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
jcrespo closed this task as Declined.May 1 2017, 9:20 PM
jcrespo added a subscriber: jcrespo.

MariaDB 10.2 is not stable for production usage as I am writing these lines, it wasn't on T156681, and it is still not GA. We do not use under-development software on production. A more realistic approach would be to ask for JSON extension on 5.7https://dev.mysql.com/doc/refman/5.7/en/json.html , which are already stable, but of course migrating from MariaDB to MySQL is a huge step and not something are currently considering. Aa Manuel told you, we are currently considering upgrading to 10.1, 10.2 is out of the table, and probably won't be for a year.

Please trust me that database vendors release versions in a really unusable way, and normally we have to wait 1 year until they are a bit usable- the last thing you want is to do the QA for them with critical data (and I suppose you do not want to lose eventlogging data?). See bugs like: https://jira.mariadb.org/browse/MDEV-11571

Additionally, analytics-store stores more than eventlogging (mediawiki tables) and for usages other than analytics (research, mobile reports, operations checks, development testing, etc.)- it is more than ok to create this ticket to let us know that that is a function that is desired, but it would require coordination to check everything works (we do not not if even mediawiki works on 10.2, because it has not been released yet!).

Also, if you are thinking of using the JSON functions without virtual columns (functional indexes), you are probably doing it wrong- you will get horrible performance without indexes. Talk to me and explain what you REALLY need for your work, and there is probably a better way to achieve it in the current version.

@jcrespo I totally hear you on 10.2 not being ready. But there has to be some way to record this analytics need on Phabricator and prevent it from being forgotten.

Upgrading MariaDB was the DBA-suggested solution in T156681#2995733 f. So that's the task I filed (and I had thought we can simply mark it as blocked on upstream if the upgrade is not possible yet). But if there are other ways to solve this, I'm happy to convert this task into e.g. "Provide better support for querying the JSON userAgent field in EventLogging tables".

...

Also, if you are thinking of using the JSON functions without virtual columns (functional indexes), you are probably doing it wrong- you will get horrible performance without indexes. Talk to me and explain what you REALLY need for your work, and there is probably a better way to achieve it in the current version.

The example that @Nuria wrote down in the task description of T156681 is still valid and should serve most of our current needs:

mysql:research@analytics-store.eqiad.wmnet [(none)]> SELECT JSON_EXTRACT(userAgent, '$device_family') FROM log.Popups_16364296 LIMIT 5;

(slightly simplified to facilitate discussion)

It's possible to extract the desired value (here, for the attribute device_family) in MariaDB using string operations. But that is cumbersome, error-prone, and I guess not very performant either.

@jcrespo I totally hear you on 10.2 not being ready. But there has to be some way to record this analytics need on Phabricator and prevent it from being forgotten.
Upgrading MariaDB was the DBA-suggested solution in T156681#2995733 f. So that's the task I filed (and I had thought we can simply mark it as blocked on upstream if the upgrade is not possible yet). But if there are other ways to solve this, I'm happy to convert this task into e.g. "Provide better support for querying the JSON userAgent field in EventLogging tables".

Hi,

I believe it was not the suggested solution, it was a comment to explain why it was failing and what our plans were and actually are (which didn't include 10.2 in a near future, only 10.1).

jcrespo added a subscriber: faidon.May 2 2017, 8:08 AM

@Tbayer Let me show you how to use a programming language to obtain the desired results:

in PHP:

$query = "SELECT userAgent FROM log.Popups_16364296 LIMIT 5";
$result = $mysqli->query($query);
while($row = $result->fetch_array()) {
    $json = json_decode($row[userAgent]);
    print $json['device_family'];
}

in Python:

# same query as before
json = json.loads(row[0])
print(json['device_family'])

@Tbayer Let me show you how to use a programming language to obtain the desired results:
in PHP:

$query = "SELECT userAgent FROM log.Popups_16364296 LIMIT 5";
$result = $mysqli->query($query);
while($row = $result->fetch_array()) {
    $json = json_decode($row[userAgent]);
    print $json['device_family'];
}

in Python:

# same query as before
json = json.loads(row[0])
print(json['device_family'])

Thanks much, but I happen to know that other languages can process JSON.

This task is about the ability to do this in SQL and get direct results without adding steps to export the data and process it elsewhere (which BTW might also come with its own performance issues of the kind you eloquently evoked above in the case of MariaDB). If there is no direct way to do this, that would be a noticeable drawback of the parsed user agent solution recently implemented in T153207.

jcrespo added a comment.EditedMay 2 2017, 10:46 AM

This task is about the ability to do this in SQL and get direct results without adding steps

That was declined at T164224#3226303 for the multiple reasons given there.

If there is no direct way to do this, that would be a noticeable drawback of the parsed user agent solution recently implemented in T153207.

You should blame whoever asked for the functionality to change and whoever approved it without taking into account infrastructure support. Operations is not going to maintain a not-yet-released database version now and probably not for a year after release being generous, and specially not for a feature that can be done easily in a different way. The baseline right now is 10.0 and we are considering 10.1 for the immediate future as that is the baseline for stretch (or maybe 5.7). If you are not ok with that, please by all means complain to the head of technical operations @faidon or ask @Nuria to maintain a special unstable database to take into account your requirements.

It seems quite strange to me that you are asking to maintain this, when analytics itself just told me you want to decommission mariadb servers with eventlogging at T156844#2990353 You should definitely discuss this internally, but please leave me out of the discussion and present me a single clear answer about the future of this service.

I will stop now answering to this ticket.

jcrespo removed a subscriber: jcrespo.May 2 2017, 10:47 AM

This task is about the ability to do this in SQL and get direct results without adding steps to export the data and process it elsewhere (which BTW might also come with its own performance issues of the kind you eloquently evoked above in the case of MariaDB). If there is no direct way to do this, that would be a noticeable drawback of the parsed user agent solution recently implemented in T153207.

It was said here: T156681#2995733 that our version didn't support this and we were nowhere near having 10.2 in our environment, so it is a bit surprising that this is now a drawback for the solution that was implemented on: T153207

faidon added a subscriber: Ottomata.EditedMay 2 2017, 12:45 PM

First off, let's tone this down a little bit and try to unpack the conversation to all of the separate contentious points :)

From this and the other tasks, this is what I understand of the situation (and correct me if I'm wrong anywhere):

  • Analytics has tried to use JSON_EXTRACT(), but couldn't because it wasn't available in our setup and inquired our DBA team about it.
  • The DBA team responded that this is 10.2 only function, which is a version that we don't run and are not planning to run in the short- and mid-term, as our next target will be 10.1.
  • Tilman filed another task (this one) about updating analytics-store to 10.2. A timeframe wasn't specified, so there is nothing wrong with this task by itself and perhaps shouldn't have been declined (but see below). On the other hand, keeping a task open in Phabricator for a year or two, mentioning some future version of MariaDB that hasn't even been GA yet, is probably not super useful. When the DBA team makes the choice to target a newer MariaDB/MySQL version, I'm sure they won't forget about analytics-store, so this task reminding them of that isn't very helpful to them.
  • Separately from this whole conversation, there's the largest strategic question of whether we are keeping the analytics-stores or not. I see T156844#2990353 linked above, in which @Ottomata mentions that Analytics wants to "try to ween people off of EventLogging MySQL altogether" and hoping to "[get] people of of EventLogging MySQL in Q1 next FY". In light of this, I don't understand why a new feature (JSON_EXTRACT) would be requested for a system that is under imminent deprecation and I also don't see how this task (which operates on a much longer timeframe than FY17-18 Q1) can be considered, so in that sense, it should probably be declined indeed.
  • Finally, and also separately from all of the above, there is a broader concern of actually using JSON-unpacking features in the database layer, rather than the application layer, even if they were readily available, from a query planning/optimization and server load standpoint. If we are keeping the EventLogging MySQL service after all, I think it'd be useful if the use case was explained a little bit more, so that we can together figure out a way to address the problem using a method that's optimal for the database layer as well.

it is a bit surprising that this is now a drawback for the solution that was implemented on: T153207

We didn't lose any functionality by converting this field to JSON. Previously it was a free form user agent string, difficult to parse in SQL as well as in regular ol' programming languages.

and hoping to "[get] people of of EventLogging MySQL in Q1 next FY".

I now think we won't meet this. T162610 will get us another way to query EventLogging by then, but it is likely not sufficient on its own to allow us to totally get rid of EventLogging MySQL. We're not sure yet what will be, but eliminating EL MySQL is an end goal.

faidon added a comment.May 2 2017, 1:47 PM

and hoping to "[get] people of of EventLogging MySQL in Q1 next FY".

I now think we won't meet this. T162610 will get us another way to query EventLogging by then, but it is likely not sufficient on its own to allow us to totally get rid of EventLogging MySQL. We're not sure yet what will be, but eliminating EL MySQL is an end goal.

We're getting a little bit off-topic but: is this "we won't meet this deadline but we're still trying to ween people off, so we shouldn't be spending our time to work on new EventLogging MySQL features" or "we won't meet this because we discovered it's an unrealistic goal and now we're doing a 180 and will keep supporting and enhancing this for the foreseeable future" or something else/in between?

The way we are talking about it, it is the former. We're sure we want T162610, so we are doing it before we think about how exactly to solve Tilman and other analyst's query latency needs. Hive works great, but it is not fun for exploratory analysis because of its high latency.

Nuria added a comment.May 2 2017, 4:00 PM

We didn't lose any functionality by converting this field to JSON. Previously it was a free form user agent string, difficult to parse in SQL as well as in regular ol' programming languages.

Indeed, a built json parser in Maria db would be nice-to-have (and that is why i filed the original ticket about this: https://phabricator.wikimedia.org/T156681) but I cannot see how that is a blocker for any SQL that prior just parsed a raw UA string. If I am missing something big time please let me know.

In this case it seems that upgrading Maria DB to add this json library was not something we are considering (per @Marostegui comments on original ticket) I agree with @faidon that i see no immediate benefit on keeping this ticket open.