Page MenuHomePhabricator

Bug behavior of QTree[Long] for quantileBounds
Closed, ResolvedPublic5 Estimated Story Points

Description

To generate mobile apps session metrics, we use Q tree in this quantiles function. However, when the data type is Long, the result is incorrect. In the last example of this documentation, the range of the median of List(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8) is (5.0,6.0), while the true median is 4.5.

Correction to the listed example, with our version this is what i get:

scala> val data = List(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8)
data: List[Int] = List(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8)

scala> val qtSemigroup = new QTreeSemigroup[Long](6)
qtSemigroup: com.twitter.algebird.QTreeSemigroup[Long] = com.twitter.algebird.QTreeSemigroup@7c46ceff

scala> val sum = qtSemigroup.sumOption(seqQTree).get
sum: com.twitter.algebird.QTree[Long] = QTree(0,4,16,0,Some(QTree(0,3,14,0,Some(QTree(0,2,6,0,Some(QTree(0,1,2,0,None,Some(QTree(1,0,2,2,None,None)))),Some(QTree(1,1,4,0,Some(QTree(2,0,2,4,None,None)),Some(QTree(3,0,2,6,None,None)))))),Some(QTree(1,2,8,0,Some(QTree(2,1,4,0,Some(QTree(4,0,2,8,None,None)),Some(QTree(5,0,2,10,None,None)))),Some(QTree(3,1,4,0,Some(QTree(6,0,2,12,None,None)),Some(QTree(7,0,2,14,None,None)))))))),Some(QTree(1,3,2,0,Some(QTree(2,2,2,0,Some(QTree(4,1,2,0,Some(QTree(8,0,2,16,None,None)),None)),None)),None)))

scala> sum.quantileBounds(0.5)
res22: (Double, Double) = (4.0,5.0)

See more details about this issue: https://github.com/twitter/algebird/issues/517

As a solution, we may change the data type from Long to Double.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Nuria edited projects, added Analytics-Kanban; removed Analytics.
Nuria set the point value for this task to 3.
Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.

Here is a version of the script you can execute in the commandline:

https://gist.github.com/nuria/157bae27e677852c7b24c3703d8ad7c0

via spark-shell with:

spark2-shell --master yarn --jars /srv/deployment/analytics/refinery/artifacts/refinery-job-spark-2.1.jar,/srv/deployment/analytics/refinery/artifacts/refinery-core.jar -v

:load the-file.scala

This comment was removed by Nuria.

Change 408329 had a related patch set uploaded (by Nuria; owner: Nuria):
[analytics/refinery/source@master] [WIP] Changing initialization of QTree to work arround precision Bug

https://gerrit.wikimedia.org/r/408329

Ok, i think i figured out how is all this plugged together, i need to do a bit more testing.

Change 408329 merged by Nuria:
[analytics/refinery/source@master] Changing initialization of QTree to work arround precision Bug

https://gerrit.wikimedia.org/r/408329

Nuria changed the point value for this task from 3 to 5.Mar 1 2018, 4:34 PM

Change 418925 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Update mobile-app session job

https://gerrit.wikimedia.org/r/418925

Change 418925 merged by Ottomata:
[analytics/refinery@master] Update mobile-app session job

https://gerrit.wikimedia.org/r/418925

I cannot put my finger on what it is the problem but I think the data from session job has issues, not having to do with this bug per se but it just looks odd.

Closing, tested with 1 day of traffic calculating percentiles by hand and values match with probabilistic calculation (they differ the most at p_99)

Still, with the more precise QTree I do not think changes are super significant for the intervals we are talking about.

Something that before was reported like (4.0, 5.0) now it would be likely reported like (4.0, 4.x)

NON PROBABILISTIC PERCENTILE (simple)

scala> val data = List(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8)
data: List[Int] = List(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8)

scala> val p_index = math.ceil((l.length - 1) * (50 / 100.0)).toInt
p_index: Int = 8

scala> data(8)
res17: Int = 5

BEFORE:
scala> val qtree = data.map(x =>QTree(x) ).reduce(qtreeSemigroup.plus)
qtree: com.twitter.algebird.QTree[Long] = QTree(0,4,16,0,Some(QTree(0,3,14,0,Some(QTree(0,2,6,0,Some(QTree(0,1,2,0,None,Some(QTree(1,0,2,2,None,None)))),Some(QTree(1,1,4,0,Some(QTree(2,0,2,4,None,None)),Some(QTree(3,0,2,6,None,None)))))),Some(QTree(1,2,8,0,Some(QTree(2,1,4,0,Some(QTree(4,0,2,8,None,None)),Some(QTree(5,0,2,10,None,None)))),Some(QTree(3,1,4,0,Some(QTree(6,0,2,12,None,None)),Some(QTree(7,0,2,14,None,None)))))))),Some(QTree(1,3,2,0,Some(QTree(2,2,2,0,Some(QTree(4,1,2,0,Some(QTree(8,0,2,16,None,None)),None)),None)),None)))

scala> qtree.quantileBounds(0.5)
res20: (Double, Double) = (4.0,5.0)

AFTER THESE CHANGES:

scala> val qtree = l.map(x =>QTree(x*16,-4,1,x,None,None) ).reduce(qtreeSemigroup.plus)
qtree: com.twitter.algebird.QTree[Long] = QTree(0,4,16,0,Some(QTree(0,3,14,0,Some(QTree(0,2,6,0,Some(QTree(0,1,2,0,None,Some(QTree(1,0,2,0,Some(QTree(2,-1,2,0,Some(QTree(4,-2,2,0,Some(QTree(8,-3,2,0,Some(QTree(16,-4,2,2,None,None)),None)),None)),None)),None)))),Some(QTree(1,1,4,0,Some(QTree(2,0,2,0,Some(QTree(4,-1,2,0,Some(QTree(8,-2,2,0,Some(QTree(16,-3,2,0,Some(QTree(32,-4,2,4,None,None)),None)),None)),None)),None)),Some(QTree(3,0,2,0,Some(QTree(6,-1,2,0,Some(QTree(12,-2,2,0,Some(QTree(24,-3,2,0,Some(QTree(48,-4,2,6,None,None)),None)),None)),None)),None)))))),Some(QTree(1,2,8,0,Some(QTree(2,1,4,0,Some(QTree(4,0,2,0,Some(QTree(8,-1,2,0,Some(QTree(16,-2,2,0,Some(QTree(32,-3,2,0,Some(QTree(64,-4,2,8,None,None)),None)),None)),None)),None)),Some(QTree(5,0,2,0,Some(QTree(10,-1,2,0,Some(QTre...
scala> qtree.quantileBounds(0.5)
res21: (Double, Double) = (4.0,4.0625)

Nuria updated the task description. (Show Details)
238482n375 triaged this task as Lowest priority.
238482n375 moved this task from Done to In Code Review on the Analytics-Kanban board.
238482n375 edited subscribers, added: 238482n375; removed: JAllemandou, Aklapper.

SG9tZVBoYWJyaWNhdG9yCk5vIG1lc3NhZ2VzLiBObyBub3RpZmljYXRpb25zLgoKICAgIFNlYXJjaAoKQ3JlYXRlIFRhc2sKTWFuaXBoZXN0ClQxOTcyODEKRml4IGZhaWxpbmcgd2VicmVxdWVzdCBob3VycyAodXBsb2FkIGFuZCB0ZXh0IDIwMTgtMDYtMTQtMTEpCk9wZW4sIE5lZWRzIFRyaWFnZVB1YmxpYwoKICAgIEVkaXQgVGFzawogICAgRWRpdCBSZWxhdGVkIFRhc2tzLiWxzKQpKQWxsZW1hbmRvdSBhZGRlZCBhIHByb2plY3Q6IEFuYWx5dGljcy1LYW5iYW4uCkpBbGxlbWFuZG91IG1vdmVkIHRoaXMgdGFzayBmcm9tIE5leHQgVXAgdG8gSW4gUHJvZ3Jlc3Mgb24gdGhlIEFuYWx5dGljcy1LYW5iYW4gYm9hcmQuCkNoYW5nZSBTdWJzY3JpYmVycwpDaGFuZ2UgUHJpb3JpdHkKQXNzaWduIC8gQ2xhaW0KTW92ZSBvbiBXb3JrYm9hcmQKQ2hhbmdlIFByb2plY3QgVGFncwpBbmFseXRpY3MtS2FuYmFuCtcKU2VjdXJpdHkK1wpXaWtpbWVkaWEtVkUtQ2FtcGFpZ25zIChTMi0yMDE4KQrXClNjYXAK1wpTY2FwIChTY2FwMy1BZG9wdGlvbi1QaGFzZTIpCtcKQWJ1c2VGaWx0ZXIK1wpEYXRhLXJlbGVhc2UK1wpIYXNodGFncwrXCkxhYnNEQi1BdWRpdG9yCtcKTGFkaWVzLVRoYXQtRk9TUy1NZWRpYVdpa2kK1wpMYW5ndWFnZS0yMDE4LUFwci1KdW5lCtcKTGFuZ3VhZ2UtMjAxOC1KYW4tTWFyCtcKSEhWTQrXCkhBV2VsY29tZQrXCkJvbGQKSXRhbGljcwpNb25vc3BhY2VkCkxpbmsKQnVsbGV0ZWQgTGlzdApOdW1iZXJlZCBMaXN0CkNvZGUgQmxvY2sKUXVvdGUKVGFibGUKVXBsb2FkIEZpbGUKTWVtZQpQcmV2aWV3CkhlbHAKRnVsbHNjcmVlbiBNb2RlClBpbiBGb3JtIE9uIFNjcmVlbgoyMzg0OZSBub3RlZDsgY29kZSBsaWNlbnNlZCB1bmRlciBHTlUgR2VuZXJhbCBQdWJsaWMgTGljZW5zZSAoR1BMKSBvciBvdGhlciBvcGVuIHNvdXJjZSBsaWNlbnNlcy4gQnkgdXNpbmcgdGhpcyBzaXRlLCB5b3UgYWdyZWUgdG8gdGhlIFRlcm1zIG9mIFVzZSwgUHJpdmFjeSBQb2xpY3ksIGFuZCBDb2RlIG9mIENvbmR1Y3QuILcgV2lraW1lZGlhIEZvdW5kYXRpb24gtyBQcml2YWN5IFBvbGljeSC3IENvZGUgb2YgQ29uZHVjdCC3IFRlcm1zIG9mIFVzZSC3IERpc2NsYWltZXIgtyBDQy1CWS1TQSC3IEdQTApZb3VyIGJyb3dzZXIgdGltZXpvbmUgc2V0dGluZyBkaWZmZXJzIGZyb20gdGhlIHRpbWV6b25lIHNldHRpbmcgaW4geW91ciBwcm9maWxlLCBjbGljayB0byByZWNvbmNpbGUu