Already resolved by Sahil and Amit, thaaanks!
Fri, Jul 13
I think a minimal amount of errors is expected.
AFAIK we only shorten the source_url, but there are other potentially long fields like source_title, that combined still can make the event overflow the max 2000 chars.
So, the errors should be a lot less frequent but not disappear, no?
Thu, Jul 12
Base64 includes a-z, A-Z, 0-9, +, and /. So, all except / are 'legal'. I bet pivot/turnilo URI encode the base64 string to avoid problems with the /. This would add 2 extra chars for every / (it frequency being 1/64 in average), so a 3% increase. Theoretically, then, using base64+uriEncoding would be ~17% (not ~20%) shorter than using uriEncoding only.
I just merged the change, because it got a +1 from Nuria and a +1 from Chelsy.
We'll deploy that with the next refinery deployment.
I think we can close this task as resolved.
Wed, Jul 11
@chelsyx Oh, ok! Sorry for the confusion :P
Yes, I'm aware that the length of the URL may be more than 2000 chars in some extreme cases (e.g. the user selects many languages). But I don't have other solution except putting it into another schema. Do you have any suggestion?
I can not reproduce this error, it might be a race condition.
Tue, Jul 10
When working on T195269, I saw that a new field was added tho MobileWikiAppiOSUserHistory: feed_enabled_list. This field is a "2-level" nested object with arrays at its leaves; While theoretically this is supported by EL pipeline, we might see some issues. A couple comments on it:
- As MobileWikiAppiOSUserHistory is already blacklisted for MySQL insertion, there will not be problems inserting events for this schema to MySQL and/or sanitizing these events in MySQL.
- However, as this field can potentially become very long, it might contribute to the whole event overflowing the max URL length of aprox. 2000 chars. And in this case, the events will fail validation in the EL processors. I saw that the subfield names were shortened on purpose, so I assume you already are aware of this.
- Fields with complex types are not supported in Druid, so this schema as is, will not be able to be fully imported to Druid (or turnilo).
- I think the schema does not follow the json schema spec when defining the 'ena' and 'dis' sub-fields. I think the [ and ] are not supposed to be there, but I might be wrong.
Mon, Jul 9
Fri, Jul 6
- have the popup be the same component for all charts, that receives the data.
if the first problem "diagonal zoom" takes a lot of time to solve, don't bother
- keep 3 colors designed for sections: reading, contributing, content (we can use different shades if it looks better)
- remove black border from line charts
- maybe thicken the lines
Thu, Jul 5
We could include here also changing the colors of the bar charts, because some of the colors currently used are too strong.
This happens in both bar-chart and line-chart.
This problem was solved by another task already.
Tue, Jul 3
For this particular case, let's not whitelist the os_minor field for now. We released the new version about a week ago and I need to verify that we are collecting data from users as expected. After that, I will run another analysis using the data from this table and check the bucket size to see how small it is.
Mon, Jul 2
Fri, Jun 29
Also forgive me for the late response, your thorough report gave me a lot to think about.
Wed, Jun 27
Oh yea, I meant the tiny inconsistency created by the normalization rounding. I added a one-liner to the docs:
Yes, it's better for the map component in Superset to have standard country names, and the names in the original dataset were far from standard (e.g. for Curaçao I found three variations: Curaçao, Curacao and Cura?ao).
Tue, Jun 26
Sorry, this change was not meant to be linked to this task... please ignore.
Mon, Jun 25
Fri, Jun 22
Hey @mpopov, sure I can try.
Thu, Jun 21
I'm still a little cautious about adding additional logic to the client even a simple one. Am happy to be more aggressive with the cut off e.g. 1000 characters and then see what happens.
Hi @mpopov :]
Did you mean to tag this task with Product-Analytics or with Analytics?
Looks good to me overall.
Now, could it be that the maximum length of the source_url is 1937 because longer source_urls are cut off by varnish?
If so, there's the possibility that we continue to have errors, even when reducing the source_url to 1400 chars in the client, no?
Wed, Jun 20
Yea, many more errors when grepping for earlier fields...
The new error dump is under stat1004.eqiad.wmnet:virtualpageview_errors_corrected.log.
Ha... I just thought that we might be ignoring lots of longer error logs...
When generating the error dumps, I grep'd for 'VirtualPageView'. The problem is that by design EL outputs the schema after all schema fields.
So, all events long enough to displace the schema outside of the varnish limit, will have not been caught by my grep.
@Jdlrobson I put the virtualpageview error logs under stat1004.eqiad.wmnet:/home/mforns/virtualpageview_errors.log.
Let me know if you want me to copy it over somewhere else.
In Superset the 'Geowiki legacy archive' dashboard works well and shows correct data.
I had difficulties seeing the countries though, because there's no line delimiting them (or there's a white line).
The geoeditors dashboard has a blue line that helps. Can we use that as well for geowiki?
Also, the bottom right table shows row counts. So if you select a wiki and a cohort, the table will show always count=1.
Not sure that is of interest?
The geowiki_archive_monthly data in Turnilo looks good to me overall and super useful!
There is an empty metric called Count though. From what I know, this metric is added by Druid no? It should count the number of rows.
However, there's another metric called Number of rows that seems to count that.
So, yea, probably everything is fine like that.
geowiki_archive_monthly_edits_country Looks good to me overall as well.
There's another small difference in relation to geowiki_archive_monthly_country, which is the edits one has full country names, where as the editors one has country ISO codes. Is that expected?
geowiki_archive_monthly_country Looks good to me!
There's only one small detail that we can discuss whether we want to change or not:
Sometimes, when normalizing the all cohort, its normalized value does not match the sum of the normalized values of its cohort parts.
An example of this is:
select * from geowiki_archive_monthly_country where month like '2012-08-01' and project='ca' and country='BR'; ca BR all 2012-08-01 16 2012-10-29 12:22:48.0 ca BR 0-10 2012-08-01 13 2012-10-29 12:22:48.0 ca BR 90-100 2012-08-01 1 2012-10-29 12:22:48.0 ca BR 1 2012-08-01 8 2012-10-29 12:22:49.0 ca BR 2 2012-08-01 3 2012-10-29 12:22:49.0 ca BR 4 2012-08-01 1 2012-10-29 12:22:49.0 ca BR 5+ 2012-08-01 3 2012-12-05 19:29:13.0 ca BR 50-60 2012-08-01 1 2012-10-29 12:22:49.0 ca BR 9 2012-08-01 1 2012-10-29 12:22:49.0
The sum of 0-10, 50-60 and 90-100 is 15, but the all value is 16. If you query
geowiki_archive_active_editors_world looks good to me now!
Tue, Jun 19
As discussed with @fdans we should re-sqoop analytics-slave::staging::erosen_geocode_active_editors_world into hive::geowiki_archive_active_editors_world, because it had an import problem. The other 3 sqooped tables seem to be fine!
Mon, Jun 18
@Jdlrobson I'll be working both Tuesday and Wednesday if you want to change.
Dan, I already fixed that on the routing fix.
We can close.
Jun 14 2018
Not sure also if base64 would make url's shorter, probably not! :/
Yea, I also don't know what the limit should be.
I think it depends on how large the rest of the schema is.
IIRC the limit for EL event size is 2000 chars.
@Nuria he responded in the other task.
See my comment on the other task:
Jun 13 2018
@sahil505 Hey, it was my fault, I thought it wasn't deployed yet.
Will move to done then.
Jun 11 2018
I'm assuming that you guys want to keep all fields in this schema indefinitely,
otherwise, we could just let the skin field raw, and purge it after 90 days?
Yes, in T175395 we sanitized the skin field in the end. I suggested to sanitize it, because the events generated by a user with an uncommon enough skin can be more likely retrieved if we keep the raw skin field. The CitationUsage schema has no other identifying fields (thanks for designing it with privacy in mind!), but it has several fields that can convey user behavior/interests like: pageId, referrer, link_text, etc. Also, this schema uses the session_token provided by mw.user.sessionId(), which if I'm not wrong, is cross-schema (meaning, identifying fields of this schema could be used in combination with other identifying fields in other schemas that use the session_token). Thus, I'd also suggest to sanitize the skin field here. I think (minerva|vector|other) would be OK.
Jun 9 2018
Jun 6 2018
Jun 5 2018
OK, this looks ready to review and merge if appropriated.
I tested with real data, and looks good :D
Jun 4 2018
Jun 1 2018
May 30 2018
I'd suggest we make this decision on the basis of some specific concern or evidence this is identifying.
Yea, agree. I will try to explain my concerns and the criteria I followed when proposing to purge os_minor.
May 28 2018
May 24 2018
@mforns can you help with this? Thank you!
Please understand, though, that we Analytics are trying our best on our side as well. And that in the view of this situation where conflicts already exist between some of us, aggressive comments do usually not help reach results, but rather make the conflicts bigger.
Agreed. While we seem to have different a priori views of what counts as aggressive (I would have regarded the wording at the center of your concerns, "creatively reinterpreting this task", as perhaps unnecessarily flippant but not as an attack), the perception of the recipient and how it makes them feel is important, and I will try to be more careful about this in our future discussions. In turn, I would still love to also see improvements or at least acknowledgments regarding the IMHO problematic communication patterns from your team that I tried to describe in this one example above. But maybe we should take this conversation offline now.
May 23 2018
Oh, cool. Yea, definitely useful. Thanks!
As @Ladsgroup knows, I worked on this task during the BCN Hackathon.
It was super-interesting and I learned a lot about Wikidata :]
Thanks for the opportunity!
Here's a summary about what I did, issues I had, and next steps: