You are a user of an experimental service, welcome! :) Jokes aside, I'd be happy if you could help testing the new service, it should be quicker for you use case to query data from Superset rather than Hive. If you don't have time, feel free to skip!
Just tested the deployed refinery jars:
- data+categories jnl files copied for both hosts via cookbook
- restarted nginx to pick up new mapping for categories (after David's suggestion)
- 2007's lag seems ok, 2008 still catching up
- both hosts are not pooled in LVS/Pybal
Mon, Apr 6
Opened https://github.com/apache/incubator-superset/issues/9468 to upstream to fix a bug found in 0.36.0rc3
@Aklapper have you tried https://turnilo.wikimedia.org/ or http://superset.wikimedia.org/ ? The latter allows you now to explore hive data via Presto (see SQLLab), I am wondering if it is sufficient for your use case. If so, it wouldn't require an explicit kerberos account (since you wouldn't access hadoop directly), let me know :)
Acked this morning warnings in icinga for commonswiki_content shard size for cloudelastic.
Sat, Apr 4
https://www.preset.io/blog/2019-12-16-elasticsearch-in-superset/ looks very interesting..
In the 0.36.0 changelog I see https://github.com/apache/incubator-superset/pull/9006/commits that could be related, it would fit with what I am seeing.
Forgot to mention that the above is a Druid datasource. After some tests the error seems appearing only when no data is returned for a query, I was able to successfully visualize some results. So the regression seems a minor one, only in the message displayed when no data is returned.
Superset 0.36+ requires:
- superset.app:create_app() in the ExecStart of the superset.service unit
- DRUID_IS_ACTIVE = True in superset_config.py
Fri, Apr 3
Another example that looks very good:
Did a little test today:
@CDanis this is an old task that I opened, do you think that we could revamp it and use what you have in mind to detect bursts in bandwidth usage? It would make a big difference in managing memcached..
@Krinkle @aaron Can you tell me some hints about how to figure out what is the key's purpose? It feels like a popular template change in nlwiki but I'd like to triple check. We are still seeing saturation problems.. :(
Interesting that ganeti1011's mgmt interface recovered, but not the others. Adding dcops to see if we can schedule in the next days/weeks a check of msw-c6-eqiad.
On msw1 I see all events like the following, starting at 5:40 UTC:
Thu, Apr 2
Reason for access: As part of our work for CheckUser and IP masking projects, we need to make sure our DB queries are suitable for production, and would need a place to properly test potentially high-load queries with production data outside of the production cluster itself. It's my understanding that the place to do these queries is stat1006 which is in the analytics group.
For the moment I am happy with TLS encryption only, since we'll probably move to kerberos authentication soon and it doesn't make much sense to do TLS client auth first.
Fixed, sorry for the trouble. Really appreciated the feedback :)
Wed, Apr 1
We could start with TLS authentication only, with:
elukey@krb1001:~$ sudo manage_principals.py create itamar --email@example.com Principal successfully created. Make sure to update data.yaml in Puppet. Successfully sent email to firstname.lastname@example.org
elukey@krb1001:~$ sudo manage_principals.py create tarrow --email@example.com Principal successfully created. Make sure to update data.yaml in Puppet. Successfully sent email to firstname.lastname@example.org
Hey Diego, not sure if you have seen https://docs.google.com/document/d/1r-oqMXViWvQCqsYz0qzezZBWpip8LvkvCGF6GivFB_8/edit# but Andrew is working on a solution that will deploy Anaconda to all the Worker nodes, so after that we shouldn't have any more package problems. The MW specific ones are probably simply to be installed, the first two should be covered by Anaconda in theory.
Tue, Mar 31
Got it, so basically do all the process of creating the schemas etc.. but push data from pmacct directly to the kafka topic bypassing Eventgate. The fact that we'll have to have something ad-hoc is not appealing to me, the goal of this task was/is to avoid any special settings for netflow, but it is probably the best compromise (and then later on it should be easy to move to Eventgate validation / kafka topic creation if we see any issue).
Mon, Mar 30
After T245810 we don't really need a custom partman recipe anymore, let's use partman/raid10-4dev.cfg (already configured in puppet for all druid nodes).
Fri, Mar 27
Ok so next steps are:
As FYI stat100[5,8] (Buster nodes) are running with JupyterLab 1.1.0, if anybody wants to test and provide feedback I'd be happy :)
I had a chat with Willy about racking requirements of the new hosts (16 refreshed + GPU ones). We currently have this configuration:
Removed support for PySpark and Spark R in Toree (use specific kernels)
The issue seems intermittent, and since there is a big re-index in progress we cannot really do much. The long term fix should be add more nodes to the cluster, waiting for a rack/setup/deploy after T233720
Thu, Mar 26
The first attempt of rollback was a disaster, I wasn't able to restore HDFS to its previous state.
We are still not sure what the issue is, but we decided to upgrade stat1008 according to T247082 to have the last ROCm upstream version before contacting the devs.
Reopened, since today a webrequest load failed due to the same issue..
Wed, Mar 25
After the above changes the GC timings are way better, no more flapping alerts.
Tue, Mar 24
Working on the rollback in https://etherpad.wikimedia.org/p/analytics-bigtop
Mon, Mar 23
If we want to make this switch api.svc.eqiad.wmnet will need to be whitelisted in the Analytics' VLAN firewall rules, and we'd have a problem when/if that ip will change. This task was created after a change in Webproxies (migration to buster) caused some problems in Refine IIRC, but maybe we are not adding much resiliency in doing this switch? Nothing against it, just thinking out loud :)
@EYener we'd prefer to do it ourselves, reporting the host and username is sufficient for us to find the problem usually :) thanks!
Hi! Do you mean notebook1003? I don't see recent activities in there, but I have restarted your notebook just in case you want to retry. But maybe you are trying from a different host? Would help to check if there are meaningful logs, let me know :)
Fri, Mar 20
I had a chat with Miriam about this:
Today while re-installing the new version of oozie/hadoop packages I experienced a problem that I forgot to fix, namely:
The rollback of HDFS at this stage should be easy, the main question mark are the oozie/hive db schemas. We have been running the Hadoop cluster with the new version of HDFS for some days, but hive and oozie were upgraded (together with their db schemas). During this timeframe oozie jobs were ran, and hive changes were made to the metastore. For the hadoop test cluster it might be a simple matter of reverting back to a known good db state (we have backups) but if this happens in production, what would be the strategy?