Page MenuHomePhabricator

Doubts and questions about Kerberos and Hadoop
Closed, ResolvedPublic

Description

This is a tracking task related to user doubts and questions about what will change when the Kerberos authentication will be enabled for Hadoop.

Before adding any comment in the task, please read: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide

Kerberos auth will be enabled on December 2nd 16th.

Event Timeline

Ottomata moved this task from Incoming to Operational Excellence on the Analytics board.

I have no doubts about this change generally, but Fundraising and Fundraising tech rely on this data heavily and will have just started the annual fundraiser on December 2nd. Could this change be made in mid-December or early January? The risk is that we won't have access to highly needed data while thousands of donations are coming in.

We are concerned about the potential downtime to Hadoop (even 45 minutes could impact us if anything was wrong with CentralNotice and we needed access to the data during that time).

We also want to be clear on which systems of ours this affects. We currently consume data from the Kafka stream which we believe is not connected to Hadoop. Is this correct? We are also transitioning to consuming data from the EventLogging system. Will this be impacted?

@mepps the data you need should be present in kafka even if hadoop has an outage, which we hope does not happen.

We are also transitioning to consuming data from the EventLogging system.

Eventlogging hive tables will be affected. The eventlogging stream on kafka is not.

Could this change be made in mid-December or early January?

Sorry but we cannot postpotne it that far, with vacations and all hands it would mean that a pretty big security concern does not get addressed for couple months.

@Nuria Thanks for the quick and thorough response! To confirm, I read your comments as saying no kafka streams will either have downtime on December 2nd or require authentication. Is that correct?

There are still concerns about having tooling for the fundraising team during the time of scheduled outage, but I will allow folks from that team to address that. As I said previously, this exact date is particularly hard because the fundraiser will have just started (it's also giving Tuesday in the US) which means we will have the highest volume of donations coming in. I do understand how hard scheduling an upgrade like this is for a team.

@mepps Thanks for reaching out! A couple of comments:

  • Kafka and eventlogging will keep working as expected, the only part that will have some downtime is the data import to HDFS. Pulling data from Kafka will keep working as expected.
  • Can you tell us a bit more what tools your team rely on other than Kafka? In this way we'll be able to know exactly what is critical for you and figure out if we need to postpone or not. As Nuria mentioned we'd prefer not to postpone this important migration, but if necessary we'll think about an alternative plan.

To clarify the expected downtime: Is it 45 minutes or 3 to 4 hours?

On the Fundraising side, as @mepps mentioned, 12/2 at 15 or 16 UTC is the launch of a large fundraising effort on Central Notice, and the ability to diagnose potential issues through HDFS (tools below) would be a relief going into that event. Understanding the security concern, would it be possible to move the release date up a few days to end of November?

If diagnostics need to be run on Central Notice, Jupyter notebooks, Turnilo, and beeline would be used.

To clarify the expected downtime: Is it 45 minutes or 3 to 4 hours?

Can be 3/4 hours, it really depends on the issues that we'll find etc..

On the Fundraising side, as @mepps mentioned, 12/2 at 15 or 16 UTC is the launch of a large fundraising effort on Central Notice, and the ability to diagnose potential issues through HDFS (tools below) would be a relief going into that event. Understanding the security concern, would it be possible to move the release date up a few days to end of November?

If diagnostics need to be run on Central Notice, Jupyter notebooks, Turnilo, and beeline would be used.

I see, I was not aware that these tools were used, nobody told us (or at least me) that hadoop would have been a very important problem for FR season. Will have a chat with my team and report back, but next time it would be great to sync in advance :)

We'd prefer not to do it next week since half of it will be holidays for part of my team, and given the big change we'd prefer to have more coverage. What about something like the 3rd or 4th of December?

During the whole first week of the banner campaign last year we were raising between one and three million dollars a day, so it would be really stressful to lose our ability to diagnose and debug problems in a situation where we could be losing up to a yearly salary during a 4 hour outage. I second the request to move this to either before the banners start or at least a week (preferably two weeks) after they start.

Let's clarify some things here: none of the data in hadoop is realtime, pageviews and eventlogging are delayed between 3-4 hours from real time events. So an issue with CentralNotice happening at 12:00 cannot be troubleshooted with data in hadoop. Does this make sense? Not having real time systems means that data sent to centralnotice at 12pm will appear in hadoop at 4pm or later. In order to take the best decision here: could you give some examples of issues you expect to see and troubleshoot (maybe some tickets from the past?)?

Data in Kafka is Real time however. Turnilo, Jupyter work with data in hadoop, the FR kafka puller works with data in kafka.

Hi, all,

could you give some examples of issues you expect to see and troubleshoot (maybe some tickets from the past?)?

There are important data points that are only available in Hive and that have been used many times in the past to debug CentralNotice.

Here are just a few examples:
T152650: Spike: Impressions abnormally low for Ireland
{T176802}
T174719: Investigate email: BH data storage/transfer issue for iPad donations

We don't expect any specific issues, but we need to be ready in case any arise. This is the time of year when we try to have several levels of backup and ensure that all the tools we may need are fully available.

none of the data in hadoop is realtime, pageviews and eventlogging are delayed between 3-4 hours from real time events. So an issue with CentralNotice happening at 12:00 cannot be troubleshooted with data in hadoop

The concern here is access to any data at all from Hive during the upgrade window. Significant issues are often not detected right away. Once they are detected, it's quite possible that we would look at data from days or weeks previous, depending on the circumstances.

We need the systems that we've worked with in the past to be online during this crucial time. If a possible issue is noticed at 12:00 on one of our top donation days, we need to access any data we can right away, regardless of whether it's 3 or 4 hours old.

I would like to emphatically and respectfully request that this change please be made on a different day. It could be before the end-of-year FR campaigns starts. It could also be after FR banners have been significantly limited or temporarily suspended, which often happens a bit after the middle of December, though the exact date will depend on how the campaign goes.

Thanks so much!!!!!! Cheers :)

We can move the date to December 16th. Now, it will be not optimal for WMF to move it much further than that, kerberos is not a cosmetic change , it comes to solve important security issues that carry quite a lot of risk for all datasets. Fundraising ones as well.

@Nuria That would be extremely helpful. That first week is incredibly tense for us and Advancement. We usually need all the tools we can get to be absolutely sure our campaigns are performing as best as they can. After the first week, we are usually in a much less risky state.

Ok, December 16th is the go day then , please be so kind as to build it in your schedule.

Ok I'll send emails to announce the new scheduled date :)

To all the FR team: please check https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide in the meantime to familiarize with what will change!

Hey @elukey, @Nuria, and analytics team: we in online fundraising really appreciate your willingness to move this release back. We understand its importance and appreciate your sensitivity to our short term needs. Thanks a lot for your teamwork and adaptability.

@elukey @Nuria Thanks so much!!!!!!!!!!!!!!!!!!!!!!!!

Thanks also @DStrine, @mepps, @EYener , @Ejegg, @spatton! :)

For all who use the analytics cluster: Please get your hadoop credentials: https://phabricator.wikimedia.org/T237605

@Nuria @elukey Thank you so much for moving this back! I appreciate all your questions and helpful explanations about the implications of this change.

Closing this since Kerberos has been enabled and there seem to be no more questions.

Ottomata added subscribers: MMiller_WMF, Ottomata.

@MMiller_WMF was trying to run Hive queries from Hue today and got

java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

I see the same. Did we ever get Hue + Hive working with Kerberos?

Thanks for adding this, @Ottomata. Hue was working fine for me on Friday, but not today. I do rely on Hue as my main way to access ad hoc results through our team's EventLogging, so I consider this pretty urgent for reporting needs that I have before the end of the week. Please keep me posted!

@MMiller_WMF @Ottomata what is the query/queries that don't work? Hue seems to work for me with kerberos.. I am wondering if the limits in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/561670/ might be the problem, but I'd need to repro first.

I go to https://hue.wikimedia.org/hue/editor/?type=hive and I get the error in the upper right corner, I believe this is because Hue is trying to list the Hive databases when it loads the page.

If I run a query, e.g.

select country_code, count(*) as cnt from wmf.projectview_hourly where year=2017 and month=1 group by country_code order by cnt desc limit 1000;

I get the same error again.

I'm experiencing the same as @MMiller_WMF . Hue was working fine on Friday and today it's funky.
Today, I'm unable to see any tables and I am seeing error messages. On the left sidebar, the error message says Error loading databases..
On the top right, I am intermittently getting this error message:
java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient in red font.
If I already have a correct/working/functional query, I can paste it in and get results but I cannot unfurl the database icon to see any tables. As DataGrip is not working, I'm hoping to use Hue to test out queries and see what tables are available. To be sure, Hue has been problematic for me (there are some tables that it never shows me)...so if we can ultimately get DataGrip to work, I will be prioritizing that tool.

Maybe screenshots/screencast would help cause I do not see any errors on my end. For the data analysts we recommend to use jupyter rather than hue for ad-hoc access. Hue has memory issues and it is really not an optimal tool for querying. @Iflorez , if you want help getting started with jupyter we can help.

Hi @Nuria I am new and was planning to use Hue as well, happy to work on Jupyter for ad hoc queries, if there's info you have on where to best access, etc. that would be helpful for me as well.

What should we use to replace the functionality of Hue allowing us to create our own project dbs?

Thanks
Shay

What should we use to replace the functionality of Hue allowing us to create our own project dbs?

All that functionality is available through hive's command line @nettrom_WMF on your team can help you with that.

I would strongly recommend analysts not to use hue at all, it is really not a tool for data science. Please see: https://wikitech.wikimedia.org/wiki/SWAP for jupyter notebook information, your team uses it frequently so anyone can advise.

@Nuria @elukey -- below is a screenshot of a query I'm running and the result that I'm getting. While I do know how to run similar queries in Jupyter, I like to use the Hue environment because of things like: copy/pasting outputs, query history, the table list and ability to see the columns in a table, autocomplete, etc. I hope that we can always have that sort of IDE for our querying.

image.png (773×1 px, 112 KB)

Thank you @Nuria!
I've been using notebooks along with the wmf data package.
I aim to shift to using notebooks with Spark and will reach out if I have any issues. Rereading the SWAP documentation is helpful.

Ok, I am good w Jupyter but still will need login access to Hue so I can see the queries shared with me by Mikhail. And kerberos access as well for notebooks. Are my requests going to the right place?
Shay

@SNowick_WMF you need to file an access request (this is not the right ticket). I would ping your team so they can help you with it. An example of one: https://phabricator.wikimedia.org/T241838

@Nuria -- thanks for pointing out that issue in my query. Here's a query that I'm confident doesn't have any errors, but is still not running:

image.png (729×1 px, 90 KB)

Change 562474 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hive: store delegation tokens in the db

https://gerrit.wikimedia.org/r/562474

@elukey, @Ottomata, thank you for helping us figure out this issue! 👏

What should we use to replace the functionality of Hue allowing us to create our own project dbs?

All that functionality is available through hive's command line @nettrom_WMF on your team can help you with that.

I would strongly recommend analysts not to use hue at all, it is really not a tool for data science. Please see: https://wikitech.wikimedia.org/wiki/SWAP for jupyter notebook information, your team uses it frequently so anyone can advise.

@Nuria, I disagree, and I would recommend that other analysts use Hue where appropriate. Not all "data science" needs to be done using the command line; like @Iflorez, I use SWAP extensively, but I also use Hue and Google Spreadsheets when they're appropriate tools for the job. In addition, whether or not Hue is an appropriate tool for analysts, we definitely need an accessible visual client for non-analysts like @MMiller_WMF.

If Hue has technical limitations that keep you from being able to recommend it, then we need to work together to find an accessible tool that you can recommend. @mpopov has started to use DataGrip and that's a potential alternative.

I disagree, and I would recommend that other analysts use Hue where appropriate.

I do not disagree. Hue is an OK tool for hadoop administrators and to dig around issues with job execution. Now, it was never designed for the intensive multi-use use case that big-data science requires thus it will always be a less than optimal tool for research, the more users it has the less optimal it will be as it relies on a central data store.
Now, hue should support ok occasional ad-hoc usage which I think is @MMiller_WMF use case.

Change 562474 merged by Elukey:
[operations/puppet@production] hive: store delegation tokens in the db

https://gerrit.wikimedia.org/r/562474

@MMiller_WMF @Ottomata can you try now to use hue and see if the issue is still there?

I disagree, and I would recommend that other analysts use Hue where appropriate.

I do not disagree. Hue is an OK tool for hadoop administrators and to dig around issues with job execution. Now, it was never designed for the intensive multi-use use case that big-data science requires thus it will always be a less than optimal tool for research, the more users it has the less optimal it will be as it relies on a central data store.
Now, hue should support ok occasional ad-hoc usage which I think is @MMiller_WMF use case.

Unlike "I would strongly recommend analysts not to use hue at all", this sounds reasonable to me 😊 I generally use it for quick queries and schema exploration, and agree that for larger projects, analysts should look elsewhere.

Out of curiosity, by "central data store" do you mean the Hive Metastore? If so, that caveat would apply to anything that uses Hive including Impyla and Beeline, right?

I think Nuria means anything that stores user and/or dataset metadata in a separate database (other than the Hive Metastore). Hue and Superset both do this. Hue could be problematic lots of folks used it, as all http requests go to a single Ganeti instance. (Not that this is much better than notebooks right now).

In T238560#5782185, @Neil_P._Quinn_WMF wrote:

I disagree, and I would recommend that other analysts use Hue where appropriate.

I do not disagree. Hue is an OK tool for hadoop administrators and to dig around issues with job execution. Now, it was never designed for the intensive multi-use use case that big-data science requires thus it will always be a less than optimal tool for research, the more users it has the less optimal it will be as it relies on a central data store.
Now, hue should support ok occasional ad-hoc usage which I think is @MMiller_WMF use case.

Unlike "I would strongly recommend analysts not to use hue at all", this sounds reasonable to me 😊 I generally use it for quick queries and schema exploration, and agree that for larger projects, analysts should look elsewhere.

By the way I always use Hue to prototype my Hive queries (and then copy them to scripts to be run with hive CLI) and for one-off queries.

Rephrasing: "I would strongly recommend analysts not to use hue at all" to do data science, that is.

I think Nuria means anything that stores user and/or dataset metadata in a separate database (other than the Hive Metastore). Hue and Superset both do this

Right.

@elukey -- yes! It is working now. Thank you.

@elukey -- yes! It is working now. Thank you.

Super, closing it again, let's open a new task if something comes up again (hopefully not! :)