Page MenuHomePhabricator

Doubts and questions about Kerberos and Hadoop
Open, HighPublic

Description

This is a tracking task related to user doubts and questions about what will change when the Kerberos authentication will be enabled for Hadoop.

Before adding any comment in the task, please read: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide

Kerberos auth will be enabled on December 2nd 16th.

Event Timeline

elukey created this task.Mon, Nov 18, 2:57 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMon, Nov 18, 2:57 PM
Ottomata triaged this task as High priority.Mon, Nov 18, 4:31 PM
Ottomata moved this task from Incoming to Operational Excellence on the Analytics board.
EYener added a subscriber: EYener.Wed, Nov 20, 6:41 PM
mepps added a subscriber: mepps.Wed, Nov 20, 7:27 PM

I have no doubts about this change generally, but Fundraising and Fundraising tech rely on this data heavily and will have just started the annual fundraiser on December 2nd. Could this change be made in mid-December or early January? The risk is that we won't have access to highly needed data while thousands of donations are coming in.

We are concerned about the potential downtime to Hadoop (even 45 minutes could impact us if anything was wrong with CentralNotice and we needed access to the data during that time).

We also want to be clear on which systems of ours this affects. We currently consume data from the Kafka stream which we believe is not connected to Hadoop. Is this correct? We are also transitioning to consuming data from the EventLogging system. Will this be impacted?

Nuria added a comment.EditedWed, Nov 20, 7:40 PM

@mepps the data you need should be present in kafka even if hadoop has an outage, which we hope does not happen.

We are also transitioning to consuming data from the EventLogging system.

Eventlogging hive tables will be affected. The eventlogging stream on kafka is not.

Could this change be made in mid-December or early January?

Sorry but we cannot postpotne it that far, with vacations and all hands it would mean that a pretty big security concern does not get addressed for couple months.

mepps added a comment.Wed, Nov 20, 8:32 PM

@Nuria Thanks for the quick and thorough response! To confirm, I read your comments as saying no kafka streams will either have downtime on December 2nd or require authentication. Is that correct?

There are still concerns about having tooling for the fundraising team during the time of scheduled outage, but I will allow folks from that team to address that. As I said previously, this exact date is particularly hard because the fundraiser will have just started (it's also giving Tuesday in the US) which means we will have the highest volume of donations coming in. I do understand how hard scheduling an upgrade like this is for a team.

Nuria added a comment.Wed, Nov 20, 9:02 PM

@mepps, Right, kafka is not affected

@mepps Thanks for reaching out! A couple of comments:

  • Kafka and eventlogging will keep working as expected, the only part that will have some downtime is the data import to HDFS. Pulling data from Kafka will keep working as expected.
  • Can you tell us a bit more what tools your team rely on other than Kafka? In this way we'll be able to know exactly what is critical for you and figure out if we need to postpone or not. As Nuria mentioned we'd prefer not to postpone this important migration, but if necessary we'll think about an alternative plan.

To clarify the expected downtime: Is it 45 minutes or 3 to 4 hours?

On the Fundraising side, as @mepps mentioned, 12/2 at 15 or 16 UTC is the launch of a large fundraising effort on Central Notice, and the ability to diagnose potential issues through HDFS (tools below) would be a relief going into that event. Understanding the security concern, would it be possible to move the release date up a few days to end of November?

If diagnostics need to be run on Central Notice, Jupyter notebooks, Turnilo, and beeline would be used.

elukey added a comment.EditedThu, Nov 21, 3:56 PM

To clarify the expected downtime: Is it 45 minutes or 3 to 4 hours?

Can be 3/4 hours, it really depends on the issues that we'll find etc..

On the Fundraising side, as @mepps mentioned, 12/2 at 15 or 16 UTC is the launch of a large fundraising effort on Central Notice, and the ability to diagnose potential issues through HDFS (tools below) would be a relief going into that event. Understanding the security concern, would it be possible to move the release date up a few days to end of November?
If diagnostics need to be run on Central Notice, Jupyter notebooks, Turnilo, and beeline would be used.

I see, I was not aware that these tools were used, nobody told us (or at least me) that hadoop would have been a very important problem for FR season. Will have a chat with my team and report back, but next time it would be great to sync in advance :)

We'd prefer not to do it next week since half of it will be holidays for part of my team, and given the big change we'd prefer to have more coverage. What about something like the 3rd or 4th of December?

Ejegg added a comment.EditedThu, Nov 21, 5:10 PM

During the whole first week of the banner campaign last year we were raising between one and three million dollars a day, so it would be really stressful to lose our ability to diagnose and debug problems in a situation where we could be losing up to a yearly salary during a 4 hour outage. I second the request to move this to either before the banners start or at least a week (preferably two weeks) after they start.

Nuria added a comment.EditedThu, Nov 21, 5:29 PM

Let's clarify some things here: none of the data in hadoop is realtime, pageviews and eventlogging are delayed between 3-4 hours from real time events. So an issue with CentralNotice happening at 12:00 cannot be troubleshooted with data in hadoop. Does this make sense? Not having real time systems means that data sent to centralnotice at 12pm will appear in hadoop at 4pm or later. In order to take the best decision here: could you give some examples of issues you expect to see and troubleshoot (maybe some tickets from the past?)?

Nuria added a comment.Thu, Nov 21, 5:31 PM

Data in Kafka is Real time however. Turnilo, Jupyter work with data in hadoop, the FR kafka puller works with data in kafka.

Hi, all,

could you give some examples of issues you expect to see and troubleshoot (maybe some tickets from the past?)?

There are important data points that are only available in Hive and that have been used many times in the past to debug CentralNotice.

Here are just a few examples:
T152650: Spike: Impressions abnormally low for Ireland
{T176802}
T174719: Investigate email: BH data storage/transfer issue for iPad donations

We don't expect any specific issues, but we need to be ready in case any arise. This is the time of year when we try to have several levels of backup and ensure that all the tools we may need are fully available.

none of the data in hadoop is realtime, pageviews and eventlogging are delayed between 3-4 hours from real time events. So an issue with CentralNotice happening at 12:00 cannot be troubleshooted with data in hadoop

The concern here is access to any data at all from Hive during the upgrade window. Significant issues are often not detected right away. Once they are detected, it's quite possible that we would look at data from days or weeks previous, depending on the circumstances.

We need the systems that we've worked with in the past to be online during this crucial time. If a possible issue is noticed at 12:00 on one of our top donation days, we need to access any data we can right away, regardless of whether it's 3 or 4 hours old.

I would like to emphatically and respectfully request that this change please be made on a different day. It could be before the end-of-year FR campaigns starts. It could also be after FR banners have been significantly limited or temporarily suspended, which often happens a bit after the middle of December, though the exact date will depend on how the campaign goes.

Thanks so much!!!!!! Cheers :)

Nuria added a comment.Thu, Nov 21, 9:48 PM

We can move the date to December 16th. Now, it will be not optimal for WMF to move it much further than that, kerberos is not a cosmetic change , it comes to solve important security issues that carry quite a lot of risk for all datasets. Fundraising ones as well.

@Nuria That would be extremely helpful. That first week is incredibly tense for us and Advancement. We usually need all the tools we can get to be absolutely sure our campaigns are performing as best as they can. After the first week, we are usually in a much less risky state.

Ok, December 16th is the go day then , please be so kind as to build it in your schedule.

Ok I'll send emails to announce the new scheduled date :)

To all the FR team: please check https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide in the meantime to familiarize with what will change!

elukey updated the task description. (Show Details)Fri, Nov 22, 8:00 AM

Hey @elukey, @Nuria, and analytics team: we in online fundraising really appreciate your willingness to move this release back. We understand its importance and appreciate your sensitivity to our short term needs. Thanks a lot for your teamwork and adaptability.

mforns moved this task from Next Up to In Progress on the Analytics-Kanban board.Mon, Nov 25, 5:37 PM

@elukey @Nuria Thanks so much!!!!!!!!!!!!!!!!!!!!!!!!

Thanks also @DStrine, @mepps, @EYener , @Ejegg, @spatton! :)

Nuria added a comment.Mon, Nov 25, 9:21 PM

For all who use the analytics cluster: Please get your hadoop credentials: https://phabricator.wikimedia.org/T237605

mepps added a comment.Tue, Nov 26, 3:58 PM

@Nuria @elukey Thank you so much for moving this back! I appreciate all your questions and helpful explanations about the implications of this change.