Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Invalid | GoranSMilovanovic | T171258 WDCM: Puppetization | |||
Resolved | GoranSMilovanovic | T180904 Hive database for analytics-wmde user |
Event Timeline
Do we have any examples, anywhere, on what goes into a manifest for a Hadoop/Hive/Sqoop user? Must be the case.
Most probably Analytics/Rafinery somewhere... hmmm... let's see, Analytics/Systems/Data Lake/Administration/Edit/Pipeline page it says "Sqoop job runs in 1003 (although that might change, check puppet) and thus far it logs to: /var/log/refinery/sqoop-mediawiki.log", but no link to puppet and this would probably only give an idea on how to puppetize the future sqoop job on stat1004.
Also I am not sure whether browsing puppet/modules/reportupdater/ helps, but from what I've seen: probably not.
Finally, I've found this example (carefully: it's a part of a third party Puppet module do deploy Hadoop) which maybe helps in figuring out the constraints that a user must satisfy in order to do Hadoop; but I am not sure whether the same or a similar approach would apply to Hive and Sqoop as well.
@Ottomata: please, how does one puppetize a Hadoop/Hive+Sqoop user, if you know of an example somewhere? Note: the user will be orchestrating his Hive/Sqoop jobs from within R on stat1004 and stat1005.
Hm, so creating a hive database is easy, in fact, you can do it!
But, a new system user that has Hadoop access is not easy for complicated reasons. See also:
https://phabricator.wikimedia.org/T174110
https://phabricator.wikimedia.org/T174465
This is halfway in progress, but is mostly complicated because it changes the way ops manages user accounts. It requires a bunch of communication and buy in from ops folks. It can be done, but it is low priority for me at the moment. :(
@Ottomata Please take your time if you are about to claim this task at all.
I could have recalled the Apache Sqoop/stat1005 related problem earlier in the production of the WDCM system (in fact, I have initiated that discussion).
The responsibility for the WDCM being late in production is thus mine. I am also aware of T174110 and T174465.
One last small favor I ask from you now: please provide an estimate of when do you think it would be possible to have the analytics-wmde users on stat1004 and stat1005 with the access rights as requested (on stat1004: mySQL, Scoop, Hive; on stat1005: Hive, beyond what it already has) - if you can provide such an estimate at this point. Thank you.
Closing the task as (conditionally) resolved given that we already have T171258 and its branches for everything related to WDCM puppetization under the analytics-wmde user account.
Thats not really a reason to close this task.
This task is a sub task of T171258.
Either we need this task to be done (it should stay open) or we don't need it (and we should close it as declined, not resolved)
@Addshore Then leave it opened. We will get back to this as soon the labs WDCM component gets puppetized. Thanks.
@Addshore Following the introduction of Kerberos authentication, all Hive and Spark scripts needed for analytics in this case are run by analytics-privatedata. So there is no need for any Hive database for analytics-wmde user, I guess.