Page MenuHomePhabricator

WDCM_Sqoop_Clients.R fails from stat1004
Closed, ResolvedPublic

Description

The main WDCM data engineering procedure:

WDCM_Sqoop_Clients.R

fails from stat1004.

Possible reasons:

Executing command: USE goransm; DROP TABLE IF EXISTS wdcm_clients_wb_entity_usage;
[WARN] Failed to create directory: /home/analytics-privatedata/.beeline
java.io.IOException: No such file or directory

Also, upon kinit on stat1004:

use goransm;
select * from wdcm_clients_wb_entity_usage where wiki_db="enwiki" limit 100;

results in:

Error: Error while compiling statement: FAILED: SemanticException Unable to determine if hdfs://analytics-hadoop/tmp/wmde/analytics/wdcm/wdcmsqoop/wdcm_clients_wb_entity_usage is encrypted: org.apache.hadoop.security.AccessControlException: Permission denied: user=goransm, access=EXECUTE, inode="/tmp/wmde":analytics-privatedata:hdfs:drwxr-x---

Inspecting the problem now.

Event Timeline

Hi Goran! We have recently introduced a stricter umask policy for HDFS, namely new files/dirs don't have the other permission bits set by default. In this case it means that if you create files on hdfs as analytics-privatedata-users then your user will probably not be able to read them (see the hdfs group set). The /tmp/wmde directory should probably need some change:

elukey@stat1004:~$ ls -ld /mnt/hdfs/tmp/wmde
drwxr-x--- 2 analytics-privatedata hdfs 4096 Jan 28 23:31 /mnt/hdfs/tmp/wmde

If you are ok I can run the following command to set analytics-privateda:analytics-privatedata-users permissions recursively on the dir:

sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown -R analytics-privatedata:analytics-privatedata-users /tmp/wmde

Would it be ok?

@elukey First of all: thanks for reaching out!

sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown -R analytics-privatedata:analytics-privatedata-users /tmp/wmde

Would it be ok?

Not that I have a certain knowledge on how kerberos-run-command operates, so please allow me the following two questions:

  • Would that allow goransm to reach something under /tmp/wmde that was previously created by analytics-privatedata?
  • Would that allow analytics-privatedata to write to the goransm database in hdfs?

Because - and as so far as my understanding goes - these two questions seem to be found at the core of the problem?

Thank you, Luca.

For the first question I'd say yes, for the second it really depends how your are creating tables in Hive, if you are using an external location (I assume /tmp/wmde) or not. I just executed the command after inspection of the files' permissions, it should be ok, let me know if it solves the problem or not!

@elukey Unfortunately... from stat1004:

Executing command: USE goransm; DROP TABLE IF EXISTS wdcm_clients_wb_entity_usage;
[WARN] Failed to create directory: /home/analytics-privatedata/.beeline
java.io.IOException: No such file or directory

I tried to reproduce with sudo -u analytics-privatedata kerberos-run-command analytics-privatedata beeline and the commands that you posted worked, but I got the error upon exiting (it is a minor issue, it tries to add history to the analytics-privatedata's home dir that doesn't exists since it is a system user, the new version of Hive is a little more verbose).

@elukey Thank you. I think now that we were facing a similar problem already - your comment in T274866#6840079 has just reminded me of it... Please give me some time to me search the Phabricator a bit and try to remember what has happened then and how it was solved. In effect, what happens is that WDCM_Sqoop_Clients.R - which just orchestrates Apache Sqoop and HiveQL calls from stat1004 - ends up producing an empty goransm.wdcm_clients_wb_entity_usage table for some reason.

@elukey I think it is the Apache Sqoop call that fails. Example:

Command:

sudo -u analytics-privatedata kerberos-run-command analytics-privatedata /usr/bin/sqoop import --connect jdbc:mysql://s1-analytics-replica.eqiad.wmnet:3311/enwiki --password-file /user/goransm/mysql-analytics-research-client-pw.txt --username research -m 16 --driver org.mariadb.jdbc.Driver --query "select * from wbc_entity_usage where \$CONDITIONS" --split-by eu_row_id --as-avrodatafile --target-dir /tmp/wmde/analytics/wdcm/wdcmsqoop/wdcm_clients_wb_entity_usage/wiki_db=enwiki --delete-target-dir

Output:

Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/share/java/maven3-slf4j-provider.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/share/java/slf4j-jcl.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/share/java/slf4j-jdk14.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/share/java/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/share/java/slf4j-nop.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/share/java/slf4j-simple.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Detected both log4j-over-slf4j.jar AND bound slf4j-log4j12.jar on the class path, preempting StackOverflowError. 
SLF4J: See also http://www.slf4j.org/codes.html#log4jDelegationLoop for more details.
Exception in thread "main" java.lang.ExceptionInInitializerError
        at org.slf4j.impl.StaticLoggerBinder.<init>(StaticLoggerBinder.java:72)
        at org.slf4j.impl.StaticLoggerBinder.<clinit>(StaticLoggerBinder.java:45)
        at org.slf4j.LoggerFactory.bind(LoggerFactory.java:150)
        at org.slf4j.LoggerFactory.performInitialization(LoggerFactory.java:124)
        at org.slf4j.LoggerFactory.getILoggerFactory(LoggerFactory.java:412)
        at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:357)
        at org.apache.commons.logging.impl.SLF4JLogFactory.getInstance(SLF4JLogFactory.java:155)
        at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:671)
        at org.apache.sqoop.Sqoop.<clinit>(Sqoop.java:42)
Caused by: java.lang.IllegalStateException: Detected both log4j-over-slf4j.jar AND bound slf4j-log4j12.jar on the class path, preempting StackOverflowError. See also http://www.slf4j.org/codes.html#log4jDelegationLoop for more details.
        at org.slf4j.impl.Log4jLoggerFactory.<clinit>(Log4jLoggerFactory.java:54)
        ... 9 more

I can confirm that what you have stated in T274866#6840079 on:

sudo -u analytics-privatedata kerberos-run-command analytics-privatedata beeline

is correct; all beeline and hdfs operations work just fine; the problem seems to be related to /usr/bin/sqoop I would say.
Please advise. The code itself was not altered in any way before this problem occurred.

Thank you.

@WMDE-leszek @Lydia_Pintscher

  • The WDCM system regular updates are stopped until this is resolved since they cannot run before the wdcm_clients_wb_entity_usage Hive table is produced anyways;
  • This will affect all analytical systems in Wikidata Analytics that rely on Wikidata reuse statistics (and many do);
  • Since the procedure failed unexpectedly, the current WDCM system update failed too, and not all of our WMDE Analytics Datasets our updated, causing the following three dashboards to stop operating:

Upon my analysis in the previous days, I do not think that there is anything that can be done to fix this by interventions in the code of this WDCM procedure. It run successfully for years and months with minor modifications (e.g. the introduction of Kerberos auth) only - and then stopped running. Since we did not intervene, it must be related to some change in the analytics infrastructure. I hope that @elukey will be able to find out more about this problem and I will be there to provide any assistance if needed of course.

Very interesting, thanks a lot for the report!

/usr/bin/sqoop does

`SQOOP_JARS=`ls /var/lib/sqoop/*.jar /usr/share/java/*.jar 2>/dev/null``

That contains both jars, see:

elukey@stat1004:~$ ls /var/lib/sqoop/*.jar /usr/share/java/*.jar 2>/dev/null | tr " " "\n" | grep log4j
/usr/share/java/log4j-over-slf4j-1.7.25.jar
/usr/share/java/log4j-over-slf4j.jar
/usr/share/java/slf4j-log4j12-1.7.25.jar
/usr/share/java/slf4j-log4j12.jar

The sqoop bin script is the one provided by Bigtop:

elukey@stat1004:~$ dpkg -S /usr/bin/sqoop
sqoop: /usr/bin/sqoop

And the jars are provided by:

elukey@stat1004:~$ dpkg -S /usr/share/java/log4j-over-slf4j.jar
libslf4j-java: /usr/share/java/log4j-over-slf4j.jar

elukey@stat1004:~$ dpkg -S /usr/share/java/slf4j-log4j12.jar
libslf4j-java: /usr/share/java/slf4j-log4j12.jar

It is a little strange, the sqoop bin didn't change between CDH and Bigtop, I checked on analytics-tool1001 (the Hue host, the only one remaining with CDH packages). What changes thought is the presence of the log4j-over jar:

elukey@analytics-tool1001:~$ dpkg -S slf4j-log4j12.jar
hadoop-client: /usr/lib/hadoop/client-0.20/slf4j-log4j12.jar
hbase: /usr/lib/hbase/lib/slf4j-log4j12.jar
hadoop: /usr/lib/hadoop/lib/slf4j-log4j12.jar
zookeeper: /usr/lib/zookeeper/lib/slf4j-log4j12.jar
hadoop-client: /usr/lib/hadoop/client/slf4j-log4j12.jar

elukey@analytics-tool1001:~$ dpkg -S log4j-over-slf4j.jar
dpkg-query: no path found matching pattern *log4j-over-slf4j.jar*

I checked with apt-cache rdepends libslf4j-java on stat1004 to see what package needs libslf4j-java, the list is long and I don't see any known suspects.

@GoranSMilovanovic to unblock you I have done (on stat1004):

elukey@stat1004:~$ sudo mv /usr/share/java/log4j-over-slf4j.jar .

This should unblock sqoop for the moment, can you please check and report back? It is not the final solution of course, just a temporary hack :)

@elukey Thank you very much for a prompt reaction! - testing now and getting back to you.

@elukey It works! Thank you! Q:

It is not the final solution of course, just a temporary hack :)

There is a regular system update that depends on this procedure scheduled for tomorrow, 20. February.
Should I prevent that update until the temporary hack evolves into a final solution, or should I proceed with the update?
Thank you very much again, Luca!

@GoranSMilovanovic we can keep stat1004 in this state for the weekend so your regular update goes through, we'll revert it only when a final solution will be found. Would it be ok?

Thank you for the report! We missed this use case during the upgrade, it is also important for our monthly jobs!

@elukey

we can keep stat1004 in this state for the weekend so your regular update goes through, we'll revert it only when a final solution will be found. Would it be ok?

Perfect. I suggest that we keep this ticket opened with a lowered priority until you decide upon the final state of stat1004 and other clients that might use Sqoop?

Thank you for the report! We missed this use case during the upgrade, it is also important for our monthly jobs!

I am glad that I was able to help, but it was completely accidental : ) - The dashboards failed first, then I've search for the cause in the ETL/ML procedures, and finally spotted the Sqoop related thing here...

Thank you, Luca.

GoranSMilovanovic lowered the priority of this task from High to Medium.Feb 19 2021, 10:40 AM

@WMDE-leszek @Lydia_Pintscher The updates are unblocked now; we expect everything to be found back in the expected state until February 21st early hours.

I am glad that I was able to help

Indeed ! We would have been caught by surprise next 1st of the month with our sqoop failing!
Thanks a lot @GoranSMilovanovic, and @elukey for the rapid response :)

@JAllemandou Anytime.

@WMDE-leszek @Lydia_Pintscher Wikidata Analytics WDCM dashboards are back an operational. I will check out the other systems one by one in the following hours but I do not expect anything problematic there.

@elukey Please let me know when you decide what needs to change on the Analytics Clients and then if I need to change something in the way I use Apache Sqoop following your action. I suggest to keep the ticket open until then. Thank you.

@GoranSMilovanovic yep I will ping you when a stable solution is fine, I am still trying to check how to fix it properly but I fear that a task to Apache Bigtop is required (will do it shortly).

Change 666356 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] bigtop::sqoop: apply workaround for BIGTOP-3508

https://gerrit.wikimedia.org/r/666356

Change 666356 merged by Elukey:
[operations/puppet@production] bigtop::sqoop: apply workaround for BIGTOP-3508

https://gerrit.wikimedia.org/r/666356

@GoranSMilovanovic I applied a more permanent fix that seems working from my tests, please let me know if it doesn't when you have a moment :)

@elukey Do you mind if we wait for the next regular update of our Wikidata Analytics and see then? It takes place in less than a week.

@elukey All checked - all superfine. Thank you very much again for your intervention!

Change 667894 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] bigtop::sqoop: apply upstream patch/override to sqoop bin

https://gerrit.wikimedia.org/r/667894

Change 667894 merged by Elukey:
[operations/puppet@production] bigtop::sqoop: apply upstream patch/override to sqoop bin

https://gerrit.wikimedia.org/r/667894

@GoranSMilovanovic I just applied the last patch for sqoop, it shouldn't change anything from your side, but it is what upstream merged. Lemme know if you see problems :)