Page MenuHomePhabricator

asoranking timer failed on stat1007
Closed, ResolvedPublic

Description

Beginning of the log output, where it seems like the call to beeline runs into a missing directory and the output it gets is an unexpected format:

Mar  1 12:00:02 stat1007 performance-asoranking[30544]: 2021-03-01 12:00:02,840 - ASORanking - DEBUG - Getting CPU medians per country per network type
Mar  1 12:00:02 stat1007 performance-asoranking[30544]: 2021-03-01 12:00:02,841 - ASORanking - DEBUG - Running SELECT nt.event.originCountry AS country, PERCENTILE(cb.event.score, 0.5) AS score
Mar  1 12:00:02 stat1007 performance-asoranking[30544]:             FROM event.CpuBenchmark AS cb JOIN event.NavigationTiming AS nt
Mar  1 12:00:02 stat1007 performance-asoranking[30544]:             ON cb.event.pageviewToken = nt.event.pageviewToken
Mar  1 12:00:02 stat1007 performance-asoranking[30544]:             WHERE cb.year = 2021 AND cb.month = 2 AND nt.year = 2021 AND nt.month = 2
Mar  1 12:00:02 stat1007 performance-asoranking[30544]:             AND nt.event.netinfoConnectionType  = 'cellular' AND nt.event.mobileMode = 'stable'
Mar  1 12:00:02 stat1007 performance-asoranking[30544]:             GROUP BY nt.event.originCountry;
Mar  1 12:00:04 stat1007 performance-asoranking[30544]: SLF4J: Class path contains multiple SLF4J bindings.
Mar  1 12:00:04 stat1007 performance-asoranking[30544]: SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
Mar  1 12:00:04 stat1007 performance-asoranking[30544]: SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
Mar  1 12:00:04 stat1007 performance-asoranking[30544]: SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Mar  1 12:00:04 stat1007 performance-asoranking[30544]: SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Mar  1 12:00:05 stat1007 performance-asoranking[30544]: !connect jdbc:hive2://analytics-hive.eqiad.wmnet:10000/default;principal=hive/analytics-hive.eqiad.wmnet@WIKIMEDIA analytics-privatedata [passwd stripped]
Mar  1 12:01:34 stat1007 performance-asoranking[30544]: [WARN] Failed to create directory: /home/analytics-privatedata/.beeline
Mar  1 12:01:34 stat1007 performance-asoranking[30544]: java.io.IOException: No such file or directory
Mar  1 12:01:34 stat1007 performance-asoranking[30544]: #011at java.io.UnixFileSystem.createFileExclusively(Native Method)
Mar  1 12:01:34 stat1007 performance-asoranking[30544]: #011at java.io.File.createNewFile(File.java:1014)
Mar  1 12:01:34 stat1007 performance-asoranking[30544]: #011at jline.console.history.FileHistory.flush(FileHistory.java:82)
Mar  1 12:01:34 stat1007 performance-asoranking[30544]: #011at org.apache.hive.beeline.BeeLine$1.run(BeeLine.java:542)
Mar  1 12:01:34 stat1007 performance-asoranking[30544]: #011at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
Mar  1 12:01:34 stat1007 performance-asoranking[30544]: #011at java.util.concurrent.FutureTask.run(FutureTask.java:266)
Mar  1 12:01:34 stat1007 performance-asoranking[30544]: #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
Mar  1 12:01:34 stat1007 performance-asoranking[30544]: #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Mar  1 12:01:34 stat1007 performance-asoranking[30544]: #011at java.lang.Thread.run(Thread.java:748)
Mar  1 12:01:34 stat1007 performance-asoranking[30544]: b'Skipping line 11: expected 1 fields, saw 2\nSkipping line 12: expected 1 fields, saw 2\nSkipping line 13: expected 1 fields, saw 2\nSkipping line 14: expected 1 fields, saw 2\nSkipping line 15: expected 1 fields, saw 2\nSkipping line 16: expected 1 fields, saw 2\nSkipping line 17: expected 1 fields, saw 2\nSkipping line 18: expected 1 fields, saw 2\nSkipping line 19: expected 1 fields, saw 2\nSkipping line 20: expected 1 fields, saw 2\nSkipping line 21: expected 1 fields, saw 2\nSkipping line 22: expected 1 fields, saw 2\nSkipping line 23: expected 1 fields, saw 2\nSkipping line 24: expected 1 fields, saw 2\nSkipping line 25: expected 1 fields, saw 2\nSkipping line 26: expected 1 fields, saw 2\nSkipping line 27: expected 1 fields, saw 2\n

Beginning of the log output from the last successful run (Feb 1):

Feb  1 12:00:02 stat1007 performance-asoranking[32054]: 2021-02-01 12:00:02,982 - ASORanking - DEBUG - Getting CPU medians per country per network type
Feb  1 12:00:02 stat1007 performance-asoranking[32054]: 2021-02-01 12:00:02,982 - ASORanking - DEBUG - Running SELECT nt.event.originCountry AS country, PERCENTILE(cb.event.score, 0.5) AS score
Feb  1 12:00:02 stat1007 performance-asoranking[32054]:             FROM event.CpuBenchmark AS cb JOIN event.NavigationTiming AS nt
Feb  1 12:00:02 stat1007 performance-asoranking[32054]:             ON cb.event.pageviewToken = nt.event.pageviewToken
Feb  1 12:00:02 stat1007 performance-asoranking[32054]:             WHERE cb.year = 2021 AND cb.month = 1 AND nt.year = 2021 AND nt.month = 1
Feb  1 12:00:02 stat1007 performance-asoranking[32054]:             AND nt.event.netinfoConnectionType  = 'cellular' AND nt.event.mobileMode = 'stable'
Feb  1 12:00:02 stat1007 performance-asoranking[32054]:             GROUP BY nt.event.originCountry;
Feb  1 12:00:06 stat1007 performance-asoranking[32054]: issuing: !connect jdbc:hive2://analytics-hive.eqiad.wmnet:10000/default;principal=hive/analytics-hive.eqiad.wmnet@WIKIMEDIA analytics-privatedata [passwd stripped]
Feb  1 12:01:51 stat1007 performance-asoranking[32054]: 2021-02-01 12:01:51,307 - ASORanking - DEBUG - Running SELECT nt.event.originCountry AS country, PERCENTILE(cb.event.score, 0.5) AS score
Feb  1 12:01:51 stat1007 performance-asoranking[32054]:             FROM event.CpuBenchmark AS cb JOIN event.NavigationTiming AS nt
Feb  1 12:01:51 stat1007 performance-asoranking[32054]:             ON cb.event.pageviewToken = nt.event.pageviewToken
Feb  1 12:01:51 stat1007 performance-asoranking[32054]:             WHERE cb.year = 2021 AND cb.month = 1 AND nt.year = 2021 AND nt.month = 1
Feb  1 12:01:51 stat1007 performance-asoranking[32054]:             AND nt.event.netinfoConnectionType  = 'wifi' AND nt.event.mobileMode IS NULL
Feb  1 12:01:51 stat1007 performance-asoranking[32054]:             GROUP BY nt.event.originCountry;
Feb  1 12:01:54 stat1007 performance-asoranking[32054]: issuing: !connect jdbc:hive2://analytics-hive.eqiad.wmnet:10000/default;principal=hive/analytics-hive.eqiad.wmnet@WIKIMEDIA analytics-privatedata [passwd stripped]
Feb  1 12:03:32 stat1007 performance-asoranking[32054]: 2021-02-01 12:03:32,822 - ASORanking - DEBUG - Fetching navtiming dataset
Feb  1 12:03:32 stat1007 performance-asoranking[32054]: 2021-02-01 12:03:32,822 - ASORanking - DEBUG - Running SELECT useragent.device_family, ip, event.responseStart - event.connectStart AS ttfb,
Feb  1 12:03:32 stat1007 performance-asoranking[32054]:             event.loadEventStart - event.responseStart AS plt, event.netinfoConnectionType AS type,
Feb  1 12:03:32 stat1007 performance-asoranking[32054]:             event.pageviewToken, event.transferSize, event.mobileMode FROM event.NavigationTiming
Feb  1 12:03:32 stat1007 performance-asoranking[32054]:             WHERE year = 2021 AND month = 1 AND event.originCountry = 'US'
Feb  1 12:03:32 stat1007 performance-asoranking[32054]:             SORT BY RAND() LIMIT 1000000;
Feb  1 12:03:35 stat1007 performance-asoranking[32054]: issuing: !connect jdbc:hive2://analytics-hive.eqiad.wmnet:10000/default;principal=hive/analytics-hive.eqiad.wmnet@WIKIMEDIA analytics-privatedata [passwd stripped]
Feb  1 12:06:22 stat1007 performance-asoranking[32054]: 2021-02-01 12:06:22,085 - ASORanking - DEBUG - Processing navtiming dataset

Event Timeline

It runs fine without pandas csv parsing issues when run outside of the systemd timer, doing this:

sudo -u analytics-privatedata /usr/local/bin/kerberos-run-command analytics-privatedata /usr/bin/python3 /srv/deployment/performance/asoranking/asoranking.py --debug --threshold 1000 --publish

Which is the command in the timer definition:

gilles@stat1007:~$ systemctl cat performance-asoranking
# /lib/systemd/system/performance-asoranking.service
[Unit]
Description=ASO ranking report monthly run

[Service]
User=analytics-privatedata
SyslogIdentifier=performance-asoranking
Slice=user.slice
ExecStart=/usr/local/bin/kerberos-run-command analytics-privatedata /usr/bin/python3 /srv/deployment/performance/asoranking/asoranking.py --debug --threshold 1000 --publish

@elukey has anything changed in regards to the OS, systemd or timers since this last ran successfully inside the timer on Feb 1st?

@Gilles we have upgraded Hadoop to 2.10.1 and Hive to 2.3.6 during the last weeks, and we are reimaging some hadoop workers to Debian Buster, but no changes on the stat100x boxes IIRC. Are the pandas errors local?

Also, is the hive output the one that you expect? Maybe something changed on the hive side and the python code needs to be adjusted, but it wouldn't explain why it works for you outside the systemd timer..

The errors showed up in the log and suggest that beeline was returning unexpected data when it ran through the timer. They just made the script unable to process the data and its job.

Could you start performance-asoranking.service manually as a one-off? I don't seem to have the necessary privileges for that. Just to check whether this was a fluke when it ran on March 1st or if it's a problem with the script running through a systemd timer.

Mentioned in SAL (#wikimedia-analytics) [2021-03-02T08:01:28Z] <elukey> manual start of performance-asotranking on stat1007 (requested by Gilles) - T276121

@Gilles same issue, will try some more tests to see if I can narrow it down..

Tried to do the following:

  • sudo -u analytics-privatedata bash
  • /usr/local/bin/kerberos-run-command analytics-privatedata /usr/bin/python3 /srv/deployment/performance/asoranking/asoranking.py --debug --threshold 1000 --publish

As Gilles reported, no issue, the script runs fine. Beeline seems to behave differently, weird. @Gilles is there a way to check the output of one of the above queries when executed via systemd?

We can put a throwaway script into a new systemd service/timer that will serve as a repro case and output what beeline gives. You can use /home/gilles/T276121.py for that purpose. Just set up a systemd unit that runs that and its logs should contain the beeline output when ran the same way through python (at INFO level).

gilles@stat1007:~$ sudo -u analytics-privatedata python3 /home/gilles/T276121.py 
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
!connect jdbc:hive2://analytics-hive.eqiad.wmnet:10000/default;principal=hive/analytics-hive.eqiad.wmnet@WIKIMEDIA analytics-privatedata [passwd stripped]
[WARN] Failed to create directory: /home/analytics-privatedata/.beeline
java.io.IOException: No such file or directory
	at java.io.UnixFileSystem.createFileExclusively(Native Method)
	at java.io.File.createNewFile(File.java:1014)
	at jline.console.history.FileHistory.flush(FileHistory.java:82)
	at org.apache.hive.beeline.BeeLine$1.run(BeeLine.java:542)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
INFO:root:***Beeline stdout below***
INFO:root:




country	score
NULL	285.0
AD	207.0
AE	190.0

[...]



INFO:root:***Attempting to parse TSV with pandas***
INFO:root:    country  score
0       NaN  285.0
1        AD  207.0
2        AE  190.0

[...]

@Gilles I am using systemctl edit to add an override, going to paste the results in there once done.

Log file containing the script's output in /var/log/performance-asoranking/asoranking.log

The output is very different, which explains the parsing failures:

Mar  2 11:08:00 stat1007 performance-asoranking[28664]: INFO:root:***Beeline stdout below***
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: INFO:root:
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: null
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: null
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: null
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: null
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: null
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: null
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: null
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: null
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: null
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: nullcountry#011score
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: NULL#011285.0
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: AD#011207.0
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: AE#011190.0
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: AF#011294.0
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: AG#011294.0
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: AI#011334.0
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: AL#011264.0
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: AM#011266.0
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: AO#011458.5
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: AR#011328.0
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: AS#011452.0
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: AT#011230.0
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: AU#011180.0
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: AW#011474.0
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: AX#011284.0
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: AZ#011253.0
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: BA#011263.0
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: BB#011328.0
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: BD#011292.0
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: BE#011212.0
Mar  2 11:08:00 stat1007 performance-asoranking[28664]: BF#011349.0

It seems like lines that were empty now explicitely output null, and characters that should be tabs get output as #011. 011 is \t in octal ASCII.

I presume this has to do with systemd transforming/encoding beeline's stdout somehow?

The stadout of beeline is captured by the script via python's subprocess.Popen and .communicate()

A workaround might be to make the encoding/decoding explicit in the Python script. Currently it might inherit it from the shell or systemd. I'll try to modify /home/gilles/T276121.py to that effect.

@elukey please try running it again with the new version of /home/gilles/T276121.py which makes some encoding parameters explicit.

@Gilles done, with the override (and puppet disabled) the performance-asotranking unit is not executing your script, so feel free to run any test as well.

I can't find a way to start the unit myself. Is there a way I could temporarily be allowed to do that? Or maybe I tried the wrong commands. Being able to start the unit myself would let me iterate on the script until I find a fix (latest attempt failed).

Ah yes sorry you don't have sudo on the host, so difficult to run stuff, and it may need an SRE meeting to grant extra sudo perms for this use case. Ping me on IRC when you need to re-run, it is not convenient I know but hopefully in few iterations we'll be good :)

More info: we applied the patch included in HIVE-19231 to /usr/lib/hive/bin, and restarted the system unit (after removing the test override). Since this is a monthly job, we can take some time to figure out what it is best to do.

The patch wasn't backported to the Hive 2.x branch for some reason, but only merged in 3.x. Instead of rebuilding the hive package we could apply a quick find/replace action in puppet, should be ok as well.

Mentioned in SAL (#wikimedia-analytics) [2021-03-25T08:14:32Z] <elukey> upgrade hive packages on stat100x to 2.6.3-2 - T276121

@Gilles I have updated the packages, if you want to double check that all is good before the next run it would be great so we are sure that nothing will explode again :)

Sure, just kick off a manual run of the asoranking systemd timer. The issue was only happening when running inside systemd.

Mentioned in SAL (#wikimedia-analytics) [2021-03-25T13:53:25Z] <elukey> systemctl restart performance-asotranking on stat1007 for T276121

Looks good, it worked fine, thank you!