Most likely because Kerberos authentication is now a requirement to read data from the Analytics cluster.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Limit to 1 million random samples per country | performance/asoranking | master | +7 -1 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • Gilles | T252424 Autonomous Systems report stopped working | |||
Resolved | • Gilles | T253730 Unable to access Kerberos keytab |
Event Timeline
Moved the crontab to the analytics-privatedata user (was previously under mine), which has a keytab to access data behind Kerberos. And changed permissions of output directory and files accordingly.
When attempting to re-generate the data for April, though, the python script ran out of memory when processing Spain :(
2020-05-28 07:29:41,424 - ASORanking - DEBUG - Running SELECT useragent.device_family, ip, event.responseStart - event.connectStart AS ttfb, event.loadEventStart - event.responseStart AS plt, event.netinfoConnectionType AS type, event.pageviewToken, event.transferSize, event.mobileMode FROM event.NavigationTiming WHERE year = 2020 AND month = 4 AND event.originCountry = 'ES'; issuing: !connect jdbc:hive2://an-coord1001.eqiad.wmnet:10000/default;principal=hive/_HOST@WIKIMEDIA analytics-privatedata [passwd stripped] Traceback (most recent call last): File "/srv/deployment/performance/asoranking/asoranking.py", line 353, in <module> aso.run() File "/srv/deployment/performance/asoranking/asoranking.py", line 35, in run self.generate_report() File "/srv/deployment/performance/asoranking/asoranking.py", line 307, in generate_report navtiming_dataset = self.fetch_navigationtiming_data(country, year, month) File "/srv/deployment/performance/asoranking/asoranking.py", line 115, in fetch_navigationtiming_data navtiming_dataset = self.fetch_sql(sql) File "/srv/deployment/performance/asoranking/asoranking.py", line 84, in fetch_sql return pandas.read_csv(tsv_path, error_bad_lines=False, low_memory=False, sep='\t') File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 646, in parser_f return _read(filepath_or_buffer, kwds) File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 401, in _read data = parser.read() File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 939, in read ret = self._engine.read(nrows) File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1508, in read data = self._reader.read(nrows) File "pandas/parser.pyx", line 851, in pandas.parser.TextReader.read (pandas/parser.c:10438) File "pandas/parser.pyx", line 939, in pandas.parser.TextReader._read_rows (pandas/parser.c:11607) File "pandas/parser.pyx", line 2024, in pandas.parser.raise_parser_error (pandas/parser.c:27037) pandas.io.common.CParserError: Error tokenizing data. C error: out of memory Segmentation fault
The reason why Spain blows up is that we still oversample in Spain and Russia by a factor of 100x. Which means it's pulling 16 million rows for Spain in April. Serialising the processing is not trivial at all.
I think that given what we're trying to do here, it might be overkill to use all the data available to generate the ranking. 1 million random samples should be enough, for example, we don't need 16 million samples to rank a few dozen ISPs.
Change 604006 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/asoranking@master] Limit to 1 million random samples per country
Change 604006 merged by jenkins-bot:
[performance/asoranking@master] Limit to 1 million random samples per country
Mentioned in SAL (#wikimedia-operations) [2020-06-11T19:19:25Z] <gilles@deploy1001> Started deploy [performance/asoranking@0a096c4]: T252424
Mentioned in SAL (#wikimedia-operations) [2020-06-11T19:20:13Z] <gilles@deploy1001> Finished deploy [performance/asoranking@0a096c4]: T252424 (duration: 00m 47s)
Generated April data successfully, I think the limit worked. I'll generate May data now and hopefully the cron job should just work in early July for the June data.