Page MenuHomePhabricator

Autonomous Systems report stopped working
Closed, ResolvedPublic

Description

Most likely because Kerberos authentication is now a requirement to read data from the Analytics cluster.

Event Timeline

Gilles triaged this task as Medium priority.May 13 2020, 8:29 AM

Moved the crontab to the analytics-privatedata user (was previously under mine), which has a keytab to access data behind Kerberos. And changed permissions of output directory and files accordingly.

When attempting to re-generate the data for April, though, the python script ran out of memory when processing Spain :(

2020-05-28 07:29:41,424 - ASORanking - DEBUG - Running SELECT useragent.device_family, ip, event.responseStart - event.connectStart AS ttfb,
            event.loadEventStart - event.responseStart AS plt, event.netinfoConnectionType AS type,
            event.pageviewToken, event.transferSize, event.mobileMode FROM event.NavigationTiming
            WHERE year = 2020 AND month = 4 AND event.originCountry = 'ES';
issuing: !connect jdbc:hive2://an-coord1001.eqiad.wmnet:10000/default;principal=hive/_HOST@WIKIMEDIA analytics-privatedata [passwd stripped] 
Traceback (most recent call last):
  File "/srv/deployment/performance/asoranking/asoranking.py", line 353, in <module>
    aso.run()
  File "/srv/deployment/performance/asoranking/asoranking.py", line 35, in run
    self.generate_report()
  File "/srv/deployment/performance/asoranking/asoranking.py", line 307, in generate_report
    navtiming_dataset = self.fetch_navigationtiming_data(country, year, month)
  File "/srv/deployment/performance/asoranking/asoranking.py", line 115, in fetch_navigationtiming_data
    navtiming_dataset = self.fetch_sql(sql)
  File "/srv/deployment/performance/asoranking/asoranking.py", line 84, in fetch_sql
    return pandas.read_csv(tsv_path, error_bad_lines=False, low_memory=False, sep='\t')
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 646, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 401, in _read
    data = parser.read()
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 939, in read
    ret = self._engine.read(nrows)
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1508, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 851, in pandas.parser.TextReader.read (pandas/parser.c:10438)
  File "pandas/parser.pyx", line 939, in pandas.parser.TextReader._read_rows (pandas/parser.c:11607)
  File "pandas/parser.pyx", line 2024, in pandas.parser.raise_parser_error (pandas/parser.c:27037)
pandas.io.common.CParserError: Error tokenizing data. C error: out of memory
Segmentation fault

The reason why Spain blows up is that we still oversample in Spain and Russia by a factor of 100x. Which means it's pulling 16 million rows for Spain in April. Serialising the processing is not trivial at all.

I think that given what we're trying to do here, it might be overkill to use all the data available to generate the ranking. 1 million random samples should be enough, for example, we don't need 16 million samples to rank a few dozen ISPs.

Change 604006 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/asoranking@master] Limit to 1 million random samples per country

https://gerrit.wikimedia.org/r/604006

Change 604006 merged by jenkins-bot:
[performance/asoranking@master] Limit to 1 million random samples per country

https://gerrit.wikimedia.org/r/604006

Mentioned in SAL (#wikimedia-operations) [2020-06-11T19:19:25Z] <gilles@deploy1001> Started deploy [performance/asoranking@0a096c4]: T252424

Mentioned in SAL (#wikimedia-operations) [2020-06-11T19:20:13Z] <gilles@deploy1001> Finished deploy [performance/asoranking@0a096c4]: T252424 (duration: 00m 47s)

Generated April data successfully, I think the limit worked. I'll generate May data now and hopefully the cron job should just work in early July for the June data.