Page MenuHomePhabricator

Move the Analytics Refinery to Python 3
Closed, ResolvedPublic8 Estimated Story Points

Description

Port all the Analytics Refinery code to Python 3.

The following bin scripts need to be reviewed/tested and moved to #!/usr/bin/env python3

camus:#!/usr/bin/env python
download-project-namespace-map:#!/usr/bin/env python
generate-domain-abbrev-map:#!/usr/bin/env python
is-yarn-app-running:#!/usr/bin/env python
refinery-drop-banner-activity-partitions:#!/usr/bin/env python --> seems unused in puppet
refinery-drop-druid-deep-storage-data:#!/usr/bin/env python
refinery-drop-eventlogging-partitions:#!/usr/bin/env python --> seems unused in puppet
refinery-drop-hive-partitions:#!/usr/bin/env python
refinery-drop-hourly-partitions:#!/usr/bin/env python  --> seems unused in puppet
refinery-drop-mediawiki-snapshots:#!/usr/bin/env python
refinery-drop-older-than:#!/usr/bin/env python
refinery-drop-webrequest-partitions:#!/usr/bin/env python
refinery-get-add-webrequest-partition-statements:#!/usr/bin/env python --> seems unused in puppet
refinery-oozie-rerun:#!/usr/bin/env python
sequence-file:#!/usr/bin/env python

Event Timeline

elukey triaged this task as Medium priority.Sep 18 2018, 4:17 PM
elukey created this task.
Nuria added a subscriber: fdans.
mforns lowered the priority of this task from Medium to Low.Apr 22 2019, 4:00 PM

Mentioned in SAL (#wikimedia-analytics) [2019-09-16T10:04:01Z] <elukey> disable puppet on an-coord1001 and manually forcing python3 for camus - T204735

Change 537055 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/refinery@master] camus: switch to python3

https://gerrit.wikimedia.org/r/537055

Change 537056 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/refinery@master] is-yarn-app-running: move to python3

https://gerrit.wikimedia.org/r/537056

elukey raised the priority of this task from Low to Medium.Sep 16 2019, 10:15 AM
elukey updated the task description. (Show Details)

Change 537057 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/refinery@master] refinery-oozie-rerun: move to python3

https://gerrit.wikimedia.org/r/537057

@mforns can you check in the task's description the seems unused in puppet? IIUC those got replaced by refinery-drop-older-than, so we could possibly delete some of those? Also, can we run refinery-drop-older-than with python3? :)

Change 537055 merged by Elukey:
[analytics/refinery@master] camus: switch to python3

https://gerrit.wikimedia.org/r/537055

Change 537056 merged by Elukey:
[analytics/refinery@master] is-yarn-app-running: move to python3

https://gerrit.wikimedia.org/r/537056

Change 537057 merged by Elukey:
[analytics/refinery@master] refinery-oozie-rerun: move to python3

https://gerrit.wikimedia.org/r/537057

@elukey

The following scripts should be removable after migration to drop-older-than:

refinery-drop-banner-activity-partitions:#!/usr/bin/env python --> seems unused in puppet
refinery-drop-eventlogging-partitions:#!/usr/bin/env python --> seems unused in puppet
refinery-drop-hourly-partitions:#!/usr/bin/env python  --> seems unused in puppet

And these two, should too I think, I wonder why they are still in puppet. Did I forget to migrate them?

refinery-drop-hive-partitions:#!/usr/bin/env python
refinery-drop-webrequest-partitions:#!/usr/bin/env python

@mforns checked puppet without a simple grep, and:

  • refinery-drop-hive-partitions
kerberos::systemd_timer { 'refinery-drop-query-clicks':
    ensure                    => absent,
    description               => 'Drop cirrus click logs from Hive/HDFS following data retention policies.',
    command                   => "${refinery_path}/bin/refinery-drop-hive-partitions -d ${query_clicks_retention_days} -D discovery -t query_clicks_hourly,query_clicks_daily -f ${query_clicks_log_file}",
    monitoring_contact_groups => 'team-discovery',
    environment               => $systemd_env,
    interval                  => '*-*-* 03:30:00',
    user                      => 'hdfs',
    use_kerberos              => $use_kerberos,
}

kerberos::systemd_timer { 'search-drop-query-clicks':
    description               => 'Drop cirrus click logs from Hive/HDFS following data retention policies.',
    command                   => "${refinery_path}/bin/refinery-drop-hive-partitions -d ${query_clicks_retention_days} -D discovery -t query_clicks_hourly,query_clicks_daily -f ${query_clicks_log_file}",
    monitoring_contact_groups => 'team-discovery',
    environment               => $systemd_env,
    interval                  => '*-*-* 03:30:00',
    user                      => 'analytics-search',
    use_kerberos              => $use_kerberos,
}

The former is already absented so we can just drop it, meanwhile the latter is not ours so we'd need to migrate it :)

  • refinery-drop-webrequest-partitions - it is only in profile::analytics::refinery::job::test::data_purge, that is used by the Hadoop testing cluster, so we can migrate anytime!

@elukey OK thanks!
Will add those as a TODO for me.

Change 537255 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/refinery@master] Force execution of all the (python) scripts under bin/ with python3

https://gerrit.wikimedia.org/r/537255

elukey added a project: Analytics-Kanban.
elukey moved this task from Next Up to In Code Review on the Analytics-Kanban board.

Change 537255 merged by Elukey:
[analytics/refinery@master] Force execution of all the (python) scripts under bin/ with python3

https://gerrit.wikimedia.org/r/537255

So it seems that the refinery-drop-older-than script has some issues with python3:

  1. python3-mock was not installed, but no errors were shown due to sys.stderr = open(os.devnull, 'w')
  1. after installing it, I got this:
Sep 18 16:35:58 an-coord1001 refinery-drop-older-than[167218]: TypeError: 'tuple' object cannot be interpreted as an integer
Sep 18 16:35:58 an-coord1001 refinery-drop-older-than[167218]: Traceback (most recent call last):
Sep 18 16:35:58 an-coord1001 refinery-drop-older-than[167218]:   File "/srv/deployment/analytics/refinery/bin/refinery-drop-older-than", line 450, in test_security_checksum_not_altered_by_logging_arguments
Sep 18 16:35:58 an-coord1001 refinery-drop-older-than[167218]:     get_security_checksum({'argument': 'value'}),
Sep 18 16:35:58 an-coord1001 refinery-drop-older-than[167218]:   File "/srv/deployment/analytics/refinery/bin/refinery-drop-older-than", line 267, in get_security_checksum
Sep 18 16:35:58 an-coord1001 refinery-drop-older-than[167218]:     hash_message = bytes(sorted(hash_args.items()))
Sep 18 16:35:58 an-coord1001 refinery-drop-older-than[167218]: TypeError: 'tuple' object cannot be interpreted as an integer
Sep 18 16:35:58 an-coord1001 systemd[1]: refinery-drop-apiaction-partitions.service: Main process exited, code=exited, status=1/FAILURE
Sep 18 16:35:58 an-coord1001 systemd[1]: refinery-drop-apiaction-partitions.service: Unit entered failed state.
Sep 18 16:35:58 an-coord1001 systemd[1]: refinery-drop-apiaction-partitions.service: Failed with result 'exit-code'.

Python3

>>> hash_args = {'a': 'a', 'b': 'b'}
>>> hash_message = bytes(sorted(hash_args.items()))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'tuple' object cannot be interpreted as an integer

>>> hash_args.items()
dict_items([('a', 'a'), ('b', 'b')])

>>> sorted(hash_args.items())
[('a', 'a'), ('b', 'b')]

>>> bytes(sorted(hash_args.items()))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'tuple' object cannot be interpreted as an integer

Python2

Python 2.7.13 (default, Sep 26 2018, 18:42:22)
[GCC 6.3.0 20170516] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> hash_args = {'a': 'a', 'b': 'b'}
>>> hash_message = bytes(sorted(hash_args.items()))
>>> hash_args.items()
[('a', 'a'), ('b', 'b')]
>>> sorted(hash_args.items())
[('a', 'a'), ('b', 'b')]
>>> bytes(sorted(hash_args.items()))
"[('a', 'a'), ('b', 'b')]"

help(bytes) in python2 leads to help(str), meanwhile on python3:

class bytes(object)
 |  bytes(iterable_of_ints) -> bytes
 |  bytes(string, encoding[, errors]) -> bytes
 |  bytes(bytes_or_buffer) -> immutable copy of bytes_or_buffer
 |  bytes(int) -> bytes object of size given by the parameter initialized with null bytes
 |  bytes() -> empty bytes object
 |
 |  Construct an immutable array of bytes from:
 |    - an iterable yielding integers in range(256)
 |    - a text string encoded using the specified encoding
 |    - any object implementing the buffer API.
 |    - an integer

Thanks Luca for the pastes.
I changed a bit the syntax to adapt to python3.
And tested that everything is ok :].
Luckily, the checksums match the ones generated by python2,
so we won't need to change all the checksums.
Here's the change:

Change 537705 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Fix python3 incompatibility in refinery-drop-older-than

https://gerrit.wikimedia.org/r/537705

Change 537705 merged by Ottomata:
[analytics/refinery@master] Fix python3 incompatibility in refinery-drop-older-than

https://gerrit.wikimedia.org/r/537705

Nice! Thanks a lot @mforns.. I think that we might be able to quickly deploy refinery today (that wouldn't even require a deploy to hdfs) and close this task.

@mforns another interesting use case:

Sep 19 10:23:17 stat1007 systemd[1]: Started Drop cirrus click logs from Hive/HDFS following data retention policies..
Sep 19 10:23:24 stat1007 refinery-drop-hive-partitions[70034]: {b'popularity_score': {}, b'feature_vectors': {}, b'labeled_query_page': {}, b'training_files': {}, b'query_clicks_hourly': {}, b'query_clustering': {}, b'go_clicks': {}, b'search_satisfac
Sep 19 10:23:24 stat1007 refinery-drop-hive-partitions[70034]: Traceback (most recent call last):
Sep 19 10:23:24 stat1007 refinery-drop-hive-partitions[70034]:   File "/srv/deployment/analytics/refinery/bin/refinery-drop-hive-partitions", line 98, in <module>
Sep 19 10:23:24 stat1007 refinery-drop-hive-partitions[70034]:     table_location = hive.table_location(table)
Sep 19 10:23:24 stat1007 refinery-drop-hive-partitions[70034]:   File "/srv/deployment/analytics/refinery/python/refinery/hive.py", line 143, in table_location
Sep 19 10:23:24 stat1007 refinery-drop-hive-partitions[70034]:     table_location = self.table_metadata(table)['Location']
Sep 19 10:23:24 stat1007 refinery-drop-hive-partitions[70034]:   File "/srv/deployment/analytics/refinery/python/refinery/hive.py", line 125, in table_metadata
Sep 19 10:23:24 stat1007 refinery-drop-hive-partitions[70034]:     if 'metadata' not in self.tables[table].keys():
Sep 19 10:23:24 stat1007 refinery-drop-hive-partitions[70034]: KeyError: 'query_clicks_hourly'

This is search drop query clicks, that is still running with refinery-drop-hive-partitions, so possibly a candidate for your new script? If so, follow up question - is python/refinery/hive.py still used? If so we'll need to make some changes in there probably..

The issue above seems to be that self.tables contains all bytes strings like { b'example': {etc..}, ...} so the string 'query_clicks_hourly' doesn't match. It might be a trivial fix but I'd love not to add any side effect like dropping all data due to a missed if, so better to check with you first :D

Change 538235 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/refinery@master] util.py: return utf-8 text string for stdout/stderr

https://gerrit.wikimedia.org/r/538235

After checking the code I think that the problem is in the util.sh() function, that with python2 returns a text string meanwhile with Python3 a bytes string. Tested the behavior of https://gerrit.wikimedia.org/r/538235 with python 3 and it works as expected.

Change 538235 merged by Elukey:
[analytics/refinery@master] util.py: return text strings for stdout/stderr

https://gerrit.wikimedia.org/r/538235

@elukey
Reviewed https://gerrit.wikimedia.org/r/538235 and the fix makes sense to me!
There is a call to os.system returning a value, but I checked the python docs and it seems to be both python2 and python3 return the exact same for os.system.

Change 538569 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/refinery@master] hdfs.py: use startsWith with a string when checking sh()'s output

https://gerrit.wikimedia.org/r/538569

Sep 23 08:33:35 an-coord1001 refinery-drop-older-than[40168]:   File "/srv/deployment/analytics/refinery/bin/refinery-drop-older-than", line 595, in <module>
Sep 23 08:33:35 an-coord1001 refinery-drop-older-than[40168]:     main(args)
Sep 23 08:33:35 an-coord1001 refinery-drop-older-than[40168]:   File "/srv/deployment/analytics/refinery/bin/refinery-drop-older-than", line 347, in main
Sep 23 08:33:35 an-coord1001 refinery-drop-older-than[40168]:     execute is not None)
Sep 23 08:33:35 an-coord1001 refinery-drop-older-than[40168]:   File "/srv/deployment/analytics/refinery/bin/refinery-drop-older-than", line 195, in remove_directories
Sep 23 08:33:35 an-coord1001 refinery-drop-older-than[40168]:     candidate_paths = hdfs.ls(glob, include_children=False)
Sep 23 08:33:35 an-coord1001 refinery-drop-older-than[40168]:   File "/srv/deployment/analytics/refinery/python/refinery/hdfs.py", line 59, in ls
Sep 23 08:33:35 an-coord1001 refinery-drop-older-than[40168]:     check_return_code=False
Sep 23 08:33:35 an-coord1001 refinery-drop-older-than[40168]:   File "/srv/deployment/analytics/refinery/python/refinery/hdfs.py", line 60, in <listcomp>
Sep 23 08:33:35 an-coord1001 refinery-drop-older-than[40168]:     ).splitlines() if not line.startswith(b'Found ')
Sep 23 08:33:35 an-coord1001 refinery-drop-older-than[40168]: TypeError: startswith first arg must be str or a tuple of str, not bytes

This seems easy to fix since sh() returns a string and startswith needs to run with a text string too, not binary (b'Found '). I applied the change on an-coord1001 and it worked fine. The code as far as I can see should not alter anything of refinery-drop-older-than, since there is the function should_remove_directory right after it:

candidate_paths = hdfs.ls(glob, include_children=False)
for path in candidate_paths:
    if should_remove_directory(path, full_path_format, threshold):
        directories_to_remove.append(path)

@mforns: as precautionary measure I have disabled all the drop scripts on an-coord1001, can you verify what I wrote above before re-enabling? (Better safe than sorry)

Change 538569 merged by Elukey:
[analytics/refinery@master] hdfs.py, import-mediawiki-dumps: use sh()'s output as text not binary

https://gerrit.wikimedia.org/r/538569

Change 538675 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/refinery@master] hive.py: open the temporary file in text mode

https://gerrit.wikimedia.org/r/538675

Change 538675 merged by Elukey:
[analytics/refinery@master] hive.py: open the temporary file in text mode

https://gerrit.wikimedia.org/r/538675

elukey set the point value for this task to 8.Sep 24 2019, 5:42 AM
elukey moved this task from In Code Review to Ready to Deploy on the Analytics-Kanban board.

Change 539094 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] analytics::search::jobs.pp: Move last deletion timers to drop-older-than

https://gerrit.wikimedia.org/r/539094

Change 539146 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Improve refinery-drop-older-than after python3 migration

https://gerrit.wikimedia.org/r/539146

Change 539146 merged by Mforns:
[analytics/refinery@master] Improve refinery-drop-older-than after python3 migration

https://gerrit.wikimedia.org/r/539146

Another weird use case happened: after running apt-get autoremove on an-worker1080, python2 dependencies got cleaned up but hen druid_loader.py caused some failures when executing on that node:

  File "./druid_loader.py", line 19, in <module>
    import requests
ImportError: No module named requests

The script has python3 listed in the #!, so I have no idea why this happens.

The explanation should be the following: the oozie shell action uses the druid_loader.py stored in HDFS, not the ones where oozie runs (of course). So since we haven't restarted any coordinator after my change, it is still using the #! python exec instructions at the top of the file..

Next step - restart all coordinators using druid_loader.py, and then execute apt-get autoremove on the worker nodes to clean up old python2 deps (and verify that everything works).

All the druid-related coordinators restarted, and I removed python-request from all the hadoop workers. If everything goes according to plan, no issue should arise anymore.

it seems to work, I just ran apt-get autoremove on all hadoop workers and no error surfaced.

Change 540581 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/refinery@master] download-project-namespace-map: handle binary enconding for python3

https://gerrit.wikimedia.org/r/540581

Change 540581 merged by Fdans:
[analytics/refinery@master] download-project-namespace-map: handle binary enconding for python3

https://gerrit.wikimedia.org/r/540581

Change 539094 merged by Ottomata:
[operations/puppet@production] analytics::search::jobs.pp: Move last deletion timers to drop-older-than

https://gerrit.wikimedia.org/r/539094