Page MenuHomePhabricator

Inform Data-Engineering about removal of cuc_ip, cule_ip, and cupe_ip
Closed, ResolvedPublic

Description

Summary

cuc_ip, cule_ip, and cupe_ip columns are going to be dropped soon and so Data-Engineering owned code needs updating.

Background

Acceptance criteria

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Milimetric subscribed.

To Do:

  • Update sqoop code
  • Update hql scripts mentioned above (look for more)

Thanks for looking into this. Let me know if there is anything I can assist with.

OK so I've done some investigation work and here's what I think needs to be done:

  1. Modify python/refinery/sqoop.py to import cuc_ip_hex instead of cuc_ip into table mediawiki_private_cu_changes
  2. Implement a Hive UDF function formatHex that converts IPv4 and IPv6 addresses from hexadecimal representation into a regular one in base 10, just like https://gerrit.wikimedia.org/g/mediawiki/libs/IPUtils/+/33e6259971c914b826fc80bb21e68f53e0b7ccd6/src/IPUtils.php#396
  3. Modify hql/geoeditors/editors_daily_monthly.hql to use cuc_ip_hex instead of cuc_ip, with the help of the new Hive UDF formatHex function.

What do you think, @Ottomata @Milimetric ?

I don't know the sqooped stuff well, but that sounds about right to me!

  1. Modify hql/geoeditors/editors_daily_monthly.hql to use cuc_ip_hex instead of cuc_ip, with the help of the new Hive UDF formatHex function.

I think your solution is better, since it keeps the mediawiki_private_cu_changes the same as upstream. But, another option might be to use the UDF to apply a transform when importing (does sqoop support that?) that save the decoded IP in the old cuc_ip field? Then editors_daily_monthly.hql would be backwards compatible?

Just putting that there in case you and Dan like it better, but I think what you have is probably best!

Change #1175156 had a related patch set uploaded (by Aleksandar Mastilovic; author: Aleksandar Mastilovic):

[analytics/refinery/source@master] Implemend a GetHexDecodedIpUDF that decodes hex-encoded IPs

https://gerrit.wikimedia.org/r/1175156

Change #1175159 had a related patch set uploaded (by Aleksandar Mastilovic; author: Aleksandar Mastilovic):

[analytics/refinery@master] Stop using cuc_ip field in cu_changes, switch to cuc_ip_hex

https://gerrit.wikimedia.org/r/1175159

Change #1175156 merged by jenkins-bot:

[analytics/refinery/source@master] Implement a GetHexDecodedIpUDF that decodes hex-encoded IPs

https://gerrit.wikimedia.org/r/1175156

Change #1175556 had a related patch set uploaded (by Aleksandar Mastilovic; author: Aleksandar Mastilovic):

[analytics/refinery/source@master] Update changelog.md to reflect the latest version 0.2.68

https://gerrit.wikimedia.org/r/1175556

Change #1175556 merged by jenkins-bot:

[analytics/refinery/source@master] Update changelog.md to reflect the latest version 0.2.68

https://gerrit.wikimedia.org/r/1175556

Change #1175159 merged by Aleksandar Mastilovic:

[analytics/refinery@master] Stop using cuc_ip field in cu_changes, switch to cuc_ip_hex

https://gerrit.wikimedia.org/r/1175159

Change #1175918 had a related patch set uploaded (by Aleksandar Mastilovic; author: Aleksandar Mastilovic):

[analytics/refinery@master] Bring back the deprecated cuc_ip field in cu_changes table

https://gerrit.wikimedia.org/r/1175918

Change #1175918 merged by Aleksandar Mastilovic:

[analytics/refinery@master] Bring back the deprecated cuc_ip field in cu_changes table

https://gerrit.wikimedia.org/r/1175918

amastilovic updated the task description. (Show Details)