Page MenuHomePhabricator

[M] Reduce image_suggestion HDFS files footprint
Closed, ResolvedPublic

Description

The hive tables analytics_platform_eng.image_suggestions_instanceof_cache, analytics_platform_eng.image_suggestions_title_cache and analytics_platform_eng.image_suggestions_suggestions are using many small files on HDFS, which is an anti-pattern.

HDFS path # Files Data size (not replicated)
/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_instanceof_cache26247505421414826
/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_title_cache25251007129736283
/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_suggestions2521066160342855877

Two things I'd like to be done:

  • Update the code so that new data gets generated over a smaller number of files. [DONE.]
  • Devise a job to read / coalesce / replace old data to reduce the existing footprint. [Avoided by deleting old data and setting a systemd timer]
  • Related topic: Should we purge historical data? [DONE.]

Event Timeline

CBogen renamed this task from Reduce image_suggestion HDFS files footprint to [M] Reduce image_suggestion HDFS files footprint.Dec 15 2022, 5:17 PM

From the asks above in description:

Related topic: Should we purge historical data?

@mfossati, @Cparle Right now we have:

xcollazo@stat1007:/mnt/hdfs/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_suggestions$ pwd
/mnt/hdfs/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_suggestions
xcollazo@stat1007:/mnt/hdfs/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_suggestions$ ls | grep snapshot | wc -l
28

So 28 weeks ~= 7 months. Is there value in keeping this data, or should we add a job to clean this up? If so, how much should we keep? (related T317364#8225212).

It's probably worth keeping some of the data, just in case. The last 4 snapshots, perhaps? And maybe the first one from each month for the last 6 months - so a total of 10. If there's a need for other old data we can always regenerate it from the source data

If there's a need for other old data we can always regenerate it from the source data

@Cparle Actually, the table dependencies for image_suggestions are purged frequently. The monthly table deps have 6 months available, but It looks like the limiting factor would be the weekly deps, which right now have at most 6 weeks available:

hive (wmf)> show partitions wikidata_entity;
OK
partition
log4j:ERROR No output stream or file set for the appender named [console].
snapshot=2022-10-31
snapshot=2022-11-07
snapshot=2022-11-14
snapshot=2022-11-21
snapshot=2022-11-28
snapshot=2022-12-05
Time taken: 0.292 seconds, Fetched: 6 row(s)


hive (wmf)> show partitions wikidata_item_page_link;
OK
partition
snapshot=2022-10-31
snapshot=2022-11-07
snapshot=2022-11-14
snapshot=2022-11-21
snapshot=2022-11-28
snapshot=2022-12-05
Time taken: 0.106 seconds, Fetched: 6 row(s)

So recovering historical data doesn't seem to be an option. Does this change your assessment?

No - I'm probably being over-cautious, we've never needed to go back and regenerate old data so far, and I can't think why we'd need to. Doing what everyone else is doing is fine with me

Doing what everyone else is doing is fine with me

All right, this makes the deletion logic simple: keep last 6 weeks, mimicking the other weekly jobs.

I think this also makes the work for this task easier since I don't need to make a separate job to rewrite the older data that has many small files. In a few weeks it will solve itself.

There were two unused tables: analytics_platform_eng.imagerec and analytics_platform_eng.imagerec_prod. They had not been written to since Jan 2022 and they were not written to by the image_suggestions Airflow DAG.

I confirmed with @Cparle that they were not neccesary anymore.

I deleted the Hive tables like so:

sudo -u analytics-platform-eng kerberos-run-command analytics-platform-eng hive
...
hive (analytics_platform_eng)> drop table imagerec;
OK
Time taken: 0.257 seconds
hive (analytics_platform_eng)> drop table imagerec_prod;
OK
Time taken: 0.153 seconds

Then I removed the related HDFS data like so:

sudo -u analytics-platform-eng kerberos-run-command analytics-platform-eng hdfs dfs -rm -R hdfs://analytics-hadoop/user/analytics-platform-eng/imagerec
sudo -u analytics-platform-eng kerberos-run-command analytics-platform-eng hdfs dfs -rm -R hdfs://analytics-hadoop/user/analytics-platform-eng/imagerec_prod
xcollazo updated the task description. (Show Details)

Change 870971 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[analytics/refinery@master] Modify refinery-drop-older-than to support 'snapshot' partitions

https://gerrit.wikimedia.org/r/870971

Change 870974 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[operations/puppet@production] Add a systemd timer to clean up old data related to image_suggestions

https://gerrit.wikimedia.org/r/870974

All right, we have the four changes above pending review, and a task that needs help from an SRE at T325837.

The below has been merged. Will monitor the platform_eng Airflow instance for a successful production run.

I have addressed the review comments on patch set 1 for https://gerrit.wikimedia.org/r/870971 and submitted patch set 2 for re-review.

Should this be in "code review" instead of blocked?

Should this be in "code review" instead of blocked?

Fixed!

Change 870971 merged by Xcollazo:

[analytics/refinery@master] Modify refinery-drop-older-than to support 'snapshot' partitions

https://gerrit.wikimedia.org/r/870971

Added

Change 870971 merged by Xcollazo:

[analytics/refinery@master] Modify refinery-drop-older-than to support 'snapshot' partitions

https://gerrit.wikimedia.org/r/870971

Added the patch to the Weekly train: https://etherpad.wikimedia.org/p/analytics-weekly-train. Hopefully will be merged sometime tomorrow Tue Jan 10.

@xcollazo is this merged? can it be closed? thanks!

Added the patch to the Weekly train: https://etherpad.wikimedia.org/p/analytics-weekly-train. Hopefully will be merged sometime tomorrow Tue Jan 10.

@xcollazo is this merged? can it be closed? thanks!

Added the patch to the Weekly train: https://etherpad.wikimedia.org/p/analytics-weekly-train. Hopefully will be merged sometime tomorrow Tue Jan 10.

It is merged.

But we can't close this ticket until the below is merged as well, which is blocked by T326827 (and this one is moving along).

Change 870974 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[operations/puppet@production] Add a systemd timer to clean up old data related to image_suggestions

https://gerrit.wikimedia.org/r/870974

Just for fun:

Old stats:

HDFS path # Files Data size (not replicated)
/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_instanceof_cache26247505421414826
/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_title_cache25251007129736283
/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_suggestions2521066160342855877

New stats:

HDFS path # Files Data size (not replicated)
/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_instanceof_cache407611974263141
/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_title_cache4082801416692542
/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_suggestions40761141614030183

% compared to original:

HDFS path # Files Data size (not replicated)
/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_instanceof_cache16%18%
/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_title_cache16%20%
/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_suggestions16%26%

The only remaining task here is the merging of https://gerrit.wikimedia.org/r/c/operations/puppet/+/870974/, which I hope will happen in the next day or so.

Change 870974 merged by Btullis:

[operations/puppet@production] Add a systemd timer to clean up old data related to image_suggestions

https://gerrit.wikimedia.org/r/870974

Confirmed that the systemd timer is present on an-launcher1002:

xcollazo@an-launcher1002:~$ systemctl list-timers | grep drop-image-suggestions
Mon 2023-01-23 13:00:00 UTC  4 days left         n/a                          n/a                 drop-image-suggestions.timer                              drop-image-suggestions.service

We're done here! Thank you all!

xcollazo updated the task description. (Show Details)