Page MenuHomePhabricator

Add image table to monthly sqoop list
Closed, ResolvedPublic

Description

The image table is the authoritative source of whether a file is local to a given wiki, so to answer the questions in the parent task we'd like to have that table sqooped monthly so it can be joined with the imagelinks table.

Event Timeline

The team will review and prioritize this during our next board review meeting (October 26th).

razzi moved this task from Incoming to Datasets on the Analytics board.
razzi added a project: Analytics-Kanban.

Change 643029 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] [WIP] Update sqoop adding tables and removing timestamps

https://gerrit.wikimedia.org/r/643029

Change 643033 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Refactor oozie mediawiki-history-load job

https://gerrit.wikimedia.org/r/643033

Change 643985 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Add tables to mediawiki-history-load

https://gerrit.wikimedia.org/r/643985

Change 644189 had a related patch set uploaded (by Joal; owner: Joal):
[operations/puppet@production] Import new tables to analytics datalake with sqoop

https://gerrit.wikimedia.org/r/644189

Change 643029 merged by Joal:
[analytics/refinery@master] Update sqoop adding tables

https://gerrit.wikimedia.org/r/643029

Change 643033 merged by Joal:
[analytics/refinery@master] Refactor oozie mediawiki-history-load job

https://gerrit.wikimedia.org/r/643033

Change 643985 merged by Joal:
[analytics/refinery@master] Add tables to mediawiki-history-load

https://gerrit.wikimedia.org/r/643985

Change 644189 merged by Elukey:
[operations/puppet@production] Import new tables to analytics datalake with sqoop

https://gerrit.wikimedia.org/r/644189

I checked the Data Lake and noticed we now have wmf_raw.mediawiki_image there. From a quick query of it for a handful of wikipedias and comparing the file counts to List of Wikipedias, this looks good to go from my end. Thanks for making this happen!

Ah! Forgot to mention in the task.
The image is now present indeed for almost all wikis - One however has failed: commonswiki (one for which images are not that important, are they?).
I'm currently doing exploration to try and solve that issue. Will the task up-to-date (sorry to have forgotten for this note).
t

After investigations based on @nettrom_WMF invaluable intuitions, the problem getting commonswiki image table comes from instances of img_metadata blob being so big that they cause problem.
I managed to get a distribution of sizes (using CEIL(LOG10(LENGTH(img_metadata))):

+----+--------+                                                                 
|l   |count   |
+----+--------+
|null|7425    |
|0   |3145369 |
|1   |15956   |
|2   |9715042 |
|3   |16314421|
|4   |36097170|
|5   |708376  |
|6   |552274  |
|7   |124839  |
|8   |578     |
+----+--------+

Sqooping the table without the field takes about 10 minutes, while getting it with img_metadata size took ~2h.
We probably can find a limit allowing to get the field value for length smaller than a certain amount of bytes and null otherwise, but it would still be slow.
If the value in that field is small I suggest we just don't get it :)

Thank you for digging into this @JAllemandou ! Since img_metadata is a serialized PHP array, I think the column has limited value by itself because those values are more useful when they're processed into a different format (see for example source_log_params in mediawiki_user_history). Secondly, the data is in the replicated database in the case someone needs to do analysis based on it, so I think we have meaningful ways to work around this. In summary, I don't see a strong argument for keeping it, and I agree with the proposal to not including it.

Change 647681 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Correct mediawiki_image table sqoop and creation

https://gerrit.wikimedia.org/r/647681

Change 647681 merged by Joal:
[analytics/refinery@master] Correct mediawiki_image table sqoop and creation

https://gerrit.wikimedia.org/r/647681