The image table is the authoritative source of whether a file is local to a given wiki, so to answer the questions in the parent task we'd like to have that table sqooped monthly so it can be joined with the imagelinks table.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | cchen | T261343 Dashboard of multimedia usage on the Wikipedias | |||
Resolved | nettrom_WMF | T265768 Statistics on media usage across Wikipedias | |||
Resolved | JAllemandou | T266077 Add image table to monthly sqoop list |
Event Timeline
The team will review and prioritize this during our next board review meeting (October 26th).
Change 643029 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] [WIP] Update sqoop adding tables and removing timestamps
Change 643033 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Refactor oozie mediawiki-history-load job
Change 643985 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Add tables to mediawiki-history-load
Change 644189 had a related patch set uploaded (by Joal; owner: Joal):
[operations/puppet@production] Import new tables to analytics datalake with sqoop
Change 643033 merged by Joal:
[analytics/refinery@master] Refactor oozie mediawiki-history-load job
Change 643985 merged by Joal:
[analytics/refinery@master] Add tables to mediawiki-history-load
Change 644189 merged by Elukey:
[operations/puppet@production] Import new tables to analytics datalake with sqoop
I checked the Data Lake and noticed we now have wmf_raw.mediawiki_image there. From a quick query of it for a handful of wikipedias and comparing the file counts to List of Wikipedias, this looks good to go from my end. Thanks for making this happen!
Ah! Forgot to mention in the task.
The image is now present indeed for almost all wikis - One however has failed: commonswiki (one for which images are not that important, are they?).
I'm currently doing exploration to try and solve that issue. Will the task up-to-date (sorry to have forgotten for this note).
t
After investigations based on @nettrom_WMF invaluable intuitions, the problem getting commonswiki image table comes from instances of img_metadata blob being so big that they cause problem.
I managed to get a distribution of sizes (using CEIL(LOG10(LENGTH(img_metadata))):
+----+--------+ |l |count | +----+--------+ |null|7425 | |0 |3145369 | |1 |15956 | |2 |9715042 | |3 |16314421| |4 |36097170| |5 |708376 | |6 |552274 | |7 |124839 | |8 |578 | +----+--------+
Sqooping the table without the field takes about 10 minutes, while getting it with img_metadata size took ~2h.
We probably can find a limit allowing to get the field value for length smaller than a certain amount of bytes and null otherwise, but it would still be slow.
If the value in that field is small I suggest we just don't get it :)
Thank you for digging into this @JAllemandou ! Since img_metadata is a serialized PHP array, I think the column has limited value by itself because those values are more useful when they're processed into a different format (see for example source_log_params in mediawiki_user_history). Secondly, the data is in the replicated database in the case someone needs to do analysis based on it, so I think we have meaningful ways to work around this. In summary, I don't see a strong argument for keeping it, and I agree with the proposal to not including it.
Change 647681 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Correct mediawiki_image table sqoop and creation
Change 647681 merged by Joal:
[analytics/refinery@master] Correct mediawiki_image table sqoop and creation