User Details
- User Since
- Jan 6 2022, 7:27 PM (100 w, 3 d)
- Availability
- Available
- LDAP User
- Marco Fossati
- MediaWiki User
- MFossati (WMF) [ Global Accounts ]
Thu, Dec 7
Many thanks for the valuable suggestion! Implemented in this commit with a default value of 4.
Now output files dropped to 1k! 🎉
Tue, Dec 5
Mon, Dec 4
Also CC @fkaelin .
Thanks for the heads up about production VS backup. The backup access request was merely based on my understanding that deleted images were stored there. If production access is a better option, then let's definitely opt for it, CC @MatthewVernon.
In any case, we are still blocked on T&C ok
Is there anything we can do from our side to unblock?
Fri, Dec 1
@jcrespo , would it be possible to use the internal reverse proxy to directly download deleted images via HTTP like here?
- The typical size is roughly 60 GB, with 207 folders holding 401 files (coalesce = 400 + _SUCCESS) each.
- no explicit partitioning
Note that the Spark job responsible for this output is the largest and most complex we have among our team's data pipelines, and usually takes 1 day of computation to complete.
The current implementation writes one parquet per wiki, thus resulting in those 207 folders. Modifying this behavior is out of scope, as it would require a lot of work: I had tried a simple solution that writes to a single parquet, causing all sorts of troubles to Spark executors.
Also, further reducing the coalesce value might increase the execution time, which isn't viable neither.
Report
script | coalesce | files before | files after |
---|---|---|---|
sections.py | 8 | 2049 | 9 |
embeddings.py | 100 | 1025 | 101 |
features.py | 400 | 807k | 79k |
Thu, Nov 30
Report
script | coalesce | files before | files after | partition before | partition after |
---|---|---|---|---|---|
article_images.py | 100 | 298k | 101 | wiki | none |
recommendation.py | 40 | 114k | 41 | wiki | none |
Wed, Nov 29
Tue, Nov 28
With T347750#9332285 we'll actually ask for a link, thus increasing the chance of a URL.
Directly moving to verify on production: we agreed with @Etonkovidova that the latest patch is tiny enough.
Perhaps I made the task description too broad: what I essentially meant is just a regular expression that matches a website and a warning if none is found.
We're explicitly asking the user to enter a website, although I acknowledge a full URL might not occur often.
Mon, Nov 27
Skipping estimation: this ticket can be tackled together with T347558: [S] Coalesce section alignment image suggestions output.
Fri, Nov 24
Weekly update
Back to work, continued where I left off.
Sent 3 incremental patches that are now ready for review.
Thu, Nov 23
Nov 2 2023
Hey @Ladsgroup , we're using that table in the image suggestions production data pipeline, see this query.
While I'll try to monitor this ticket and mailing lists announcements, it would be great if you could ping me in advance before the breaking change rolls out, so that we can take action.
You could also leave a comment in T350007: [M] Adapt image suggestions to comply with breaking database schema changes if you prefer, thanks!
Oct 30 2023
Closing as invalid: we plan to use actual data to train a machine learning model, as part of T349641: [Investigation EPIC] Machine detection for media with copyright issues in Upload Wizard on Commons.
CC @JAllemandou .
Weekly update
Made good progress, then spent some time debugging why code that handles warning messages for sub-radio buttons was never reached.
Note that acceptance criteria came on Wed 25 and augmented the previously estimated effort in my opinion.
Oct 27 2023
Weekly update
This week's SEAL run went fine. As a result, I checked SLIS output, which looks fine as well. Image suggestions completed, too. Closing!
Oct 26 2023
I was thinking the same 😄
Oct 24 2023
Oct 23 2023
Review done. Moving to blocked, pending community feedback integration.
Oct 20 2023
Weekly update
The patch was reviewed by @matthiasmullie, who directly integrated feedback (thanks!).
Pending some work on tests that are currently failing
Weekly update
The SEAL pipeline is currently running in production, waiting for the last DAG task to complete. A couple of quick hotfixes were needed to ensure a proper execution.
Here are my suggestions:
- match against deletion request opening reasons, closing reasons, and file deletion reasons (AKA edit messages or revision comments), See data lake query to extract file deletion reasons
- expand all wikilink shorthands, such as COM:DW = COM:DERIV = Commons:Derivative_works. Note that COM prefixes always expand to Commons
- compile the list by looking at wikilinks and words frequencies, plus word clusters from T340546: [XL] Analysis of deletion requests on Commons. Most frequent are:
- COM:FOP = COM:PANO = Commons:Freedom_of_panorama
- COM:SS = COM:SCREENSHOT[S] = Commons:Screenshots
- COM:ALBUM = Commons:ALBUM
Oct 17 2023
Oct 16 2023
I can confirm Airflow variables are not updated after deployment.
Opened T348963: DagProperties don't automatically update Airflow variables, CC @xcollazo .
Thanks again for your fast reaction @xcollazo , much appreciated!
I can confirm the SEAL DAG is now deployed.
Hey @VirginiaPoundstone , I don't think there's any deadline from our side. However, please note that this was initially raised as one of the causes that put the Hadoop cluster under pressure, CC @JAllemandou .
Oct 13 2023
Yeah I know 😄 , the heavier one is a pre-trained machine learning model and the other one has a lot of machine learning dependencies, despite my slimming efforts.
Maybe we can add them manually to unblock you. Let me try.
Thanks a lot and no worries, not in a hurry.
Would it be viable to bump the memory for future deployments?
Moving away from code review, pending deployment & production monitoring.
@xcollazo , could you please give us a hand? Java is OOMing, perhaps our artifacts are too heavy (one is 1.22 GB, the other 1.64)? Pasting what's happening below:
mfossati@deploy2002:/srv/deployment/airflow-dags/platform_eng$ scap deploy 17:59:22 Started deploy [airflow-dags/platform_eng@520fa55] 17:59:22 Deploying Rev: HEAD = 520fa55c45d057c39c977e7cbe652844e9f66c00 17:59:22 Started deploy [airflow-dags/platform_eng@520fa55]: (no justification provided) 17:59:22 == DEFAULT == :* an-airflow1004.eqiad.wmnet 17:59:23 airflow-dags/platform_eng: fetch stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0) | 17:59:24 airflow-dags/platform_eng: config_deploy stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0) | 17:59:49 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'airflow-dags/platform_eng', '-g', 'default', 'promote', '--refresh-config'] (ran as analytics-platform-eng@an-airflow1004.eqiad.wmnet) returned [1]: Could not chdir to home directory /nonexistent: No such file or directory Registering scripts in directory '/srv/deployment/airflow-dags/platform_eng-cache/revs/520fa55c45d057c39c977e7cbe652844e9f66c00/scap/scripts' Registering scripts in directory '/srv/deployment/airflow-dags/platform_eng-cache/revs/520fa55c45d057c39c977e7cbe652844e9f66c00/scap/scripts' Executing check 'artifacts_sync' Check 'artifacts_sync' failed: hdfsWrite: NewByteArray error: java.lang.OutOfMemoryError: Java heap space Traceback (most recent call last): File "/usr/lib/airflow/bin/artifact-cache", line 8, in <module> sys.exit(main()) File "/usr/lib/airflow/lib/python3.10/site-packages/workflow_utils/artifact/cli.py", line 32, in main artifact.cache_put(force=args['--force']) File "/usr/lib/airflow/lib/python3.10/site-packages/workflow_utils/artifact/artifact.py", line 71, in cache_put cache.put(self, open_file) File "/usr/lib/airflow/lib/python3.10/site-packages/workflow_utils/artifact/cache.py", line 128, in put output.write(input_stream.read()) File "pyarrow/io.pxi", line 359, in pyarrow.lib.NativeFile.write File "pyarrow/error.pxi", line 113, in pyarrow.lib.check_status OSError: [Errno 12] HDFS Write failed. Detail: [errno 12] Cannot allocate memory
Oct 9 2023
Oct 5 2023
Oct 3 2023
All data pipelines succeeded.
The required dataset is now available, row counts of ALIS, SLIS and the delta are all consistent with the previous run.
First complete version & DAG ready for code review.
The upstream dependency is back, so I manually triggered all data pipelines.
Section topics and section alignment suggestions went fine, image suggestions running now.
All merged, closing.
Oct 2 2023
The wmf.wikidata_item_page_link/snapshot=2023-09-18 upstream dependency isn’t available yet.
Corresponding DAG sensors timed out, causing all DAGs to fail.
Sep 28 2023
Note that dumps will be owned by Data Engineering, Data Products team.
Work on Dumps 2.0 is also underway.
CC @JAllemandou .
Sep 22 2023
@Urbanecm_WMF , T343844: NEW BUG REPORT fiwiki’s section-level image suggestions aren’t generated in production looks fine now in terms of data.
Here's the section-level image suggestions (SLIS) current count for Vietnamese Wikipedia:
In [1]: isu = spark.read.table('analytics_platform_eng.image_suggestions_suggestions').where('snapshot="2023-09-11"') In [2]: slis = isu.where(isu.section_index.isNotNull()) In [3]: slis.where(slis.wiki == 'viwiki').count() Out[3]: 38210
Re the acceptance criterion of this ticket: we have at least 1 SLIS for 293 Wikipedias, not all. Here's the complete list with SLIS raw counts (sorry for the long paste!). You may want to skip Wikipedias with not enough SLIS.
In [4]: slis.select(slis.wiki).distinct().count() Out[4]: 293 In [5]: slis.groupBy(slis.wiki).count().orderBy('count', ascending=False).show(n=300)
wiki | count |
enwiki | 237706 |
dewiki | 129877 |
ruwiki | 128102 |
frwiki | 124932 |
eswiki | 123535 |
itwiki | 118022 |
ukwiki | 95136 |
plwiki | 90826 |
nlwiki | 79564 |
huwiki | 74956 |
hewiki | 74620 |
cswiki | 68765 |
nowiki | 62103 |
cawiki | 62063 |
jawiki | 60551 |
ptwiki | 58337 |
svwiki | 57341 |
zhwiki | 56971 |
arwiki | 56372 |
fiwiki | 55348 |
srwiki | 52096 |
be_x_oldwiki | 48349 |
trwiki | 41341 |
elwiki | 41114 |
idwiki | 39665 |
bgwiki | 39489 |
viwiki | 38210 |
hrwiki | 36195 |
rowiki | 34542 |
shwiki | 33347 |
dawiki | 31348 |
simplewiki | 30523 |
skwiki | 30279 |
fawiki | 28997 |
hywiki | 24828 |
kowiki | 24236 |
mkwiki | 23015 |
glwiki | 22539 |
mswiki | 22387 |
lvwiki | 21521 |
bewiki | 21180 |
azwiki | 20708 |
slwiki | 19090 |
bswiki | 19047 |
eowiki | 18518 |
astwiki | 18457 |
euwiki | 18347 |
etwiki | 18034 |
ltwiki | 17549 |
bnwiki | 14227 |
kkwiki | 11855 |
thwiki | 11848 |
afwiki | 11295 |
hiwiki | 10680 |
tawiki | 10131 |
kawiki | 9685 |
nnwiki | 9249 |
sqwiki | 8961 |
alswiki | 8678 |
mlwiki | 7545 |
uzwiki | 7021 |
fywiki | 5864 |
jvwiki | 5744 |
urwiki | 5573 |
iswiki | 5193 |
lawiki | 4765 |
ocwiki | 4093 |
pnbwiki | 3859 |
knwiki | 3743 |
tewiki | 3707 |
cywiki | 3596 |
bawiki | 3498 |
suwiki | 3462 |
tlwiki | 3326 |
pawiki | 3279 |
scowiki | 2911 |
lmowiki | 2717 |
mrwiki | 2715 |
mnwiki | 2594 |
ttwiki | 2504 |
swwiki | 2373 |
anwiki | 2357 |
tgwiki | 2330 |
zh_classicalwiki | 2218 |
gawiki | 2204 |
ndswiki | 2186 |
zh_yuewiki | 2160 |
guwiki | 2150 |
pswiki | 2077 |
liwiki | 2061 |
newiki | 2039 |
mywiki | 1934 |
roa_tarawiki | 1927 |
lbwiki | 1851 |
arzwiki | 1806 |
hywwiki | 1707 |
mtwiki | 1668 |
kywiki | 1508 |
siwiki | 1252 |
bhwiki | 1220 |
stqwiki | 1153 |
kuwiki | 1130 |
aswiki | 1047 |
scnwiki | 952 |
cowiki | 907 |
vecwiki | 882 |
iawiki | 881 |
mgwiki | 877 |
orwiki | 869 |
minwiki | 791 |
mwlwiki | 755 |
nds_nlwiki | 739 |
brwiki | 711 |
ruewiki | 644 |
wuuwiki | 634 |
azbwiki | 633 |
sawiki | 630 |
kmwiki | 609 |
iowiki | 569 |
vepwiki | 532 |
krcwiki | 484 |
mznwiki | 462 |
ckbwiki | 437 |
yiwiki | 428 |
hawiki | 428 |
sowiki | 413 |
gdwiki | 406 |
sahwiki | 400 |
koiwiki | 393 |
rmwiki | 383 |
novwiki | 379 |
cebwiki | 371 |
scwiki | 339 |
barwiki | 333 |
bclwiki | 332 |
vlswiki | 321 |
ladwiki | 306 |
sdwiki | 288 |
tkwiki | 284 |
pmswiki | 279 |
cvwiki | 276 |
xmfwiki | 273 |
cewiki | 262 |
fiu_vrowiki | 254 |
hifwiki | 254 |
frrwiki | 253 |
ilowiki | 248 |
fowiki | 241 |
map_bmswiki | 239 |
maiwiki | 236 |
extwiki | 225 |
napwiki | 207 |
oswiki | 205 |
tcywiki | 205 |
warwiki | 201 |
dagwiki | 182 |
bxrwiki | 175 |
lezwiki | 174 |
iewiki | 163 |
igwiki | 159 |
furwiki | 140 |
bat_smgwiki | 133 |
newwiki | 125 |
htwiki | 123 |
altwiki | 117 |
hsbwiki | 113 |
arywiki | 109 |
lowiki | 108 |
dtywiki | 108 |
diqwiki | 105 |
zeawiki | 103 |
gomwiki | 102 |
zh_min_nanwiki | 92 |
gvwiki | 85 |
cbk_zamwiki | 85 |
tumwiki | 81 |
niawiki | 79 |
wawiki | 76 |
awawiki | 75 |
nsowiki | 73 |
banwiki | 73 |
avwiki | 72 |
gorwiki | 71 |
dvwiki | 71 |
snwiki | 69 |
mrjwiki | 69 |
bjnwiki | 63 |
skrwiki | 63 |
ganwiki | 60 |
papwiki | 59 |
szlwiki | 58 |
madwiki | 57 |
fjwiki | 56 |
avkwiki | 55 |
szywiki | 55 |
lldwiki | 54 |
kswiki | 52 |
gnwiki | 50 |
lijwiki | 50 |
shnwiki | 48 |
smnwiki | 47 |
lfnwiki | 47 |
pcdwiki | 46 |
glkwiki | 45 |
inhwiki | 45 |
kaawiki | 45 |
emlwiki | 44 |
pflwiki | 42 |
sewiki | 42 |
kbdwiki | 37 |
dsbwiki | 36 |
zawiki | 36 |
pcmwiki | 35 |
wowiki | 34 |
myvwiki | 34 |
xhwiki | 31 |
smwiki | 31 |
gurwiki | 30 |
kbpwiki | 30 |
gcrwiki | 30 |
ugwiki | 29 |
tyvwiki | 29 |
nrmwiki | 29 |
amwiki | 28 |
satwiki | 27 |
udmwiki | 27 |
nqowiki | 27 |
pamwiki | 26 |
kwwiki | 25 |
amiwiki | 24 |
yowiki | 24 |
hawwiki | 23 |
shiwiki | 23 |
frpwiki | 22 |
bpywiki | 22 |
mhrwiki | 22 |
acewiki | 20 |
trvwiki | 20 |
kabwiki | 19 |
roa_rupwiki | 19 |
crhwiki | 17 |
twwiki | 17 |
quwiki | 17 |
csbwiki | 16 |
zuwiki | 16 |
bowiki | 15 |
lnwiki | 14 |
omwiki | 14 |
kshwiki | 14 |
pntwiki | 13 |
mnwwiki | 12 |
nywiki | 11 |
kvwiki | 11 |
tywiki | 11 |
blkwiki | 11 |
xalwiki | 10 |
vowiki | 10 |
rwwiki | 9 |
ltgwiki | 9 |
bugwiki | 8 |
hakwiki | 8 |
pdcwiki | 8 |
nahwiki | 8 |
atjwiki | 8 |
sswiki | 7 |
miwiki | 5 |
olowiki | 5 |
tetwiki | 5 |
jamwiki | 5 |
ffwiki | 4 |
lbewiki | 4 |
bmwiki | 4 |
tpiwiki | 4 |
tiwiki | 3 |
cuwiki | 3 |
cdowiki | 3 |
lgwiki | 3 |
arcwiki | 3 |
guwwiki | 2 |
gagwiki | 2 |
mniwiki | 2 |
pagwiki | 2 |
angwiki | 2 |
gucwiki | 2 |
gotwiki | 1 |
adywiki | 1 |
tswiki | 1 |
tnwiki | 1 |
taywiki | 1 |
mdfwiki | 1 |
kcgwiki | 1 |
kiwiki | 1 |
vewiki | 1 |
@xcollazo @mforns , actually it's essential to understand the expected behavior of DagProperties: if they don't automatically update Airflow's variables, new deployments won't be sustainable.
Re-opening the ticket: from our side, we can check what happens in the next deployment. but it would be great to get your advice, too.
- Section topics:
In [1]: st = spark.read.parquet('/user/analytics-platform-eng/structured-data/section_topics/2023-09-11') In [2]: st.where(st.wiki_db == 'fiwiki').count() Out[2]: 2367759
- Section alignment suggestions:
In [3]: ali = spark.read.parquet('/user/analytics-platform-eng/structured-data/section-alignment-suggestions/suggestions/2023-09-11') In [4]: ali.where(ali.target_wiki_db == 'fiwiki').count() Out [4]: 28518
- Section-level image suggestions:
In [5]: isu = spark.read.table('analytics_platform_eng.image_suggestions_suggestions').where('snapshot="2023-09-11"') In [6]: slis = isu.where(isu.section_index.isNotNull()) In [7]: slis.where(slis.wiki == 'fiwiki').count() Out[7]: 55348
Awesome! Closing.