User Details
- User Since
- Jan 6 2022, 7:27 PM (127 w, 6 d)
- Availability
- Available
- LDAP User
- Marco Fossati
- MediaWiki User
- MFossati (WMF) [ Global Accounts ]
Yesterday
Thu, Jun 6
Tue, Jun 4
Mon, Jun 3
isu = spark.read.table('analytics_platform_eng.image_suggestions_suggestions') alis = isu.where('section_index is null') slis = isu.where('section_index is not null')
Fri, May 31
@Etonkovidova @Sneha FYI as of now the patch is reverted, so we won't see the change on beta until we re-merge it.
@Etonkovidova @Sneha , the reason why I haven't added that horizontal line is because another one will show up in case of multiple uploads, so I've left it out.
Thu, May 30
Wed, May 29
Hey @KStoller-WMF , chiming in while @AUgolnikova-WMF is out of office: yes, I'll pick up this ticket next week. Stay tuned!
Tue, May 28
Mon, May 27
Wed, May 22
May 15 2024
May 14 2024
See T364551: [SPIKE] Send an image thumbnail to the logo detection service within Upload Wizard
I'd suggest we proceed with a base64 encoded image for now.
With binary being the preferred format, right?
May 13 2024
I think that the logo detection service can be exposed through an internal endpoint, so it will be inside WMF’s infrastructure.
Moreover, when an image is sent to the upload stash, there’s a set of already implemented checks including existing duplicates and previously deleted duplicates.
May 10 2024
I agree and have dug deeper in the current request being made to the Upload API: maybe the CSRF token is what we're looking for. See upload_file_in_chunks in the example request code. I can confirm that the Upload Wizard is sending a token parameter in the request.
Chiming in: this will be done in T361061: [M] Update the 'other information' field in upload wizard.
May 9 2024
I've opened T364551: [SPIKE] Send an image thumbnail to the logo detection service within Upload Wizard to investigate the feasibility of this solution.
@isarantopoulos @kevinbazira , I think I found how to get a thumbnail from a stashed image. There you go: https://commons.wikimedia.org/wiki/Special:UploadStash/thumb/1awuam969hko.2tkfbz.10893556.png/224px-1awuam969hko.2tkfbz.10893556.png, where 1awuam969hko.2tkfbz.10893556.png is the stash file key. The 224px- prefix is the width size.
Of course, I feel there's a caveat, as it seems that the thumbnail is generated on the fly at request time. Still not optimal, but sounds like a workable solution.
I can imagine we can tackle that from within the Upload Wizard with some JavaScript library. I can create a ticket to look into that if you think this would be the best solution.
Thinking out loud: what about sending multiple requests if the limit is reached? I speculate that 50 uploads are an edge case: if this happens, we could dispatch different requests.
May 8 2024
Hmm, I've just given it a try and I think it won't work for stashed images, which is a hard requirement for us.
@isarantopoulos , totally agree, makes a lot of sense.
May 7 2024
Fix deployed & pipeline resumed. Needs some monitoring.
May 6 2024
Great catch, I totally missed this!
I've just scratched the surface: it seems that the stash URL request should contain some logged-in user session ID to enable access, which is stored in a cookie. We'll have to dig into the Upload stash code base to fully understand the mechanism. For now I can see cookies like commonswikiSession that ring a bell.
That said, what if we T363506: Pass image objects to the logo detection service instead? Would that not require a logged-in user? Definitely an open question.
May 2 2024
We haven't thought of this yet, mainly because pre-processing logic on the model side already handles resizing. That said, I agree it'd be better to directly send the 224x224 image object.
Change deployed:
0: jdbc:hive2://analytics-hive.eqiad.wmnet:10> describe wmf_raw.mediawiki_pagelinks; +--------------------------+-----------------------+--------------------------------------------------------------------------------------+ | col_name | data_type | comment | +--------------------------+-----------------------+--------------------------------------------------------------------------------------+ | pl_from | bigint | Key to the page_id of the page containing the link | | pl_from_namespace | int | MediaWiki version: ? 1.24 - page_namespace of the page containing the link | | pl_target_id | bigint | Foreign key to linktarget. | | snapshot | string | Versioning information to keep multiple datasets (YYYY-MM for regular labs imports) | | wiki_db | string | The wiki_db project | | | NULL | NULL | | # Partition Information | NULL | NULL | | # col_name | data_type | comment | | | NULL | NULL | | snapshot | string | Versioning information to keep multiple datasets (YYYY-MM for regular labs imports) | | wiki_db | string | The wiki_db project | +--------------------------+-----------------------+--------------------------------------------------------------------------------------+
May 1 2024
@Sneha :
- spaces before a file extension should trigger https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist-custom-space, but I can't seem to hit that, e.g., pic .jpg still triggers https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist-custom-filename
- prefixes like 666px- trigger https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist-custom-thumbnail, e.g., 666px-pic
- prefixes like 666px- together with a .svg.png suffix trigger https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist-custom-SVG-thumbnail, e.g., 666px-pic.svg.png
We're asking the user to omit the file extension, so https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist-custom-space and https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist-custom-SVG-thumbnail are less likely.
Apr 30 2024
Correct. It's an on-wiki system message.
Could we have a custom dialog with only example text that is not linked to any of these pages. It seems there are a lot of variation of these pages so we can't confidently rely on one. We are not showing any links or additional text. We are only showing examples (which are unlikely to change.)
Yes, we could, but those messages seem to come from the community (e.g., https://commons.wikimedia.org/w/index.php?title=MediaWiki:Titleblacklist-custom-filename&action=history), so I'd opt for keeping the process intact, i.e., propose the updates on wiki. @Sannita, what do you think?
- Update the copy on the "view example" dialog as shown here
@Sneha, I think we need a Commons admin to update https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist-custom-filename. The patch has a temporary workaround so that we can test it, but I suggest to remove it before merging.
FYI the following custom messages also exist:
- https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist-custom-space
- https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist-custom-double-apostrophe
- https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist-custom-thumbnail
- https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist-custom-SVG-thumbnail
Apr 29 2024
Apr 26 2024
Back on it.
Closing, see T347569#9747385.
Checked raw counts of the last 5 snapshots:
snapshots = ('2024-03-18', '2024-03-25', '2024-04-01', '2024-04-08', '2024-04-15') tables = ('image_suggestions_instanceof_cache', 'image_suggestions_lead_image_data', 'image_suggestions_search_index_delta', 'image_suggestions_search_index_full', 'image_suggestions_suggestions', 'image_suggestions_title_cache', 'image_suggestions_wikidata_data') for s in snapshots: print(s) for t in tables: print(t) ddf = spark.read.table(f'analytics_platform_eng.{t}').where(f"snapshot='{s}'") print(ddf.count()) print()
2024-03-18 image_suggestions_instanceof_cache 5405468 image_suggestions_lead_image_data 8046032 image_suggestions_search_index_delta 6984930 image_suggestions_search_index_full 74311759 image_suggestions_suggestions 369129698 image_suggestions_title_cache 5228917 image_suggestions_wikidata_data 104657563
Apr 25 2024
Hey @kevinbazira , here's how a public stash URL would look like: https://commons.wikimedia.org/wiki/Special:UploadStash/file/1avpfxdmdb4c.deuia.10893556.png. The only variable would be the file key, i.e., 1avpfxdmdb4c.deuia.10893556.png.
Not 100% sure, but I guess that you can go for http://localhost:6500/wiki/Special:UploadStash/file/1avpfxdmdb4c.deuia.10893556.png, with commons.wikimedia.org as the host header.
Apr 17 2024
Submitted a draft patch that needs extra pairs of eyes. Moving to code review.
Apr 16 2024
Migration will complete in roughly one week and old columns will be dropped in two weeks: https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/Y4C7W4TEC7DXXTY3HKDBG7HB56QBRXPY/
If this all lands in April, wmf_raw.mediawiki_pagelinks/snapshot=2024-04 will contain the breaking changes.
Apr 10 2024
Published at https://commons.wikimedia.org/wiki/Commons:WMF_support_for_Commons/Upload_Wizard_Improvements/Logo_detection, closing. Thanks @Sannita for your work!
Apr 9 2024
Apr 8 2024
Outcome of a quick investigation on available pre-trained models that may fit our use case:
- it seems that pre-training is generally done on standard benchmark datasets, check out this list
- keras offers models pre-trained on the following datasets:
Apr 5 2024
According to T345771#9526320:
- The old columns have been dropped in testwiki and will be dropped soon (this and next week) on commonswiki and testcommonswiki.
- The rest of wikis will keep the old schema until all wikis have been migrated (or at least almost all of them if we realize wikidata is taking way too long).
Apr 4 2024
The prototype looks good to me, I'm excited to see this effort move to the next level!
@kevinbazira, I've especially appreciated the tightness of our development iterations 😄 .
@kevinbazira , I can confirm that inputs and outputs are fine.
FYI, I've fixed the expected type of the image dataset, so please use the latest commit.