https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/merge_requests/29 reviewed, looks great to me.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Yesterday
Mon, Jun 24
Fri, Jun 21
Wed, Jun 19
Thu, Jun 6
Tue, Jun 4
Mon, Jun 3
isu = spark.read.table('analytics_platform_eng.image_suggestions_suggestions') alis = isu.where('section_index is null') slis = isu.where('section_index is not null')
Fri, May 31
@Etonkovidova @Sneha FYI as of now the patch is reverted, so we won't see the change on beta until we re-merge it.
@Etonkovidova @Sneha , the reason why I haven't added that horizontal line is because another one will show up in case of multiple uploads, so I've left it out.
Thu, May 30
Wed, May 29
Hey @KStoller-WMF , chiming in while @AUgolnikova-WMF is out of office: yes, I'll pick up this ticket next week. Stay tuned!
Tue, May 28
May 27 2024
May 22 2024
May 15 2024
May 14 2024
In T363506#9794241, @isarantopoulos wrote:We concluded that we will figure out the format after the team figures out the spike (accessing the image and sending a thumbnail to Lift Wing).
See T364551: [SPIKE] Send an image thumbnail to the logo detection service within Upload Wizard
I'd suggest we proceed with a base64 encoded image for now.
With binary being the preferred format, right?
May 13 2024
In T362749#9786269, @isarantopoulos wrote:In T362749#9786161, @Ladsgroup wrote:Yes, Upload stash shouldn't be accessed directly or indirectly. It is internal to mediawiki and private.
Having it private makes total sense from a user privacy point of view. This would also mean that sending image thumbnails from the stash to Lift Wing is out of the question.
I think that the logo detection service can be exposed through an internal endpoint, so it will be inside WMF’s infrastructure.
Moreover, when an image is sent to the upload stash, there’s a set of already implemented checks including existing duplicates and previously deleted duplicates.
In T362749#9789915, @Ladsgroup wrote:you can just send over the file to liftwing maybe? (we should consider alternative designs and so on).
See T363506: Pass image objects to the logo detection service.
May 10 2024
In T362749#9785333, @isarantopoulos wrote:@mfossati is there any other way to access the images in the upload stash other than using a cookie. Using a user cookie to access an API doesn't seem like the right way for a production application both from a design as well as a security point of view. An API key/token would seem more appropriate (if there is such an option available).
I agree and have dug deeper in the current request being made to the Upload API: maybe the CSRF token is what we're looking for. See upload_file_in_chunks in the example request code. I can confirm that the Upload Wizard is sending a token parameter in the request.
In T361049#9784859, @Etonkovidova wrote:(2) I have some problems testing these two AC:
- Pre-fill the title using file name if it matches the descriptive criteria, if not leave it blank
- Update the copy for the current error message for when the user has not entered a descriptive title as show here.
In T361049#9784859, @Etonkovidova wrote:(1) the scope of re-designing Describe step presently doesn't include Additional information from the figma mockup
Chiming in: this will be done in T361061: [M] Update the 'other information' field in upload wizard.
May 9 2024
In T363506#9757394, @isarantopoulos wrote:@mfossati I am in favor of passing the image object in some serialized form.
We would need the upload wizard to send a resized image (224x224) instead of the whole file.
I've opened T364551: [SPIKE] Send an image thumbnail to the logo detection service within Upload Wizard to investigate the feasibility of this solution.
In T363506#9781491, @mfossati wrote:In T363506#9781301, @isarantopoulos wrote:@mfossati We noticed that the user can define the width in the url like in this example http://commons.wikimedia.org/w/index.php?title=Special:FilePath&file=Cambia_logo.png&width=224. If we can use this then it would be sufficient and we can stick with using urls in the request.
Hmm, I've just given it a try and I think it won't work for stashed images, which is a hard requirement for us.
@isarantopoulos @kevinbazira , I think I found how to get a thumbnail from a stashed image. There you go: https://commons.wikimedia.org/wiki/Special:UploadStash/thumb/1awuam969hko.2tkfbz.10893556.png/224px-1awuam969hko.2tkfbz.10893556.png, where 1awuam969hko.2tkfbz.10893556.png is the stash file key. The 224px- prefix is the width size.
Of course, I feel there's a caveat, as it seems that the thumbnail is generated on the fly at request time. Still not optimal, but sounds like a workable solution.
In T363506#9757394, @isarantopoulos wrote:We would need the upload wizard to send a resized image (224x224) instead of the whole file.
I can imagine we can tackle that from within the Upload Wizard with some JavaScript library. I can create a ticket to look into that if you think this would be the best solution.
In T363506#9780991, @kevinbazira wrote:If one user sends a request with 50 image URLs and another sends a request with 50 serialized images objects, the latter is likely to exceed the server's request body size limit faster.
Thinking out loud: what about sending multiple requests if the limit is reached? I speculate that 50 uploads are an edge case: if this happens, we could dispatch different requests.
May 8 2024
In T363506#9781301, @isarantopoulos wrote:@mfossati We noticed that the user can define the width in the url like in this example http://commons.wikimedia.org/w/index.php?title=Special:FilePath&file=Cambia_logo.png&width=224. If we can use this then it would be sufficient and we can stick with using urls in the request.
Hmm, I've just given it a try and I think it won't work for stashed images, which is a hard requirement for us.
@isarantopoulos , totally agree, makes a lot of sense.
May 7 2024
Fix deployed & pipeline resumed. Needs some monitoring.
May 6 2024
In T362749#9774553, @kevinbazira wrote:@achou pointed out that files might not be accessible since the upload stash docs state: files not be public, and only writable/accessible by the uploader.
@mfossati, if images uploaded to the stash are private to the user, how will the tool you build to do logo-detection be able to access these image URLs or serialized image objects and send them to the LiftWing API to get a prediction?
Great catch, I totally missed this!
I've just scratched the surface: it seems that the stash URL request should contain some logged-in user session ID to enable access, which is stored in a cookie. We'll have to dig into the Upload stash code base to fully understand the mechanism. For now I can see cookies like commonswikiSession that ring a bell.
That said, what if we T363506: Pass image objects to the logo detection service instead? Would that not require a logged-in user? Definitely an open question.
May 2 2024
In T363506#9757394, @isarantopoulos wrote:We would need the upload wizard to send a resized image (224x224) instead of the whole file. Is that something you are already considering or think it would be easy to try?
We haven't thought of this yet, mainly because pre-processing logic on the model side already handles resizing. That said, I agree it'd be better to directly send the 224x224 image object.
Change deployed:
0: jdbc:hive2://analytics-hive.eqiad.wmnet:10> describe wmf_raw.mediawiki_pagelinks; +--------------------------+-----------------------+--------------------------------------------------------------------------------------+ | col_name | data_type | comment | +--------------------------+-----------------------+--------------------------------------------------------------------------------------+ | pl_from | bigint | Key to the page_id of the page containing the link | | pl_from_namespace | int | MediaWiki version: ? 1.24 - page_namespace of the page containing the link | | pl_target_id | bigint | Foreign key to linktarget. | | snapshot | string | Versioning information to keep multiple datasets (YYYY-MM for regular labs imports) | | wiki_db | string | The wiki_db project | | | NULL | NULL | | # Partition Information | NULL | NULL | | # col_name | data_type | comment | | | NULL | NULL | | snapshot | string | Versioning information to keep multiple datasets (YYYY-MM for regular labs imports) | | wiki_db | string | The wiki_db project | +--------------------------+-----------------------+--------------------------------------------------------------------------------------+
May 1 2024
@Sneha :
- spaces before a file extension should trigger https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist-custom-space, but I can't seem to hit that, e.g., pic .jpg still triggers https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist-custom-filename
- prefixes like 666px- trigger https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist-custom-thumbnail, e.g., 666px-pic
- prefixes like 666px- together with a .svg.png suffix trigger https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist-custom-SVG-thumbnail, e.g., 666px-pic.svg.png
We're asking the user to omit the file extension, so https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist-custom-space and https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist-custom-SVG-thumbnail are less likely.
Apr 30 2024
In T361049#9757575, @Sneha wrote:@mfossati If I understood this correctly, it seem currently the dialog is pulling text from that first link and any changes on that page would be reflected in our dialog?
Correct. It's an on-wiki system message.
Could we have a custom dialog with only example text that is not linked to any of these pages. It seems there are a lot of variation of these pages so we can't confidently rely on one. We are not showing any links or additional text. We are only showing examples (which are unlikely to change.)
Yes, we could, but those messages seem to come from the community (e.g., https://commons.wikimedia.org/w/index.php?title=MediaWiki:Titleblacklist-custom-filename&action=history), so I'd opt for keeping the process intact, i.e., propose the updates on wiki. @Sannita, what do you think?
- Update the copy on the "view example" dialog as shown here
@Sneha, I think we need a Commons admin to update https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist-custom-filename. The patch has a temporary workaround so that we can test it, but I suggest to remove it before merging.
FYI the following custom messages also exist:
- https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist-custom-space
- https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist-custom-double-apostrophe
- https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist-custom-thumbnail
- https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist-custom-SVG-thumbnail
Apr 29 2024
Apr 26 2024
Back on it.
Closing, see T347569#9747385.
Checked raw counts of the last 5 snapshots:
snapshots = ('2024-03-18', '2024-03-25', '2024-04-01', '2024-04-08', '2024-04-15') tables = ('image_suggestions_instanceof_cache', 'image_suggestions_lead_image_data', 'image_suggestions_search_index_delta', 'image_suggestions_search_index_full', 'image_suggestions_suggestions', 'image_suggestions_title_cache', 'image_suggestions_wikidata_data') for s in snapshots: print(s) for t in tables: print(t) ddf = spark.read.table(f'analytics_platform_eng.{t}').where(f"snapshot='{s}'") print(ddf.count()) print()
2024-03-18 image_suggestions_instanceof_cache 5405468 image_suggestions_lead_image_data 8046032 image_suggestions_search_index_delta 6984930 image_suggestions_search_index_full 74311759 image_suggestions_suggestions 369129698 image_suggestions_title_cache 5228917 image_suggestions_wikidata_data 104657563
Apr 25 2024
In T362749#9729294, @kevinbazira wrote:@mfossati, when a model-server is deployed within the WMF k8s infrastructure it has to be configured to enable it to access external resources like wikimedia, wikipedia, and wikidata (see details here). Is it possible for the Structured content team to provide sample URLs from the commons upload stash? This will enable us to configure the logo-detection model-server to access them from LiftWing. Thanks in advance.
Hey @kevinbazira , here's how a public stash URL would look like: https://commons.wikimedia.org/wiki/Special:UploadStash/file/1avpfxdmdb4c.deuia.10893556.png. The only variable would be the file key, i.e., 1avpfxdmdb4c.deuia.10893556.png.
Not 100% sure, but I guess that you can go for http://localhost:6500/wiki/Special:UploadStash/file/1avpfxdmdb4c.deuia.10893556.png, with commons.wikimedia.org as the host header.
Apr 17 2024
Submitted a draft patch that needs extra pairs of eyes. Moving to code review.
Apr 16 2024
Migration will complete in roughly one week and old columns will be dropped in two weeks: https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/Y4C7W4TEC7DXXTY3HKDBG7HB56QBRXPY/
If this all lands in April, wmf_raw.mediawiki_pagelinks/snapshot=2024-04 will contain the breaking changes.
Apr 10 2024
Published at https://commons.wikimedia.org/wiki/Commons:WMF_support_for_Commons/Upload_Wizard_Improvements/Logo_detection, closing. Thanks @Sannita for your work!