Fri, Jul 1
@EBernhardson re-imported the data, and now for ptwiki at least we have ~130k articles again. Still trying to see what went wrong
Thu, Jun 30
Here's what's involved in doing a Cassandra-based solution
One other option:
Mon, Jun 27
LGTM too, thanks everyone
I thought @hnowlan 's patch meant that this was deployed but it's still not working so I guess not?
Thu, Jun 23
Tue, Jun 7
This can probably be closed now?
May 30 2022
May 25 2022
Ok looks like the data has imported correctly, hooray!
Done, processed 58495 files
May 24 2022
Ideal except for we'd have to re-rewrite a bunch of code ...
I have no idea why that is ... we're just using df.write.saveAsTable(). Is there any config we can do to improve this?
May 23 2022
Waiting for the patch to be merged before closing this
yeah, basically it's one dataset - we didn't think of it that way at the start, but it turns out the data is the same shape for both so it's all in the same table
May 20 2022
Ok so now the data is being written to the tables image_suggestions_search_index_full and image_suggestions_search_index_delta in the hive db analytics_platform_eng. Partitioned by a snapshot column in the format yyyy-mm-dd
Data is being written to the tables image_suggestions_search_index_full and image_suggestions_search_index_delta in the hive db analytics_platform_eng. Partitioned by a snapshot column in the format yyyy-mm-dd
May 12 2022
Still, my personal feeling is that we should target the overall Commons search system effectiveness for users, rather than focusing on eventual recall changes due to the activation of a feature.
May 11 2022
Marco's sampled the search terms from the logs based on a mixture of popularity and random, but just looking at the sampled search terms for French, for example, very few of them match up with wikidata labels and therefore won't have any synonyms ... and seeing that the point of this exercise is to capture the effect of the synonyms patch, we've probably been barking up the wrong tree.
Update on this ticket - looking at the data I'm not sure that what we've gathered is capturing the effect of the synonyms patch, and I think we might need to curate it more carefully.
May 10 2022
One limitation of the current import scripts is they expect everything to be sourced from partitioned hive tables. Typically we partition by a date col of the airflow execution date. Would it take much to arrange these into a partitioned table? Saving to hive partitions might also resolve the permissions issues by way of different defaults, although I'm not entirely sure.
May 9 2022
@Cparle which confidence level are we using in the current iteration of the data pipeline?
May 4 2022
Ok to resolve this @CBogen ?
Confidence >= 90%
May 3 2022
Having spoken to @Eevans about this I'm going to close this ticket. Because the script runs only once a week there's no way to completely prevent out of date data from being served to users, and keeping the has-suggestion flags up to date in the wiki search indices should prevent the particular problem we were trying to fix with this ticket from ever reaching users
Apr 28 2022
Apr 27 2022
Apr 26 2022
Patch written https://gitlab.wikimedia.org/repos/generated-data-platform/datapipelines/-/tree/T299890-exclude-rejections but can't fully test it until the schema is deployed (see https://gitlab.wikimedia.org/repos/generated-data-platform/topics/image-suggestions-feedback/-/merge_requests/1)
Apr 25 2022
As an alternative to a public API, we could provide a flat file containing all image suggestions for any wiki quite easily (in .csv format or similar)
Apr 21 2022
... aaand merged. Closing the ticket
Ok ran the old IMA again but with the data-loss part fixed (I think!), and now from it we're getting suggestions for 82399 articles, 81177 of which are also suggested by the new pipeline
Apr 20 2022
I think all remaining refactoring work has been done, so closing
Apr 19 2022
Ok this is pretty much done. We're writing to Hive instead of hdfs, to make it easier to export to Cassandra
Apr 11 2022
We've changed our approach to calculating confidence scores, and are now estimating them before storing image suggestions. This ticket is therefore no longer necessary for image suggestions, as we don't have another use case for getting images for a particular Q-id
Apr 8 2022
Apr 7 2022
Apr 6 2022
We have example data files on hdfs
Apr 4 2022
Mar 29 2022
Also ... I don't think the way the data is being stored allows for that anyway. We store the user who has rejected an image, not the tool they were using at the time, see P21420 Perhaps this is what the comment field is intended for? Not sure.
It would mean that, yes
Mar 21 2022
Moved this into blocked - it should be quite easy to do once we have T299789 done, so there's no point in wasting effort doing it before then
Mar 14 2022
In the source-of-truth for image suggestions (Cassandra, see T293808) we'll be storing the value of instance of for each article. This means we can exclude articles with instance of==Q5
We'll need at least a preliminary dataset from to do this work
This ticket was a change to an interface that never made it to production and is no longer in development, so closing
Mar 8 2022
Mar 3 2022
Will there be another API with some business logic to complement the generic API?
Feb 23 2022
After running queries on the labeled data, it turns out the most reliable confidence score is simply based on the source of the match
Note that we currently have a notebook for gathering the data but we don't have agreement about how to get the data into cassandra yet
Feb 22 2022
Another source of ground truth might be images that were added then reverted within e.g. a day?
Feb 14 2022
Hmmm ok so you have no dump-and-reload mechanism? If not we'll have to keep the data from the previous run in order to work out the __DELETE_GROUPING__ part
Feb 8 2022
Feb 7 2022
@Multichill is the bot just using wbsetclaim then a null edit? Are you getting this with any of your other bots?
Feb 4 2022
@Zbyszko this isn't api documentation as such, but explains how MediaSearch works https://www.mediawiki.org/wiki/Extension:WikibaseMediaInfo/MediaSearch MediaSearch just uses the standard search api with a media-specific profile loaded if you're searching in the File namespace, though there are a couple of features that were developed specifically for media, namely haswbstatement and wbstatementquantity (in WikibaseCirrusSearch) and custommatch (in WikibaseMediaInfo)