User Details
- User Since
- Aug 21 2017, 4:16 PM (291 w, 7 h)
- Availability
- Available
- IRC Nick
- cormacparle
- LDAP User
- Cparle
- MediaWiki User
- CParle (WMF) [ Global Accounts ]
Thu, Mar 16
We're adding a single (un-indexed) attribute named section_heading of type text, correct?
Correct - section_heading is the only thing clients need, so we're keeping it as simple as possible for now
Tue, Mar 14
Good catch @xcollazo - I edited the task description
Merged, closing. Follow-up work in T330848
Mon, Mar 13
I propose that we just add this to the image-suggestions DAG rather than it having its own DAG. @mfossati @xcollazo @matthiasmullie what do you think?
Ok myself and @mfossati agree that this is no longer blocked.
Fri, Mar 10
Mon, Mar 6
Fri, Mar 3
Feb 10 2023
Update on the suggestions part ... I have altered the image_suggestions_suggestions table in my own hive db (the hql query to do so is here) to hold section_heading, but I'm getting an error writing the altered table when I run the pipeline script on stat1007
Feb 8 2023
Already covered by T323505 (and done!)
Reopening as we want to split up T311814
Already implemented as part of T311829 (in progress)
As we have no API, we (SD) have no control over how the data is read, only how it is written
Feb 7 2023
Rating summary by wiki on Tues Feb 7
@Ankan_WMF there should be more suggestions available for bnwiki now
Feb 3 2023
Jan 30 2023
Did we decide definitively which fields need to be added to the data model? If not then we ought to asap ...
Jan 27 2023
The existing production workflow is probably not easily adapted to beta cluster (cc @Cparle -- does that sound correct?)
Jan 19 2023
Jan 16 2023
Let's begin by checking if this materially affects us, and if so create separate tickets for updating of our products that are affected
We're not inside MW though - we need to be able to do this from python script running on an airflow machine
Jan 13 2023
We can figure out which templates are infoboxes for a particular wiki by extracting the must_not queries on template.keyword in the response to https://<wiki>.wikipedia.org/w/index.php?title=Special:Search&cirrusDumpQuery=&ns0=1&search=hasrecommendation%3Aimage+-hastemplatecollection%3Ainfobox
No, I don't think so. Our problem is we don't know which templates are in the template collection ... actually though I see now (looking at searchDebugUrls) that we can figure that out for any wiki by picking out the must_not bits from a query like https://cs.wikipedia.org/w/index.php?title=Speci%C3%A1ln%C3%AD:Hled%C3%A1n%C3%AD&cirrusDumpQuery=&ns0=1&search=hasrecommendation%3Aimage+-hastemplatecollection%3Ainfobox
Jan 12 2023
Added numbers for intersections to the table above (https://phabricator.wikimedia.org/T315976#8456730) so I think this can closed now @mfossati ?
Jan 10 2023
Jan 9 2023
Should this be in "code review" instead of blocked?
Dec 20 2022
No - I'm probably being over-cautious, we've never needed to go back and regenerate old data so far, and I can't think why we'd need to. Doing what everyone else is doing is fine with me
Ok, so I don't think there's a bug after all
It's probably worth keeping some of the data, just in case. The last 4 snapshots, perhaps? And maybe the first one from each month for the last 6 months - so a total of 10. If there's a need for other old data we can always regenerate it from the source data
Dec 19 2022
Code for the tool here https://gitlab.wikimedia.org/toolforge-repos/section-image-suggestions-test
Dec 14 2022
Note that the above notebooks have been combined in https://gitlab.wikimedia.org/cparle/notebooks/-/blob/main/section_image_suggestions_data.ipynb ... still needs to be productionized though
FYI research's code for section-alignment already generates a parquet with sections-with-images, so we can probably use this as an input
So ... can we count this as done?
Discussed with @AUgolnikova-WMF and we agreed that what we have already is adequate for this stage of the project, and we can revisit the community config after the MVP stage
Can we add lists to this too? Some sections are entirely enclosed with <ul></ul> tags
Yeah fair point, maybe we should bring @MunizaA 's code into our repo and call it from our DAG
Dec 13 2022
In October we sent ~18.5k notifications and 264 images were added as a result, so the work/impact ratio for notifications-for-experienced-users has so far been rather low. Are we sure about expending more effort on something that has made so little impact so far?
Dec 12 2022
Notebook on which this might be based can be found here https://gitlab.wikimedia.org/cparle/notebooks/-/blob/main/section_image_suggestions_data.ipynb
Dec 9 2022
Here's a sample of the data generated by T315976 (approx 2000 suggestions per wiki, 1000 generated via section topics and 1000 via section alignment)
@Eevans ... I won't close this if you're still working on it, but I might take it off our board if that's ok?
section-alignment suggestions | section-topics-plus-p18 suggestions | intersection | |
enwiki | 248035 | 50337151 | 14536 |
ptwiki* | 148838 | 147934 | 584 |
idwiki | 75618 | 1677378 | 2103 |
ruwiki | 267413 | 11865098 | 7743 |
arwiki | 97886 | 3226347 | 2828 |
bnwiki | 28796 | 406662 | 213 |
eswiki | 215593 | 11747916 | 10621 |
cswiki | 124834 | 3901333 | 4644 |
frwiki | 259604 | 16446381 | 10244 |
Dec 8 2022
Sample dataset for enwiki
Dec 6 2022
Dec 5 2022
so ... can this be closed @aaron ?
@matthiasmullie who signs off on this? is it @Krinkle ?
Dec 2 2022
Nov 29 2022
All page have schema.org information, and the following pages have additional schema.org information about some of their sections
Nov 28 2022
Yeah we should be able to suss it out from Hive I think ... I don't think we can truncate because Growth are using the data all the time
Nov 25 2022
Out of scope: Excluding suggestions based on custom community configuration (like excluding articles with certain categories or templates). Can be done on the frontend. It's unrelated to the API, specific to the Growth use case, and easy to implement within the frontend's search query construction logic.
As a follow-on from @kostajh 's comment above ... I think this data is just what we need - if a section should be excluded from getting link recommendations it's probably a safe bet to assume it shouldn't have images either. Perhaps we could grab the data from there via an api call for each relevant wiki at the start of the data pipeline? It'd be easy to grab the json from a call like this, and parse it to get the sections we want to exclude
OK all working now, hooray!
It's in /home/cparle on stat1008, also in hdfs:///user/cparle/all_page_with_suggestions_20221027.csv
Nov 23 2022
I assume, that whatever process loaded data into the suggestions table, also loaded into instanceof_cache and title_cache, is this assumption safe?
Nov 22 2022
New ticket for cleaning up the other tables T323561