Steps to reproduce
- Access Corriere Della Sera Archives
- Search for a term
- Click the Eye next to a result to view the newspaper
- Sometimes the paper will load no problem, sometimes it will partially load, and then the screen will go blank grey.
| Samwalton9-WMF | |
| Aug 28 2023, 6:07 PM |
| F55934180: image.png | |
| Jun 27 2024, 11:07 PM |
| F55934434: Screenshot 2024-06-27 at 6.06.05 PM.png | |
| Jun 27 2024, 11:07 PM |
| F55931111: Screenshot 2024-06-27 at 4.27.56 PM.png | |
| Jun 27 2024, 10:04 PM |
| F55931444: Screenshot 2024-06-27 at 4.44.17 PM.png | |
| Jun 27 2024, 10:04 PM |
| F55923328: Screenshot 2024-06-27 at 12.01.00 PM.png | |
| Jun 27 2024, 5:08 PM |
Steps to reproduce
Putting a brief pause on this until we get staging back up so I can test some config there.
Adding login flow to corriere
Navigate to the Access collection button in staging for Corriere Della Serra archives and update the parameter URL to point to the login URL. Update the landing page to be our proxied URL.
<a href="https://wikipedialibrary.idm.oclc.org:9443/login?auth=staging&url=https://www.corriere.it/account/login?landing=https://wikipedialibrary.idm.oclc.org:9443/login?auth=staging&url=https://archivio.corriere.it/" class="btn btn-sm twl-btn access-apply-button" type="button" name="button" target="_blank" rel="noopener"> Access collection </a>
This will take you to a page that is proxied for the login flow. When I try to set the landing parameter to anything that isn't their domain (our proxy URL) then it won't let me sign in. The button is disabled and stops me from continuing through the flow.
If I add the landing parameter to be
<a href="https://wikipedialibrary.idm.oclc.org:9443/login?auth=staging&url=https://www.corriere.it/account/login?landing=https://archivio.corriere.it/" class="btn btn-sm twl-btn access-apply-button" type="button" name="button" target="_blank" rel="noopener"> Access collection </a>
Then the flow continues to the site https://archivio.corriere.it/ without being proxied.
@Kgraessle
Can you please share a WIP github PR with the ezproxy config changes you were testing to get these results? It's okay if some parts are commented out (with notes) if you tried a few different things.
I would expect that there is javascript validation happening to do that disable, which you would also need to capture with a rewrite rule to amend the allowed domains/hostnames. I'm surprised that the login url is being used here instead of the proxied endpoint( eg. https://archivio-corriere-it.wikipedialibrary.idm.oclc.org:9443)
If I add the landing parameter to be
<a href="https://wikipedialibrary.idm.oclc.org:9443/login?auth=staging&url=https://www.corriere.it/account/login?landing=https://archivio.corriere.it/" class="btn btn-sm twl-btn access-apply-button" type="button" name="button" target="_blank" rel="noopener"> Access collection </a>Then the flow continues to the site https://archivio.corriere.it/ without being proxied.
Unfortunately none of the config I updated was helpful, it's also hard to tell when the config goes out so it's been a bit tedious trying to test config without knowing if it's actually deployed or not.
I don't really have anything substantial to commit as a result.
I can tell you that my initial thoughts are that the image is being corrupted when the content provider is sending the data to the ezproxy server:
I pivoted and what I outlined in this spike process was an attempt to use the Corriere della Sera native login flow through the proxy like we discussed in the engineering weekly to see if that helps the corrupted jpegs.
I didn't want to update the database partner URL for a spike, so I hacked it together in dev tools which did teach me the following:
I saw this error when trying to login with the landing parameter to our proxy:
Some things I might try next:
Initially, I thought these images were just designed to fill half of the height, but looking more closely at one of the impacted images, I'm not so sure now:
I pivoted and what I outlined in this spike process was an attempt to use the Corriere della Sera native login flow through the proxy like we discussed in the engineering weekly to see if that helps the corrupted jpegs.
I didn't want to update the database partner URL for a spike, so I hacked it together in dev tools which did teach me the following:
- The login URL domain is not the archive domain https://www-corriere-it.wikipedialibrary.idm.oclc.org:9443/account/login?landing=https://archivio.corriere.it/Archivio/interface/landing.html
- I think the sign in flow may not be working because of google recaptcha Ezproxy does not support it
- If we remove the landing parameter from the login URL it sends us to https://www.corriere.it/ and not https://archivio.corriere.it/ because the domain is not the archive
We have run into a number of smaller partners that mix content across individually authorized domains. It might make sense to concatenate the config for archivio.corriere.it and corriere.it into a single config stanza since they have the same authorization criteria.
- I'm not sure using the login flow will even fix the corrupted image issues
Agree; the idea behind going this direction is that users who aren't logged into an individual account may not always get the full image; that was only a hunch, though.
I saw this error when trying to login with the landing parameter to our proxy:
Some things I might try next:
- Asking Corriere Della Serra to disable recaptcha for ezproxy users so we can test the login flow and see if that fixes the corrupt files.
100% agree, as this has resolved issues for us with other partners.
- Go back to investigating config when I can more easily test.
I'll send out a request to ezproxy to see if I can get all of us on the notification list for config changes. My initial idea to use one of our lists is flawed because of the (sometimes lengthy) delay it can add.
@sjvipin could you reach out on the partner side to request disabling of captchas for Corriere Della Sera and Corriere Della Sera Archives?
Yes! They do, and there's quite a lot of them. I tested this by checking the same images across different sessions and they are consistently breaking.
Here's a couple of articles that I tested that consistently break if we want to send them a list; I know it won't be even close to comprehensive.
I also found a few articles that consistently worked:
I pivoted and what I outlined in this spike process was an attempt to use the Corriere della Sera native login flow through the proxy like we discussed in the engineering weekly to see if that helps the corrupted jpegs.
I didn't want to update the database partner URL for a spike, so I hacked it together in dev tools which did teach me the following:
- The login URL domain is not the archive domain https://www-corriere-it.wikipedialibrary.idm.oclc.org:9443/account/login?landing=https://archivio.corriere.it/Archivio/interface/landing.html
- I think the sign in flow may not be working because of google recaptcha Ezproxy does not support it
- If we remove the landing parameter from the login URL it sends us to https://www.corriere.it/ and not https://archivio.corriere.it/ because the domain is not the archive
We have run into a number of smaller partners that mix content across individually authorized domains. It might make sense to concatenate the config for archivio.corriere.it and corriere.it into a single config stanza since they have the same authorization criteria.
I agree with this as well. I think since the login flow is shared across the two content providers, it would make sense to combine them if we do go down that route. Also when you click on the navigation bar banner in the archive it takes you back to the non-archived domain which can be a really confusing user experience.
- I'm not sure using the login flow will even fix the corrupted image issues
Agree; the idea behind going this direction is that users who aren't logged into an individual account may not always get the full image; that was only a hunch, though.
I saw this error when trying to login with the landing parameter to our proxy:
Some things I might try next:
- Asking Corriere Della Serra to disable recaptcha for ezproxy users so we can test the login flow and see if that fixes the corrupt files.
100% agree, as this has resolved issues for us with other partners.
Thanks for pinging Vipin to get this rolling.
- Go back to investigating config when I can more easily test.
I'll send out a request to ezproxy to see if I can get all of us on the notification list for config changes. My initial idea to use one of our lists is flawed because of the (sometimes lengthy) delay it can add.
Thanks! I appreciate you!
@Kgraessle let's put a pause on this while @sjvipin talks with the partner about the options of whitelisting recaptcha or switching to voucher codes instead of proxy.
The partner has given us voucher codes that we can use instead. We will test this out in the coming week and share an update.
We got voucher codes for CdS, but not for archives. We never managed to fix the proxy implementation, so marking this declined for now. We'll revisit archives in the future when we speak to them.