Page MenuHomePhabricator

Encode sharing portion of reading lists in URL
Closed, DuplicatePublic

Description

Background

The Android team is working on allowing reading lists to be shared (see epic for details). In order to achieve this the sharing portion needs to be encoded in a URL.

Must Haves (Technical Details)
User Stories
  • As a Wikipedia Android app user and student in Morocco, I want to export my reading lists, so that I can use it at the Mohammed V University school library
  • As a Wikipedia Android app user in Ghana, I want to share my reading list with a family member in the US that has an iOS device, so they can read the articles I've saved about Accra ahead of their trip home in December.
  • As a Wikipedia Android app user organizer in South Asia, I want to share reading list via Whatsapp after an event, so people that have attended know which articles are in need of contributions

Event Timeline

JTannerWMF triaged this task as Medium priority.Aug 31 2022, 9:56 PM
JTannerWMF created this task.
JTannerWMF added a subscriber: Dbrant.

@Dbrant to provide technical instructions in task

I'm coming from a position of not knowing anything at all about how much data you need to share, so I want to highlight some limitations to encoding things into the URL that might turn out to not be relevant.

It sounds like you want to share a blob of JSON which contains some amount of data about a list-of-pages. At a minimum this would presumably be an array of page IDs, but it might also be page titles and perhaps metadata (time-saved, notes, etc?). I don't have any idea whether you limit on the allowed size of these shared lists -- if you don't then even the most minimal representation is going to run into issues eventually.

The major issue is that browsers and web servers impose some limits on how long URLs can be. Internet Explorer is the worst, since it's around 2k characters -- but if we don't care about these links working in IE then we get a bit more room. I've seen various claims for Edge over the years, and it certainly spent a few years hovering around 4k but might have been increased since then. Most of the other browsers are well over this, at least 64k but sometimes more.

The next big hurdle is: wikipedia uses Apache as its application webserver (I believe), and Apache's default URL length limit is 8177 characters. You'd need to ask someone in ops whether we override the default LimitRequestLine, but if not then that's your big limit on shared list size. (You could get around this by having the JSON in the fragment since that's not sent to the server, but that would completely lock you in to writing a client-side javascript single-page-app to display the list.)

It's worth considering, as well, that even a 2k URL is going to be really awkward to share. To give you a fun example, here's one:

https://en.wikipedia.org/wiki/Special:ReadingLists/%5B21515%2C94331%2C69335%2C674972%2C206737%2C884491%2C300813%2C362124%2C272648%2C224949%2C238552%2C779171%2C200052%2C624379%2C252515%2C172407%2C283423%2C947908%2C777525%2C458931%2C74569%2C574394%2C533559%2C529195%2C865742%2C114756%2C162305%2C271536%2C634429%2C678288%2C996240%2C184742%2C498344%2C582946%2C987392%2C63194%2C622200%2C948615%2C447227%2C798334%2C593714%2C839612%2C331280%2C751757%2C568939%2C746067%2C200724%2C108544%2C870940%2C840622%2C690792%2C541578%2C939551%2C659776%2C970525%2C979847%2C556856%2C62050%2C621661%2C528494%2C707802%2C283610%2C681668%2C306426%2C538813%2C976380%2C235514%2C986093%2C264232%2C644674%2C362565%2C487417%2C225310%2C735711%2C275397%2C416887%2C587379%2C710828%2C741107%2C257357%2C630676%2C681439%2C697655%2C479375%2C150662%2C233569%2C497875%2C13738%2C897969%2C238300%2C43985%2C532173%2C431351%2C311146%2C776900%2C465002%2C184505%2C181404%2C871139%2C870707%2C50199%2C496758%2C363752%2C745226%2C553319%2C974920%2C233824%2C712605%2C940234%2C323830%2C776088%2C580893%2C97738%2C322470%2C504062%2C319102%2C959337%2C286478%2C391398%2C884740%2C626991%2C658260%2C524782%2C397521%2C987469%2C33367%2C405076%2C713614%2C409130%2C432715%2C189114%2C179393%2C920242%2C482685%2C702507%2C821697%2C944284%2C676733%2C796333%2C4290%2C975863%2C907376%2C409025%2C83760%2C831720%2C535550%2C11340%2C914627%2C45644%2C840455%2C545314%2C60634%2C284955%2C943208%2C302%2C186930%2C949995%2C930635%2C521066%2C517938%2C321758%2C931607%2C839612%2C781622%2C423606%2C980545%2C206926%2C139483%2C227548%2C635255%2C936613%2C848372%2C210383%2C350587%2C726322%2C809474%2C193001%2C255466%2C840634%2C772308%2C367159%2C9735%2C16971%2C537971%2C325772%2C33666%2C251494%2C186943%2C942488%2C190412%2C522239%2C305821%2C967960%2C819362%2C412158%2C915180%2C755243%2C92081%2C321779%2C72835%2C23483%2C416996%2C687247%2C775414%2C83334%2C554142%2C997280%2C681758%2C858298%2C144129%2C999396%2C72633%2C817752%2C586008%2C2322%2C515019%2C517519%2C512585%2C937784%2C654264%2C40486%2C755567%2C91663%2C35250%2C378976%2C356187%2C475082%2C988410%2C984649%2C959929%2C246823%2C988351%2C956545%2C398359%2C212727%5D

That's 2kb of JSON stuck on the end of a wiki URL, sharing 235 pageid-like numbers. 🤩

Thanks, @DLynch
The idea we're exploring so far is to pass this encoded URL into our URL shortener (w.wiki), and the shortened URL will be what's shared.
We're aware that the URL shortener itself imposes a limit of ~1500 characters (smaller than all the other browser-side and Apache-side constraints you listed), and we're fine if it means that the maximum number of page IDs we can share is on the order of 100.

There is some further trickery we can explore to squeeze more data into the URL fragment, e.g. gzipped json + base64, packed binary structure + base64, etc. But in any case, a limit on the number of shareable articles is something we've accepted.

The structure will basically be:

{
  title: "My list",
  description: "Description of list",
  pages: [
    { "en": 123456 },
    { "en": 345678 },
    { "de": 789243 },
    { "ru": 912378 },
    ...
  ]
}

(note that the page IDs can come from arbitrary language wikis, not just a single wiki.)

I suppose it's a little bit silly to wind up basically using the URL shortener as blind JSON-blob storage rather than making something more focused for the task, but I can see how project constraints might lead you to it.

As I said in the meeting, this is not good idea. As David said, url shortener is not a json blob storage. You even have the instrumentation and the infra, you can add a column to the table to make a reading list public and share its id instead.

I can totally see how the idea of doing it this way raises an eyebrow, but like you say, this is a consequence of our project constraints.

The big constraint is that the reading lists in a user's app are not necessarily synced to the server (the user doesn't need to be logged in at all, and even if they are, syncing of reading lists is optional), so the reading lists effectively exist on the client only, and therefore the only way to "share" them is to encode them into a literal payload. We're certainly open to other ideas for how this could be done.

If we suppose that we start requiring the user to log in and sync their lists before sharing, this would certainly allow us to share the list by its id by marking the list as public. However, that would open an even larger can of worms with regard to moderation and patrolling of lists.

However, that would open an even larger can of worms with regard to moderation and patrolling of lists.

I don't agree with this -- I think the moderation burden is the same either way. Both options involve a person visiting a Wikipedia-branded page and seeing (as I understand it) the user-provided list title and description. If someone shares something full of racial slurs or death threats, we're not going to get away with saying "technically that's not stored in our database". (The drawback here, of course, is that it's almost impossible for us to actually moderate things shared in the JSON-blob method...)

For that matter, even ignoring the title and description which we could run through content blacklists, I'm confident I could construct something horrifically offensive purely with a list of articles...

Hi @DLynch thank you for your product related concerns. They will be taken into consideration as we discuss it with Adam and Josh on Tuesday. I hope you have confidence that we are evaluating the risks of the project, especially mitigating things like racial slurs, something that can be quite triggering as you can imagine. When we settle on an approach I will be sure to ping you on the project page since you are an eagerly interested party.

@Ladsgroup , happy to have an additional circle up regarding using the URL shortner since it seems the initial conversations we had with you and Ed have gotten lost in translation and we should probably revisit what was discussed.