Page MenuHomePhabricator

Add a first-class representation of pages (title-associated information) in restbase
Closed, DeclinedPublic

Description

We have a need to track several bits of data per *title*, independent of revisions corresponding to this title:

  • Page deletions and other protection information. A page corresponding to a given title can be deleted many times, which a versioned page table can capture.
  • The title's rename history (so that we can reconstruct linear histories), possibly in the form of renamed_from and renamed_to fields.
  • The MediaWiki page_id and other bits from the page table. Note that a single page title can map to multiple page ids over time if the content was deleted & re-created or restored, or renamed to & from other titles. Again, a versioned page table can capture this with one row per version.
  • Efficient and (ideally) ordered listings of all titles (T89564). The order requirement is actually not so trivial to support for large data sets without more secondary index work, so maybe worth deferring?
  • Possibly, at a point in the (far) future, a synchronization point for atomic moves and renames.

We should start to support this properly in RESTBase. We should probably expose this information at /page/title/{title}. This will also give us a natural resource path for page-related events like page creation or deletion.

Logically we can then check whether a page is deleted on each revision access. This would bring the number of queries per revision request to three for old revisions (one additional check for revision deletion), and two for new revisions. In local testing, an extra revision metadata request currently roughly halves throughput, so there is a big advantage in retrieving all protection information in a single request. We might be able to avoid the extra page metadata request by also storing page deletion information in a static column in each key_rev_value bucket. The static column is shared between all revisions of a given domain & title, so only needs to be updated once on page deletion, for each content type. Static columns can be added & removed with a schema upgrade. In order to keep old revisions of a page hidden while allowing new content creation at the same title, we might need to store a timestamp or timeuuid indicating the time before which content should be suppressed. If we also denormalized revision deletion information per-row, then we could reliably load all suppression information in a single I/O.

Strawman design (WIP)

Page table schema (yaml syntax):

attributes:
  title: string
  event: string
  good_after: timeuuid
  page_id: int
  # more attributes
  tid: timeuuid
index:
  - type: hash
    attribute: title
  - type: static
    attribute: good_after
  - type: range
    attribute: tid
    order: desc

Notes:

  • one row per change with tid; *all* changes (including deletions or renames) create new rows with new tids
  • event can be "creation", "move_to:<title>", "move_from:<title>","deletion"; alternatively, we could consider separate rows or a map; a rename will insert two rows at Missing repository, expected "{src repository:path ...}" or "{src path repository=...}" in: {source,destination} with the same tid.
  • The handling of deletions / restrictions likely isn't quite right yet; should go through restriction & query use cases to figure that out:
    • Page is deleted (currently: last event is "deletion"): all revisions with this title should be hidden.
    • Page is not current: Last event is not "creation" or "move_from:*" (not nice to check this!). Queries asking for the 'latest' revision using this title shouldn't return anything, but old revisions can still be retrieved (up until the next deleted entry in the page table, which matches good_after).

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke added a subscriber: GWicke.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 22 2015, 12:31 AM
GWicke renamed this task from Add a first-class representation of title-based pages in restbase to Add a first-class representation of pages (title-associated information) in restbase.Jul 22 2015, 12:33 AM
GWicke triaged this task as Medium priority.
GWicke updated the task description. (Show Details)
GWicke added a project: RESTBase.
GWicke set Security to None.
GWicke removed a subscriber: Aklapper.
GWicke updated the task description. (Show Details)Jul 22 2015, 12:40 AM
GWicke updated the task description. (Show Details)
GWicke updated the task description. (Show Details)Jul 22 2015, 10:54 PM
GWicke updated the task description. (Show Details)Jul 22 2015, 10:57 PM
GWicke updated the task description. (Show Details)Jul 22 2015, 11:51 PM
GWicke updated the task description. (Show Details)Jul 24 2015, 11:16 PM
Pchelolo claimed this task.Jul 29 2015, 5:55 PM

Yup, I think we should implement this ASAP. Most notably, MW bases article actions on page ids, so we should really have that stored. Some comments/suggestions follow.

We might be able to avoid the extra page metadata request by also storing page deletion information in a static column in each key_rev_value bucket.

I can see the value in that, but if we are to store various page info bits in its own bucket, then we'll have information fragmentation. Unless we replicate that in both the new page table and key_rev bucket, which would speed up access detection.

  • the title's rename history (so that we can reconstruct linear histories), possibly in the form of a renamed_from field

Good idea. Perhaps old_titles would be more appropriate, and keep it as a time-ordered array of older page names. For the first iteration it might be enough to simply query for a specific page id and get the listing of old titles.

GWicke updated the task description. (Show Details)Jul 30 2015, 10:09 PM

We might be able to avoid the extra page metadata request by also storing page deletion information in a static column in each key_rev_value bucket.

I can see the value in that, but if we are to store various page info bits in its own bucket, then we'll have information fragmentation. Unless we replicate that in both the new page table and key_rev bucket, which would speed up access detection.

Yeah, the latter is what I was thinking about. The catch is that we'd need to update *each* key_rev_value instance on each page deletion, or at least only use the optimization for the key_rev_value instances that we do keep up to date (basically a whitelist). Default behavior would be to check with the page table for latest-revision requests. For old-revision requests, we could probably continue to check only the revision table by adding and updating a static column indicating 'page deleted' in there.

  • the title's rename history (so that we can reconstruct linear histories), possibly in the form of a renamed_from field

Good idea. Perhaps old_titles would be more appropriate, and keep it as a time-ordered array of older page names.

Maybe, yeah. It might just be harder to make sure that such an array remains consistent in case of rapid renames. Maybe we can have both, with the second being rebuilt asynchronously?

GWicke updated the task description. (Show Details)Sep 1 2015, 8:17 PM
GWicke added a subscriber: aaron.
GWicke updated the task description. (Show Details)Sep 1 2015, 8:27 PM
GWicke updated the task description. (Show Details)
GWicke added a comment.EditedSep 1 2015, 8:35 PM

@aaron made several good points about this in a recent discussion in the office. I added some in the summary (like the need to keep old, page-deleted revisions hidden). Another I remember is that it might be hard to re-construct the historical rename history, as the information MediaWiki keeps about renames is minimal. The best information we have is probably in wikitext-formatted & localized comments in the log tables. Alternatively, we could consult archive.org..

Pchelolo added a comment.EditedSep 2 2015, 10:33 AM

Regarding the page deletion, the current plan is to store a static deletedTimestamp in all buckets, so that all revisions before the timestamp are hidden. However, this approach has some problems:

  1. The most critical issue, is that, according to CQL ALTER TABLE docs adding a static column to an existing table is not allowed.

These additions to a table are not allowed:

  • Adding a column having the same name as an existing column
  • A static column
  1. On undeleting we need to revert the deletedTimestamp to its previous value, so a history of deleted timestamp needs to be stored, for example in a deletionHistory static set, or even only the previous value as we don't need a full history.
  2. Currently PHP hook sends the same 'edit' event for undeleting a page as for any other edit. We need to have a clear way to distinguish them, because undeleting is a complex operation which involves reverting deletedTimestamp fields across all buckets and it shouldn't be done on every 'edit' event.
  3. static columns in cassandra are per-partition, but when the page is renamed, it gets to a different partition, effectively nulling out the deletedTimestamp, previousDeletedTimestamp and others. Need to keep that in mind while making decisions on page renaming.

Regarding the rename history, the current plan is to have both renamed_from static field and old_titles static set. Problems:

  1. PHP hook sends 'delete' + 'edit' events on page rename, it should be changed to signal about the rename in a special way. Currently we could do it by providing some special header with old page title, but when the change propagation is ready we would need a special event.
  2. Same issue with per-partition statics: we would need to manually copy the old_titles on rename, as it definitely gets into a new partition.
  3. Reconstructing older rename history might be a problem.

Regarding the MediaWiki page_id and other bits from the page table - we already have all bits in the table, but need to verify they all are set correctly.

Regarding the page deletion, the current plan is to store a static deletedTimestamp in all buckets, so that all revisions before the timestamp are hidden. However, this approach has some problems:

  1. The most critical issue, is that, according to CQL ALTER TABLE docs adding a static column to an existing table is not allowed.

As a pleasant surprise, the docs seem to be wrong: https://gist.github.com/gwicke/35bbce4161852525859b

  1. On undeleting we need to revert the deletedTimestamp to its previous value, so a history of deleted timestamp needs to be stored, for example in a deletionHistory static set, or even only the previous value as we don't need a full history.

We should be able to keep track of previous page deletion events in the new page table.

  1. Currently PHP hook sends the same 'edit' event for undeleting a page as for any other edit. We need to have a clear way to distinguish them, because undeleting is a complex operation which involves reverting deletedTimestamp fields across all buckets and it shouldn't be done on every 'edit' event.

Yeah, definitely. We should also make sure that the events in the new event bus expose the right information.

  1. static columns in cassandra are per-partition, but when the page is renamed, it gets to a different partition, effectively nulling out the deletedTimestamp, previousDeletedTimestamp and others. Need to keep that in mind while making decisions on page renaming.

Yeah, good point. Deleting renamed pages will require us to update the static column for the entire page history, across renames. Again, this shouldn't be too hard as long as we note rename events in the page table.

Regarding the rename history, the current plan is to have both renamed_from static field and old_titles static set. Problems:

  1. PHP hook sends 'delete' + 'edit' events on page rename, it should be changed to signal about the rename in a special way. Currently we could do it by providing some special header with old page title, but when the change propagation is ready we would need a special event.

Yes, I think a rename event is non-controversial.

  1. Same issue with per-partition statics: we would need to manually copy the old_titles on rename, as it definitely gets into a new partition.

Yup. The alternative is not to bother with such a roll-up for now, and instead follow the renamed_from chain dynamically.

  1. Reconstructing older rename history might be a problem.

Regarding the MediaWiki page_id and other bits from the page table - we already have all bits in the table, but need to verify they all are set correctly.

We have the page_id, but we don't have a clean history of the title -> page_id mapping. The only remains are hidden in a comment field in log tables, which is localized & will require some parsing effort to extract usable information.

GWicke updated the task description. (Show Details)Sep 8 2015, 1:42 AM
GWicke updated the task description. (Show Details)Sep 8 2015, 1:57 AM

@Pchelolo, I started a strawman schema for the page table in the description. Especially the check for whether the page represents the "HEAD" of a page still looks a bit icky. Might be worth adding a boolean for that (is_current?).

I can't escape the feeling we are overcomplicating things here. The main confusion point seems to be the discrepancy in storing stuff - for MW the reference field is the page ID, while for RESTBase that's its title. Considering that (a) a page's ID never changes; and (b) renames, deletes and their reverts are exceptionally rare; it may be worth breaking the logic here and storing information based on the page ID. When a relevant property (visibility, title) changes, the row can be altered. Denormalising this and keeping a secondary title-to-pageID mapping can increase efficiency. That way, we can simply have:

page_id: int,
deleted: boolean,
current_title: string,
current_revid: int,
past_titles: set<string>
# here other attrs ...

The mapping table might be simple as well:

title: string,
page_id: int,
tid: timeuuid

When a page is renamed, the new name is added at the end of the past_titles set and current_title is changed accordingly. At the same, a new record is created for the mapping table reflecting that change, which allows us to find a page's ID simply by looking at its latest record for a given page name.

GWicke added a comment.EditedSep 8 2015, 2:33 PM

We have been weighing the advantages and disadvantages of by-id vs. by-title storage since starting work on RESTBase, so I don't think this is a case of confusion. It really is a conscious choice based on a good amount of reasoning. This reasoning should definitely be re-evaluated as we learn more, but we should do so thoroughly and without haste.

Since basically all interaction with our content is via titles, the most important disadvantage of by-id storage is probably the introduction of a data dependency in each request (resolve title to id, then request by id). This has significant costs for latency and throughput. There are other issues too, like much higher costs for resolving requests for items under their original name, which in turn is important for stable citations and avoiding race conditions in API interactions.

By-id storage avoids the need to walk rename chains during page deletion and history display. It also slightly simplifies the handling of renames by only requiring updates in one place, rather than two. All of these operations are relatively rare and not performance-sensitive.

In both solutions, a title will be home to several pages over time. Every delete, undelete or rename changes the page id associated with a title.

I've been prototyping this for quite a while. I have a working implementation of a page_data table as proposed by @GWicke, but there's a number of questions on what do we want to achieve with it:

  1. Do we want to expose the full rename history as a ordered list in /page/title/{title} endpoint? Reconstructing a history could be a very complex operation, requiring N reads from the page table, where N === number of renames, and I don't see much value in it. Other possibility is to expose only renamed_from and renamed_to fields, however these could get cycled. Alternatively, we could expose a latest_page_name field - it would allow clients to go to the latest version really fast, but again, it's pretty hard to construct.
  1. We cannot detect an undelete event - PHP hook just sends a plain 'edit', but that's not the same - page can be undeleted - in that case we need to rollback the good_after field, or a new page can be created - then good_after should be left intact. Adding a custom header to mark undelete isn't an option, as some attacker could just supply that header and see deleted content in the next request, so looks like we can't support undelete until we have a change propagation system.

I've been prototyping this for quite a while. I have a working implementation of a page_data table as proposed by @GWicke, but there's a number of questions on what do we want to achieve with it:

  1. Do we want to expose the full rename history as a ordered list in /page/title/{title} endpoint? Reconstructing a history could be a very complex operation, requiring N reads from the page table, where N === number of renames, and I don't see much value in it. Other possibility is to expose only renamed_from and renamed_to fields, however these could get cycled. Alternatively, we could expose a latest_page_name field - it would allow clients to go to the latest version really fast, but again, it's pretty hard to construct.

IMHO, this list should be pre-computed on the go as changes happen.

  1. We cannot detect an undelete event - PHP hook just sends a plain 'edit', but that's not the same - page can be undeleted - in that case we need to rollback the good_after field, or a new page can be created - then good_after should be left intact. Adding a custom header to mark undelete isn't an option, as some attacker could just supply that header and see deleted content in the next request, so looks like we can't support undelete until we have a change propagation system.

There are ArticleUndelete and ArticleRevisionUndeleted hooks which our update extension should listen to for this. These could then just send a no-cache request to /page/title/{title} and /page/revision/{revision} endpoints, which would pick up the new setting from the MW API without exposing RESTBase to any eventual vulnerabilities.

IMHO, this list should be pre-computed on the go as changes happen.

Sure, but what exactly are we going to expose - a whole list of renames, or just a current title?

  1. We cannot detect an undelete event - PHP hook just sends a plain 'edit', but that's not the same - page can be undeleted - in that case we need to rollback the good_after field, or a new page can be created - then good_after should be left intact. Adding a custom header to mark undelete isn't an option, as some attacker could just supply that header and see deleted content in the next request, so looks like we can't support undelete until we have a change propagation system.

There are ArticleUndelete and ArticleRevisionUndeleted hooks which our update extension should listen to for this. These could then just send a no-cache request to /page/title/{title} and /page/revision/{revision} endpoints, which would pick up the new setting from the MW API without exposing RESTBase to any eventual vulnerabilities.

We do listen for ArticleUndelete in the update hook, but sending it to RB just with a no-cache header doesn't let us distinguish the undelete from creation of a new page with the same title. In the former case we need to undelete all previous revisions of the title, while in the latter case we don't. So, we need some special making for an undelete event, but it should be secure (verifiable that it came from the hook and not from an attacker). Another option is to iterate over previous revisions and check if they got undeleted with MW API, but it's really inefficient thing to do on every request with a no-cache header.

For the history, we could eventually support paging through the history backwards. That paging interface should transparently follow renames. However, I don't think implementing this is particularly high priority. The revision information in RESTBase will remain incomplete pending reconstruction of historic renames, and the full linear history is available from the Action API. As long as we make sure that we can follow renames in the future, we should be all set.

Re undelete, I see two main options:

  1. restrict purges to local IPs and send an undelete header from our extension, and
  2. use the event bus to supply undelete events.

We might want to do 1) in any case, as the updates from 2) will probably be HTTP requests as well.

So, with the proposed page_data design, here's a list of actions that should be done to get it out:

  • Tracking renames: On article rename, the hook sends a delete event for the previous title and edit even for the new title. Edit event contains an X-Restbase-ParentRevision header, so in RB we can check if the title of the previous revision doesn't match the new title and append two items into a page table: rename_to event for old title and rename_from even for new title. The delete event should be removed from the hook. Also, parsing of the X-Restbase-ParentRevision header should be done only if the request came from internal IP, because otherwise an attacker can completely mess up the database. Additional requests:
    • +1 read for parent revision
    • +2 writes for page table (if the article was indeed renamed)
  • Tracking deletes: On page delete, RB is signalled on /title/{title} endpoint with a no-cache header. Then in case MW API returns 404, we assume that article was deleted and add page_deleted restriction to the revision. This should be changed to insert delete event into the page table and update the good_after static field. Additional requests:
    • -1 write to update revisions table and set page_deleted
    • +1 write to update page table
  • Tracking undeletes: Currently, we have no way to signal undelete event. So, a custom header should be introduced, like x-restbase-mode: undelete. Parsing of this header should be restricted to local requests only, as again, an attacker could mess up the database providing this header. When this is received, we should iterate through the page table until the previous delete event is found, insert an undelete event and update good_after to the tid of the previous delete event. Additional request:
    • +1 read to iterate page table
    • +1 write to update good_after
  • Checking if the page is deleted/Content access: Look up the latest entry on the page_table for a given title, take good_after tid and compare it with a revision tid or current time. The delete check should be done for every bucket access, not only revision info. Additional requests:
    • +1 read to check the page table, could be done in parallel.
  • Checking for latest: Add a static title_latest boolean, and update it on every event. On renamed_to set to false, on all others set to true. Check could be done on the same requests as delete check, so no additional requests.

Updated page table design

attributes:
  title: string
  event_type: string # create|rename_to|rename_from|delete|undelete
  event_data: string # for renames, this represents other article's data, for others: null
  good_after: timeuuid
  is_title_latest: boolean
  page_id: int # Not sure we actually need this here.
  tid: timeuuid
index:
  - type: hash
    attribute: title
  - type: static
    attribute: good_after
  - type: static
    attribute: is_title_latest
  - type: range
    attribute: tid
    order: desc

Problems to keep in mind

  • Currently we use page_deleted revision restriction to track deletions. In the new model this becomes outdated, but going over all articles to build-up correct state of deleted pages took us about 2-3 weeks, so we don't really want to rebuild it, so old check logic should be left.
  • More importantly, we can't rely on any data being in the page table, because reconstructing the table for all articles throughout all history seems like an impossible task from the performance/time perspective.

Some ideas on rename history

  • With the current schema, construction of rename history list is extremely inefficient. To speed it up, we could have a rename_history static set, that would be updated on every rename event. However, all of the sets in all titles should be updated, so, on rename we would need to take all titles in the current rename_history list, and append a new title in rename_history set for every title. Renames doesn't happen too often, but this would take N writes, where N is the number of entries in the rename history.
  • Another problem, is understanding where we are in the rename history. Suppose the page was renamed like 'A -> B -> A -> B -> ...' When we take a historic revision, and see this rename history list, how do we understand, where we are now? Is this revision first in history, or third? To solve this, there are two possible approaches:
    • Add rename_history_idx non-static int field in the page table, that represents where this entry is in the rename history list. It should only be updated when the entry is inserted, and when exposing a list it could be split in two: rename_from list and rename_to list
    • Have non-static rename_from and rename_to set fields, copy them over on each event, and on rename reconstruct the lists
  • Another possibility, assign an internal unique id for a page. As MW page_id is not stable across renames/deletes/undeletes, we can create our own custom one. Set is up on create event and never change on moves/deletes/undeletes. The it's easy to create a secondary index with this internal id as a hash key, and ordered by tid, that would contain an ordered list of all page renames. And by tid we would always know where we are in a rename history. This approach is good from the performance/coding perspective, but I really don't like introducing yet another id.

Basically, the only major thing left is considerations on how to support rename_history reconstruction. @GWicke, @mobrovac, @Eevans ?

So, with the proposed page_data design, here's a list of actions that should be done to get it out:

  • Checking if the page is deleted/Content access: Look up the latest entry on the page_table for a given title, take good_after tid and compare it with a revision tid or current time. The delete check should be done for every bucket access, not only revision info. Additional requests:
    • +1 read to check the page table, could be done in parallel.

This would replace the existing revision table check on access to the latest revision. With a static column to track page deletion in high-traffic tables like HTML, we should be able to handle HTML reads with a single Cassandra query.

  • Checking for latest: Add a static title_latest boolean, and update it on every event.

I don't think this is needed just yet. The use case for this would be checking for edit conflicts, and those could either be handled via RAMP transactions, or a single CAS on a static column in the revision table.

On renamed_to set to false, on all others set to true. Check could be done on the same requests as delete check, so no additional requests.

Isn't this the same as checking for 'event' being 'renamed_to' ? Why do we need title_latest?

Updated page table design

attributes:
  title: string
  event_type: string # create|rename_to|rename_from|delete|undelete
  event_data: string # for renames, this represents other article's data, for others: null

^^ Much cleaner than before.

good_after: timeuuid
is_title_latest: boolean

^^ Is this really needed?

page_id: int # Not sure we actually need this here.
tid: timeuuid

index:

  • type: hash attribute: title
  • type: static attribute: good_after
  • type: static attribute: is_title_latest
  • type: range attribute: tid order: desc
##Problems to keep in mind
- Currently we use `page_deleted` revision restriction to track deletions. In the new model this becomes outdated, but going over all articles to build-up correct state of deleted pages took us about 2-3 weeks, so we don't really want to rebuild it, so old check logic should be left.
- More importantly, we can't rely on any data being in the page table, because reconstructing the table for all articles throughout all history seems like an impossible task from the performance/time perspective.

It is not something we will do immediately, but we should structure things so that we have the option of filling in the history gradually.

Some ideas on rename history

  • With the current schema, construction of rename history list is extremely inefficient.

I don't think this is a given. Most pages have no or few renames over their entire history, and following zero to a handful pointers isn't that expensive. Keep in mind that those renames are typically over a very long history, so somebody paging through the history is unlikely to encounter more than one rename per history page.

  • Another problem, is understanding where we are in the rename history. Suppose the page was renamed like 'A -> B -> A -> B -> ...' When we take a historic revision, and see this rename history list, how do we understand, where we are now? Is this revision first in history, or third?

Each page table row has a tid, which makes it possible to select the page table entry corresponding to a revision using its tid. This relies on MediaWiki sending us re-render triggering events using the current page name, and the current tid being used for such re-renders. After a rename, this means that the revision will then be added under the new name, pointing to a new page table entry. To avoid race conditions with renames after such a re-render triggering event was emitted, we should make sure that a high-resolution timestamp (ideally, timeuuid) is included in rename events on the Event-Platform, and is then used for the re-render. This way, a re-render under the old name will still point to the correct tid range in the page table (assuming the re-render succeeds).

  • Another possibility, assign an internal unique id for a page. As MW page_id is not stable across renames/deletes/undeletes, we can create our own custom one. Set is up on create event and never change on moves/deletes/undeletes. The it's easy to create a secondary index with this internal id as a hash key, and ordered by tid, that would contain an ordered list of all page renames. And by tid we would always know where we are in a rename history. This approach is good from the performance/coding perspective, but I really don't like introducing yet another id.

We could use this on the revision table as well, to flatten the history of a logical page across renames. The difficulty, however will be in piecing together such a logical history across renames. For that, we'll need the ability to walk history backwards in any case, so I think it makes sense to do that first, and consider such a surragate page id as an optimization later.

  • Checking for latest: Add a static title_latest boolean, and update it on every event.

I don't think this is needed just yet. The use case for this would be checking for edit conflicts, and those could either be handled via RAMP transactions, or a single CAS on a static column in the revision table.

Indeed it's not really needed. Will remove.

Additionally, I've been thinking about the create event and I can't find any use-cases for it. If we generated internal stable page_id, then it would be the place where to do it, but as we don't generate them (at least for now), there's no other use for this even type, so I think we could remove it.

Pchelolo added a comment.EditedOct 7 2015, 1:09 PM

I've been playing with a WIP on vagrant, and with a little update to the hook, it handles deletes/undeletes perfectly.

However, moves are still problematic. The implementation relies on title mismatch between current revision title, and x-restbase-parentrevision title, however MW creates 2 revisions on article move: first it creates an empty page with new title, and then it moves the content. The hook get's notified only about the second revision, and it's parent revision points to the previous blank rev, so we can't detect the rename.
There are a couple of options how to resolve that better:

  • Every moved page contains a comment like "Pchelolo moved page [[User:Pchelolo/Before Rename]] to [[User:Pchelolo/After Rename]]" - we could rely on that to detect renames. From one side, this doesn't look very safe, but from the other side this would greatly simplify loading up/rebuilding the historic data - we just need to load revisions without parsing links table etc.
  • Somehow get notified about the first blank revision creation. I've tried several hooks which seemed useful, but non of them is called.

When we get change an EventBus, the solution we develop now would become obsolete. What are your thoughts?

mobrovac moved this task from Backlog to Under discussion on the RESTBase board.Oct 13 2015, 1:07 PM
mobrovac moved this task from Under discussion to In progress on the RESTBase board.

@Pchelolo, relying on the comment sounds brittle. I believe those comments are localized, so a pattern that works for English might not work in many other languages.

Does the blank revision actually exist beyond the rename? If so, could we fetch revision information for it?

Pchelolo added a comment.EditedNov 11 2015, 3:35 PM

@GWicke Actually, that comment is outdated - I've already resolved the issue buy adding x-restbase-parenttitle header to the hook notification

GWicke added a comment.EditedNov 25 2015, 7:50 PM

I actually wonder if we should additionally store deletes & suppressions in a dedicated table. A table containing only this information would be very small & thus very likely in page cache, which in turn speeds up accesses a lot.

We could still complement this with the static column optimization discussed in the task description, but given the reduced complexity (especially in the update handling) I think we should look into the dedicated 'suppression' table first.

Edit: See also T120409: RESTBase should honor wiki-wide deletion/suppression of users.

I actually wonder if we should additionally store deletes & suppressions in a dedicated table. A table containing only this information would be very small & thus very likely in page cache, which in turn speeds up accesses a lot.

Given the assumption page renames are rare, we could do so for those as well.

@mobrovac: Maybe for MediaWiki #redirects, yeah. We'll need to think about the semantics we are shooting for there, as a redirect is really a property of a revision. We might not want to redirect all title-associated properties based on a redirect configured in the last revision's page content.

GWicke moved this task from Backlog to designing on the Services board.Jul 11 2017, 10:55 PM
GWicke edited projects, added Services (designing); removed Services.
Pchelolo closed this task as Declined.Jul 17 2019, 3:07 AM