Page MenuHomePhabricator

T5. Determine database dump architecture and file format (spike)
Closed, ResolvedPublic2 Story Points

Description

Notes from meeting with @MattFlaschen, @EBernhardson, and @matthiasmullie (originally at https://etherpad.wikimedia.org/p/flow-db-dumps)


Logs

Do we need this, or is it redundant to all history?

Not sure

This is more denormalized data that will exist within in the full revision dump, but could still be usefull to spell out explicitly

All history

  • For this, we discussed having a list of revisions inside each object (this part needs a little more discussion).
  • Moderation

What about moderation and unmoderating topics?

Creates a revision of the topic title.

How do we represent topics that were part of a board at one time, but not later. E.g. moved between boards, or added to additional boards (topic in multiple boards simultaneously)

Haven't decided how to represent that in the db.

Board could just be a query

Or it could be a human-curated list of topics.

Since neither move nor "topic in multiple boards" is implemented yet, the "all history" version can include the end state list of topics (which

should also be the super-set) and to see when they were created/deleted/etc. you just check the history of the topic title.

Attaching and detaching from a board (those two actions should be able to represent moving as well as simultaneous attachment)

could be represented as part of the history of the topic title.

This should allow us to just add it to the <topic> part of the "all history" later when it's implemented.

Do we have to worry about orphan topics, or should be we forbid that?

However, we definitely need to revisit this before adding support for either of those features.

Should the same mechanism be used for deleted topics, or should that just show the moderation on the topic title?


Current version

  • Don't need to represent full history of moderation, but should only include visible items and show current moderation/lock status.
  • Same format as the document topic XML
<board id="sasd..." title="...">

    <header id="sasc1234">

    <!CDATA

    <p data-parsoid="..."></p>

    -->

    </header>

    <topic id="sasd3423" />

    <topic id="sasd3424" />

</board>

```<topic id="sasd3423" dumpVersion="1">

    <title>...</title>

     <!-- summary also contains categories -->

    <summary id="sasd3239">

    <!CDATA

    <p data-parsoid="..."></p>

    -->

    </summary>

    <post id="sasd3423" timestamp="..." user="..." lastEditUser="" lastModerationUser="">

    <!CDATA

    <p data-parsoid="..."></p> <!-- Should we inline this as XML, or as CDATA?  Parsoid guarantees HTML, but not any version, nor XML.  Maybe XHTML?  Or just CDATA. -->

    -->

    <status isHidden="1" user="" reason="" />

    <children>

    <post id="...">...</post>

    </children>

    </post>

    <moderatedPost isSuppressed="1" moderatingUser="" title="Something"> <- as long as we properly document all nodes, I'm happy with either (<moderatedPost/> or <post><moderation></post>)

    Suppression shouldn't reveal anything at all, so it just vanishes from the dump.

    If you suppress a board, it should suppress an the topics so they should also not be in dumps.

    If you suppress a topic (directly or indirects), all posts should be suppressed

    Need to look into what happens to post visibility if you replied to a moderated post before it was moderated (i.e. I reply, then the parent post

    is later hidden, deleted, or suppressed); we think it varies between types, but needs to be checked.

</topic>

<moderatedTopic isDeleted="1" moderatingUser="" title="Topic title">
By doing like this, we can allow topics to be part of multiple boards.
data-parsoid and data-mw will probably be stripped (data-parsoid already has). But they both might be useful to reusers. Add back in?
How to handle moderated posts? Idea should be that current version exposes all data that an anonymous user would see (for example, you can still see the title of a deleted post)
It's public who created a deleted topic, and posted to it, but not clear if it should be in the current version.

Yes.

moderationPost/moderationTopic vs. status/moderated sub-element. Not sure, advantages to both

status might be better given that some moderations like lock might be fairly common.

Event Timeline

Mattflaschen-WMF updated the task description. (Show Details)
Mattflaschen-WMF raised the priority of this task from to Normal.
Mattflaschen-WMF removed a project: Notice.
Mattflaschen-WMF set Security to None.
DannyH renamed this task from Determine database dump architecture and file format (spike) to T5. Determine database dump architecture and file format (spike).Mar 25 2015, 7:33 PM
DannyH raised the priority of this task from Normal to High.

I invited @EBernhardson and @matthiasmullie to a meeting about this on Friday.

Restricted Application added a project: Collaboration-Team-Triage. · View Herald TranscriptApr 2 2015, 2:12 AM

I'm posting in our notes from the meeting about this. We have a general architecture, but feedback would be welcome.

EBernhardson closed this task as Resolved.Apr 6 2015, 5:37 PM

I see a split <board> & <topic> node, with references to topic nodes inside <board>.
Inside <topic>, however, all children are nested.

IMHO it makes more sense to either nest everything (<board><topic><post></post></topic></board>) or flatten everything (<board></board><topic></topic><post></post>).
I have a preference for nesting.

I think that may have been based on:

  1. The idea that topics might one day belong to multiple boards, and then you don't have to repeat it in each board.
  2. So the board XML file doesn't become insanely long.

At this point, we'll probably never do topics belonging to multiple boards, so #1 is moot. #2 is probably not really an issue with a streaming parser.

DannyH removed a subscriber: DannyH.Oct 6 2015, 12:18 AM