- anyone who wants reviews the code and comment it
- apply agreed changes to the code
|Resolved||None||T120037 Vital Signs: Please provide an "all languages" de-duplicated stream for the Community/Content groups of metrics|
|Resolved||None||T120036 Vital Signs: Please make the data for enwiki and other big wikis less sad, and not just be missing for most days|
|Open||None||T130256 Wikistats 2.0.|
|Duplicate||Milimetric||T141536 Compare early results of Wikistats 2.0 with Wikistats 1.0|
|Resolved||mforns||T143321 Create clean simplewiki output from edit history reconstruction|
|Resolved||mforns||T143322 Edit History: Review scala code functionality and make page and user output uniform|
I reviewed the whole code looking for the things I believe we should tackle/decide before the vetting of the data, and made this list. Please comment if you think they are not important or should be tackled later on.
- Both sides (user/page) of the code should output the errors or unexpected situations (or just statistics about them, counts). For example: parsing errors, events that can not join any history chain, history chains that get cropped because of conflicting events, etc. This would help measuring how much we actually reconstruct edit history and what differences to expect when comparing the data of Wikistats 1.0 vs 2.0.
- Decide whether to use normalized titles/usernames or original ones in the output. Currently user->original, page->normalized.
- Apply admin username historification on the page side.
- Decide on: Flushing name-conflicting states only done when the current event joins with some state (page side). Or flushing name-conflicting states always regardless if the current event joins or not (user side).
- Decide what default values give to: startTimestamp, registration/creation, causedByUserId, causedByUserName, causedByEventType when:
- flushing state because its creation/registration is greater than current point in time.
- flushing state because of name conflict with an upcoming event.
- Confirm that we want to store deletion states (they kind of violate the unique title invariant, but as they have the type = delete, can be filtered out).
- Decide whether we want to use the code structure of the user side also in the page side (with processingStatus) or not. This does not change the output, but may help applying all the other bullets in this list.
Following up with the conversation we had on the previous comment, these are the actions to take and implement in this task:
- outputting errors: By default, output counts on the errors, but add a flag argument that when set, outputs the whole errors to a file or somewhere.
- Leave it like it is, because it is the way mediawiki stores the username and pageTitle.
- We should remove the causedByUserName field from both sides of the algorithm and populate it in the SQL query that writes to the denormalized table.
- Always flush conflicting states. Better incomplete than incorrect.
- When flushing, default to type="create", start=evt.ts, creation=evt.ts, causedByUserId=None, and add a new field that stores that some values are a guess.
- Leave delete states as they are.
- Do not change the structure of the pageHistoryBuilder, if we need, we can change it when more event types come.