| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | awight | T144100 Pageview dumps incorrectly formatted, need to escape special characters | |||
| Resolved | None | T156656 Review parent task for any potential pageview definition improvements |
Event Timeline
See parent task and see if there's anything to change on the pageview definition (but not fixing mediawiki's problem of returning 200s for malformed requests).
@Milimetric Would you mind pointing me to the definition this task will update? If there are formatting changes to how fields are delimited and escaped, we will need to find documentation to update, and write release notes for downstream consumers of the dump files.
Sorry to have missed this ping @awight, and thanks for the work! The pagecounts-raw data is the older stuff, where you updated the docs as mentioned in T144100#5053676. But the issue you're fixing will improve the pageviews data. That's documented here: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews. I noticed the same edit you made on the pagecounts-raw's version of page_title is there on the pageviews schema detail, so I think the docs are up to date. Though it is confusing, ping me if you think I'm confused :)
I noticed the same edit you made on the pagecounts-raw's veI noticed the same edit you made on the pagecounts-raw's version of page_title is there on the pageviews schema detail
Glad to be confused in good company! It's just because the table of column definitions is transcluded from the "raw" page. I can't say whether that's a good idea or not, though...