Page MenuHomePhabricator

[Abstract Wikipedia data science] Move data storage to database which can be accessed from outside of Toolforge
Closed, ResolvedPublic

Description

Description

Using database instead of csv helps to achieve multiple things:

  1. Remove any issues with race conditions for multiple programs working with one file
  2. Creating database in Toolforge namespace as explained here might solve problems with downloading the database for further analysis out of the Toolforge (or maybe we can use some bot/service for downloading it?)
Tasks
  • Look into file downloading from Toolforge
  • Design and create custom database into Toolforge space
  • Mirror all the existing csv usage with database usage
    • Meta table parser
    • Database fetcher
      • page_id + dbname is now the primary key
    • API fetcher
    • Comparison code for API and db
  • Switch to db as the main source
  • Remove csv code copies

Event Timeline

For file downloading: scp seems to be working just file, but wasn't able to make ssh tunneling from here to work

Faulty Toolforge update today slows things down, sadly...

We could use dbname but that wasnt not save from the content fetcher. When loading from database I guess that wont matter, so yes, we can use dbname for sure.

We could use dbname but that wasnt not save from the content fetcher. When loading from database I guess that wont matter, so yes, we can use dbname for sure.

I just have some negative prejudice agains composite keys in databases (and that might matter as this table would have a lot of content), so I'll still try to avoid it somehow.

For table, which stores url, I made dbname the key, as it's guaranteed to be unique

LostEnchanter updated the task description. (Show Details)