Hello! I would like to start the discussion around creating a new database for a service.
The service is the [[https://github.com/geohci/similar-users | sockpuppet detection service ]] (also known as the similar-users service). The Platform Engineering team are currently assisting with the migration of this service towards prod. This is a service that will infrequently be used to query information about similar users/activity on-wiki. Currently its data model is stored in ~1.5GB of CSV-like files which makes shipping the service a problem, and SQL storage is ideal for this kind of data. It is expected that this service will be lightly used (and only by people with CheckUser access)
One complicating factor in this process that we need to iron out and would appreciate guidance on is that the data used by the service will be refreshed on a monthly basis. A PySpark model currently generates the CSV files and the application needs to be restarted to reload these files. Ideally the process that creates these files would simply update the database in-place. Is this a feasible model? All data is recreated rather than updated currently by virtue of the use of files and the fact that the data is based on the previous month's activity. Happy to open another ticket for this if the discussion needs more space.
QPS: Not clear at present but likely less than hundreds an hour.
DB Name: sockpuppet
Accessed from server (s): kubernetes*.<eqiad|codfw>.wmnet, kubestage*.<eqiad|codfw>.wmnet, analytics VLANs hosts to load data (or an intermediary host if the envisioned process asks for that)
Backup Policy: Probably not needed - data for the database is generated from PySpark models and can be regenerated
Grants needed: All access to the database from the application is read-only. However, we will also need to load data on a monthly basis (see above)