Page MenuHomePhabricator

Load synthethic data into the growthbook_test database
Closed, ResolvedPublic

Description

I have created a postgresql_test database/user on the postgresql-growthbook database so that @mpopov can start testing Growthbook itself. However the generation of such synthetic data requires a python environment with psycopg2, sqlalchemy or ADBC (Arrow Database Connector), along with polars. This is probably best suited for a jupyter environment, which would require access to the PG cluster from outside the Kubernetes cluster itself.

For this to work, we'd need to make ingress work for our CloudnativePG clusters, which would be a nice feature to have! Let's see whether we can make it work in a non-encrypted fashion.

Links

Event Timeline

brouberol triaged this task as Medium priority.
mpopov awarded a token.
mpopov updated the task description. (Show Details)

Change #1203381 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] dse-k8s-worker: allow stat boxes to egress to the growthbook PG service

https://gerrit.wikimedia.org/r/1203381

Change #1203381 merged by Brouberol:

[operations/puppet@production] dse-k8s-worker: allow stat boxes to egress to the growthbook PG service

https://gerrit.wikimedia.org/r/1203381

I was able to grant egress access to the PG database from the stat boxes. To illustrate that point, I created a dummy table in the growthbook_test database

growthbook_test=# CREATE TABLE books (id int primary key, name text);
CREATE TABLE
growthbook_test=# INSERT INTO books VALUES (1, 'The Fellowship of the Ring');
INSERT 0 1
growthbook_test=# INSERT INTO books VALUES (2, 'The Two Towers');
INSERT 0 1
growthbook_test=# INSERT INTO books VALUES (3, 'The Return of the King');
INSERT 0 1
growthbook_test=# ALTER TABLE public.books OWNER TO growthbook_test;
ALTER TABLE

And with this, I was able to read from that table in a jupyter notebook:

Screenshot 2025-11-10 at 12.00.39.png (1×2 px, 294 KB)

@mpopov I've put the growtbook password in a file only readable by yourself, in your stat1008 home directory. Feel free to create a jupyter environment with all your required dependencies, to generate and load that data in the growthbook_test database.

Change #1203487 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] Define the synthetic data PG data source in configuration

https://gerrit.wikimedia.org/r/1203487

Change #1203487 merged by Brouberol:

[operations/deployment-charts@master] Define the synthetic data PG data source in configuration

https://gerrit.wikimedia.org/r/1203487

@brouberol: Ran into an interesting error:

ProgrammingError: (psycopg2.errors.InsufficientPrivilege) permission denied for schema public
LINE 2: CREATE TABLE games (

Perhaps related to https://stackoverflow.com/a/75876944

PostgreSQL 15 also revokes the CREATE permission from all users except a database owner from the public (or default) schema.

Here is the code I am trying to run:

import polars as pl
import getpass
import psycopg2
import sqlalchemy

pgpass = getpass.getpass()

connection_args = {
    "host": "10.67.28.76",
    "database": "growthbook_test",
    "user": "growthbook_test",
    "password": pgpass
}

engine = sqlalchemy.create_engine(
    "postgresql+psycopg2://user:password@host/database",
    connect_args = connection_args
)

games_df = pl.DataFrame({
    "name": ["Dead Space", "Alan Wake 2", "SIGNALIS"],
    "year": [2023, 2023, 2022]
})

games_df.write_database(table_name = "games", connection=engine)

References

Also, when trying to edit the "Synthetic data" data source in GrowthBook I was prevented from doing so:

Cannot update. Data sources managed by config.yml

I was trying to add subject_id identifier and delete the prefilled user_id and anonymous_id identifiers.

I then tried to add growthbook_test as a data source via UI (called "Synthetic data 2") but got:

Cannot add. Data sources managed by config.yml

I think – for now – while we are still in exploring & learning mode, let's make it as easy as possible to make changes by switching over to UI-managed data sources. Once we've had our fun and go into production mode we can go back to config.yml-managed data sources.

Gotcha. Let me see how I can make UI datasource editing work. And I think we're missing GRANTs for the growthbook_test user.

Change #1205167 had a related patch set uploaded (by Bearloga; author: Bearloga):

[operations/deployment-charts@master] growthbook: remove data source

https://gerrit.wikimedia.org/r/1205167

Seems like this is not configurable: as soon as you start using config.yml to configure datasources, further UI changes are prevented (cf https://github.com/growthbook/growthbook/issues/268#issuecomment-1048415724). As such, I think I'll remove the datasource from configuration, and re-create it from the UI itself.

Just uploaded a patch that does that :)

Change #1205167 merged by Brouberol:

[operations/deployment-charts@master] growthbook: remove data source

https://gerrit.wikimedia.org/r/1205167

growthbook_test=# GRANT ALL PRIVILEGES ON SCHEMA public TO growthbook_test;
GRANT

Change #1205176 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] growthbook: temporarily avoid to mount the config.yml file in the backend pod

https://gerrit.wikimedia.org/r/1205176

Change #1205176 merged by Brouberol:

[operations/deployment-charts@master] growthbook: temporarily avoid to mount the config.yml file in the backend pod

https://gerrit.wikimedia.org/r/1205176

Simulated experiments now available at https://growthbook.wikimedia.org/experiments#all

Synthetic data is in growthbook_test.events

Notebook for simulating experiments and loading the data into the db: https://gitlab.wikimedia.org/bearloga/growthbook-synthetic-data/-/blob/main/T409591.ipynb

mpopov moved this task from Doing to Done on the Product-Analytics (Kanban) board.

Change #1210616 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] dse-k8s: delete the stat-> PG on k8s ingress firewall rule

https://gerrit.wikimedia.org/r/1210616

Change #1210616 merged by Brouberol:

[operations/puppet@production] dse-k8s: delete the stat-> PG on k8s ingress firewall rule

https://gerrit.wikimedia.org/r/1210616