AQS 2.0: Extract production testing data for Druid-based endpoints
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	BPirkle
	Sep 14 2022, 7:40 PM

Description

The AQS 2.0 Druid-based testing endpoints will need data from our production data lake, to execute tests against. See T316849: Audit tests for Druid-based endpoints for a list of tests, and an initial analysis of the necessary data.

Write queries to extract this data in a format suitable for ingestion into the Druid testing environment described in T311190: Establish testing procedure for Druid-based endpoints. Document the resulting queries, including the procedure required to execute them and ingest the data into the testing environment.

This task does not cover publishing and making accessible a Docker image for the Druid testing environment containing the ingested data. That work should be covered in a separate task.

Completion criteria:

queries are written and successfully executed against production
testing data is ingested into the Druid testing environment (locally, not published anywhere)
all steps required to perform the above are documented

Related Objects
Search...

Status	Subtype	Assigned	Task
Stalled		None	T324931 Clean up open RESTBase related tickets
In Progress		None	T262315 <CORE TECHNOLOGY> API Migration & RESTbase Sunset
In Progress		• DAbad	T263489 AQS 2.0
Resolved	Spike	EChukwukere-WMF	T317790 <Spike> AQS 2.0 Testing Plan
Resolved		None	T288160 Development and test environments for AQS 2.0 Cassandra services
Resolved		BPirkle	T311190 Establish testing procedure for Druid-based endpoints
Resolved		BPirkle	T316849 Audit tests for Druid-based endpoints
Resolved		BPirkle	T317803 AQS 2.0: Extract production testing data for Druid-based endpoints
Invalid		EChukwukere-WMF	T321727 <spike> Review testing procedure for Druid-based endpoints
Resolved		FGoodwin	T318765 Add articles with url-reserved characters to Cassandra testing data
Resolved		FGoodwin	T330512 AQS 2.0: document procedure for adding data to Cassandra testing env
Open		None	T334851 Define a procedure/pattern to populate test environments
Resolved		SGupta-WMF	T347946 move aqs-docker-test-env (Cassandra test environment) from the personal repository

Event Timeline

BPirkle created this task.Sep 14 2022, 7:40 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 14 2022, 7:40 PM

BPirkle added a parent task: T311190: Establish testing procedure for Druid-based endpoints.Sep 14 2022, 7:40 PM

BPirkle moved this task from Incoming to Backlog on the API Platform board.Sep 14 2022, 7:46 PM

BPirkle mentioned this in T311190: Establish testing procedure for Druid-based endpoints.Sep 19 2022, 9:51 PM

The purpose of this comment is to document what I've learned so far about writing and executing Druid queries against WMF's Druid installation. The target audience is other AQS 2.0 developers who may (like me) be unfamiliar with Druid and how to query it. This is much more a "just enough to get started" explanation that a full technical description.

First, it isn't strictly necessary to query Druid for our testing data. The data is also available in in hadoop/HDFS and can be queried via hive/spark/hue (although hue is being deprecated). However, I chose to query actual Druid by posting Druid native queries from the command line. There's a fair amount of discussion on T311190: Establish testing procedure for Druid-based endpoints about all the various options, as well as numerous related pages on Wikitech. The rest of this comment will focus on the approach that I chose.

Druid can accept queries via either:

Druid SQL, which looks a lot like the SQL I was already familiar with from relational database like MySQL/Mariadb
Native Queries, in which the query is described as a json object

The existing production AQS code uses Native Queries. I found them to be convenient and understandable (at least for my relatively simple use cases). The ability to save queries as .json files was nice when working from the command line, and they are also convenient to represent in code either as embedded strings or as external asset files. It is also trivial to save the output to file.

Manually executing native queries from the command line requires a developer account and production shell access. Anyone needing shell access should coordinate with their manager, and is welcome to use my request for shell access as an example (hopefully that's still the right procedure...)

Here are the steps I took to execute a Druid native query:

ssh stat1005.eqiad.wmnet (I assume other stat servers would work too, I just happened to use 1005)
kinit (not absolutely sure you need this for Druid, it may just be for Hadoop. But it doesn't hurt)
vim my-query.json, then write/paste your query and save the file (or use your favorite command line editor, or scp your locally-created .json query files to your home directly on the stat server)
curl -L -X POST "http://druid1004.eqiad.wmnet:8082/druid/v2/?pretty" -H "content-type: application/json" -d @my-query.json

That will display the output directly in your terminal, which may be fine for quick analysis, confirming your query works, etc. If you'd like to save the results, you can instead tell curl to write to a file, like this:

curl -L -X POST "http://druid1004.eqiad.wmnet:8082/druid/v2/?pretty" -H "content-type: application/json" -d @bill-scan-three.json

Disclaimer: as with most command line operations, there are multiple ways to get similar results. I'm not claiming that what I documented above is ideal for every use case, just that it worked for mine. If you have a different way you prefer to do things that accomplishes your goals, go for it.

So, what do queries look like? Two notable points:

Druid native queries have a "queryType"
Aggregations are available and important

I won't pretend to understand the subtleties of all that, but it appears to me that for purposes of this task - extracting raw rows to ingest as testing data - the "scan" query type will be very useful.

Here are some example queries:

top_revisions_per_project_2022_07.json

{
  "queryType": "topN",
  "dataSource": "mediawiki_history_reduced_2022_08",
  "dimension": "project",
  "granularity": "month",
  "metric": "revisions_created",
  "threshold": 10,
  "filter": {
    "type": "and",
    "fields": [{
        "type": "selector",
        "dimension": "event_entity",
        "value": "revision"
      },
      {
        "type": "selector",
        "dimension": "event_type",
        "value": "create"
      }
    ]
  },
  "aggregations": [{
    "type": "longSum",
    "name": "revisions_created",
    "fieldName": "events"
  }],
  "intervals": "2022-07-01/2022-08-01"
}

Output from the above is:

[ {
  "timestamp" : "2022-07-01T00:00:00.000Z",
  "result" : [ {
    "revisions_created" : 23914911,
    "project" : "wikidata"
  }, {
    "revisions_created" : 9116716,
    "project" : "commons.wikimedia"
  }, {
    "revisions_created" : 5649871,
    "project" : "en.wikipedia"
  }, {
    "revisions_created" : 915244,
    "project" : "en.wiktionary"
  }, {
    "revisions_created" : 760184,
    "project" : "fr.wikipedia"
  }, {
    "revisions_created" : 739002,
    "project" : "de.wikipedia"
  }, {
    "revisions_created" : 736428,
    "project" : "ru.wikipedia"
  }, {
    "revisions_created" : 582137,
    "project" : "es.wikipedia"
  }, {
    "revisions_created" : 550489,
    "project" : "zh.wikipedia"
  }, {
    "revisions_created" : 480457,
    "project" : "ja.wikipedia"
  } ]
} ]

create-page events on en.wikipedia during July 1, 2002 for namespace 0:

"bill-scan-four.json" 
{
   "queryType": "scan",
   "dataSource": "mediawiki_history_reduced_2022_08",
   "resultFormat": "list",
   "columns":[],
  "intervals": "2022-07-01/2022-07-02",
   "batchSize":20480,
  "filter": {
    "type": "and",
    "fields": [{
        "type": "selector",
        "dimension": "page_namespace",
        "value": "0"
      },
      {
        "type": "selector",
        "dimension": "project",
        "value": "en.wikipedia"
      },
      {
        "type": "selector",
        "dimension": "event_type",
        "value": "create-page"
      }
    ]
  }
 }

The above found 1059 occurrences. One example:

{
    "__time" : 1656633600000,
    "project" : "en.wikipedia",
    "event_entity" : "page",
    "event_type" : "create-page",
    "user_text" : "Zyxw",
    "user_type" : "user",
    "page_title" : "TeamNL",
    "page_namespace" : "0",
    "page_type" : "content",
    "other_tags" : null,
    "revisions" : "0",
    "text_bytes_diff" : null,
    "text_bytes_diff_abs" : null,
    "text_bytes_diff_sum" : 0,
    "text_bytes_diff_abs_sum" : 0,
    "revisions_sum" : 0,
    "events" : 1
  }

Next up: write the actual queries needed to extract our testing data.

I ended up writing this as one query, as follows:

{
  "queryType": "scan",
  "dataSource": "mediawiki_history_reduced_2022_08",
  "resultFormat": "compactedList",
  "columns":[],
  "intervals": "2022-07-01/2022-07-02",
  "batchSize":20480,
  "filter": {
    "type": "or",
    "fields": [
      {
        "type": "and",
        "fields": [
          {
            "type": "in",
            "dimension": "page_title",
            "values": [
              "Jess_Wells",
              "Věra_Lišková",
              "Chipolin",
              "Karmarong_dialect",
              "Project:WikiProject_Bangladesh_Premier_League/Article_alerts/Archive",
              "Timeline_of_the_Joe_Biden_presidency_(2022_Q2)"
            ]
          },
          {
            "type": "selector",
            "dimension": "project",
            "value": "en.wikipedia"
          }
        ]
      },
      {
        "type": "and",
        "fields": [
          {
            "type": "in",
            "dimension": "page_title",
            "values": [
              "Julie_Lemieux",
              "36_Ursae_Majoris",
              "Manaus_slender-legged_tree_frog",
              "TANS_Perú_Flight_204",
              "Category:1925_comedy_movies"
            ]
          },
          {
            "type": "selector",
            "dimension": "project",
            "value": "simple.wikipedia"
          }
        ]
      },
      {
        "type": "and",
        "fields": [
          {
            "type": "in",
            "dimension": "user_text",
            "values": [
              "-Alabama-",
              "AAlertBot",
              "AnomieBOT"
            ]
          },
          {
            "type": "selector",
            "dimension": "project",
            "value": "en.wikipedia"
          }
        ]
      }
    ]
  }
}

I'll get all this into a proper repository in WMF GitLab once I finish up a few more details. But for now, here's the Python script I wrote to convert the Druid compact list output json into a csv for injection into Druid:

https://gist.github.com/bpirkle/461e952ea39b2ef43cfd082d2b871ad9

Maybe there's a way I could have fed the Druid json response directly back in, but this wasn't too hard, is a little more compact, and (probably most important) it is easier to review the testing data outside Druid using tools like an everyday spreadsheet or text editor.

The Druid environment is currently at https://github.com/bpirkle/aqs-docker-druid-test-env . As noted above, I'll get it into a better repository. But I'll do that work under a different phab task.

Steps to query and ingest Druid data into a local env:

ssh stat1005.eqiad.wmnet (presumably, this could be done from several servers. This is just the one I happened to use.)
get the above query onto the stat server, by scp'ing it, running vim on the stat server and pasting it in, or whatever you prefer
curl -L -X POST "http://druid1004.eqiad.wmnet:8082/druid/v2/?pretty" -H "content-type: application/json" -o query-output-compacted-list.json -d @test-data-query.json
copy query-output-compacted-list.json to your local machine (again, however you prefer)
python3 druid-to-csv.py (or however you prefer to run the script that converts the json to csv)
clone the Druid environment repository
copy in the .csv to that repository's directory under the name aqs_druid_test_data.csv
docker build -t bpirkle/aqs-docker-druid-test-env . (this creates the Docker image with the data ingested)
docker run -p 8081:8081 -p 8082:8082 -p 8888:8888 -it bpirkle/aqs-docker-druid-test-env (this actually runs the image in a container)
hit http://localhost:8888/ in your browser and confirm data is present. Note that it may take several minutes for Druid to fully launch. In my experience, I was able to bring up the Druid interface almost immediately, but had to wait a bit for the data to appear, even though it was already ingested in the build step.

The above is still a little more fidgety and manual than I'd like. I'll try to clean it up when I create the real repository. But I'll do that under another task. The above is sufficient to get a working env with test data, and therefore meets the completion criteria of this task.

BPirkle updated the task description. (Show Details)Oct 4 2022, 3:19 AM

BPirkle moved this task from Backlog to QA/Review on the API Platform board.

VirginiaPoundstone added a parent task: T317790: <Spike> AQS 2.0 Testing Plan.Oct 7 2022, 7:21 PM

VirginiaPoundstone moved this task from QA/Review to Sprint 01 on the API Platform board.Oct 25 2022, 4:22 PM

VirginiaPoundstone edited projects, added API Platform (Sprint 01); removed API Platform.

VirginiaPoundstone moved this task from Next Up to Ready for Code Review/ Ready for Tech input on the API Platform (Sprint 01) board.

VirginiaPoundstone moved this task from Ready for Code Review/ Ready for Tech input to Next Up on the API Platform (Sprint 01) board.Oct 25 2022, 4:41 PM

BPirkle closed this task as Resolved.Oct 26 2022, 2:23 AM

VirginiaPoundstone added a project: AQS2.0.Nov 14 2022, 7:29 PM

BPirkle mentioned this in T327817: Edit Analytics Service.Jan 24 2023, 9:13 PM

BPirkle mentioned this in T327818: Editor Analytics Service.

Sfaci mentioned this in T336405: Editor Analytics: Get new data for AQS Druid test environment.Jun 6 2023, 4:48 PM