Page MenuHomePhabricator

Create sample manuscript(s) for integration testing
Closed, ResolvedPublic

Event Timeline

There is now a first version of the output format, see details below. Please note the format may be subject to change once we finalize the API/config setup for running the manuscript tool. Let us know if you have issues with or comments on the format.

The documentation below can be found in the README file in the manuscript git repository (which will be made public later on).

Exporting scripts

Output is printed as json, with the following structure:

  • root
    • printed (timestamp)
    • stats (number of scripts, total number of sentences)
    • scripts
      • script #1
        • batch_metadata (metadata about the batch from which the script was generated: timestamp [when generated], number of sentences, options used for generation, etc)
        • script_metadata (timestamp [when generated], number of sentences, options used for generation, etc)
        • sentences
          • sentence
            • id (database id)
            • text (the sentence)
            • source (source reference, e.g., Wikipedia article)
      • script #2
        • batch_metadata (metadata about the batch from which the script was generated: timestamp [when generated], number of sentences, options used for generation, etc)
        • script_metadata (timestamp [when generated], number of sentences, options used for generation, etc)
        • sentences
          • sentence
            • id (database id)
            • text (the sentence)
            • source (source reference, e.g., Wikipedia article)
      • script #3 ...

The go structs:

type ScriptSet struct {
	Printed string           `json:"printed"` // timestamp
	Stats   map[string]int64 `json:"stats"`
	Scripts []Script         `json:"scripts"`
}

type Script struct {
	BatchMetadata   protocol.ScriptMetadata `json:"batch_metadata"`
	ScriptMetadata  protocol.ScriptMetadata `json:"script_metadata"`
	Sentences       []text.Sentence         `json:"sentences"`
}

Sample output (edited for readability):

{
 "printed": "2021-03-18 18:34:35",
 "stats": {
  "scripts": 2,
  "size": 60
 },
 "scripts": [
  {
   "script_metadata": {
    // ... omitted metadata
    "options": {
     "script_name": "test_script_1",
    },
    "input_size": 40000,
    "output_size": 20,
    "timestamp": "2021-03-18 17:51:17"
   },
   "sentences": [
    {
     "id": 50809,
     "text": "I Sverige fick organisationen negativ uppmärksamhet efter en reklamfilm innehållande det svenska kungaparet utnyttjades utan dess tillåtelse samt att filmen innehöll arrangerad och felaktig fakta.",
     "source": "https://sv.wikipedia.org/wiki?curid=1668"
    },
    // ... omitted sentences
    {
     "id": 16481,
     "text": "Precis som i längdhopp och i tresteg har varje kastare i mästerskap tre kast på sig och det är det längsta kastet som räknas.",
     "source": "https://sv.wikipedia.org/wiki?curid=576"
    }
   ]
  },
  {
   "script_metadata": {
    // ... omitted metadata
    "options": {
      "script_name": "test_script_2",
    },
    "input_size": 40000,
    "output_size": 40,
    "timestamp": "2021-03-18 17:52:29"
   },
   "sentences": [
    {
     "id": 3869,
     "text": "Alnön är känd för sin jakt på vilt då det är gott om vilda djur.",
     "source": "https://sv.wikipedia.org/wiki?curid=112"
    },
    // ... omitted sentences
    {
     "id": 5304,
     "text": "\"Men du tog ju ett glas vin till maten.\"",
     "source": "https://sv.wikipedia.org/wiki?curid=187"
    }
   ]
  }
 ]
}

A complete sample output can be found in this file: