Exporting a terminology to JSON
To produce a json file out of a terminology with TermSuite, use:
- option
--jsonif you are operating with the command line, - JsonExporter if you are operating with the Java API,
- Export to JSON… from the menu of the terminology editor if you are operating with the graphical user interface of TermSuite,
File structure
Overview
{
"metadata" : { ... },
"input_sources" : { ... },
"words" : [ ... ],
"terms" : [ ... ],
"relations" : [ ... ],
"occurrences" : [ ... ]
}
metadata
The key metadata holds general information about the current terminology:
"metadata" : {
"name" : "txt-en-2017-07-19_14-14-22",
"lang" : "en",
"occurrence_storage" : "embedded",
"wordsNum" : 368322,
"spottedTermsNum" : 226145
}
occurrence_storage
Two possible values:
- embedded: means that the term occurrences are contained in the current JSON file at the key
occurrences, - disk: means that the term occurrences are contained in a separate file of the file system. The path to this path is given by the key
persistent_store_path.
input_sources
The key input_sources maps all documents’ pathes of the input corpus with a id:
Example:
"input_sources" : {
"1" : "/home/cram-d/Corpora/TS/wind-energy/English/txt/file-1.txt",
"2" : "/home/cram-d/Corpora/TS/wind-energy/English/txt/file-2.txt",
"3" : "/home/cram-d/Corpora/TS/wind-energy/English/txt/file-3.txt",
"4" : "/home/cram-d/Corpora/TS/wind-energy/English/txt/file-4.txt",
"5" : "/home/cram-d/Corpora/TS/wind-energy/English/txt/file-5.txt",
...
}
words
The key words is an array that contains all morphological information about words contained in terms. Words are identified with the unique key lemma. See TermSuite data model for more information on how words interact with terms.
Example with a word that is not a compound:
{
"lemma" : "energy",
"stem" : "energi"
}
Example with a compound word:
{
"lemma" : "highspeed",
"stem" : "highspe",
"compound_type" : "NATIVE",
"components" : [ {
"lemma" : "high",
"begin" : 0,
"end" : 5,
"substring" : "highs",
"neoAffix" : false
}, {
"lemma" : "pe",
"begin" : 5,
"end" : 9,
"substring" : "peed",
"neoAffix" : false
} ]
}
terms
The key terms is an array that holds all terms contained in the terminology. Terms are identified by their unique key props.key.
Example with term nn: wind energy:
{
"props" : {
"rank" : 3,
"dFreq" : 32,
"fNorm" : 1.1240164855751218,
"gfNorm" : 3.477238863707884E-5,
"spec" : 4.509551591070845,
"freq" : 414,
"ortho" : 1.0,
"iFreq" : 87,
"ind" : 0.21014492753623187,
"pilot" : "wind energy",
"lemma" : "wind energy",
"tfIdf" : 12.9375,
"key" : "nn: wind energy",
"pattern" : "N N",
"rule" : "nn",
"swtSize" : 2,
"depth" : 2
},
"words" : [ {
"syn" : "N",
"swt" : true,
"lemma" : "wind"
}, {
"syn" : "N",
"swt" : true,
"lemma" : "energy"
} ]
}
- The key
propsholds in an array all properties of the current term. - The key
wordsgive the sequence of words that the current term is made of.- The sub key
lemmapoints to the word object in the list of words and allows to retrieve the stem and the morphological composition of the word. - The sub key
syngives the syntactic label of the word for this term. - The sub key
swtis true when the word taken with its syntactic label is a single-word term.
- The sub key
Example 1: the entry {"syn" : "N", "swt" : true, "lemma" : "wind"} is also a single-word term because there is a entry n: wind in the terminology.
Example 2: the term npn: storage of hydrogen has the following words:
"words" : [ {
"syn" : "N",
"swt" : true,
"lemma" : "storage"
}, {
"syn" : "P",
"swt" : false,
"lemma" : "of"
}, {
"syn" : "N",
"swt" : true,
"lemma" : "hydrogen"
}
]
relations
The key terms is an array that holds all type of links between terms. All relations are directed.
Here is an example of relation:
{
"from" : "nn: pressure decrease",
"to" : "npan: decrease of total pressure",
"type" : "var",
"props" : {
"vRank" : 1,
"vRules" : [ "S-PI-NN-PN" ],
"vScore" : 0.99,
"isExt" : false,
"vBagFreq" : 2,
"srcGain" : 0.0,
"nSrcGain" : 1.0,
"isInfered" : false,
"isGraph" : false,
"isDeriv" : false,
"isPref" : false,
"isSyntag" : true,
"isMorph" : false,
"isSem" : false
}
}
- the keys
fromandtoare the term keys of the relation boundaries. They indicate that the relation starts from termnn: pressure decreaseand goes to termnpan: decrease of total pressure. - the key
typegives the relation type. There are as many allowed values as there exists relation types in the data model. - the key
propsholds in an array all properties of the current relation.
Note: among all relations set by TermSuite between terms, the most interesting type of relations are the variations, ie. those relations having
type = "var".
occurrences
The key occurrences, when set (see key occurrence_storage in metadata) holds the list of occurrences of all terms.
"occurrences" : [ {
"tid" : "nnn: turbine generator connection",
"begin" : 2397,
"end" : 2425,
"file" : 17,
"text" : "Turbine generator connection"
}, {
"tid" : "nnn: turbine generator connection",
"begin" : 23234,
"end" : 23262,
"file" : 17,
"text" : "Turbine generator connection"
},
...
]
Each occurrence entry has five keys:
tid: the term id pointing to the value of props.key in the term’s entry,file: the file id where the the occurrence has been spotted. See input sources to get the file url from the file id,begin,end: the offsets of the occurrence within the file,text: (optional) the plain text covered by the occurrence. It can be retrieved directly from the file url, begin and end.
