Warning: This documentation page is out of date. Some of its information cannot be applied to current version of TermSuite 3.0.3
This page guides you the steps to get TermSuite running and extracting the terminology from a short example corpus.
- 1. Requirements
- 2. Installation of TreeTagger
- 3. Download TermSuite
- 4. Prepare your corpus
- 5. Run terminology extraction
- 6. Understanding the TSV output
- 8. Enjoy TermSuite
For the sake of simplicity, create a directory where you will download all files required by TermSuite. In the following sections, we refer to this directory as
Make sure you have Java installed on your OS (at least version 8), or follow the official installation instructions.
To check if Java is installed properly and see its current version, open a command line prompt and type:
$ java -version
2. Installation of TreeTagger
TermSuite requires a POS Tagger and lemmatizer to run terminology extraction pipelines. In this guide, we install TreeTagger, but TermSuite also supports Mate. TreeTagger must be installed apart from TermSuite, due to license concerns.
To install TreeTagger on your OS: (See the install instructions for details)
1. Download TreeTagger from the official site to
TERMSUITE_WORKSPACE and uncompress it.
2. Creates a subdirectory named
models within the TreeTagger directory.
3. Download to dir
models/ the english utf-8-encoded (very important) model from the official site and rename it to
Encoding and naming of TreeTagger models is important for TermSuite to run correctly. See detailed instructions for all languages.
3. Download TermSuite
Download the last stable version of TermSuite’s jar for project
termsuite-core from Maven Central to directory
Currently : termsuite-core-3.0.3.jar
4. Prepare your corpus
Download the example multilingual corpus Wind Energy to
TERMSUITE_WORKSPACE and uncompress it.
5. Run terminology extraction
TERMSUITE_WORKSPACE folder should look like this: (non exhaustive)
TERMSUITE_WORKSPACE/ wind-energy/ README.txt English/ txt/ file1.txt file3.txt [...] file38.txt French/ [...] tree-tagger/ bin/ [...] lib/ [...] models/ english.par # Should be the `utf-8` model ! [...] termsuite-core-3.0.3.jar
Run the terminology extraction on the Wind Energy corpus and language
$ java -cp termsuite-core-3.0.3.jar eu.project.ttc.tools.cli.TermSuiteTerminoCLI \ -t ./tree-tagger/ \ -c ./wind-energy/English/txt/ \ -l en \ --tsv ./wind-energy-en.tsv
6. Understanding the TSV output
TermSuite created the file
wind-energy-en.tsv showed below.
* 1st column is the term id, or the base term id in case of a variant,
* 2nd column:
T indicates a term,
V indicates a variant,
* 3rd column: the term pilot,
* 4th column: the term frequency,
* 5th column: the term specificity (weirdness ratio logarithm).
baseId type p pilot wrlog f 1 T N N wind turbine 5,16 1852 1 V N N N horizontal-axis wind turbines 3,52 42 1 V A N N N horizontal axis wind turbine 3,50 41 1 V A N N N vertical axis wind turbines 3,62 53 1 V A N N N modern horizontal-axis wind turbines 2,59 5 1 V A N N smaller-scale wind turbines 2,20 2 1 V A N N on-shore wind turbines 1,90 1 1 V A N N pre-manufactured wind turbine 1,90 1 1 V A N N repowred wind turbines 1,90 1 1 V A N N N conventional horizontal-axis wind turbines 1,90 1 1 V A N N N potential campus wind turbines 1,90 1 1 V A N N N typical horizontal-axis wind turbine 1,90 1 1 V A N N N unconventional horizontal-axis wind turbines 1,90 1 1 V N N N N hawts horizontal-axis wind turbines 1,90 1 1 V N N N N utility scale wind turbine 1,90 1 1 V N N N N lift type wind turbines 1,90 1 1 V A N N domestic wind turbines 3,35 29 1 V N N N wind turbine syndrome 3,21 21 [...] 2 T N rotor 4,82 848 3 T N N wind energy 4,51 414 3 V A N N californian wind energy 1,90 1 3 V A N N offshore wind energy 3,56 47 3 V N N N wind energy conversion 3,32 27 3 V N N N wind energy conf 2,59 5 3 V A N N N significant contribution wind energy 1,90 1 3 V N N N N activity plan wind energy 1,90 1 3 V N N N N title ge wind energy 1,90 1 3 V N N N wind energy easements 1,90 1 4 T N N wind speed 4,41 331 4 V N P N speed of the wind 2,50 4 4 V N C N P N speed and direction of the wind 1,90 1 4 V N N N integer wind speed 1,90 1 4 V A N N average wind speed 3,29 25 4 V A N N undisturbed wind speed 2,59 5 4 V N N N cutoff wind speed 2,37 3 4 V N N N N terrain score wind speed 1,90 1 4 V N N N wind speed cutoff 2,20 2 4 V N N N cut-out wind speed 2,50 4 4 V N N N cut-off wind speeds 1,90 1 4 V N N N incision wind speed 1,90 1 5 T N N wind power 4,34 278 5 V N P N power of the wind 2,97 12 5 V N N N wind turbine power 2,89 10 5 V N N N wind power plant 3,76 74 5 V A N N developable wind power 1,90 1 5 V A N N N environmental engineering wind power 1,90 1 5 V N N N wind power stations 3,27 24 6 T N airfoil 4,26 236 ...
This output can be configured.
json are also posssible. See command line options.
8. Enjoy TermSuite
You can now run TermSuite on your own corpus.
Full documentation of available features and options from the command line: command line API.
Embed TermSuite directly to your Java project: Java API.