You need to install a 3rd-party Part-Of-Speech tagger and lemmatizer when you run TermSuite for extracting the terminology of a corpus.
There are currently two POS Taggers and lemmatizers supported by TermSuite:
TreeTagger (recommended): supports
en (better POS tagging performances but slower)
Option 1 (recommended): Installing TreeTagger and its models
TreeTagger is a very fast POS tagger and lemmatizer having very acceptable performances on all TermSuite languages. Unfortunately, its license excludes commercial usage. As a consequence, TreeTagger cannot be included as a 3rd party dependency in TermSuite and needs to be install manually by end users.
Download and install TreeTagger to your OS following the official instructions.
The TreeTagger home directory contains tree subdirectories:
lib. The TreeTagger executable program should be in the
bin sub-directory. Note that the TreeTagger executable program is called
tree-tagger on Linux and
tree-tagger.exe on Windows.
The language model parameter files for should be in the models sub-directory. If the parameter files are in the
lib subdirectory, please create a symbolic link named
models to this directory as follows:
$ cd /path/to/tree-tagger-home-directory $ ln -s lib models
Pay attention to parameter file names and encoding !
models directory, TermSuite expects a specific file name for each language parameter file. Please name your parameter files as follows, and make them all be
On Linux, you can check that TreeTagger is correctly installed by launching these two command lines:
$ cd tree-tagger-home-directory $ ./bin/tree-tagger ./models/english.par
Exit the program by the keyboard short-cut Ctrl+D.
Option 2: Installing Mate models
Mate has slightly better POS tagging performances than TreeTagger in the context of terminology extraction but also has a few disadvantages:
* only three languages models are public
* parsing language models is very slow and results in a quite long and constant loading time for each run,
* tagging and lemmatizing raw text is also very slow compared to TreeTagger.
The main advantage of Mate is that it is embedded into TermSuite. You do not need to install it on your OS. You just need to download (or train) the required language models. See the official page for models of these three languages:
Mate requires two different for each languages. The parser+tagger model and the lemmatizer model.
Pay attention to parameter file names !
TermSuite expects a specific file name for each mate model. Please follow naming patterns below.
The models must follow this patterns : (where
xx is the two-letter language code)
|Model||File name pattern|