Fork TermSuite on GitHub.

Option 1: manual installation

Create a directory where you will download all files required by TermSuite. In the following sections, we refer to this directory as TERMSUITE_WORKSPACE.

Java

Make sure you have Java installed on your OS (at least version 8), or follow the official installation instructions.

To check if Java is installed properly and see its current version, open a command line prompt and type:

$ java -version

Install TreeTagger

TermSuite requires a POS Tagger and lemmatizer to run terminology extraction pipelines. In this guide, we install TreeTagger, but TermSuite also supports Mate. The tagger/lemmatizer must be installed apart from TermSuite, due to license concerns.

To install TreeTagger on your OS: (See the install instructions for details)
1. Download TreeTagger from the official site and install it to TERMSUITE_WORKSPACE/treetagger with the help of official instructions.
2. Creates a subdirectory named models at TERMSUITE_WORKSPACE/treetagger.
3. Download to dir models/ the english utf-8-encoded (very important) model from the official site and rename it to english.par.

Download TermSuite

Download the last stable version of TermSuite’s jar for project termsuite-core from Maven Central to directory TERMSUITE_WORKSPACE.

Currently : termsuite-core-3.0.10.jar

Prepare your corpus

Download the example corpus Wind Energy to TERMSUITE_WORKSPACE and uncompress it.

Otherwise, you could prepare your own corpus in the form of a collection of *.txt files within a any directory.

Run terminology extraction

Now your TERMSUITE_WORKSPACE folder should look like this: (non exhaustive)

TERMSUITE_WORKSPACE/
  wind-energy/
    README.txt
    English/
      txt/
        file1.txt
        file3.txt
        [...]
        file38.txt
    French/
    [...]
  treetagger/
    bin/
      [...]
    lib/
      [...]
    models/
      english.par # Should be the `utf-8` model !
    [...]
  termsuite-core-3.0.10.jar

Run the terminology extraction on the Wind Energy corpus and language en:

$ java -cp termsuite-core-3.0.10.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
  -t ./treetagger/ \
  -c ./wind-energy/English/txt/ \
  -l en \
  --tsv ./wind-energy-en.tsv

Understanding the TSV output

The wind-energy-en.tsv file produced by this command line, should look like the excerpt below. To understand the TSV format, please refer the TSV output documentation

Two other output formats are also available: tbx and json. See command line options.

#	type	p	pilot	spec	f
1	T	N N	wind turbine	5,16	1852
1	V	N N N	horizontal-axis wind turbines	3,52	42
1	V	A N N N	horizontal axis wind turbine	3,50	41
1	V	A N N N	vertical axis wind turbines	3,62	53
1	V	A N N N	modern horizontal-axis wind turbines	2,59	5
1	V	A N N	smaller-scale wind turbines	2,20	2
1	V	A N N	on-shore wind turbines	1,90	1
1	V	A N N	pre-manufactured wind turbine	1,90	1
1	V	A N N	repowred wind turbines	1,90	1
1	V	A N N N	conventional horizontal-axis wind turbines	1,90	1
1	V	A N N N	potential campus wind turbines	1,90	1
1	V	A N N N	typical horizontal-axis wind turbine	1,90	1
1	V	A N N N	unconventional horizontal-axis wind turbines	1,90	1
1	V	N N N N	hawts horizontal-axis wind turbines	1,90	1
1	V	N N N N	utility scale wind turbine	1,90	1
1	V	N N N N	lift type wind turbines	1,90	1
1	V	A N N	domestic wind turbines	3,35	29
1	V	N N N	wind turbine syndrome	3,21	21
[...]
2	T	N	rotor	4,82	848
3	T	N N	wind energy	4,51	414
3	V	A N N	californian wind energy	1,90	1
3	V	A N N	offshore wind energy	3,56	47
3	V	N N N	wind energy conversion	3,32	27
3	V	N N N	wind energy conf	2,59	5
3	V	A N N N	significant contribution wind energy	1,90	1
3	V	N N N N	activity plan wind energy	1,90	1
3	V	N N N N	title ge wind energy	1,90	1
3	V	N N N	wind energy easements	1,90	1
4	T	N N	wind speed	4,41	331
4	V	N P N	speed of the wind	2,50	4
4	V	N C N P N	speed and direction of the wind	1,90	1
4	V	N N N	integer wind speed	1,90	1
4	V	A N N	average wind speed	3,29	25
4	V	A N N	undisturbed wind speed	2,59	5
4	V	N N N	cutoff wind speed	2,37	3
4	V	N N N N	terrain score wind speed	1,90	1
4	V	N N N	wind speed cutoff	2,20	2
4	V	N N N	cut-out wind speed	2,50	4
4	V	N N N	cut-off wind speeds	1,90	1
4	V	N N N	incision wind speed	1,90	1
5	T	N N	wind power	4,34	278
5	V	N P N	power of the wind	2,97	12
5	V	N N N	wind turbine power	2,89	10
5	V	N N N	wind power plant	3,76	74
5	V	A N N	developable wind power	1,90	1
5	V	A N N N	environmental engineering wind power	1,90	1
5	V	N N N	wind power stations	3,27	24
6	T	N	airfoil	4,26	236
...

Enjoy TermSuite

You can now run TermSuite on your own corpus.

See the full documentation of available features and options of terminology extraction from the command line at command line API, and its examples.

You could also have chosen to run PreprocessorCLI instead of TerminologyExtractorCLI and apply only TermSuite’s NLP preprocessings to your corpus without extracting the terminology. See also the preprocessing examples.

You can also embed TermSuite directly into your Java project with TermSuite’s Java API.

Option 2: docker container

TermSuite’s third-party dependency on TreeTagger or Mate might be discouraging, because it is one difficult step in installation process described above and also an external path to tagger’s installation directory to specify explicitely at every single run.

To overcome this issue, we have made TermSuite work with Docker container technology.

Unfortunately, we cannot publish any TermSuite pre-built container image, for the same licensing reasons. Hoever, the docker image can be easily built once for all by the user with Git and Docker

Follow the guide below to launch TermSuite tools without installing nor configuring any POS tagger.

See TermSuite’s Docker project for more detailed instructions.

Prerequisites

  • Docker
  • Git

Clone TermSuite’s docker project

$ git clone https://github.com/termsuite/termsuite-docker.git

Build the image

$ cd termsuite-docker
$ bin/build

Prepare your corpus

Same step as with manual installation. See above.

Run terminology extraction from the image

You can known run TermSuite tools (preprocessings, terminology extraction, and alignment) with bin/termsuite script. See TermSuite’s Docker project for more informtaion.

To extract the terminology from the

$ bin/termsuite extract
  -c ./wind-energy/English/txt/ \
  -l en \
  --tsv ./wind-energy-en.tsv

Understanding the TSV output

Same TSV output as with manual installation. See above for details.

See also

Preprocessing and extraction pipelines

See the exhaustive list of analysis engines and linguistic resources that TermSuite uses for terminology extraction.

Bilingual alignment

TermSuite also supports terminology alignment, i.e. bilingual domain-specific term translation. See how to extract your source and target terminologies for alignment and how to run the aligner with command line or with Java API.

See also alignment architecture overview and more theoritical concerns on TermSuite alignment.

The Java API

TermSuite is a Java software and can easily be embedded into your Java projects as a Maven or Gradle dependency. There is a Java API for: (not exhaustive)

The graphical user interface (GUI)

Another way of running TermSuite is its graphical user interface (GUI). Note that the current version of TermSuite’s GUI is 2.3, while other API’s current version is 3.0. Be aware that you might not benefit from last TermSuite’s features and improvements within the GUI.