TermSuite is a toolbox for terminology extraction and multilingual term alignment.

Multiword and compound term detection, morphosyntactic analysis, term variant detection, term specificity computation, etc. See features

Language Support

French English Russian Italian German Spanish

Command Line - Graphical User Interface - Java API

Current version of TermSuite is 3.0.10 See Changelog

Get it running !

Prepare your system for TermSuite, download, install and get it running on an example corpus quickly.

List of all TermSuite's features, analysis engines, and configuration parameters. Java API.

Build it from sources with Gradle, or use it as a maven dependency.

Command Line

$ java -cp termsuite-core-3.0.10.jar \ \
            -t /path/to/treetagger/ \
            -c /path/to/corpus/ \
            -l en \
            --tsv my-termino.tsv \

NLP preprocessing, terminology extraction, and multilingual alignment from command line. Getting Started [using Docker]

Java API

Embed TermSuite in your Java application and get access to all functionalities and full control.

// Create the corpus object
TXTCorpus corpus = new TXTCorpus(
    Paths.get("wind-energy", "documents"));

// Do the NLP preprocessings
IndexedCorpus corpus = TermSuite.preprocessor()
    .setTaggerPath(Paths.get("path", "to", "treetagger"))
    .toIndexedCorpus(corpus, 500000);

// Extract the terminology

// Keep only top 1000 terms by specificity with their variants

// Export the terminology to TSV
TermSuiteFactory.createTsvExporter(new TsvOptions())
    .export(corpus, Paths.get("my-termino.tsv"));

Graphical User Interface

A graphical user interface for terminology extraction, multilingual alignment, in-context term occurrence viewer, linguistic resource editor.


Features Overview

Word tokenization
POS Tagging (3rd party: with TreeTagger or Mate)
Lemmatization (3rd party: with TreeTagger or Mate)
Stemming (Snowball)
Terminology extraction
Efficient multiword term detection
Term morphology extraction
Term syntactic variants detection
Term graphic variants detection
Variant detection based on term derivations and term prefixation
Term semantic variants detection
Term morphosyntactic variants detection
Term specificity (Weirdness Ratio) computing and other term measures: WR log, term frequency, etc
Term alignment (distributional and compositional, multilingual and monolingual)
Terminology export in multiple formats: `json`, `tsv`, `tbx`