Fork TermSuite on GitHub.

TermSuite is a toolbox for terminology extraction and multilingual term alignment.

Multiword and compound term detection, morphosyntactic analysis, term variant detection, term specificity computation, etc. See features

Language Support

French English Russian Italian German Spanish

Command Line - Graphical User Interface - Java API

Current version of TermSuite is 3.0.10 See Changelog

Get it running !

Prepare your system for TermSuite, download, install and get it running on an example corpus quickly.

Getting Started

Documentation

List of all TermSuite's features, analysis engines, and configuration parameters. Java API.

User Manual Javadoc

Developers

Build it from sources with Gradle, or use it as a maven dependency.

View on GithubMaven / Gradle

Command Line

$ java -cp termsuite-core-3.0.10.jar \
        fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
            -t /path/to/treetagger/ \
            -c /path/to/corpus/ \
            -l en \
            --tsv my-termino.tsv \

NLP preprocessing, terminology extraction, and multilingual alignment from command line. Getting Started [using Docker]

Java API

Embed TermSuite in your Java application and get access to all functionalities and full control.

// Create the corpus object
TXTCorpus corpus = new TXTCorpus(
    Lang.EN,
    Paths.get("wind-energy", "documents"));

// Do the NLP preprocessings
IndexedCorpus corpus = TermSuite.preprocessor()
    .setTaggerPath(Paths.get("path", "to", "treetagger"))
    .toIndexedCorpus(corpus, 500000);

// Extract the terminology
TermSuite.terminoExtractor().execute(corpus);

// Keep only top 1000 terms by specificity with their variants
TermSuite.terminologyFilterer()
    .by(TermProperty.SPECIFICITY)
    .keepTopN(1000).keepVariants()
    .filter(corpus);

// Export the terminology to TSV
TermSuiteFactory.createTsvExporter(new TsvOptions())
    .export(corpus, Paths.get("my-termino.tsv"));

Graphical User Interface

A graphical user interface for terminology extraction, multilingual alignment, in-context term occurrence viewer, linguistic resource editor.

Download

(click here if the link above fails)

Documentation

Tutorial for TermSuite Graphical User Interface

academics

Publications

ACL2016

Damien Cram and Béatrice Daille.
Terminology Extraction with Term Variant Detection.
Proceedings of ACL-2016 System Demonstrations.
PDF

IJCNLP2011

Jérôme Rocheteau and Béatrice Daille.
TTC TermSuite: A UIMA Application for Multilingual Terminology Extraction from Comparable Corpora.
Proceedings of the 5th International Joint Conference on Natural Language Processing.
PDF

BENJAMINS2017

Béatrice Daille.
Term Variation in Specialised Corpora: Characterisation, automatic discovery and applications.
Vol. 19. John Benjamins Publishing Company.
URL

Features Overview

Word tokenization
POS Tagging (3rd party: with TreeTagger or Mate)
Lemmatization (3rd party: with TreeTagger or Mate)
Stemming (Snowball)
Terminology extraction
Efficient multiword term detection
Term morphology extraction
Term syntactic variants detection
Term graphic variants detection
Variant detection based on term derivations and term prefixation
Term semantic variants detection
Term morphosyntactic variants detection
Term specificity (Weirdness Ratio) computing and other term measures: WR log, term frequency, etc
Term alignment (distributional and compositional, multilingual and monolingual)
Terminology export in multiple formats: `json`, `tsv`, `tbx`