TermSuite

TermSuite is a toolbox for terminology extraction and multilingual term alignment.

Multiword and compound term detection, morphosyntactic analysis, term variant detection, term specificity computation, etc. See features

Language Support

Command Line - Graphical User Interface - Java API

Current version of TermSuite is 3.0.10 See Changelog

Get it running !

Prepare your system for TermSuite, download, install and get it running on an example corpus quickly.

Getting Started

Documentation

List of all TermSuite's features, analysis engines, and configuration parameters. Java API.

User Manual Javadoc

Developers

Build it from sources with Gradle, or use it as a maven dependency.

View on Github Maven / Gradle

Command Line

$ java -cp termsuite-core-3.0.10.jar \
        fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
            -t /path/to/treetagger/ \
            -c /path/to/corpus/ \
            -l en \
            --tsv my-termino.tsv \

NLP preprocessing, terminology extraction, and multilingual alignment from command line. Getting Started [using Docker]

Download

termsuite-core-3.0.10.jar

Documentation

NLP Preprocessing - Terminology Extraction - Alignment

Java API

Embed TermSuite in your Java application and get access to all functionalities and full control.

Download

termsuite-core-3.0.10.jar Developer instructions

Documentation

NLP Preprocessing - Terminology Extraction - Alignment

// Create the corpus object
TXTCorpus corpus = new TXTCorpus(
    Lang.EN,
    Paths.get("wind-energy", "documents"));

// Do the NLP preprocessings
IndexedCorpus corpus = TermSuite.preprocessor()
    .setTaggerPath(Paths.get("path", "to", "treetagger"))
    .toIndexedCorpus(corpus, 500000);

// Extract the terminology
TermSuite.terminoExtractor().execute(corpus);

// Keep only top 1000 terms by specificity with their variants
TermSuite.terminologyFilterer()
    .by(TermProperty.SPECIFICITY)
    .keepTopN(1000).keepVariants()
    .filter(corpus);

// Export the terminology to TSV
TermSuiteFactory.createTsvExporter(new TsvOptions())
    .export(corpus, Paths.get("my-termino.tsv"));

Graphical User Interface

A graphical user interface for terminology extraction, multilingual alignment, in-context term occurrence viewer, linguistic resource editor.

Download

(click here if the link above fails)

Documentation

Tutorial for TermSuite Graphical User Interface

Publications

ACL2016	Damien Cram and Béatrice Daille. Terminology Extraction with Term Variant Detection. Proceedings of ACL-2016 System Demonstrations. PDF
IJCNLP2011	Jérôme Rocheteau and Béatrice Daille. TTC TermSuite: A UIMA Application for Multilingual Terminology Extraction from Comparable Corpora. Proceedings of the 5th International Joint Conference on Natural Language Processing. PDF
BENJAMINS2017	Béatrice Daille. Term Variation in Specialised Corpora: Characterisation, automatic discovery and applications. Vol. 19. John Benjamins Publishing Company. URL

Features Overview

Word tokenization
POS Tagging (3rd party: with TreeTagger or Mate)
Lemmatization (3rd party: with TreeTagger or Mate)
Stemming (Snowball)
Terminology extraction
Efficient multiword term detection
Term morphology extraction
Term syntactic variants detection
Term graphic variants detection
Variant detection based on term derivations and term prefixation
Term semantic variants detection
Term morphosyntactic variants detection
Term specificity (Weirdness Ratio) computing and other term measures: WR log, term frequency, etc
Term alignment (distributional and compositional, multilingual and monolingual)
Terminology export in multiple formats: `json`, `tsv`, `tbx`