TermSuite

Prerequesites
PreprocessorCLI
Examples

Prerequesites

Java 8
install an external POS tagger

PreprocessorCLI

Usage

java [-Xms256m -Xmx8g] -cp termsuite-core-3.0.2.jar \
	 fr.univnantes.termsuite.tools.PreprocessorCLI OPTIONS

Description

Applies TermSuite’s preprocessings to given text corpus.

Mandatory options

`--tagger-home`, `-t` FILE

Path to POS tagger’s home

`--from-text-corpus`, `-c` DIR

Directory to corpus (containing a list of .txt documents)

`--language`, `-l` LANG

Language of the input corpus

`--json-anno`, `--tsv-anno`, `--json`, `--xmi-anno`

At least one option in --json-anno, --tsv-anno, --json, --xmi-anno must be set.

Other options

`--capped-size` INT

The maximum number of terms to keep in memory while spotting. Allows to process bigger volumes of input text.

`--encoding`, `-e` ENC

Encoding of the input corpus

`--json` FILE

Path to JSON indexed corpus file where all occurrences are imported to

Warning: At least one option in --json-anno, --tsv-anno, --json, --xmi-anno must be set.

`--json-anno` DIR

Path to JSON export directory of all spotted term annotations

Warning: At least one option in --json-anno, --tsv-anno, --json, --xmi-anno must be set.

`--no-occurrence` (no arg)

Do not store occurrence offsets in memory while spotting. Allows to process bigger volumes of input text.

`--resource-dir` DIR

Custom resource directory

`--resource-jar` FILE

Custom resource jar

`--resource-url-prefix` STRING

Custom resource url prefix

`--tagger` STRING

Which POS tagger to use. Allowed values are: mate, tt

`--tsv-anno` DIR

Path to TSV export directory of all spotted term annotations

Warning: At least one option in --json-anno, --tsv-anno, --json, --xmi-anno must be set.

`--watch` TERM_LIST

List of terms (grouping keys or lemmas) to log to output

`--xmi-anno` DIR

Path to XMI export directory of all spotted term annotations

Warning: At least one option in --json-anno, --tsv-anno, --json, --xmi-anno must be set.

Examples

Example launcher scripts can be found at:

https://github.com/termsuite/termsuite-core/tree/develop/examples/cmd

Preprocessing a `.txt` corpus and exporting it to a `.xmi` preprocessed corpus

This example shows how to execute the preprocessing pipeline on a textual corpus and export the preprocessed documents to their *.xmi format for later reuse (e.g. by TerminoExtractorCLI).

Command Line

java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.PreprocessorCLI \
            -t $TREETAGGER_HOME \
            -c $CORPUS_PATH \
            -l en \
            --xmi-anno $PREPARED_CORPUS_PATH_XMI \
            --info

Docker

termsuite preprocess \
            -c $CORPUS_PATH \
            -l en \
            --xmi-anno $PREPARED_CORPUS_PATH_XMI \
            --info

Preprocessing a ``.txt `corpus and exporting it to a` .json` preprocessed corpus

This example shows how to execute the preprocessing pipeline on a textual corpus and export the preprocessed documents to their *.json format for later reuse (e.g. by TerminoExtractorCLI).
Note: The prepared corpus in *.json format (i.e. one output directory containing one file-i.json annotation file per input file-i.txt input text document) is different from the prepared corpus in imported terminology imported-termino.json format (one signle json file for the whole input document).

Command Line

java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.PreprocessorCLI \
            -t $TREETAGGER_HOME \
            -c $CORPUS_PATH \
            -l en \
            --json-anno $PREPARED_CORPUS_PATH_JSON \
            --info

Docker

termsuite preprocess \
            -c $CORPUS_PATH \
            -l en \
            --json-anno $PREPARED_CORPUS_PATH_JSON \
            --info

Preprocessing a ``*.txt `corpus and exporting it as a` json` terminology

This example shows how to execute the preprocessing pipeline on a textual corpus and export the preprocessed corpus as a terminology in its native format (one single output .json) for later reuse (e.g. by TerminoExtractorCLI).
Note: The prepared corpus in *.json format (i.e. one output directory containing one file-i.json annotation file per input file-i.txt input text document) is different from the prepared corpus in imported terminology terminology.json format (one signle json file for the whole input document).

Command Line

java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.PreprocessorCLI \
            -t $TREETAGGER_HOME \
            -c $CORPUS_PATH \
            -l en \
            --json $PREPARED_CORPUS_AS_TERMINOLOGY_PATH \
            --info

Docker

termsuite preprocess \
            -c $CORPUS_PATH \
            -l en \
            --json $PREPARED_CORPUS_AS_TERMINOLOGY_PATH \
            --info

Documentation PreprocessorCLI

Introduction

Command Line API

Java API

Graphical User Interface

Theory

Input/Output Formats

Linguistic Resources

Links

Prerequesites

PreprocessorCLI

Usage

Description

Mandatory options

`--tagger-home`, `-t` FILE

`--from-text-corpus`, `-c` DIR

`--language`, `-l` LANG

`--json-anno`, `--tsv-anno`, `--json`, `--xmi-anno`

Other options

`--capped-size` INT

`--encoding`, `-e` ENC

`--json` FILE

`--json-anno` DIR

`--no-occurrence` (no arg)

`--resource-dir` DIR

`--resource-jar` FILE

`--resource-url-prefix` STRING

`--tagger` STRING

`--tsv-anno` DIR

`--watch` TERM_LIST

`--xmi-anno` DIR

Examples

Preprocessing a `.txt` corpus and exporting it to a `.xmi` preprocessed corpus

Preprocessing a ``.txt `corpus and exporting it to a` .json` preprocessed corpus

Preprocessing a ``*.txt `corpus and exporting it as a` json` terminology

Prerequesites

PreprocessorCLI

Usage

Description

Mandatory options

--tagger-home, -t FILE

--from-text-corpus, -c DIR

--language, -l LANG

--json-anno, --tsv-anno, --json, --xmi-anno

Other options

--capped-size INT

--encoding, -e ENC

--json FILE

--json-anno DIR

--no-occurrence (no arg)

--resource-dir DIR

--resource-jar FILE

--resource-url-prefix STRING

--tagger STRING

--tsv-anno DIR

--watch TERM_LIST

--xmi-anno DIR

Examples

Preprocessing a *.txt corpus and exporting it to a *.xmi preprocessed corpus

Preprocessing a ``.txt corpus and exporting it to a .json` preprocessed corpus

Preprocessing a ``*.txt corpus and exporting it as a json` terminology

`--tagger-home`, `-t` FILE

`--from-text-corpus`, `-c` DIR

`--language`, `-l` LANG

`--json-anno`, `--tsv-anno`, `--json`, `--xmi-anno`

`--capped-size` INT

`--encoding`, `-e` ENC

`--json` FILE

`--json-anno` DIR

`--no-occurrence` (no arg)

`--resource-dir` DIR

`--resource-jar` FILE

`--resource-url-prefix` STRING

`--tagger` STRING

`--tsv-anno` DIR

`--watch` TERM_LIST

`--xmi-anno` DIR

Preprocessing a `.txt` corpus and exporting it to a `.xmi` preprocessed corpus

Preprocessing a ``.txt `corpus and exporting it to a` .json` preprocessed corpus

Preprocessing a ``*.txt `corpus and exporting it as a` json` terminology