Fork TermSuite on GitHub.

Prerequesites

  1. Java 8
  2. install an external POS tagger

PreprocessorCLI

Usage

java [-Xms256m -Xmx8g] -cp termsuite-core-3.0.2.jar \
	 fr.univnantes.termsuite.tools.PreprocessorCLI OPTIONS

Description

Applies TermSuite’s preprocessings to given text corpus.

Mandatory options

--tagger-home, -t FILE

Path to POS tagger’s home

--from-text-corpus, -c DIR

Directory to corpus (containing a list of .txt documents)

--language, -l LANG

Language of the input corpus

--json-anno, --tsv-anno, --json, --xmi-anno

Other options

--capped-size INT

The maximum number of terms to keep in memory while spotting. Allows to process bigger volumes of input text.

--encoding, -e ENC

Encoding of the input corpus

--json FILE

Path to JSON indexed corpus file where all occurrences are imported to

--json-anno DIR

Path to JSON export directory of all spotted term annotations

--no-occurrence (no arg)

Do not store occurrence offsets in memory while spotting. Allows to process bigger volumes of input text.

--resource-dir DIR

Custom resource directory

--resource-jar FILE

Custom resource jar

--resource-url-prefix STRING

Custom resource url prefix

--tagger STRING

Which POS tagger to use. Allowed values are: mate, tt

--tsv-anno DIR

Path to TSV export directory of all spotted term annotations

--watch TERM_LIST

List of terms (grouping keys or lemmas) to log to output

--xmi-anno DIR

Path to XMI export directory of all spotted term annotations

Examples

Example launcher scripts can be found at:

https://github.com/termsuite/termsuite-core/tree/develop/examples/cmd

Preprocessing a *.txt corpus and exporting it to a *.xmi preprocessed corpus

This example shows how to execute the preprocessing pipeline on a textual corpus and export the preprocessed documents to their *.xmi format for later reuse (e.g. by TerminoExtractorCLI).

Command Line

java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.PreprocessorCLI \
            -t $TREETAGGER_HOME \
            -c $CORPUS_PATH \
            -l en \
            --xmi-anno $PREPARED_CORPUS_PATH_XMI \
            --info

Docker

termsuite preprocess \
            -c $CORPUS_PATH \
            -l en \
            --xmi-anno $PREPARED_CORPUS_PATH_XMI \
            --info

Preprocessing a ``.txt corpus and exporting it to a .json` preprocessed corpus

This example shows how to execute the preprocessing pipeline on a textual corpus and export the preprocessed documents to their *.json format for later reuse (e.g. by TerminoExtractorCLI).
Note: The prepared corpus in *.json format (i.e. one output directory containing one file-i.json annotation file per input file-i.txt input text document) is different from the prepared corpus in imported terminology imported-termino.json format (one signle json file for the whole input document).

Command Line

java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.PreprocessorCLI \
            -t $TREETAGGER_HOME \
            -c $CORPUS_PATH \
            -l en \
            --json-anno $PREPARED_CORPUS_PATH_JSON \
            --info

Docker

termsuite preprocess \
            -c $CORPUS_PATH \
            -l en \
            --json-anno $PREPARED_CORPUS_PATH_JSON \
            --info

Preprocessing a ``*.txt corpus and exporting it as a json` terminology

This example shows how to execute the preprocessing pipeline on a textual corpus and export the preprocessed corpus as a terminology in its native format (one single output .json) for later reuse (e.g. by TerminoExtractorCLI).
Note: The prepared corpus in *.json format (i.e. one output directory containing one file-i.json annotation file per input file-i.txt input text document) is different from the prepared corpus in imported terminology terminology.json format (one signle json file for the whole input document).

Command Line

java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.PreprocessorCLI \
            -t $TREETAGGER_HOME \
            -c $CORPUS_PATH \
            -l en \
            --json $PREPARED_CORPUS_AS_TERMINOLOGY_PATH \
            --info

Docker

termsuite preprocess \
            -c $CORPUS_PATH \
            -l en \
            --json $PREPARED_CORPUS_AS_TERMINOLOGY_PATH \
            --info