Prerequesites
PreprocessorCLI
Usage
java [-Xms256m -Xmx8g] -cp termsuite-core-3.0.2.jar \
fr.univnantes.termsuite.tools.PreprocessorCLI OPTIONS
Description
Applies TermSuite’s preprocessings to given text corpus.
Mandatory options
--tagger-home, -t FILE
Path to POS tagger’s home
--from-text-corpus, -c DIR
Directory to corpus (containing a list of .txt documents)
--language, -l LANG
Language of the input corpus
--json-anno, --tsv-anno, --json, --xmi-anno
At least one option in --json-anno, --tsv-anno, --json, --xmi-anno must be set.
Other options
--capped-size INT
The maximum number of terms to keep in memory while spotting. Allows to process bigger volumes of input text.
--encoding, -e ENC
Encoding of the input corpus
--json FILE
Path to JSON indexed corpus file where all occurrences are imported to
Warning: At least one option in --json-anno, --tsv-anno, --json, --xmi-anno must be set.
--json-anno DIR
Path to JSON export directory of all spotted term annotations
Warning: At least one option in --json-anno, --tsv-anno, --json, --xmi-anno must be set.
--no-occurrence (no arg)
Do not store occurrence offsets in memory while spotting. Allows to process bigger volumes of input text.
--resource-dir DIR
Custom resource directory
--resource-jar FILE
Custom resource jar
--resource-url-prefix STRING
Custom resource url prefix
--tagger STRING
Which POS tagger to use. Allowed values are:
mate,tt
--tsv-anno DIR
Path to TSV export directory of all spotted term annotations
Warning: At least one option in --json-anno, --tsv-anno, --json, --xmi-anno must be set.
--watch TERM_LIST
List of terms (grouping keys or lemmas) to log to output
--xmi-anno DIR
Path to XMI export directory of all spotted term annotations
Warning: At least one option in --json-anno, --tsv-anno, --json, --xmi-anno must be set.
Examples
Example launcher scripts can be found at:
https://github.com/termsuite/termsuite-core/tree/develop/examples/cmd
Preprocessing a *.txt corpus and exporting it to a *.xmi preprocessed corpus
This example shows how to execute the preprocessing pipeline on a textual corpus and export the preprocessed documents to their *.xmi format for later reuse (e.g. by TerminoExtractorCLI).
Command Line
java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.PreprocessorCLI \
-t $TREETAGGER_HOME \
-c $CORPUS_PATH \
-l en \
--xmi-anno $PREPARED_CORPUS_PATH_XMI \
--info
Docker
termsuite preprocess \
-c $CORPUS_PATH \
-l en \
--xmi-anno $PREPARED_CORPUS_PATH_XMI \
--info
Preprocessing a ``.txt corpus and exporting it to a .json` preprocessed corpus
This example shows how to execute the preprocessing pipeline on a textual corpus and export the preprocessed documents to their *.json format for later reuse (e.g. by TerminoExtractorCLI).
Note: The prepared corpus in *.json format (i.e. one output directory containing one file-i.json annotation file per input file-i.txt input text document) is different from the prepared corpus in imported terminology imported-termino.json format (one signle json file for the whole input document).
Command Line
java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.PreprocessorCLI \
-t $TREETAGGER_HOME \
-c $CORPUS_PATH \
-l en \
--json-anno $PREPARED_CORPUS_PATH_JSON \
--info
Docker
termsuite preprocess \
-c $CORPUS_PATH \
-l en \
--json-anno $PREPARED_CORPUS_PATH_JSON \
--info
Preprocessing a ``*.txt corpus and exporting it as a json` terminology
This example shows how to execute the preprocessing pipeline on a textual corpus and export the preprocessed corpus as a terminology in its native format (one single output .json) for later reuse (e.g. by TerminoExtractorCLI).
Note: The prepared corpus in *.json format (i.e. one output directory containing one file-i.json annotation file per input file-i.txt input text document) is different from the prepared corpus in imported terminology terminology.json format (one signle json file for the whole input document).
Command Line
java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.PreprocessorCLI \
-t $TREETAGGER_HOME \
-c $CORPUS_PATH \
-l en \
--json $PREPARED_CORPUS_AS_TERMINOLOGY_PATH \
--info
Docker
termsuite preprocess \
-c $CORPUS_PATH \
-l en \
--json $PREPARED_CORPUS_AS_TERMINOLOGY_PATH \
--info
