Prerequesites
PreprocessorCLI
Usage
java [-Xms256m -Xmx8g] -cp termsuite-core-3.0.2.jar \
fr.univnantes.termsuite.tools.PreprocessorCLI OPTIONS
Description
Applies TermSuite’s preprocessings to given text corpus.
Mandatory options
--tagger-home
, -t
FILE
Path to POS tagger’s home
--from-text-corpus
, -c
DIR
Directory to corpus (containing a list of .txt documents)
--language
, -l
LANG
Language of the input corpus
--json-anno
, --tsv-anno
, --json
, --xmi-anno
At least one option in --json-anno
, --tsv-anno
, --json
, --xmi-anno
must be set.
Other options
--capped-size
INT
The maximum number of terms to keep in memory while spotting. Allows to process bigger volumes of input text.
--encoding
, -e
ENC
Encoding of the input corpus
--json
FILE
Path to JSON indexed corpus file where all occurrences are imported to
Warning: At least one option in --json-anno
, --tsv-anno
, --json
, --xmi-anno
must be set.
--json-anno
DIR
Path to JSON export directory of all spotted term annotations
Warning: At least one option in --json-anno
, --tsv-anno
, --json
, --xmi-anno
must be set.
--no-occurrence
(no arg)
Do not store occurrence offsets in memory while spotting. Allows to process bigger volumes of input text.
--resource-dir
DIR
Custom resource directory
--resource-jar
FILE
Custom resource jar
--resource-url-prefix
STRING
Custom resource url prefix
--tagger
STRING
Which POS tagger to use. Allowed values are:
mate
,tt
--tsv-anno
DIR
Path to TSV export directory of all spotted term annotations
Warning: At least one option in --json-anno
, --tsv-anno
, --json
, --xmi-anno
must be set.
--watch
TERM_LIST
List of terms (grouping keys or lemmas) to log to output
--xmi-anno
DIR
Path to XMI export directory of all spotted term annotations
Warning: At least one option in --json-anno
, --tsv-anno
, --json
, --xmi-anno
must be set.
Examples
Example launcher scripts can be found at:
https://github.com/termsuite/termsuite-core/tree/develop/examples/cmd
Preprocessing a *.txt
corpus and exporting it to a *.xmi
preprocessed corpus
This example shows how to execute the preprocessing pipeline on a textual corpus and export the preprocessed documents to their *.xmi
format for later reuse (e.g. by TerminoExtractorCLI).
Command Line
java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.PreprocessorCLI \
-t $TREETAGGER_HOME \
-c $CORPUS_PATH \
-l en \
--xmi-anno $PREPARED_CORPUS_PATH_XMI \
--info
Docker
termsuite preprocess \
-c $CORPUS_PATH \
-l en \
--xmi-anno $PREPARED_CORPUS_PATH_XMI \
--info
Preprocessing a ``.txt corpus and exporting it to a
.json` preprocessed corpus
This example shows how to execute the preprocessing pipeline on a textual corpus and export the preprocessed documents to their *.json
format for later reuse (e.g. by TerminoExtractorCLI).
Note: The prepared corpus in *.json
format (i.e. one output directory containing one file-i.json
annotation file per input file-i.txt
input text document) is different from the prepared corpus in imported terminology imported-termino.json
format (one signle json
file for the whole input document).
Command Line
java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.PreprocessorCLI \
-t $TREETAGGER_HOME \
-c $CORPUS_PATH \
-l en \
--json-anno $PREPARED_CORPUS_PATH_JSON \
--info
Docker
termsuite preprocess \
-c $CORPUS_PATH \
-l en \
--json-anno $PREPARED_CORPUS_PATH_JSON \
--info
Preprocessing a ``*.txt corpus and exporting it as a
json` terminology
This example shows how to execute the preprocessing pipeline on a textual corpus and export the preprocessed corpus as a terminology in its native format (one single output .json
) for later reuse (e.g. by TerminoExtractorCLI).
Note: The prepared corpus in *.json
format (i.e. one output directory containing one file-i.json
annotation file per input file-i.txt
input text document) is different from the prepared corpus in imported terminology terminology.json
format (one signle json
file for the whole input document).
Command Line
java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.PreprocessorCLI \
-t $TREETAGGER_HOME \
-c $CORPUS_PATH \
-l en \
--json $PREPARED_CORPUS_AS_TERMINOLOGY_PATH \
--info
Docker
termsuite preprocess \
-c $CORPUS_PATH \
-l en \
--json $PREPARED_CORPUS_AS_TERMINOLOGY_PATH \
--info