- Prerequesites
- TerminologyExtractorCLI
- Usage
- Description
- Mandatory options
- Other options
--capped-size
INT--context-assoc-rate
INT or FLOAT--context-coocc-th
INT or FLOAT--context-scope
INT--contextualize
(no arg)--disable-derivative-splitting
(no arg)--disable-gathering
(no arg)--disable-merging
(no arg)--disable-morphology
(no arg)--disable-native-splitting
(no arg)--disable-post-processing
(no arg)--disable-prefix-splitting
(no arg)--enable-semantic-gathering
(no arg)--encoding
,-e
ENC--from-prepared-corpus
DIR--from-text-corpus
,-c
DIR--graphical-similarity-th
INT or FLOAT--json
FILE--language
,-l
LANG--nb-semantic-candidates
INT--no-occurrence
(no arg)--post-filter-keep-variants
(no arg)--post-filter-max-variants
INT--post-filter-property
STRING--post-filter-th
INT or FLOAT--post-filter-top-n
INT--postproc-affix-score-th
INT or FLOAT--postproc-affix-score-th
INT or FLOAT--postproc-independance-th
INT or FLOAT--postproc-variation-score-th
INT or FLOAT--pre-filter-max-variants
INT--pre-filter-property
STRING--pre-filter-th
INT or FLOAT--pre-filter-top-n
INT--ranking-asc
STRING--ranking-desc
STRING--resource-dir
DIR--resource-jar
FILE--resource-url-prefix
STRING--semantic-dico-only
(no arg)--semantic-distance
INT or FLOAT--semantic-similarity-th
INT or FLOAT--synonyms-dico
FILE--tagger
STRING--tagger-home
,-t
FILE--tbx
FILE--tsv
FILE--tsv-hide-headers
(no arg)--tsv-hide-variants
(no arg)--tsv-properties
STRING--watch
TERM_LIST
- Examples
- Filtering: cleaning by threshold value on any property
- Filtering: keeping only top n terms
- Filtering: keeping term variants
- Exporting terminology to TSV
- Extract a terminology ready for alignment
- Deactivating variant gathering
- Activating semantic variant detection
- Semantic variant detection with custom dictionary
- Semantic variant detection without distributional variants (dico only)
- Extracting a terminology from a very large corpus
- Extracting a terminology from preprocessed corpus
- Customizing Terminology Post-Processing
- Filtering: cleaning terminology before gathering step (
--pre-filter-*
) - Debugging terminology extraction
- Using customized linguistic resources (Advanced users)
Prerequesites
TerminologyExtractorCLI
Usage
java [-Xms256m -Xmx8g] -cp termsuite-core-3.0.2.jar \
fr.univnantes.termsuite.tools.TerminologyExtractorCLI OPTIONS
Description
Extracts terminology from a domain-specific textual corpus (or preprocessed corpus).
Mandatory options
--from-text-corpus
, --from-prepared-corpus
Exactly one option in --from-text-corpus
, --from-prepared-corpus
must be set.
--tsv
, --tbx
, --json
Other options
--capped-size
INT
The maximum number of terms to keep in memory while spotting. Allows to process bigger volumes of input text.
--context-assoc-rate
INT or FLOAT
Association rate measure used to normalize context vectors. Allowed values are:
MutualInformation
,LogLikelihood
Warning: This option can only be set when option --contextualize
is already set.
--context-coocc-th
INT or FLOAT
Sets a minimum frequency threshold for co-terms to appear in context vectors
Warning: This option can only be set when option --contextualize
is already set.
--context-scope
INT
Radius of single-word term window used during contextualization
Warning: This option can only be set when option --contextualize
is already set.
--contextualize
(no arg)
Activates the contextualizer
--disable-derivative-splitting
(no arg)
Disable morphological derivative splitting
--disable-gathering
(no arg)
Disable variant term gathering
--disable-merging
(no arg)
Disable graphical term merging
--disable-morphology
(no arg)
Disable morphology analysis (native, prefix, derivation splitting)
--disable-native-splitting
(no arg)
Disable morphological native splitting
--disable-post-processing
(no arg)
Disable post-gathering scoring and filtering processings
--disable-prefix-splitting
(no arg)
Disable morphological prefix splitting
--enable-semantic-gathering
(no arg)
Enable semantic term gathering (monolingual alignment)
--encoding
, -e
ENC
Encoding of the input corpus
--from-prepared-corpus
DIR
A file or directory path. Starts the terminology extraction pipeline from an XMI corpus or an imported terminology json file instead of a txt corpus.
Warning: Exactly one option in --from-text-corpus
, --from-prepared-corpus
must be set.
--from-text-corpus
, -c
DIR
Directory to corpus (containing a list of .txt documents)
Warning: Exactly one option in --from-text-corpus
, --from-prepared-corpus
must be set.
--graphical-similarity-th
INT or FLOAT
Graphical similarity threshold
--json
FILE
Outputs terminology to JSON file
--language
, -l
LANG
Language of the input corpus
--nb-semantic-candidates
INT
Max number of semantic variants for each terms
Warning: This option can only be set when option --enable-semantic-gathering
is already set.
--no-occurrence
(no arg)
Do not store occurrence offsets in memory while spotting. Allows to process bigger volumes of input text.
--post-filter-keep-variants
(no arg)
Keep variants during post-gathering filtering even if they are to be filtered
Warning: This option can only be set when option --post-filter-property
is already set.
--post-filter-max-variants
INT
The maximum number of variants to keep during post-gathering filtering
Warning: This option can only be set when option --post-filter-property
is already set.
--post-filter-property
STRING
Enables post-gathering filtering based on given property. Allowed values are:
rank
,documentFrequency
,frequencyNorm
,generalFrequencyNorm
,specificity
,frequency
,OrthographicScore
,IndependantFrequency
,Independance
,tf-idf
,spec-idf
,SwtSize
,Depth
--post-filter-th
INT or FLOAT
Threshold value of post-gathering filter
Warning: This option can only be set when option --post-filter-property
is already set.
Warning: At most one option in --post-filter-th
, --post-filter-top-n
must be set.
--post-filter-top-n
INT
N value for post-gathering filtering over top N terms
Warning: This option can only be set when option --post-filter-property
is already set.
Warning: At most one option in --post-filter-th
, --post-filter-top-n
must be set.
--postproc-affix-score-th
INT or FLOAT
Minimal score for affix-score. Variations under that threshold are filtered out.
Warning: This option cannot be set when option --disable-post-processing
is already set.
--postproc-affix-score-th
INT or FLOAT
Minimal score for variataion orthographic score. Variations under that threshold are filtered out.
Warning: This option cannot be set when option --disable-post-processing
is already set.
--postproc-independance-th
INT or FLOAT
Term independance score threshold. Terms under threshold are filtered out.
Warning: This option cannot be set when option --disable-post-processing
is already set.
--postproc-variation-score-th
INT or FLOAT
Filters out variations with scores under given threshold
Warning: This option cannot be set when option --disable-post-processing
is already set.
--pre-filter-max-variants
INT
The maximum number of variants to keep during pre-gathering filtering
Warning: This option can only be set when option --pre-filter-property
is already set.
--pre-filter-property
STRING
Enables pre-gathering filtering based on given property. Allowed values are:
rank
,documentFrequency
,frequencyNorm
,generalFrequencyNorm
,specificity
,frequency
,OrthographicScore
,IndependantFrequency
,Independance
,tf-idf
,spec-idf
,SwtSize
,Depth
--pre-filter-th
INT or FLOAT
Threshold value of pre-gathering filter
Warning: This option can only be set when option --pre-filter-property
is already set.
Warning: At most one option in --pre-filter-top-n
, --pre-filter-th
must be set.
--pre-filter-top-n
INT
N value for pre-gathering filtering over top N terms
Warning: This option can only be set when option --pre-filter-property
is already set.
Warning: At most one option in --pre-filter-top-n
, --pre-filter-th
must be set.
--ranking-asc
STRING
Sets the output ranking property in ASCENDING order. Allowed values are:
rank
,documentFrequency
,frequencyNorm
,generalFrequencyNorm
,specificity
,frequency
,OrthographicScore
,IndependantFrequency
,Independance
,tf-idf
,spec-idf
,SwtSize
,Depth
Warning: At most one option in --ranking-asc
, --ranking-desc
must be set.
--ranking-desc
STRING
Sets the output ranking property in DESCENDING order. Allowed values are:
rank
,documentFrequency
,frequencyNorm
,generalFrequencyNorm
,specificity
,frequency
,OrthographicScore
,IndependantFrequency
,Independance
,tf-idf
,spec-idf
,SwtSize
,Depth
Warning: At most one option in --ranking-asc
, --ranking-desc
must be set.
--resource-dir
DIR
Custom resource directory
--resource-jar
FILE
Custom resource jar
--resource-url-prefix
STRING
Custom resource url prefix
--semantic-dico-only
(no arg)
Find semantic variants with the help of dictionary only, no alignment.
Warning: This option can only be set when option --enable-semantic-gathering
is already set.
--semantic-distance
INT or FLOAT
Similarity measure used for semantic alignment. Allowed values are:
Cosine
,Jaccard
Warning: This option can only be set when option --enable-semantic-gathering
is already set.
--semantic-similarity-th
INT or FLOAT
Minimum semantic similarity threshold for semantic gathering (monolingual alignment)
Warning: This option can only be set when option --enable-semantic-gathering
is already set.
--synonyms-dico
FILE
Custom synonyms dictionary for semantic variant detection.
Warning: This option can only be set when option --enable-semantic-gathering
is already set.
--tagger
STRING
Which POS tagger to use. Allowed values are:
mate
,tt
--tagger-home
, -t
FILE
Path to POS tagger’s home
Warning: This option can only be set when option --from-text-corpus
is already set.
--tbx
FILE
Outputs terminology to TBX file
--tsv
FILE
Outputs terminology to TSV file
--tsv-hide-headers
(no arg)
Hide column headers
Warning: This option can only be set when option --tsv
is already set.
--tsv-hide-variants
(no arg)
Does no show the variants for each term
Warning: This option can only be set when option --tsv
is already set.
--tsv-properties
STRING
The comma-separated list columns of the tsv file. Allowed values are:
rank
,isSingleWord
,documentFrequency
,frequencyNorm
,generalFrequencyNorm
,specificity
,frequency
,OrthographicScore
,IndependantFrequency
,Independance
,pilot
,lemma
,tf-idf
,spec-idf
,groupingKey
,pattern
,spottingRule
,isFixedExpression
,SwtSize
,Filtered
,Depth
,VariationRank
,VariationRule
,DerivationType
,GraphSimilarity
,Score
,AffixGain
,AffixSpec
,AffixRatio
,AffixScore
,NormalizedAffixScore
,AffixOrthographicScore
,ExtensionScore
,NormalizedExtensionScore
,HasExtensionAffix
,IsExtension
,VariantBagFrequency
,SourceGain
,NormalizedSourceGain
,IsInfered
,IsGraphical
,IsDerivation
,IsPrefixation
,IsSyntagmatic
,IsMorphological
,IsSemantic
,Distributional
,SemanticSimilarity
,Dico
,SemanticScore
Warning: This option can only be set when option --tsv
is already set.
--watch
TERM_LIST
List of terms (grouping keys or lemmas) to log to output
Examples
Example launcher scripts can be found at:
https://github.com/termsuite/termsuite-core/tree/develop/examples/cmd
Filtering: cleaning by threshold value on any property
Terminology can be filtered on any property after variant gathering with --post-filter-property
option. When given together with option --post-filter-th
, TermSuite will filter by threshold on given property.
Any numeric property can be used when for post-propcessing filtering. See all TermSuite properties.
For example, this launcher filters all term having dFreq < 3
, i.e. appearing in less than 3 different document in the source corpus..
Command Line
java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
-t $TREETAGGER_HOME \
-c $CORPUS_PATH \
-l en \
--post-filter-property dFreq \
--post-filter-th 3 \
--tsv $TSV_OUTPUT_FILE \
--tsv-properties "pattern,pilot,spec,freq,dFreq" \
--info
Docker
termsuite extract \
-c $CORPUS_PATH \
-l en \
--post-filter-property dFreq \
--post-filter-th 3 \
--tsv $TSV_OUTPUT_FILE \
--tsv-properties "pattern,pilot,spec,freq,dFreq" \
--info
Output TSV file (2673 lines)
See how to interprete terminology TSV output file.
# type pattern pilot spec freq dFreq
1 T N rotor 4,82 848 30
2 T N N wind turbine 4,56 1855 37
2 V[s] N N N wind turbine rotor 3,38 31 12
2 V[s] N N N WIND TURBINE APPLICATIONS 3,83 86 3
2 V[s] N N N wind turbine blades 3,57 48 12
2 V[s] A N N offshore wind turbine 3,26 47 7
2 V[s] N N N wind turbine noise 3,53 43 3
2 V[s] N N N wind turbine concepts 3,46 37 4
2 V[s] N N N wind turbine generator 3,02 27 8
2 V[s] A N N small wind turbines 3,41 33 8
2 V[s] N N N MW wind turbine 2,89 10 5
2 V[s]+ N N N wind turbine technology 3,34 28 10
2 V[s] N N N wind turbine system 3,40 32 7
2 V[s] A N N modern wind turbines 2,82 17 7
2 V[s] N N N wind turbine tower 3,07 15 9
2 V[s] A N N large wind turbines 3,12 17 10
3 T N N wind energy 4,51 414 32
3 V[s] N N N wind energy potential 3,07 15 5
3 V[s] A N N offshore wind energy 3,56 47 7
3 V[s] N N N wind energy development 3,29 25 5
3 V[s] N N N Wind Energy Facility 3,04 14 5
3 V[s] N N N wind energy projects 2,97 12 8
3 V[s] N N N wind energy technology 2,67 6 5
3 V[s] N N N wind energy research 2,67 6 5
3 V[s] N N N wind energy application 2,50 4 3
3 V[s] N N N wind energy resources 2,50 4 3
4 T N N wind power 4,34 278 26
4 V[s] N N N wind turbine power 2,89 10 6
4 V[s] N N N Wind Power Plant 3,76 74 9
4 V[s] A N N offshore wind power 3,01 13 4
--- lines 31 to 2662 ---
2524 T N round 0,03 5 4
2525 T N Street 0,03 8 4
2526 T N News 0,03 3 3
2527 T N house 0,03 9 4
2528 T N trust 0,02 4 4
2529 T N London 0,02 4 4
2530 T N money 0,02 6 4
2531 T N England 0,01 3 3
2532 T N Scotland 0,01 11 4
2533 T N weeks 0,01 4 4
--- END OF FILE
Filtering: keeping only top n terms
Terminology can be filtered on any property after variant gathering with --post-filter-property
option. When given together with option --post-filter-top-n
, TermSuite will only keep top n values sorted by given property (desc).
Any numeric property can be used when for post-propcessing filtering. See all TermSuite properties.
For example, this launcher keeps only top 500 most specific terms.
Command Line
java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
-t $TREETAGGER_HOME \
-c $CORPUS_PATH \
-l en \
--post-filter-property spec \
--post-filter-top-n 500 \
--tsv $TSV_OUTPUT_FILE \
--tsv-properties "pattern,pilot,spec,freq,dFreq" \
--info
Docker
termsuite extract \
-c $CORPUS_PATH \
-l en \
--post-filter-property spec \
--post-filter-top-n 500 \
--tsv $TSV_OUTPUT_FILE \
--tsv-properties "pattern,pilot,spec,freq,dFreq" \
--info
Output TSV file (569 lines)
See how to interprete terminology TSV output file.
# type pattern pilot spec freq dFreq
1 T N rotor 4,82 848 30
2 T N N wind turbine 4,68 1854 37
2 V[s] N N N wind turbine rotor 3,38 31 12
2 V[s] N N N WIND TURBINE APPLICATIONS 3,83 86 3
2 V[s] N N N WIND TURBINE SOUND 3,74 71 2
2 V[s] N N N wind turbine blades 3,57 48 12
2 V[s] A N N offshore wind turbine 3,26 47 7
2 V[s] N N N DFIG wind turbine 3,25 23 2
2 V[s] N N N wind turbine noise 3,53 43 3
2 V[s] N N N wind turbine concepts 3,46 37 4
2 V[s] N N N wind turbine generator 3,02 27 8
2 V[s] A N N Domestic Wind Turbines 3,35 29 1
2 V[s] A N N small wind turbines 3,41 33 8
2 V[s] N N N wind turbine technology 3,34 28 10
2 V[s]+ N N N wind turbine system 3,40 32 7
2 V[s] N N N wind turbine syndrome 3,21 21 2
2 V[s] N N N wind turbine tower 3,07 15 9
2 V[s] A N N large wind turbines 3,12 17 10
3 T N N wind energy 4,51 414 32
3 V[s] N N N wind energy potential 3,07 15 5
3 V[s] A N N offshore wind energy 3,56 47 7
3 V[s] N N N wind energy development 3,29 25 5
3 V[s] N N N Wind Energy Facility 3,04 14 5
3 V[s] N N N wind energy projects 2,97 12 8
4 T N N wind power 4,34 278 26
4 V[s] N N N Wind Power Plant 3,76 74 9
4 V[s] N N N wind power stations 3,27 24 2
4 V[s] A N N offshore wind power 3,01 13 4
4 V[s] N N N Wind Power Project 3,17 19 5
4 V[s] N N N wind power development 3,04 14 5
--- lines 31 to 558 ---
491 T N RMS 2,93 11 1
492 T N N migration routes 2,93 11 4
493 T A N electrical power 2,93 11 9
494 T N Sediment 2,93 11 2
495 T N N N blade flow field 2,93 11 1
496 T N Sadock 2,93 11 1
497 T A N electric lines 2,93 11 1
498 T N Guajira 2,93 11 1
499 T N N design procedure 2,93 11 2
500 T R dynamically 2,93 11 4
--- END OF FILE
Filtering: keeping term variants
Filtering terminology will often result in having few term variants detected in output terminology, because most term variants have low frequencies. You can configure TermSuite to keep all variants of a term even if they are to be filtered out by the filter selector.
Any numeric property can be used when for post-propcessing filtering. See all TermSuite properties.
This launcher script keeps top 500 term by specificity, plus all 1-degree variants of these 500 terms.
Command Line
java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
-t $TREETAGGER_HOME \
-c $CORPUS_PATH \
-l en \
--post-filter-property freq \
--post-filter-top-n 500 \
--post-filter-keep-variants \
--tsv $TSV_OUTPUT_FILE \
--tsv-properties "pattern,pilot,spec,freq,dFreq" \
--info
Docker
termsuite extract \
-c $CORPUS_PATH \
-l en \
--post-filter-property freq \
--post-filter-top-n 500 \
--post-filter-keep-variants \
--tsv $TSV_OUTPUT_FILE \
--tsv-properties "pattern,pilot,spec,freq,dFreq" \
--info
Output TSV file (709 lines)
See how to interprete terminology TSV output file.
# type pattern pilot spec freq dFreq
1 T N rotor 4,82 848 30
2 T N N wind turbine 4,56 1855 37
2 V[s] A N N ARI-450 wind turbine 1,90 2 1
2 V[s] N N N vertical-axis wind turbine 2,37 6 2
2 V[s] N N N wind turbine rotor 3,38 31 12
2 V[s] N N N WIND TURBINE APPLICATIONS 3,83 86 3
2 V[s] N N N WIND TURBINE SOUND 3,74 71 2
2 V[s] N N N wind turbine blades 3,57 48 12
2 V[s] A N N offshore wind turbine 3,26 47 7
2 V[s] N N N DFIG wind turbine 3,25 23 2
2 V[s] N N N wind turbine noise 3,53 43 3
2 V[s] N N N wind turbine concepts 3,46 37 4
2 V[s] N N N wind turbine generator 3,02 27 8
2 V[s] A N N Domestic Wind Turbines 3,35 29 1
2 V[s] N N N direct-drive wind turbines 2,77 15 1
2 V[s] A N N small wind turbines 3,41 33 8
2 V[s] N N N MW wind turbine 2,89 10 5
2 V[s] N N N wind turbine technology 3,34 28 10
2 V[s] N N N wind turbine system 3,40 32 7
2 V[s] N N N wind turbine placement 2,89 20 1
2 V[s] N N N wind turbine syndrome 3,21 21 2
2 V[s] A N N small-scale wind turbines 2,50 4 1
2 V[s] N N N wind turbine configuration 2,63 11 2
2 V[s] A N N modern wind turbines 2,82 17 7
2 V[s] A N N three-bladed wind turbines 2,80 8 2
2 V[s] N N N wind turbine tower 3,07 15 9
2 V[s] A N N large wind turbines 3,12 17 10
3 T N N wind energy 4,51 414 32
3 V[s] N N N wind energy potential 3,07 15 5
3 V[s] A N N offshore wind energy 3,56 47 7
--- lines 31 to 698 ---
684 T N place 0,15 70 19
685 T N team 0,14 57 11
686 T N life 0,13 62 22
687 T N years 0,13 300 27
688 T N business 0,13 48 15
689 T N parties 0,10 45 6
690 T N government 0,10 58 17
691 T N people 0,09 101 11
692 T N months 0,09 52 10
693 T N company 0,09 66 19
--- END OF FILE
Exporting terminology to TSV
Export extracted terminology to TSV simply by setting option --tsv
.
You can select term and variant properties to appear as column in TSV with option --tsv-properties
. Given value must be a ,
-separated list of valid property names. See all available properties.
Command Line
java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
-t $TREETAGGER_HOME \
-c $CORPUS_PATH \
-l en \
--tsv $TSV_OUTPUT_FILE \
--tsv-properties "key,pattern,pilot,spec,freq,ind" \
--info
Docker
termsuite extract \
-c $CORPUS_PATH \
-l en \
--tsv $TSV_OUTPUT_FILE \
--tsv-properties "key,pattern,pilot,spec,freq,ind" \
--info
Output TSV file (13352 lines)
See how to interprete terminology TSV output file.
# type key pattern pilot spec freq ind
1 T n: rotor N rotor 4,82 848 0,17
2 T nn: wind turbine N N wind turbine 4,56 1855 0,28
2 V[s] ann: ari-450 wind turbine A N N ARI-450 wind turbine 1,90 2 1,00
2 V[s]+ nnn: vertical-axis wind turbine N N N vertical-axis wind turbine 2,37 6 0,50
2 V[s] nnn: wind turbine rotor N N N wind turbine rotor 3,38 31 0,13
2 V[s] nnn: wind turbine application N N N WIND TURBINE APPLICATIONS 3,83 86 0,59
2 V[s] nnn: wind turbine sound N N N WIND TURBINE SOUND 3,74 71 0,87
2 V[s] nnn: wind turbine blade N N N wind turbine blades 3,57 48 0,29
2 V[s] ann: offshore wind turbine A N N offshore wind turbine 3,26 47 0,43
2 V[s] nnn: dfig wind turbine N N N DFIG wind turbine 3,25 23 0,30
2 V[s]+ nnn: wind turbine noise N N N wind turbine noise 3,53 43 0,60
2 V[s]+ nnn: wind turbine concept N N N wind turbine concepts 3,46 37 0,43
2 V[s]+ nnn: wind turbine generator N N N wind turbine generator 3,02 27 0,44
2 V[s] ann: domestic wind turbine A N N Domestic Wind Turbines 3,35 29 0,24
2 V[s] nnn: direct-drive wind turbine N N N direct-drive wind turbines 2,77 15 0,67
--- END OF FILE
Extract a terminology ready for alignment
To extract a terminology that is reusable by TermSuite’s aligner, you only need to set the --contextualize
option and export terminology to its native format (JSON) with option --json
.
Command Line
java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
-t $TREETAGGER_HOME \
-c $CORPUS_PATH \
-l en \
--contextualize \
--json $JSON_OUTPUT_FILE \
--info
Docker
termsuite extract \
-c $CORPUS_PATH \
-l en \
--contextualize \
--json $JSON_OUTPUT_FILE \
--info
Deactivating variant gathering
Variant gathering ca be deactivated with option --disable-gathering
. this should make the terminology extraction pipeline to terminate significantly more quickly.
Command Line
java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
-t $TREETAGGER_HOME \
-c $CORPUS_PATH \
-l en \
--disable-gathering \
--tsv $TSV_OUTPUT_FILE \
--info
Docker
termsuite extract \
-c $CORPUS_PATH \
-l en \
--disable-gathering \
--tsv $TSV_OUTPUT_FILE \
--info
Output TSV file (11931 lines)
See how to interprete terminology TSV output file.
# type key freq
1 T nn: wind turbine 1852
2 T n: rotor 848
3 T nn: wind energy 414
4 T nn: wind speed 331
5 T nn: wind power 278
6 T n: airfoil 236
7 T n: voltage 214
8 T a: due 185
9 T n: hydrogen 156
10 T n: natura 144
11 T n: vortex 132
12 T n: wecs 116
13 T n: hub 114
14 T n: deis 110
15 T nn: wind generator 105
16 T nn: pitch angle 100
17 T npn: angle of attack 97
18 T nn: blade element 96
19 T n: infrasound 96
20 T nn: power coefficient 94
21 T nn: rotor blade 94
22 T n: tsr 91
23 T nn: turbine blade 91
24 T nnn: wind turbine application 86
25 T n: annex 83
26 T nn: turbine sound 80
27 T nn: farm development 79
28 T ann: low frequency sound 78
29 T n: ordinance 77
30 T nn: blade design 77
--- END OF FILE
Activating semantic variant detection
You can activate semantic variant detection with option --enable-semantic-gathering
. Semantic variant gathering can be significantly slower. In TSV output, semantic variants are flagged as V[h]
. TSV properties isDico
and isDistrib
tell if the variant has been detected with a synonymic dictionary or with distributional alignments.
Command Line
java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
-t $TREETAGGER_HOME \
-c $CORPUS_PATH \
-l en \
--enable-semantic-gathering \
--tsv $TSV_OUTPUT_FILE \
--tsv-properties "pilot,freq,spec,semScore,isDico,isDistrib" \
--info
Docker
termsuite extract \
-c $CORPUS_PATH \
-l en \
--enable-semantic-gathering \
--tsv $TSV_OUTPUT_FILE \
--tsv-properties "pilot,freq,spec,semScore,isDico,isDistrib" \
--info
Output TSV file (15553 lines)
See how to interprete terminology TSV output file.
# type pilot freq spec semScore isDico isDistrib
1 T rotor 848 4,82
2 T wind turbine 1855 4,56
2 V[h]+ wind power-plant 2 1,90 0,97 0 1
2 V[h]+ wind channel 2 2,20 0,97 0 1
2 V[h]+ Wind turbines-a 2 2,20 0,97 0 1
2 V[h]+ wind farm 488 3,20 0,89 0 1
2 V[s] wind turbine rotor 31 3,38
2 V[s]+ vertical-axis wind turbine 6 2,37
2 V[s] WIND TURBINE APPLICATIONS 86 3,83
2 V[s] wind turbine blades 48 3,57
2 V[h]+ Enfield-Andreau turbine 3 2,37 0,54 0 1
2 V[s]+ wind turbine concepts 37 3,46
2 V[s]+ wind turbine generator 27 3,02
2 V[s]+ Domestic Wind Turbines 29 3,35
2 V[s]+ small wind turbines 33 3,41
2 V[s] MW wind turbine 10 2,89
2 V[s]+ wind turbine technology 28 3,34
2 V[s]+ wind turbine system 32 3,40
2 V[s]+ small-scale wind turbines 4 2,50
2 V[s]+ wind turbine configuration 11 2,63
2 V[s] three-bladed wind turbines 8 2,80
2 V[s] wind turbine tower 15 3,07
2 V[s]+ large wind turbines 17 3,12
2 V[s]+ smaller-scale wind turbines 2 2,20
2 V[s]+ wind turbine operations 14 3,04
2 V[s] wind turbine performance 14 3,04
2 V[s] wind turbine component 9 2,85
2 V[s] ARI wind turbine 4 2,20
2 V[s]+ land-based wind turbines 4 2,50
2 V[s]+ wind turbine foundations 9 2,85
--- END OF FILE
Semantic variants are denoted as V[h]
. Consider filtering on values of properties isDistrib
and semScore
for better filtering.
Semantic variant detection with custom dictionary
Extracting semantic variants requires a synonyms dictionary. Due to licensing issues, the one packaged with TermSuite can be very poor for some languages. You might prefer to use your own dictionary with option --synonyms-dico
. See TermSuite packaged english dictionary for an example of dictionary format.
Command Line
java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
-t $TREETAGGER_HOME \
-c $CORPUS_PATH \
-l en \
--enable-semantic-gathering \
--synonyms-dico $SYNONYMS_DICO \
--tsv $TSV_OUTPUT_FILE \
--tsv-properties "pilot,freq,spec,semScore,isDico,isDistrib" \
--info
Docker
termsuite extract \
-c $CORPUS_PATH \
-l en \
--enable-semantic-gathering \
--synonyms-dico $SYNONYMS_DICO \
--tsv $TSV_OUTPUT_FILE \
--tsv-properties "pilot,freq,spec,semScore,isDico,isDistrib" \
--info
Semantic variant detection without distributional variants (dico only)
Semantic variants found distributionally (with property isDistrib=true
) generally are of worse quality and slower to find compared to variants found on dictionary. You can tell TermSuite to deactivate distributional gathering with option --semantic-dico-only
.
Command Line
java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
-t $TREETAGGER_HOME \
-c $CORPUS_PATH \
-l en \
--enable-semantic-gathering \
--semantic-dico-only \
--tsv $TSV_OUTPUT_FILE \
--tsv-properties "pilot,freq,spec,semScore,isDico,isDistrib" \
--info
Docker
termsuite extract \
-c $CORPUS_PATH \
-l en \
--enable-semantic-gathering \
--semantic-dico-only \
--tsv $TSV_OUTPUT_FILE \
--tsv-properties "pilot,freq,spec,semScore,isDico,isDistrib" \
--info
Extracting a terminology from a very large corpus
Extracts a terminology from a very large corpus by deactivating occurrence indexing and setting a capped terminology size. The capped size is the maximum number of terms allowed to be kept in memory. Every time this number goes over the capped size, TermSuite filters the on-going termino by frequency.
Command Line
java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
-t $TREETAGGER_HOME \
-c $CORPUS_PATH \
-l en \
--no-occurrence \
--capped-size 500000 \
--tsv $TSV_OUTPUT_FILE \
--info
Docker
termsuite extract \
-c $CORPUS_PATH \
-l en \
--no-occurrence \
--capped-size 500000 \
--tsv $TSV_OUTPUT_FILE \
--info
Extracting a terminology from preprocessed corpus
It is not mandatory to start a terminology extraction pipeline on textual corpus (with option -c
). You may want to do your preprocessings once and start a extraction pipelines from an already preprocessed corpus with option --from-prepared-corpus
instead.
Note that since the corpus is prepared (including POS tags), there is no more need for setting options -t
. TermSuite will complain if both options are set together.
A valid preprocessed corpus is a directory path containing as many *.xmi
files as there are *.txt
file in input corpus. The file-i.xmi
file being the list of preprocessed annotation of file file-i.txt
.
Another valid preprocessed corpus type is a TermSuite .json
terminology in which all spotted terms have been imported but on which no further terminology extraction processings have been operated.
See TermSuite Preprocessor
on how to produce a valid preprocessed corpus in any of these format.
Command Line
java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
--from-prepared-corpus $PREPARED_CORPUS_PATH \
-l en \
--tsv $TSV_OUTPUT_FILE \
--info
Docker
termsuite extract \
--from-prepared-corpus $PREPARED_CORPUS_PATH \
-l en \
--tsv $TSV_OUTPUT_FILE \
--info
Customizing Terminology Post-Processing
Terminology Post-Processing step (after variant gathering) can be customized with --postproc-*
options. Option --disable-post-processing
would disable this step.
This launcher sets a threshold of 0.20 for term property Independance
. This means that for each term, at least 20% of its occurrences must not be in the form of any of its variant.
See default TermSuite config for the default english post-processing values.
Command Line
java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
-t $TREETAGGER_HOME \
-c $CORPUS_PATH \
-l en \
--postproc-independance-th 0.20 \
--tsv $TSV_OUTPUT_FILE \
--tsv-properties "pilot,freq,spec,ind" \
--info
Docker
termsuite extract \
-c $CORPUS_PATH \
-l en \
--postproc-independance-th 0.20 \
--tsv $TSV_OUTPUT_FILE \
--tsv-properties "pilot,freq,spec,ind" \
--info
Output TSV file (12557 lines)
See how to interprete terminology TSV output file.
# type pilot freq spec ind
1 T wind turbine 1854 4,68 0,28
1 V[s] ARI-450 wind turbine 2 1,90 1,00
1 V[s]+ vertical-axis wind turbine 6 2,37 0,50
1 V[s] WIND TURBINE APPLICATIONS 86 3,83 0,59
1 V[s] WIND TURBINE SOUND 71 3,74 0,87
1 V[s] wind turbine blades 48 3,57 0,29
1 V[s] offshore wind turbine 47 3,26 0,43
1 V[s] DFIG wind turbine 23 3,25 0,30
1 V[s]+ wind turbine noise 43 3,53 0,60
1 V[s]+ wind turbine concepts 37 3,46 0,43
1 V[s]+ wind turbine generator 27 3,02 0,44
1 V[s] Domestic Wind Turbines 29 3,35 0,24
1 V[s] direct-drive wind turbines 15 2,77 0,67
1 V[s] small wind turbines 33 3,41 0,61
1 V[s] MW wind turbine 10 2,89 0,80
1 V[s]+ wind turbine technology 28 3,34 0,25
1 V[s]+ wind turbine placement 20 2,89 0,55
1 V[s] wind turbine syndrome 21 3,21 1,00
1 V[s]+ small-scale wind turbines 4 2,50 0,25
1 V[s] wind turbine configuration 11 2,63 0,36
1 V[s] modern wind turbines 17 2,82 1,00
1 V[s] three-bladed wind turbines 8 2,80 0,63
1 V[s] wind turbine tower 15 3,07 0,20
1 V[s] large wind turbines 17 3,12 0,65
1 V[s]+ wind turbine design 16 2,80 0,31
1 V[s]+ smaller-scale wind turbines 2 2,20 1,00
1 V[s] wind turbine operations 14 3,04 0,86
1 V[s] wind turbine performance 14 3,04 0,50
1 V[s] wind turbine component 9 2,85 0,22
1 V[s] wind turbine project 12 2,67 0,42
--- END OF FILE
Filtering: cleaning terminology before gathering step (--pre-filter-*
)
Filtering terminology after the variant gathering step (--post-filter-*
) is often the default requirement for users. Sometimes, it is clever to filter the terminology before the variant gathering with --pre-filter-property
option. One of the main use case for pre-filtering is to reduce the terminology size in amount of term variant gathering, thus accelerating the whole variant gathering step.
Any numeric property can be used when for pre-propcessing filtering. See all TermSuite properties.
For example, this launcher filters all term having freq < 2
(hapaxes) before variant gathering. As a consequence, that will never be possible to detect term variants occurring exactly once. There can not be any *-keep-variants
option with --pre-filter
as there is --post-filter-keep-variants
.
Command Line
java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
-t $TREETAGGER_HOME \
-c $CORPUS_PATH \
-l en \
--pre-filter-property freq \
--pre-filter-th 2 \
--tsv $TSV_OUTPUT_FILE \
--tsv-properties "pattern,pilot,spec,freq,dFreq" \
--info
Docker
termsuite extract \
-c $CORPUS_PATH \
-l en \
--pre-filter-property freq \
--pre-filter-th 2 \
--tsv $TSV_OUTPUT_FILE \
--tsv-properties "pattern,pilot,spec,freq,dFreq" \
--info
Debugging terminology extraction
You can activate logging to console with either --info
or --debug
(more verbose).
There is also the possibility to get term-centered information with option --watch
. The following launcher will print all pipeline information about term “offshore wind energy”, even if it has been filtered. This is a very convenient tool when you do not get what you expected of a term (missing variants, missing term, etc).
Command Line
java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
-t $TREETAGGER_HOME \
-c $CORPUS_PATH \
-l en \
--debug \
--tsv $TSV_OUTPUT_FILE \
--watch "offshore wind energy"
Docker
termsuite extract \
-c $CORPUS_PATH \
-l en \
--debug \
--tsv $TSV_OUTPUT_FILE \
--watch "offshore wind energy"
Using customized linguistic resources (Advanced users)
TermSuite terminology extraction is based on multiple linguistic resources of many types and formats (multi-word term spotting patterns, variation gathering rules, term frequencies in general language, morphology rules, etc). The main use case for this option is to elaborate linguistic ressources for a new language. Use customized linguistic resources with options --resource-dir
or --resource-jar
.
See TermSuite resources documentation and see how to easily produce your own custom resource directory or jar.
Command Line
java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
-t $TREETAGGER_HOME \
-c $CORPUS_PATH \
-l en \
--resource-dir $CUSTOM_RESOURCE_DIR \
--tsv $TSV_OUTPUT_FILE \
--info
Docker
termsuite extract \
-c $CORPUS_PATH \
-l en \
--resource-dir $CUSTOM_RESOURCE_DIR \
--tsv $TSV_OUTPUT_FILE \
--info