TermSuite

Prerequesites
TerminologyExtractorCLI
Examples

Prerequesites

Java 8
A prepared corpus an external POS tagger installed

TerminologyExtractorCLI

Usage

java [-Xms256m -Xmx8g] -cp termsuite-core-3.0.2.jar \
	 fr.univnantes.termsuite.tools.TerminologyExtractorCLI OPTIONS

Description

Extracts terminology from a domain-specific textual corpus (or preprocessed corpus).

Mandatory options

`--from-text-corpus`, `--from-prepared-corpus`

Exactly one option in --from-text-corpus, --from-prepared-corpus must be set.

`--tsv`, `--tbx`, `--json`

At least one option in --tsv, --tbx, --json must be set.

Other options

`--capped-size` INT

The maximum number of terms to keep in memory while spotting. Allows to process bigger volumes of input text.

`--context-assoc-rate` INT or FLOAT

Association rate measure used to normalize context vectors. Allowed values are: MutualInformation, LogLikelihood

Warning: This option can only be set when option --contextualize is already set.

`--context-coocc-th` INT or FLOAT

Sets a minimum frequency threshold for co-terms to appear in context vectors

Warning: This option can only be set when option --contextualize is already set.

`--context-scope` INT

Radius of single-word term window used during contextualization

Warning: This option can only be set when option --contextualize is already set.

`--contextualize` (no arg)

Activates the contextualizer

`--disable-derivative-splitting` (no arg)

Disable morphological derivative splitting

`--disable-gathering` (no arg)

Disable variant term gathering

`--disable-merging` (no arg)

Disable graphical term merging

`--disable-morphology` (no arg)

Disable morphology analysis (native, prefix, derivation splitting)

`--disable-native-splitting` (no arg)

Disable morphological native splitting

`--disable-post-processing` (no arg)

Disable post-gathering scoring and filtering processings

`--disable-prefix-splitting` (no arg)

Disable morphological prefix splitting

`--enable-semantic-gathering` (no arg)

Enable semantic term gathering (monolingual alignment)

`--encoding`, `-e` ENC

Encoding of the input corpus

`--from-prepared-corpus` DIR

A file or directory path. Starts the terminology extraction pipeline from an XMI corpus or an imported terminology json file instead of a txt corpus.

Warning: Exactly one option in --from-text-corpus, --from-prepared-corpus must be set.

`--from-text-corpus`, `-c` DIR

Directory to corpus (containing a list of .txt documents)

Warning: Exactly one option in --from-text-corpus, --from-prepared-corpus must be set.

`--graphical-similarity-th` INT or FLOAT

Graphical similarity threshold

`--json` FILE

Outputs terminology to JSON file

Warning: At least one option in --tsv, --tbx, --json must be set.

`--language`, `-l` LANG

Language of the input corpus

`--nb-semantic-candidates` INT

Max number of semantic variants for each terms

Warning: This option can only be set when option --enable-semantic-gathering is already set.

`--no-occurrence` (no arg)

Do not store occurrence offsets in memory while spotting. Allows to process bigger volumes of input text.

`--post-filter-keep-variants` (no arg)

Keep variants during post-gathering filtering even if they are to be filtered

Warning: This option can only be set when option --post-filter-property is already set.

`--post-filter-max-variants` INT

The maximum number of variants to keep during post-gathering filtering

Warning: This option can only be set when option --post-filter-property is already set.

`--post-filter-property` STRING

Enables post-gathering filtering based on given property. Allowed values are: rank, documentFrequency, frequencyNorm, generalFrequencyNorm, specificity, frequency, OrthographicScore, IndependantFrequency, Independance, tf-idf, spec-idf, SwtSize, Depth

`--post-filter-th` INT or FLOAT

Threshold value of post-gathering filter

Warning: This option can only be set when option --post-filter-property is already set.

Warning: At most one option in --post-filter-th, --post-filter-top-n must be set.

`--post-filter-top-n` INT

N value for post-gathering filtering over top N terms

Warning: This option can only be set when option --post-filter-property is already set.

Warning: At most one option in --post-filter-th, --post-filter-top-n must be set.

`--postproc-affix-score-th` INT or FLOAT

Minimal score for affix-score. Variations under that threshold are filtered out.

Warning: This option cannot be set when option --disable-post-processing is already set.

`--postproc-affix-score-th` INT or FLOAT

Minimal score for variataion orthographic score. Variations under that threshold are filtered out.

Warning: This option cannot be set when option --disable-post-processing is already set.

`--postproc-independance-th` INT or FLOAT

Term independance score threshold. Terms under threshold are filtered out.

Warning: This option cannot be set when option --disable-post-processing is already set.

`--postproc-variation-score-th` INT or FLOAT

Filters out variations with scores under given threshold

Warning: This option cannot be set when option --disable-post-processing is already set.

`--pre-filter-max-variants` INT

The maximum number of variants to keep during pre-gathering filtering

Warning: This option can only be set when option --pre-filter-property is already set.

`--pre-filter-property` STRING

Enables pre-gathering filtering based on given property. Allowed values are: rank, documentFrequency, frequencyNorm, generalFrequencyNorm, specificity, frequency, OrthographicScore, IndependantFrequency, Independance, tf-idf, spec-idf, SwtSize, Depth

`--pre-filter-th` INT or FLOAT

Threshold value of pre-gathering filter

Warning: This option can only be set when option --pre-filter-property is already set.

Warning: At most one option in --pre-filter-top-n, --pre-filter-th must be set.

`--pre-filter-top-n` INT

N value for pre-gathering filtering over top N terms

Warning: This option can only be set when option --pre-filter-property is already set.

Warning: At most one option in --pre-filter-top-n, --pre-filter-th must be set.

`--ranking-asc` STRING

Sets the output ranking property in ASCENDING order. Allowed values are: rank, documentFrequency, frequencyNorm, generalFrequencyNorm, specificity, frequency, OrthographicScore, IndependantFrequency, Independance, tf-idf, spec-idf, SwtSize, Depth

Warning: At most one option in --ranking-asc, --ranking-desc must be set.

`--ranking-desc` STRING

Sets the output ranking property in DESCENDING order. Allowed values are: rank, documentFrequency, frequencyNorm, generalFrequencyNorm, specificity, frequency, OrthographicScore, IndependantFrequency, Independance, tf-idf, spec-idf, SwtSize, Depth

Warning: At most one option in --ranking-asc, --ranking-desc must be set.

`--resource-dir` DIR

Custom resource directory

`--resource-jar` FILE

Custom resource jar

`--resource-url-prefix` STRING

Custom resource url prefix

`--semantic-dico-only` (no arg)

Find semantic variants with the help of dictionary only, no alignment.

Warning: This option can only be set when option --enable-semantic-gathering is already set.

`--semantic-distance` INT or FLOAT

Similarity measure used for semantic alignment. Allowed values are: Cosine, Jaccard

Warning: This option can only be set when option --enable-semantic-gathering is already set.

`--semantic-similarity-th` INT or FLOAT

Minimum semantic similarity threshold for semantic gathering (monolingual alignment)

Warning: This option can only be set when option --enable-semantic-gathering is already set.

`--synonyms-dico` FILE

Custom synonyms dictionary for semantic variant detection.

Warning: This option can only be set when option --enable-semantic-gathering is already set.

`--tagger` STRING

Which POS tagger to use. Allowed values are: mate, tt

`--tagger-home`, `-t` FILE

Path to POS tagger’s home

Warning: This option can only be set when option --from-text-corpus is already set.

`--tbx` FILE

Outputs terminology to TBX file

Warning: At least one option in --tsv, --tbx, --json must be set.

`--tsv` FILE

Outputs terminology to TSV file

Warning: At least one option in --tsv, --tbx, --json must be set.

`--tsv-hide-headers` (no arg)

Hide column headers

Warning: This option can only be set when option --tsv is already set.

`--tsv-hide-variants` (no arg)

Does no show the variants for each term

Warning: This option can only be set when option --tsv is already set.

`--tsv-properties` STRING

The comma-separated list columns of the tsv file. Allowed values are: rank, isSingleWord, documentFrequency, frequencyNorm, generalFrequencyNorm, specificity, frequency, OrthographicScore, IndependantFrequency, Independance, pilot, lemma, tf-idf, spec-idf, groupingKey, pattern, spottingRule, isFixedExpression, SwtSize, Filtered, Depth, VariationRank, VariationRule, DerivationType, GraphSimilarity, Score, AffixGain, AffixSpec, AffixRatio, AffixScore, NormalizedAffixScore, AffixOrthographicScore, ExtensionScore, NormalizedExtensionScore, HasExtensionAffix, IsExtension, VariantBagFrequency, SourceGain, NormalizedSourceGain, IsInfered, IsGraphical, IsDerivation, IsPrefixation, IsSyntagmatic, IsMorphological, IsSemantic, Distributional, SemanticSimilarity, Dico, SemanticScore

Warning: This option can only be set when option --tsv is already set.

`--watch` TERM_LIST

List of terms (grouping keys or lemmas) to log to output

Examples

Example launcher scripts can be found at:

https://github.com/termsuite/termsuite-core/tree/develop/examples/cmd

Filtering: cleaning by threshold value on any property

Terminology can be filtered on any property after variant gathering with --post-filter-property option. When given together with option --post-filter-th, TermSuite will filter by threshold on given property.
Any numeric property can be used when for post-propcessing filtering. See all TermSuite properties.
For example, this launcher filters all term having dFreq < 3, i.e. appearing in less than 3 different document in the source corpus..

Command Line

java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
            -t $TREETAGGER_HOME \
            -c $CORPUS_PATH \
            -l en \
            --post-filter-property dFreq \
            --post-filter-th 3 \
            --tsv $TSV_OUTPUT_FILE \
            --tsv-properties "pattern,pilot,spec,freq,dFreq" \
            --info

Docker

termsuite extract \
            -c $CORPUS_PATH \
            -l en \
            --post-filter-property dFreq \
            --post-filter-th 3 \
            --tsv $TSV_OUTPUT_FILE \
            --tsv-properties "pattern,pilot,spec,freq,dFreq" \
            --info

Output TSV file (2673 lines)

See how to interprete terminology TSV output file.

#	type	pattern	pilot	spec	freq	dFreq
1	T	N	rotor	4,82	848	30
2	T	N N	wind turbine	4,56	1855	37
2	V[s]	N N N	wind turbine rotor	3,38	31	12
2	V[s]	N N N	WIND TURBINE APPLICATIONS	3,83	86	3
2	V[s]	N N N	wind turbine blades	3,57	48	12
2	V[s]	A N N	offshore wind turbine	3,26	47	7
2	V[s]	N N N	wind turbine noise	3,53	43	3
2	V[s]	N N N	wind turbine concepts	3,46	37	4
2	V[s]	N N N	wind turbine generator	3,02	27	8
2	V[s]	A N N	small wind turbines	3,41	33	8
2	V[s]	N N N	MW wind turbine	2,89	10	5
2	V[s]+	N N N	wind turbine technology	3,34	28	10
2	V[s]	N N N	wind turbine system	3,40	32	7
2	V[s]	A N N	modern wind turbines	2,82	17	7
2	V[s]	N N N	wind turbine tower	3,07	15	9
2	V[s]	A N N	large wind turbines	3,12	17	10
3	T	N N	wind energy	4,51	414	32
3	V[s]	N N N	wind energy potential	3,07	15	5
3	V[s]	A N N	offshore wind energy	3,56	47	7
3	V[s]	N N N	wind energy development	3,29	25	5
3	V[s]	N N N	Wind Energy Facility	3,04	14	5
3	V[s]	N N N	wind energy projects	2,97	12	8
3	V[s]	N N N	wind energy technology	2,67	6	5
3	V[s]	N N N	wind energy research	2,67	6	5
3	V[s]	N N N	wind energy application	2,50	4	3
3	V[s]	N N N	wind energy resources	2,50	4	3
4	T	N N	wind power	4,34	278	26
4	V[s]	N N N	wind turbine power	2,89	10	6
4	V[s]	N N N	Wind Power Plant	3,76	74	9
4	V[s]	A N N	offshore wind power	3,01	13	4
--- lines 31 to 2662 ---
2524	T	N	round	0,03	5	4
2525	T	N	Street	0,03	8	4
2526	T	N	News	0,03	3	3
2527	T	N	house	0,03	9	4
2528	T	N	trust	0,02	4	4
2529	T	N	London	0,02	4	4
2530	T	N	money	0,02	6	4
2531	T	N	England	0,01	3	3
2532	T	N	Scotland	0,01	11	4
2533	T	N	weeks	0,01	4	4
--- END OF FILE

Filtering: keeping only top n terms

Terminology can be filtered on any property after variant gathering with --post-filter-property option. When given together with option --post-filter-top-n, TermSuite will only keep top n values sorted by given property (desc).
Any numeric property can be used when for post-propcessing filtering. See all TermSuite properties.
For example, this launcher keeps only top 500 most specific terms.

Command Line

java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
            -t $TREETAGGER_HOME \
            -c $CORPUS_PATH \
            -l en \
            --post-filter-property spec \
            --post-filter-top-n 500 \
            --tsv $TSV_OUTPUT_FILE \
            --tsv-properties "pattern,pilot,spec,freq,dFreq" \
            --info

Docker

termsuite extract \
            -c $CORPUS_PATH \
            -l en \
            --post-filter-property spec \
            --post-filter-top-n 500 \
            --tsv $TSV_OUTPUT_FILE \
            --tsv-properties "pattern,pilot,spec,freq,dFreq" \
            --info

Output TSV file (569 lines)

See how to interprete terminology TSV output file.

#	type	pattern	pilot	spec	freq	dFreq
1	T	N	rotor	4,82	848	30
2	T	N N	wind turbine	4,68	1854	37
2	V[s]	N N N	wind turbine rotor	3,38	31	12
2	V[s]	N N N	WIND TURBINE APPLICATIONS	3,83	86	3
2	V[s]	N N N	WIND TURBINE SOUND	3,74	71	2
2	V[s]	N N N	wind turbine blades	3,57	48	12
2	V[s]	A N N	offshore wind turbine	3,26	47	7
2	V[s]	N N N	DFIG wind turbine	3,25	23	2
2	V[s]	N N N	wind turbine noise	3,53	43	3
2	V[s]	N N N	wind turbine concepts	3,46	37	4
2	V[s]	N N N	wind turbine generator	3,02	27	8
2	V[s]	A N N	Domestic Wind Turbines	3,35	29	1
2	V[s]	A N N	small wind turbines	3,41	33	8
2	V[s]	N N N	wind turbine technology	3,34	28	10
2	V[s]+	N N N	wind turbine system	3,40	32	7
2	V[s]	N N N	wind turbine syndrome	3,21	21	2
2	V[s]	N N N	wind turbine tower	3,07	15	9
2	V[s]	A N N	large wind turbines	3,12	17	10
3	T	N N	wind energy	4,51	414	32
3	V[s]	N N N	wind energy potential	3,07	15	5
3	V[s]	A N N	offshore wind energy	3,56	47	7
3	V[s]	N N N	wind energy development	3,29	25	5
3	V[s]	N N N	Wind Energy Facility	3,04	14	5
3	V[s]	N N N	wind energy projects	2,97	12	8
4	T	N N	wind power	4,34	278	26
4	V[s]	N N N	Wind Power Plant	3,76	74	9
4	V[s]	N N N	wind power stations	3,27	24	2
4	V[s]	A N N	offshore wind power	3,01	13	4
4	V[s]	N N N	Wind Power Project	3,17	19	5
4	V[s]	N N N	wind power development	3,04	14	5
--- lines 31 to 558 ---
491	T	N	RMS	2,93	11	1
492	T	N N	migration routes	2,93	11	4
493	T	A N	electrical power	2,93	11	9
494	T	N	Sediment	2,93	11	2
495	T	N N N	blade flow field	2,93	11	1
496	T	N	Sadock	2,93	11	1
497	T	A N	electric lines	2,93	11	1
498	T	N	Guajira	2,93	11	1
499	T	N N	design procedure	2,93	11	2
500	T	R	dynamically	2,93	11	4
--- END OF FILE

Filtering: keeping term variants

Filtering terminology will often result in having few term variants detected in output terminology, because most term variants have low frequencies. You can configure TermSuite to keep all variants of a term even if they are to be filtered out by the filter selector.
Any numeric property can be used when for post-propcessing filtering. See all TermSuite properties.
This launcher script keeps top 500 term by specificity, plus all 1-degree variants of these 500 terms.

Command Line

java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
            -t $TREETAGGER_HOME \
            -c $CORPUS_PATH \
            -l en \
            --post-filter-property freq \
            --post-filter-top-n 500 \
            --post-filter-keep-variants \
            --tsv $TSV_OUTPUT_FILE \
            --tsv-properties "pattern,pilot,spec,freq,dFreq" \
            --info

Docker

termsuite extract \
            -c $CORPUS_PATH \
            -l en \
            --post-filter-property freq \
            --post-filter-top-n 500 \
            --post-filter-keep-variants \
            --tsv $TSV_OUTPUT_FILE \
            --tsv-properties "pattern,pilot,spec,freq,dFreq" \
            --info

Output TSV file (709 lines)

See how to interprete terminology TSV output file.

#	type	pattern	pilot	spec	freq	dFreq
1	T	N	rotor	4,82	848	30
2	T	N N	wind turbine	4,56	1855	37
2	V[s]	A N N	ARI-450 wind turbine	1,90	2	1
2	V[s]	N N N	vertical-axis wind turbine	2,37	6	2
2	V[s]	N N N	wind turbine rotor	3,38	31	12
2	V[s]	N N N	WIND TURBINE APPLICATIONS	3,83	86	3
2	V[s]	N N N	WIND TURBINE SOUND	3,74	71	2
2	V[s]	N N N	wind turbine blades	3,57	48	12
2	V[s]	A N N	offshore wind turbine	3,26	47	7
2	V[s]	N N N	DFIG wind turbine	3,25	23	2
2	V[s]	N N N	wind turbine noise	3,53	43	3
2	V[s]	N N N	wind turbine concepts	3,46	37	4
2	V[s]	N N N	wind turbine generator	3,02	27	8
2	V[s]	A N N	Domestic Wind Turbines	3,35	29	1
2	V[s]	N N N	direct-drive wind turbines	2,77	15	1
2	V[s]	A N N	small wind turbines	3,41	33	8
2	V[s]	N N N	MW wind turbine	2,89	10	5
2	V[s]	N N N	wind turbine technology	3,34	28	10
2	V[s]	N N N	wind turbine system	3,40	32	7
2	V[s]	N N N	wind turbine placement	2,89	20	1
2	V[s]	N N N	wind turbine syndrome	3,21	21	2
2	V[s]	A N N	small-scale wind turbines	2,50	4	1
2	V[s]	N N N	wind turbine configuration	2,63	11	2
2	V[s]	A N N	modern wind turbines	2,82	17	7
2	V[s]	A N N	three-bladed wind turbines	2,80	8	2
2	V[s]	N N N	wind turbine tower	3,07	15	9
2	V[s]	A N N	large wind turbines	3,12	17	10
3	T	N N	wind energy	4,51	414	32
3	V[s]	N N N	wind energy potential	3,07	15	5
3	V[s]	A N N	offshore wind energy	3,56	47	7
--- lines 31 to 698 ---
684	T	N	place	0,15	70	19
685	T	N	team	0,14	57	11
686	T	N	life	0,13	62	22
687	T	N	years	0,13	300	27
688	T	N	business	0,13	48	15
689	T	N	parties	0,10	45	6
690	T	N	government	0,10	58	17
691	T	N	people	0,09	101	11
692	T	N	months	0,09	52	10
693	T	N	company	0,09	66	19
--- END OF FILE

Exporting terminology to TSV

Export extracted terminology to TSV simply by setting option --tsv.
You can select term and variant properties to appear as column in TSV with option --tsv-properties. Given value must be a ,-separated list of valid property names. See all available properties.

Command Line

java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
            -t $TREETAGGER_HOME \
            -c $CORPUS_PATH \
            -l en \
            --tsv $TSV_OUTPUT_FILE \
            --tsv-properties "key,pattern,pilot,spec,freq,ind" \
            --info

Docker

termsuite extract \
            -c $CORPUS_PATH \
            -l en \
            --tsv $TSV_OUTPUT_FILE \
            --tsv-properties "key,pattern,pilot,spec,freq,ind" \
            --info

Output TSV file (13352 lines)

See how to interprete terminology TSV output file.

#	type	key	pattern	pilot	spec	freq	ind
1	T	n: rotor	N	rotor	4,82	848	0,17
2	T	nn: wind turbine	N N	wind turbine	4,56	1855	0,28
2	V[s]	ann: ari-450 wind turbine	A N N	ARI-450 wind turbine	1,90	2	1,00
2	V[s]+	nnn: vertical-axis wind turbine	N N N	vertical-axis wind turbine	2,37	6	0,50
2	V[s]	nnn: wind turbine rotor	N N N	wind turbine rotor	3,38	31	0,13
2	V[s]	nnn: wind turbine application	N N N	WIND TURBINE APPLICATIONS	3,83	86	0,59
2	V[s]	nnn: wind turbine sound	N N N	WIND TURBINE SOUND	3,74	71	0,87
2	V[s]	nnn: wind turbine blade	N N N	wind turbine blades	3,57	48	0,29
2	V[s]	ann: offshore wind turbine	A N N	offshore wind turbine	3,26	47	0,43
2	V[s]	nnn: dfig wind turbine	N N N	DFIG wind turbine	3,25	23	0,30
2	V[s]+	nnn: wind turbine noise	N N N	wind turbine noise	3,53	43	0,60
2	V[s]+	nnn: wind turbine concept	N N N	wind turbine concepts	3,46	37	0,43
2	V[s]+	nnn: wind turbine generator	N N N	wind turbine generator	3,02	27	0,44
2	V[s]	ann: domestic wind turbine	A N N	Domestic Wind Turbines	3,35	29	0,24
2	V[s]	nnn: direct-drive wind turbine	N N N	direct-drive wind turbines	2,77	15	0,67
--- END OF FILE

Extract a terminology ready for alignment

To extract a terminology that is reusable by TermSuite’s aligner, you only need to set the --contextualize option and export terminology to its native format (JSON) with option --json.

Command Line

java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
            -t $TREETAGGER_HOME \
            -c $CORPUS_PATH \
            -l en \
            --contextualize \
            --json $JSON_OUTPUT_FILE \
            --info

Docker

termsuite extract \
            -c $CORPUS_PATH \
            -l en \
            --contextualize \
            --json $JSON_OUTPUT_FILE \
            --info

Deactivating variant gathering

Variant gathering ca be deactivated with option --disable-gathering. this should make the terminology extraction pipeline to terminate significantly more quickly.

Command Line

java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
            -t $TREETAGGER_HOME \
            -c $CORPUS_PATH \
            -l en \
            --disable-gathering \
            --tsv $TSV_OUTPUT_FILE \
            --info

Docker

termsuite extract \
            -c $CORPUS_PATH \
            -l en \
            --disable-gathering \
            --tsv $TSV_OUTPUT_FILE \
            --info

Output TSV file (11931 lines)

See how to interprete terminology TSV output file.

#	type	key	freq
1	T	nn: wind turbine	1852
2	T	n: rotor	848
3	T	nn: wind energy	414
4	T	nn: wind speed	331
5	T	nn: wind power	278
6	T	n: airfoil	236
7	T	n: voltage	214
8	T	a: due	185
9	T	n: hydrogen	156
10	T	n: natura	144
11	T	n: vortex	132
12	T	n: wecs	116
13	T	n: hub	114
14	T	n: deis	110
15	T	nn: wind generator	105
16	T	nn: pitch angle	100
17	T	npn: angle of attack	97
18	T	nn: blade element	96
19	T	n: infrasound	96
20	T	nn: power coefficient	94
21	T	nn: rotor blade	94
22	T	n: tsr	91
23	T	nn: turbine blade	91
24	T	nnn: wind turbine application	86
25	T	n: annex	83
26	T	nn: turbine sound	80
27	T	nn: farm development	79
28	T	ann: low frequency sound	78
29	T	n: ordinance	77
30	T	nn: blade design	77
--- END OF FILE

Activating semantic variant detection

You can activate semantic variant detection with option --enable-semantic-gathering. Semantic variant gathering can be significantly slower. In TSV output, semantic variants are flagged as V[h]. TSV properties isDico and isDistrib tell if the variant has been detected with a synonymic dictionary or with distributional alignments.

Command Line

java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
            -t $TREETAGGER_HOME \
            -c $CORPUS_PATH \
            -l en \
            --enable-semantic-gathering \
            --tsv $TSV_OUTPUT_FILE \
            --tsv-properties "pilot,freq,spec,semScore,isDico,isDistrib" \
            --info

Docker

termsuite extract \
            -c $CORPUS_PATH \
            -l en \
            --enable-semantic-gathering \
            --tsv $TSV_OUTPUT_FILE \
            --tsv-properties "pilot,freq,spec,semScore,isDico,isDistrib" \
            --info

Output TSV file (15553 lines)

See how to interprete terminology TSV output file.

#	type	pilot	freq	spec	semScore	isDico	isDistrib
1	T	rotor	848	4,82			
2	T	wind turbine	1855	4,56			
2	V[h]+	wind power-plant	2	1,90	0,97	0	1
2	V[h]+	wind channel	2	2,20	0,97	0	1
2	V[h]+	Wind turbines-a	2	2,20	0,97	0	1
2	V[h]+	wind farm	488	3,20	0,89	0	1
2	V[s]	wind turbine rotor	31	3,38			
2	V[s]+	vertical-axis wind turbine	6	2,37			
2	V[s]	WIND TURBINE APPLICATIONS	86	3,83			
2	V[s]	wind turbine blades	48	3,57			
2	V[h]+	Enfield-Andreau turbine	3	2,37	0,54	0	1
2	V[s]+	wind turbine concepts	37	3,46			
2	V[s]+	wind turbine generator	27	3,02			
2	V[s]+	Domestic Wind Turbines	29	3,35			
2	V[s]+	small wind turbines	33	3,41			
2	V[s]	MW wind turbine	10	2,89			
2	V[s]+	wind turbine technology	28	3,34			
2	V[s]+	wind turbine system	32	3,40			
2	V[s]+	small-scale wind turbines	4	2,50			
2	V[s]+	wind turbine configuration	11	2,63			
2	V[s]	three-bladed wind turbines	8	2,80			
2	V[s]	wind turbine tower	15	3,07			
2	V[s]+	large wind turbines	17	3,12			
2	V[s]+	smaller-scale wind turbines	2	2,20			
2	V[s]+	wind turbine operations	14	3,04			
2	V[s]	wind turbine performance	14	3,04			
2	V[s]	wind turbine component	9	2,85			
2	V[s]	ARI wind turbine	4	2,20			
2	V[s]+	land-based wind turbines	4	2,50			
2	V[s]+	wind turbine foundations	9	2,85			
--- END OF FILE

Semantic variants are denoted as V[h]. Consider filtering on values of properties isDistrib and semScore for better filtering.

Semantic variant detection with custom dictionary

Extracting semantic variants requires a synonyms dictionary. Due to licensing issues, the one packaged with TermSuite can be very poor for some languages. You might prefer to use your own dictionary with option --synonyms-dico. See TermSuite packaged english dictionary for an example of dictionary format.

Command Line

java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
            -t $TREETAGGER_HOME \
            -c $CORPUS_PATH \
            -l en \
            --enable-semantic-gathering \
            --synonyms-dico $SYNONYMS_DICO \
            --tsv $TSV_OUTPUT_FILE \
            --tsv-properties "pilot,freq,spec,semScore,isDico,isDistrib" \
            --info

Docker

termsuite extract \
            -c $CORPUS_PATH \
            -l en \
            --enable-semantic-gathering \
            --synonyms-dico $SYNONYMS_DICO \
            --tsv $TSV_OUTPUT_FILE \
            --tsv-properties "pilot,freq,spec,semScore,isDico,isDistrib" \
            --info

Semantic variant detection without distributional variants (dico only)

Semantic variants found distributionally (with property isDistrib=true) generally are of worse quality and slower to find compared to variants found on dictionary. You can tell TermSuite to deactivate distributional gathering with option --semantic-dico-only.

Command Line

java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
            -t $TREETAGGER_HOME \
            -c $CORPUS_PATH \
            -l en \
            --enable-semantic-gathering \
            --semantic-dico-only \
            --tsv $TSV_OUTPUT_FILE \
            --tsv-properties "pilot,freq,spec,semScore,isDico,isDistrib" \
            --info

Docker

termsuite extract \
            -c $CORPUS_PATH \
            -l en \
            --enable-semantic-gathering \
            --semantic-dico-only \
            --tsv $TSV_OUTPUT_FILE \
            --tsv-properties "pilot,freq,spec,semScore,isDico,isDistrib" \
            --info

Extracting a terminology from a very large corpus

Extracts a terminology from a very large corpus by deactivating occurrence indexing and setting a capped terminology size. The capped size is the maximum number of terms allowed to be kept in memory. Every time this number goes over the capped size, TermSuite filters the on-going termino by frequency.

Command Line

java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
            -t $TREETAGGER_HOME \
            -c $CORPUS_PATH \
            -l en \
            --no-occurrence \
            --capped-size 500000 \
            --tsv $TSV_OUTPUT_FILE \
            --info

Docker

termsuite extract \
            -c $CORPUS_PATH \
            -l en \
            --no-occurrence \
            --capped-size 500000 \
            --tsv $TSV_OUTPUT_FILE \
            --info

Extracting a terminology from preprocessed corpus

It is not mandatory to start a terminology extraction pipeline on textual corpus (with option -c). You may want to do your preprocessings once and start a extraction pipelines from an already preprocessed corpus with option --from-prepared-corpus instead.
Note that since the corpus is prepared (including POS tags), there is no more need for setting options -t. TermSuite will complain if both options are set together.
A valid preprocessed corpus is a directory path containing as many *.xmi files as there are *.txt file in input corpus. The file-i.xmi file being the list of preprocessed annotation of file file-i.txt.
Another valid preprocessed corpus type is a TermSuite .json terminology in which all spotted terms have been imported but on which no further terminology extraction processings have been operated.
See TermSuite Preprocessor on how to produce a valid preprocessed corpus in any of these format.

Command Line

java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
            --from-prepared-corpus $PREPARED_CORPUS_PATH \
            -l en \
            --tsv $TSV_OUTPUT_FILE \
            --info

Docker

termsuite extract \
            --from-prepared-corpus $PREPARED_CORPUS_PATH \
            -l en \
            --tsv $TSV_OUTPUT_FILE \
            --info

Customizing Terminology Post-Processing

Terminology Post-Processing step (after variant gathering) can be customized with --postproc-* options. Option --disable-post-processing would disable this step.
This launcher sets a threshold of 0.20 for term property Independance. This means that for each term, at least 20% of its occurrences must not be in the form of any of its variant.
See default TermSuite config for the default english post-processing values.

Command Line

java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
            -t $TREETAGGER_HOME \
            -c $CORPUS_PATH \
            -l en \
            --postproc-independance-th 0.20 \
            --tsv $TSV_OUTPUT_FILE \
            --tsv-properties "pilot,freq,spec,ind" \
            --info

Docker

termsuite extract \
            -c $CORPUS_PATH \
            -l en \
            --postproc-independance-th 0.20 \
            --tsv $TSV_OUTPUT_FILE \
            --tsv-properties "pilot,freq,spec,ind" \
            --info

Output TSV file (12557 lines)

See how to interprete terminology TSV output file.

#	type	pilot	freq	spec	ind
1	T	wind turbine	1854	4,68	0,28
1	V[s]	ARI-450 wind turbine	2	1,90	1,00
1	V[s]+	vertical-axis wind turbine	6	2,37	0,50
1	V[s]	WIND TURBINE APPLICATIONS	86	3,83	0,59
1	V[s]	WIND TURBINE SOUND	71	3,74	0,87
1	V[s]	wind turbine blades	48	3,57	0,29
1	V[s]	offshore wind turbine	47	3,26	0,43
1	V[s]	DFIG wind turbine	23	3,25	0,30
1	V[s]+	wind turbine noise	43	3,53	0,60
1	V[s]+	wind turbine concepts	37	3,46	0,43
1	V[s]+	wind turbine generator	27	3,02	0,44
1	V[s]	Domestic Wind Turbines	29	3,35	0,24
1	V[s]	direct-drive wind turbines	15	2,77	0,67
1	V[s]	small wind turbines	33	3,41	0,61
1	V[s]	MW wind turbine	10	2,89	0,80
1	V[s]+	wind turbine technology	28	3,34	0,25
1	V[s]+	wind turbine placement	20	2,89	0,55
1	V[s]	wind turbine syndrome	21	3,21	1,00
1	V[s]+	small-scale wind turbines	4	2,50	0,25
1	V[s]	wind turbine configuration	11	2,63	0,36
1	V[s]	modern wind turbines	17	2,82	1,00
1	V[s]	three-bladed wind turbines	8	2,80	0,63
1	V[s]	wind turbine tower	15	3,07	0,20
1	V[s]	large wind turbines	17	3,12	0,65
1	V[s]+	wind turbine design	16	2,80	0,31
1	V[s]+	smaller-scale wind turbines	2	2,20	1,00
1	V[s]	wind turbine operations	14	3,04	0,86
1	V[s]	wind turbine performance	14	3,04	0,50
1	V[s]	wind turbine component	9	2,85	0,22
1	V[s]	wind turbine project	12	2,67	0,42
--- END OF FILE

Filtering: cleaning terminology before gathering step (`--pre-filter-*`)

Filtering terminology after the variant gathering step (--post-filter-*) is often the default requirement for users. Sometimes, it is clever to filter the terminology before the variant gathering with --pre-filter-property option. One of the main use case for pre-filtering is to reduce the terminology size in amount of term variant gathering, thus accelerating the whole variant gathering step.
Any numeric property can be used when for pre-propcessing filtering. See all TermSuite properties.
For example, this launcher filters all term having freq < 2 (hapaxes) before variant gathering. As a consequence, that will never be possible to detect term variants occurring exactly once. There can not be any *-keep-variants option with --pre-filter as there is --post-filter-keep-variants.

Command Line

java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
            -t $TREETAGGER_HOME \
            -c $CORPUS_PATH \
            -l en \
            --pre-filter-property freq \
            --pre-filter-th 2 \
            --tsv $TSV_OUTPUT_FILE \
            --tsv-properties "pattern,pilot,spec,freq,dFreq" \
            --info

Docker

termsuite extract \
            -c $CORPUS_PATH \
            -l en \
            --pre-filter-property freq \
            --pre-filter-th 2 \
            --tsv $TSV_OUTPUT_FILE \
            --tsv-properties "pattern,pilot,spec,freq,dFreq" \
            --info

Debugging terminology extraction

You can activate logging to console with either --info or --debug (more verbose).
There is also the possibility to get term-centered information with option --watch. The following launcher will print all pipeline information about term “offshore wind energy”, even if it has been filtered. This is a very convenient tool when you do not get what you expected of a term (missing variants, missing term, etc).

Command Line

java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
            -t $TREETAGGER_HOME \
            -c $CORPUS_PATH \
            -l en \
            --debug \
            --tsv $TSV_OUTPUT_FILE \
            --watch "offshore wind energy"

Docker

termsuite extract \
            -c $CORPUS_PATH \
            -l en \
            --debug \
            --tsv $TSV_OUTPUT_FILE \
            --watch "offshore wind energy"

Using customized linguistic resources (Advanced users)

TermSuite terminology extraction is based on multiple linguistic resources of many types and formats (multi-word term spotting patterns, variation gathering rules, term frequencies in general language, morphology rules, etc). The main use case for this option is to elaborate linguistic ressources for a new language. Use customized linguistic resources with options --resource-dir or --resource-jar.
See TermSuite resources documentation and see how to easily produce your own custom resource directory or jar.

Command Line

java -Xms1g -Xmx8g -cp $TS_HOME/termsuite-core-$TS_VERSION.jar fr.univnantes.termsuite.tools.TerminologyExtractorCLI \
            -t $TREETAGGER_HOME \
            -c $CORPUS_PATH \
            -l en \
            --resource-dir $CUSTOM_RESOURCE_DIR \
            --tsv $TSV_OUTPUT_FILE \
            --info

Docker

termsuite extract \
            -c $CORPUS_PATH \
            -l en \
            --resource-dir $CUSTOM_RESOURCE_DIR \
            --tsv $TSV_OUTPUT_FILE \
            --info

Prerequesites

TerminologyExtractorCLI

Usage

Description

Mandatory options

--from-text-corpus, --from-prepared-corpus

--tsv, --tbx, --json

Other options

--capped-size INT

--context-assoc-rate INT or FLOAT

--context-coocc-th INT or FLOAT

--context-scope INT

--contextualize (no arg)

--disable-derivative-splitting (no arg)

--disable-gathering (no arg)

--disable-merging (no arg)

--disable-morphology (no arg)

--disable-native-splitting (no arg)

--disable-post-processing (no arg)

--disable-prefix-splitting (no arg)

--enable-semantic-gathering (no arg)

--encoding, -e ENC

--from-prepared-corpus DIR

--from-text-corpus, -c DIR

--graphical-similarity-th INT or FLOAT

--json FILE

--language, -l LANG

--nb-semantic-candidates INT

--no-occurrence (no arg)

--post-filter-keep-variants (no arg)

--post-filter-max-variants INT

--post-filter-property STRING

--post-filter-th INT or FLOAT

--post-filter-top-n INT

--postproc-affix-score-th INT or FLOAT

--postproc-affix-score-th INT or FLOAT

--postproc-independance-th INT or FLOAT

--postproc-variation-score-th INT or FLOAT

--pre-filter-max-variants INT

--pre-filter-property STRING

--pre-filter-th INT or FLOAT

--pre-filter-top-n INT

--ranking-asc STRING

--ranking-desc STRING

--resource-dir DIR

--resource-jar FILE

--resource-url-prefix STRING

--semantic-dico-only (no arg)

--semantic-distance INT or FLOAT

--semantic-similarity-th INT or FLOAT

--synonyms-dico FILE

--tagger STRING

--tagger-home, -t FILE

--tbx FILE

--tsv FILE

--tsv-hide-headers (no arg)

--tsv-hide-variants (no arg)

--tsv-properties STRING

--watch TERM_LIST

Examples

Filtering: cleaning by threshold value on any property

Filtering: keeping only top n terms

Filtering: keeping term variants

Exporting terminology to TSV

Extract a terminology ready for alignment

Deactivating variant gathering

Activating semantic variant detection

Semantic variant detection with custom dictionary

Semantic variant detection without distributional variants (dico only)

Extracting a terminology from a very large corpus

Extracting a terminology from preprocessed corpus

Customizing Terminology Post-Processing

Filtering: cleaning terminology before gathering step (--pre-filter-*)

Debugging terminology extraction

Using customized linguistic resources (Advanced users)

`--from-text-corpus`, `--from-prepared-corpus`

`--tsv`, `--tbx`, `--json`

`--capped-size` INT

`--context-assoc-rate` INT or FLOAT

`--context-coocc-th` INT or FLOAT

`--context-scope` INT

`--contextualize` (no arg)

`--disable-derivative-splitting` (no arg)

`--disable-gathering` (no arg)

`--disable-merging` (no arg)

`--disable-morphology` (no arg)

`--disable-native-splitting` (no arg)

`--disable-post-processing` (no arg)

`--disable-prefix-splitting` (no arg)

`--enable-semantic-gathering` (no arg)

`--encoding`, `-e` ENC

`--from-prepared-corpus` DIR

`--from-text-corpus`, `-c` DIR

`--graphical-similarity-th` INT or FLOAT

`--json` FILE

`--language`, `-l` LANG

`--nb-semantic-candidates` INT

`--no-occurrence` (no arg)

`--post-filter-keep-variants` (no arg)

`--post-filter-max-variants` INT

`--post-filter-property` STRING

`--post-filter-th` INT or FLOAT

`--post-filter-top-n` INT

`--postproc-affix-score-th` INT or FLOAT

`--postproc-affix-score-th` INT or FLOAT

`--postproc-independance-th` INT or FLOAT

`--postproc-variation-score-th` INT or FLOAT

`--pre-filter-max-variants` INT

`--pre-filter-property` STRING

`--pre-filter-th` INT or FLOAT

`--pre-filter-top-n` INT

`--ranking-asc` STRING

`--ranking-desc` STRING

`--resource-dir` DIR

`--resource-jar` FILE

`--resource-url-prefix` STRING

`--semantic-dico-only` (no arg)

`--semantic-distance` INT or FLOAT

`--semantic-similarity-th` INT or FLOAT

`--synonyms-dico` FILE

`--tagger` STRING

`--tagger-home`, `-t` FILE

`--tbx` FILE

`--tsv` FILE

`--tsv-hide-headers` (no arg)

`--tsv-hide-variants` (no arg)

`--tsv-properties` STRING

`--watch` TERM_LIST

Filtering: cleaning terminology before gathering step (`--pre-filter-*`)