In TermSuite, alignment is the process of grouping together terms that share the most similar contexts. There exist two sorts of alignment in TermSuite:

- Contextualization: building context vectors
- General alignment methods
- Bilingual alignment methods (term translation)
- Monolingual alignment methods (synonym detection)

### Contextualization: building context vectors

#### Context scope

The *scope* of a context for a term *T* is the number of words occurring just before *T* left and just after *T*.

Example with the following sentence:

Data were acquired with the blades rotating at zero yaw, for a range of wind speeds.

When the *scope* is *1*, the context of term “*wind*” is {range: 1, speed: 1}.

When the *scope* is *2*, the context of term “*wind*” is {yaw: 1, range: 1, speed: 1}.

**Note 1:** co-terms in context vector are lemmatized.

**Note 2:** determiners and propositions are skipped when computed context’s scope.

#### Normalization of context vectors

The idea behind normalization is two lower the impact of co-terms that appear in multiple context vectors.

Given two terms *T1* and *T2*, having respectively *{co-term1: 2, co-term2: 1, co-term3: 2}* and *{co-term1: 1, co-term2: 1, co-term4: 3}* as their context vectors, we can observe that:

*co-term2*is not very specific to either*T1*nor*T2*as it appears in both contexts with the same distribution (*frequency=1*),*co-term1*is a bit more specific to*T1*than*T2*as it appears twice in*T1*’s context and only once in*T2*’s,*co-term3*is very specific to*T1*as it appears twice in its context and is absent of*T2*’s,*co-term4*is even more specific to*T2*as it appears three times in its context and is absent of*T1*’s.

Two normalization algorithms are available in TermSuite for context vector normalization:

*LogLikelyhood**MutualInformation*

### General alignment methods

Methods described in this section applies on bilingual alignment and monolingual alignment.

In general, alignment requires:

- at least one
*terminology*where terms have been contextualized, (in the case of bilingual alignment, there must be two terminologies) - a
*dictionary*(a synonym dictionary for monolingual alignment, a bilingual dictionary for bilingual alignment).

#### Distributional alignment

*Distributional alignment* is the process of computing the closeness of two terms based on the similarity of their normalized context vectors.

*Distributional alignment* applies on single-word terms only, *ie.* on *length-1 terms*.

There are two context vector similarity measures currently implemented in TermSuite:

#### Compositional alignment

*Compositional alignment* applies on *length-2 terms*, *ie.* on terms composed of two words, like *wind energy*, *breast cancer*, *chemotherapy* (*chemo* + *therapy*, see the special case of compound words), etc.

**Note:** Determiners and propositions are ignored when computing the term’s length. For example, the term *energy of the wind* is length-2 because *of* and *the* are ignored.

Say the term to align is made of *word1* and *word2*. In TermSuite’s compositional alignment algorithm:

- TermSuite gets
*C1*, the set of all length-1 alignment candidates for*word1*from dictionary, - TermSuite gets
*C2*, the set of all length-1 alignment candidates for*word2*from dictionary, - TermSuite combines all length-1 candidates from
*C1*and*C2*to produce the set*C=C1xC2*of all length-2 alignment candidates that occur in target terminology, - TermSuite scores and ranks the candidates in
*C*.

The maximum number of alignment candidates is * | C1 | x | C2 | . If any of *word1 or word2 has no entry in dictionary, then there can be no alignment candidate and the compositional method fails. |

** Example: ** Translation of term *wind power* from english to french with compositional method.

- word1=
*wind*, C1={*vent*,*souffle*,*gaz*} - word2=
*power*, C2={*énergie*,*puissance*,*vertu*} - The combination of
*C1*and*C2*give*9*candidates but only*énergie du vent*and*puissance du vent*exist in target terminology. C={*énergie du vent*,*puissance du vent*} - If candidates are ranked according to their target frequency or specificity, then the best candidate is
*énergie du vent*.

About target candidates ranking, see this publication. Currently, the default target candidate ranking strategy is by *decreasing specificity*.

#### Semi-distributional alignment

Like compositional alignment, *semi-distributional alignment* applies on *length-2 terms* only. Semi-distributional alignment works in a very similar manner to *compositional alignment*.

The only difference lies at step 2, where alignment candidates for *word2* are computed with the distributional method instead of from the dictionary. Step 1 (*word1* candidates computation), step 3 (combination), and step 4 (ranking) stay unchanged. Symmetrically, we could apply the *distributional* alignment on *word1* and leave steps 2, 3, and 4 unchanged.

Usually, the distributionnal method has lower performances in terms of precision than the compositional method. It is invoked in TermSuite as a fallback when one of the two words is missing from the dictionary. When both term’s words are missing from dictionary, it would be theoriticallay feasible to apply a *pure distributionnal* alignment, but the precision is too low. It is not implemented.

#### Aligning longer terms (length > 2)

Longer terms are aligned recursively.

If **length is n**, the term is of the form *T=word1 word2 … wordn*. There are n-1 alternatives for splitting T in two smaller-size terms:

* alternative 1: *T’=word1* and *T”=word2 … wordn*

* alternative 2: *T’=word1 word2* and *T”=word3 … wordn*

* …

* alternative n-1: *T’=word1 word2 … wordn-1* and *T”=wordn*.

For each splitting alternative *i*:

1. TermSuite produces:

* candidate set *C’* by aligning *T’* with the compositional method or with the semi-distributional method if the compositional is not appliable.

* candidate set *C”* from *T”* likewise.

2. TermSuite produces candidate set *Ci* by combining *C’* and *C”* in a similar manner than step 3 of compositional method.

Finally, TermSuite produce the final candidate set *C = C1 U C1 U … U Cn-1* and ranks *C* in a similar manner than step 4 of compositional method.

#### Aligning compound terms

A *compound term* is a single-word term composed of at least two different words. For example, the term *windpower* is composed of *wind+power*.

##### Size-2 compounds

Aligning a size-2 compound term like *windpower* is a four-step process:

- producing the candidate set
*C1*by aligning*windpower*as a regular single-word term (ignoring its composition) with the help of the dictionary or with the [distributional][#distributional] method, - producing the candidate set
*C2*by aligning*windpower*as a length-2 term with word1=wind and word2=power, with the compositional method or with the semi-distributional method as fallback. - producing the candidate set
*C=C1 U C2*, - ranking
*C*(cf. step 4 of compositional method).

##### Size-n compounds, n>2

If the compound term is made of at least three words, as it often happens in german (*eg.* *windenergienutzung=wind+energie+nutzung*), the same principle applies on four different candidate set. For the sake of simplicity, we denote *T=A+B+C*, where *A*, *B*, and *C* are the sub-words:

- Candidate set
*C1*obtained by aligning the term*T*as a single-word term. - Candidate sets
*C2’*and*C2”*obtained by aligning the term*ABC*as a size-2 compound as in previous section.*C2’*is obtained by assuming that T is made of the two words*A+BC*,*C2”*is obtained by assuming that T is made of the two words*AB+C*. - Candidate set
*C3*obtained by aligning the term*T*as a size-3 compound*A+B+C*, by applying the alignment algorithm for*length-n*terms on word1=A, word2=B, and word3=C.

Finally, We make the union candidate set *C = C1 U C2’ U C2” U C3* and rank it as usual.

The table below gives all the possible composition patterns for alignment depending on the actual composition size.

Composition size | Actual composition pattern | List of candidate composition patterns |
---|---|---|

1 | A | {A} |

2 | AB | {AB, A+B} |

3 | ABC | {ABC, AB+C, A+BC, A+B+C} |

4 | ABCD | {ABCD, ABC+D, AB+CD, A+BCD, AB+C+D, A+BC+D, A+B+CD, A+B+C+D} |

#### Generalization to all types of terms of all lengthes

The most general and most difficult situation to handle is the case when the term to align has a length > 1 and at least of its words is a compound. For example *offshore windpower* is a length-2 term made of simple word *offshore* and compound word *wind+power*.

Suppose we have a length-3 term *A+B C D+E+F*, *ie* word1 is the compound *A+B*, word2 is the simple word *C*, and word3 is the compound word *D+E+F*.

As illustrated in the table of candidates composition patterns, we have to consider the following composition alternatives:

# | word1 | word2 | word3 |
---|---|---|---|

1 | AB | C | DEF |

2 | AB | C | DE+F |

3 | AB | C | D+EF |

4 | C | AB | D+E+F |

5 | A+B | C | DEF |

6 | A+B | C | DE+F |

7 | A+B | C | D+EF |

8 | A+B | C | D+E+F |

For each of these composition alternatives, we get all its components and apply [length-n algorithm][#length-n] on them, as if they would all form one regular length-n term (*ie.* having no compound words).

For example, with alternative 7, we consider we a length-5 term *T=A B C D EF*.

**Note:** The resulting complexity of this overall algorithm might look very expensive but in practice:

- every exepensive sub-alignment invocation can be cached and reused very quickly for other combinations,
- there are very few terms of interest with such length and complexity of composition.

### Bilingual alignment methods (term translation)

Bilingual alignment is the process of finding the best translation candidates of a source term in a target language.

#### Requirements

It requires:

- a
*source terminology*where terms have been contextualized, - a
*target terminology*where terms have been contextualized, - a
*bilingual dictionary*from source language to target language.

#### Translation of source term’s context vector

The translation of a source term into a target language works as described in the general case, but the similarity of the source and target vectors is not an easy problem since they contain words from different languages.

Let’s denote:

* *Ts* the source term to translate,

* *V={co-term1: 2, co-term2: 5, co-term3: 1}* its context vector.

In order to be able to apply our similarity measures on context vectors, we translate the source context vector *V* into *V’* as follows. For each co-term *t* in *V*, we get the set of all candidate translations *C* for *t* from the dictionary.

- If
*C*is empty, then the co-term could not be found in the dictionary and we skip it in*V’*. - If
*C*has exactly one element*ct’*, we put*ct’*with the same frequency in*V’*. - If
*C*has several elements, we use the frequency to distribute the original frequency of co-term*ct*among all candidate translations for this co-term.

**Example** Let’s suppose that:

- the source language is
**en**, - the target language is
**fr**, - the source context vector is
*{wind: 3, blade:1, darius:2}* - the dictionary contains the following entries:

```
wind vent
wind gaz
wind air
blade pâle
```

- the frequencies in target terminology are:

```
vent: 17
gaz:2
air:5
pâle: 12
```

Then the translation of context vector gives:

- for co-term
**wind**:*{vent: 3*17/24, air: 3*5/24, gaz: 3*2/24}* - for co-term
**blade**:*{pâle: 1*12/12}* - for co-term
**darius**:*{}*

The final translated source context vector is *{vent: 2.13, air: 0.25, gaz: 0.63, pâle: 1}*

### Monolingual alignment methods (synonym detection)

Monolingual alignment is the process of finding the best candidate synonyms of a term in a terminology.

#### Requirements

It requires:

- a
*source terminology*where terms have been contextualized, - a dictionary of synonyms.

#### Synonym detection algorithm

Every methods described in the general case could be applied on synonym detection but in order to reduce the time computation and increase the precision, we limit the search for synonyms to:

- length-2 terms,
- having the same syntactic pattern,
- sharing at least one term in common.

To configure this process so as it adapts the best to each supported language, a language-specific list of synonymic variant rules is defined.

**Example:** The following rule finds length-2 synonyms having the pattern `A N`

by fixing the first word (`s[0]==t[0]`

) and looking for synonyms between the two second words (`synonym(t[1],s[1])`

). The `A`

is called the ** fixed part** of the synonimic rule because it must be the same in the two terms and the

`N`

is called the **.**

*synonimic part*```
"AN-AsynN":
source: A N
target: A N
rule: s[0]==t[0] && synonym(t[1],s[1])
```

For each synonymic rule defined, TermSuite applies the following algorithm:

- select all terms having the syntactic pattern required (here:
`A N`

) into a set named*C1*, - group terms in
*C1*by their*fixed part*(here we group the terms whenever they have the same first word because we have`s[0]==t[0]`

), - in each group,
*ie*set a synonym score to every pair of terms*(t1,t2)*in the group based on the*synonimic part*. In the example rule above, the synonimic part of the rule is the second word, so we compare for each pair their second words (their`N`

). Let’s name*w1*the second word of*t1*and*w2*the second word of*t2*:- compute the distributional similarity between the context vector of
*w1*and the context vector of*w2*, - add
*0.5*to that score if the pair*w1-w2*is present in the dictionary.

- compute the distributional similarity between the context vector of
- finally, for all pairs in the group having a score above given threshold that depends on the language (see language configuration file), set a semantic variation.