Next version (3.0)
- TermSuite will now require Java 8
- Detection of synonimic variations
- Improve multilingual alignment results for complex terms and compounds
Version 2.3 focuses on migrating to Java 8 and facilitation the use of TermSuite by simplifying the
Java API and improving the documentation of TermSuite in french and english.
- Require Java 8,
- Possibility to use TermSuite only as text preprocessor (tokenizer, tagger, lemmatizer, stemmer, filters),
- New simplified Java API,
- Linguistic resources are embedded by default,
- Improved test coverage,
- Extension of linguistic specification for derivates and prefix variation to
TermSuite 2.2 is the last version supporting Java 7. Next TermSuite version will require Java 8.
- #51 Reorganized TermSuite resources (see TermSuite Resources on Github)
- #48 Support detection of derivative variations
- #49 Implement prefix exceptions in prefix splitting
- #54 Update variation rules with prefix detection and derivatives detection
- #46 Possibility to add custom UIMA AE and CollectionReader to TermSuitePipeline
- #41 Serialize and deserialize TermSuite tokens (the UIMA CAS) to
json(thanks to Simon Méoni)
- #42 Fixed expression detection
- #43 Added support for ISTEX API as a collection reader.
- #71 Improved morphosyntactic analysis performances
- Several bug fixes. See issues on Github.
In TermSuite GUI:
- Some bug fixes and minor UI improvements issues
- Improved alignment performances (improved response time and support for multi-word terms and compounds)
- Version 2.2 has abandonned support of TEI corpus, due to issue #24
- Graphical User Interface
- Observable pipelines
- uima profiler dependency removed
- Added a streaming-API (beta)
- Exportable annotations to JSON
- Support for compositional and semi-distributional aligner back
- Morphosyntactic analysis: added lemmatization of components
- Allow the capping of terminology from TermSuiteTerminoCLI with options
- Scalabilty : can now handle big corpus with MongoDB
- Changed the IO API. See JsonTermIndexIO class
- Removed term index measures
- Added ranker AE
- Added Merger AE
- Added Scorer AE (scores variations)
Version 2.0 is a major release.
IMPORTANT 1: In 2.0, there is no more the distinction between the
Spotter and the
Indexer phases. They have been merged into one unique highly configurable
Broken until 2.1
- Graphical User Interface
- Added support for Mate POS Tagger and Lemmatizer (
fronly) in addition to TreeTagger.
- Major refactoring of TermSuite’s data model. See [[TermSuiteDataModel]]
- Simplified TermSuite Type system (see [[TermSuiteDataModel]] for details)
- Dropped all UIMA XML resource description. Moved to uimaFit completely.
- Splitted the huge TermSuite’s Gradle multi-project config into separate and independant Github/Maven projects.
- Published TermSuite and all its dependant subprojects as Maven artefacts.
- Moved source code and documentation to Github and Github’s wiki.
- Added AE primary occurrence detector.
- Added AE for term class grouping.
- Added AE for producing tsv variant evaluation files.
- TSV Export AE made configurable. (you can select the Term properties you want to get in your TSV)
- Added Morphology analysis. (with CompostAE)
- Improved syntactic variant resources + supportfor morphosyntactic variants detection.
- Improved and debugged Weirdness Ratio (old specificity) computing (
GeneralLanguage.*resources being retrained on larger corpus with new TermSuite standards)
- Migrated syntactic variant resources from
*.yaml(beatter readability + improved rule indexing and more efficient rule matching detection)
- Added JSON Serializing for terminologies. Now terminologies can be saved into
*.jsonand loaded from
- Terminology export in
*.xmiformat in not supported anymore. (
*.xmiterminology files were too big, not human readable, and Exporter/Parser were too slow)
- Added several
JUnittests for several AEs and intelligent helper methods.
IMPORTANT : The new default encoding for TreeTagger resource interpretation in TermSuite is UTF-8. Please make sure that all your TreeTagger models are UTF-8 encoded.
- Clean UIMA multi-word terms based on a per-leanguage allowed characters list
- New UIMA AE for rule-based pattern detection : UIMA Tokens Regex (with tests and doc)
- New UIMA AE for term gathering : GroovyMultiWordGatherer
- Add script TermSuiteTerminoCLI for terminology extraction in TermSuite (with spotting and gathering rules as parameter)
- ADD CollectionReader and SimplePipeline TermSuite launchers.
- Debugged Txt collection reading : some annoying characters are replaced (annoying apostrophes, annoying unsual whitespaces, etc)
- ADD 2 collection readers : StringCollectionReader (for inline TerminoCLI) and TEI CollectionReader (for corpora in tei format)
More stats and examples about term spotting and gathering in TermSuite’s logs.
- Integration of uimaFit : all spotter AE refactored into uimaFit + uimaFit launcher for launcher
- Faster and multi-word term spotting with UIMA Tokens Regex for 4 languages : fr, en, de, ru
- Easier definition of multi-word term rules with UIMA Tokens Regex for 4 languages : fr, en, de, ru
- Faster AE for single word term spotting : eu.project.ttc.engines.SingleWordTermSpotter for all languages
- Easier definition of singler-word term rules with eu.project.ttc.engines.SingleWordTermSpotter for all languages
- Faster and multi-word term gathering with UIMA Tokens Regex for two languages : fr, en
- Easier definition of multi-word term rules with UIMA Tokens Regex for two languages : fr, en
- Debugged script TermSuiteAlignerCLI
2014/03/17: Updated of the UserGuide TermSuite 1.5
- New GUI, improved wording and software architecture.
- Improved scalabilty, TermSuite can handle large corpora of several millions of words.
- Moved to gradle build tool.
- Moved to git based repository.
- Added TSV output as an option
- List of occurrences with frequencies in TBX output
- Added menu for loading existing processed data on the spotter results view.
- Several small bugfixes and enhancements.
- New tabbed menus for parameters.
- Several menus to distinguish between different parameters
- Variant detection parameters were separated for better comprehension
- Separation of parameters in basic and advanced options.
- The different alignment methods were clearly separated in different options.
- TSV output for spotter, indexer, aligner
- File parameters for input of CLI
- User manual (pdf format) to be downloaded
- Java program to prepare the directory of terms to be translated as input of the Aligner
- Pilot form in TBX output
- List of occurrences with frequencies in TBX output
- compositional and semi-compositional methods added for MWT alignment
- Added the Chinese general lexicon required for specificity calculation
- Enhanced MWT recognition rules for Ru/Lv/Es
- Enhanced MWT conflation rules for Ru/Lv/Es
- langSet identifier added in the XMI files
- Dissociation of XMI files (input of the aligner) and TBX files (output of the monolingual terminology extraction)
- Detection of graphic variants for MWTs (Ignore Diacritics in Conflation settings)
- New diacritics-insensitive edit distance in Conflation settings
- Specifity score in TBX output
- Verbs and other categories removed from TBX output (unless Keep verbs and others is specified)
- Statistical filtering of monolingual candidates: by cut-off rank (e.g. top-100, top-250, etc.) or threshold.
- TBX candidates are ranked according to the filter criteria, or alphabetically if no filtering is done.
- Bilingual TBX output following specifications
- Cut-off rank for translation candidates in bilingual output
- Progress bar color forced to green
- Windows shell CLI fixed
- Parameter group support for indexer settings (Advanced settings/TBX settings)
- Added the Latvian and Spanish general lexicons required for specificity calculation (the Latvian lexicon needs to be cleaned)
- Enhanced MWT recognition rules for En/Fr/De
- Enhanced MWT conflation rules for En/Fr/De
- Fail-fast edit distances
- graphical interface improved
- Latvian processing improved
- Chinese processing improved
- compound and multi-word alignment compositional method improved
- term conflation improved (no more extensions)
- result displays refactorized in tabbed panels
- no more CPE and CR, only AE
- reshaping in Spotter, Indexer and Aligner
- added Relater for computing similarity distances between context vectors of a monolingual contextual terminology
- fixed issue 7
- Danish support for Tagger and Termer
- XML format for resources (tokenizer, tree-tagger, stop-word, rule-based term detection)
- neoclassical compound alignment
- refactoring taggers: a tagger button and one tagger engine per tab
- refactoring converters: a converter button and one converter engine per tab
- bug fix: the Indexer annotator was running as many times as there are index listeners and not only once
- bug fix: removing hapax both by filtering raw terms and by filtering their lemma
- bug fix: enabling term indexation according to their annotation types
- adding a Converter launcher for converting XMI files into other formats e.g. TSV
- sorting the terminology view of the Termer tool by frequencies
- adding a Stemmer for English, French, German, Russian and Spanish into the TreeTagger analysis engine
- adding Tilde Tagger for processing Latvian
- adding a term context result viewer for the Contextualizer tool
- splitting Ziggurat in 2 tools: Contextualizer and Aligner
- renaming Acabit in Termer
- CPE workflow split into 3 respectively called TreeTagger, Acabit and Ziggurat
- GUI refactoring: 1 tool by CPE
- version presented at IJCNLP
- TBX export bug fix
- Term Bank XMI serialization and deserialization removed after a “out of memory” exception thrown.
- Term Bank binary serialization and deserialization added instead.
- Term Context Indexer added.
- multi-word rules for Spanish, Latvian and Russian added.
- initial release of the new TTC TermSuite GUI and CLI interfaces