Suitable Training Data
What makes suitable training data? This is a question often asked by our members. There are four sources of training data that can be used to develop high quality KantanMT engines.
Training data from similar domains or subject matters generally yield better KantanMT engines - improving translation and reducing post-editing effort.
The four types of training data supported by KantanMT.com are:-
- Translation Memory Files: This is the best source of training data since the source and target texts are aligned. The optimal format for use with KantanMT.com is TMX (Translation Memory Exchange) format, however text files can also be used.
- Monolingual Translated Text Files: Monoligual text files are used to create language models for a KantanMT engine. Language models are used for word and phrase selection and have a direct impact on the fluency of KantanMT engines. Suitable formats are DOCX, PDFs or TXT files.
- KantanLibrary Training Data: KantanLibrary training data can be used as language foundation data for KantanMT engines. Many different training data sets are available.
- Terminology Files: Terminology files or glossary files in TBX (Terminology Interchange XML) format can be used as training material. They ensure that your KantanMT engine uses the correct terminology of your clients improving translation consistency and quility.
You can also upload test data which can be used to calculate KantanBuildAnalytics™ quality measurements.
- Test Data - This data should be stored in aligned text files called source.test.src and source.test.trg. Each file should be UTF8 encoded and contain one test segment per line.