The rules for the term checker are in 4 XML files:
Disambiguation-ste6.xmltells LanguageTool which STE terms are approved and which STE terms are not approved. Disambiguation rules identify the part of speech that a term has. For disambiguation, sequence is important. Thus,
disambiguation-ste6.xmlincludes the content of
disambiguation-projectterms.xmlbefore the disambiguation rules.
Disambiguation-projectterms.xmltells LanguageTool which project terms are approved and which project terms are not approved.
Grammar-ste6.xmltells LanguageTool to find STE terms that are not approved and STE terms that are used incorrectly.
Grammar-projectterms.xmltells LanguageTool to find project terms that are not approved and project terms that are used incorrectly.
The disambiguation files specify the terms and the grammar files tell LanguageTool which terms to find
The files for the term checker are outside the LanguageTool installation directory
grammar.xml contain all the rules for the grammar checks and the style checks. The 2 XML files are in the installation directory. If you use the stand-alone version of LanguageTool and the OpenOffice version, then you have 2 sets of files.
The term checker files are outside the installation directories. This method has these advantages:
- Only 1 set of files is necessary. All the changes that you make in a file are available in all the different versions of LanguageTool.
- You can easily use different sets of rules for different projects.
- If you update LanguageTool, the term checker files are not deleted.
When you install the term checker, you will replace
grammar.xml with files from TechScribe. The new files contain links to the files for the term checker.
If ASD-STE100 shows a term as approved for one part of speech, and unapproved for a different part of speech, then the term checker finds the term if it is used incorrectly. For example, work is approved as a noun, but not as a verb:
In the text, "You must work quickly," work is a verb, not a noun. Thus, the term checker finds the term.
The design of disambiguation-ste6.xml
The rules that are in
disambiguation-ste6.xml analyse text, and apply a part of speech if a term is identified unambiguously as one part of speech.
The analysis uses pattern matching. For example, in the text "the X was," X is a noun. X cannot be a verb or some other part of speech. Each rule is applied in sequence. If text matches a pattern, the rule adds a part of speech to the matched term. If the pattern is not matched, the next rule is used.
For technical details, refer to these documents:
- Developing a Disambiguator (http://wiki.languagetool.org/developing-a-disambiguator
- Patterns in language for POS disambiguation in a style checker (www.techscribe.co.uk/ta/patterns-in-language-tcuk-2013.pdf.
The design of grammar-ste6.xml
The rules that are in
grammar-ste6.xml tell LanguageTool the text to find. For example, the pattern that finds the verb work is as follows:
<token>work<exception postag_regexp="yes" postag="IS_NOUN|PROJECTTERM"/></token>
The XML code means, "find the word work, unless work is a noun or a project term." Token is the LanguageTool term for word.
To prevent 2 errors for 1 problem, rule STE_RULE_1_1_USE_APPROVED_WORDS tells LanguageTool to find only the unknown terms. Other rules in
grammar-ste6.xml cause LanguageTool to find the unapproved terms and the terms that are used incorrectly.
For an STE_NOT_APPROVED term, the exception is applicable to a multi-word project term. The postags prevent a rule from finding an unapproved STE term that is in a multi-word project term or in a multi-word unapproved project term.
- The postag PROJECTTERM means that the word is part of a project term. For an example, refer to
disambiguation-projectterms.xml, PROJECT_TVb_2_WORD_BASE-CONFORM-TO. (In STE, the verb conform is not approved. However, the verb conform to is a Technical Verb (STE issue 6, Rule 1.13.3.c, regulatory language). As a test, the verb conform to is in
disambiguation-projectterms.xml, not in
- The postag PROJECTTERM_NOT_APPROVED means that the word is part of an unapproved project term. For example, if a rule for the unapproved verb log in tells you to use the verb log on, the message about the STE unapproved term log (in the phrase log on) is not necessary.
Refer also to
Building a controlled language lexicon for Danish (http://rauli.cbs.dk/index.php/LSP/article/download/2069/2068)
A specification and validating parser for simplified technical Spanish (www.csis.ul.ie/staff/Richard.Sutcliffe/ruiz_cascales_thesis03.pdf)
A simple rule-based part of speech tagger (www.aclweb.org/anthology/H92-1022)