The design of the term checker for ASD-STE100

This term checker is a customized version of LanguageTool. The grammar and the terms that the term checker validates are specified in 4 XML files. (The word term means the same as the word word in ASD-STE100.)

An important difference between a grammar checker and the term checker is as follows:

A grammar checker finds incorrect text.
The term checker ignores correct text.

The term checker for ASD-STE100 is not Software as a Service (SaaS). You install LanguageTool on your computers. All the processing is on your computers. LanguageTool does not send your data to TechScribe.

The structure of LanguageTool

LanguageTool has rules for many languages. For each language, the rules are in the files disambiguation.xml and grammar.xml. The term checker uses the English disambiguation.xml and grammar.xml to specify the location of the rules for the term checker.

LanguageTool has rules for many languages. For each language, the rules are in the files disambiguation.xml and grammar.xml. (Some languages also have style rules in style.xml.) The term checker uses the English disambiguation.xml and grammar.xml to specify the location of the rules for the term checker.

The disambiguation files specify the terms and the grammar files tell LanguageTool which terms to find

The rules for the term checker are in 4 XML files:

disambiguation-ste9.xml tells LanguageTool which STE terms are approved and which STE terms are not approved. Disambiguation rules identify the part of speech that a term has. For disambiguation, sequence is important. Thus, disambiguation-ste9.xml includes the content of disambiguation-projectterms.xml before the disambiguation rules.
disambiguation-projectterms.xml tells LanguageTool which project terms are approved and which project terms are not approved. You will edit this file to include your organization's technical terms.
grammar-ste9.xml tells LanguageTool to find STE terms that are not approved and STE terms that are used incorrectly.
grammar-projectterms.xml tells LanguageTool to find project terms that are not approved and project terms that are used incorrectly. You can edit this file to give guidance to your technical writers about your technical terms.

Disambiguation files and grammar files in the term checker

Changes to LanguageTool

Each day, a new version of LanguageTool is available from https://languagetool.org/download/snapshots/. Refer to https://forum.languagetool.org/t/ann-changes-in-the-lt-release-process/11015.

To make LanguageTool into a checker for STE, TechScribe gets a snapshot and makes these changes to LanguageTool:

Change the directory name LanguageTool-6.n-SNAPSHOT to ste9-ltyyyy-mm-dd.
Replace \org\languagetool\resource\en\disambiguation.xml with TechScribe's version, which has a password-protected link to disambiguation-ste9-202y-mm-dd.xml on the TechScribe website.
Replace \org\languagetool\resource\en\grammar.xml with TechScribe's version, which has a password-protected link to grammar-ste9-202y-mm-dd.xml on the TechScribe website.
Replace \org\languagetool\rules\en\en-GB\grammar.xml with TechScribe's version of grammar.xml. This change is for spelling rules that are applicable only to British English. Refer to 'For British English, the term checker uses the Oxford spelling'.
Replace \org\languagetool\rules\en\en-US\grammar.xml with TechScribe's version of grammar.xml. This change removes the rules that are not applicable to STE.
Delete \org\languagetool\rules\en\style.xml.
- The rules are not applicable to STE.
- The parts of speech in the term checker have an unwanted effect on the LanguageTool rules. Refer to https://github.com/languagetool-org/languagetool/issues/8414.
Delete \org\languagetool\rules\en\en-GB\style.xml. The rules are not applicable to STE.
Delete \org\languagetool\rules\en\en-US\style.xml. The rules are not applicable to STE.
Delete some terms from \org\languagetool\resource\en\multiwords.txt. LanguageTool gives each term in this file only one part of speech, which can cause an error with the analysis of text. Refer to https://github.com/languagetool-org/languagetool/issues/7779.
Delete terms from org\languagetool\rules\en\en-GB\replace.txt. The file is necessary for LanguageTool, but the contents are not applicable to STE.
Delete terms from org\languagetool\rules\en\en-US\replace.txt. The file is necessary for LanguageTool, but the contents are not applicable to STE.
Add TechScribeSTEchecker.bat. In Microsoft Windows, you can use this batch file to start the term checker with Java SDK 17 (or a later version).
Add jaxp-strict.properties. Refer to https://github.com/languagetool-org/languagetool/issues/11558.
Change testrules.bat to use jaxp-strict.properties.

The disambiguation of terms

If ASD-STE100 shows a term as approved for one part of speech, and unapproved for a different part of speech, then the term checker finds the term if it is used incorrectly. For example, work is approved as a noun, but not as a verb:

The STE term checker finds the word 'work' if it is a verb

In the text, "You must work quickly," work is a verb, not a noun. Thus, the term checker finds the term.

The design of disambiguation-ste9.xml

The disambiguation rules that are in disambiguation-ste9.xml analyse text and add a POS tag to show the part of speech that a word has.

The analysis uses pattern matching. For example, in the text "the X was," X is a noun. X cannot be a verb or some other part of speech. Each rule is applied in sequence. If text matches a pattern, the rule adds a part of speech to the matched term. If the pattern is not matched, the next rule is used.

The analysis is not always correct. The examples that follow are not STE, they are standard English:

The analysis uses patterns (n-grams). For some text, full-sentence parsing is necessary:
- A person's right to work to earn money. ['s = verb, right=verb]
- A person's right to work to earn money is not disputed. ['s = possessive, right=noun]
Semantics has an effect on the analysis:
- To close the Awards Ceremony, the former cricket captain was praised. [former=adjective]
- To improve the structural stability, the former attachment point was moved. [former=noun in a multi-word noun]

The disambiguator is different to the LanguageTool disambiguator

LanguageTool has a disambiguator, but the term checker cannot use it:

A word in ASD-STE100 does not always have the same parts of speech that the word has in standard English. For example, in standard English, the word your is a possessive pronoun, but in ASD-STE100, it is an adjective.
The disambiguation of ASD-STE100 is not the same as the disambiguation of standard English. For example, think about the sentence "You can see the key positions in Figure 7." In standard English, the word key can be parsed as an adjective that modifies the noun positions. In ASD-STE100, key is parsed as a noun, because key is an approved technical noun (noun) and is unknown as an adjective.

For more information about disambiguation, refer to these documents:

Developing a Disambiguator (https://dev.languagetool.org/developing-a-disambiguator
Patterns in language for POS disambiguation in a style checker (www.techscribe.co.uk/ta/patterns-in-language-tcuk-2013.pdf.

The design of grammar-ste9.xml

The rules that are in grammar-ste9.xml tell LanguageTool the text to find. For example, the pattern that finds the verb work is as follows:

<token>work<exception postag_regexp="yes" postag="IS_(NOUN|NNP)|PROJECT_TN_NOUN_MULTI_WORD.*"/></token>

The XML code means, "find the word work, unless work is a noun, a proper noun, or a multi-word project noun."

To prevent 2 errors for 1 problem, rule STE_RULE_1_1_USE_APPROVED_WORDS tells LanguageTool to find only the unknown terms. Other rules in grammar-ste9.xml cause LanguageTool to find the unapproved terms and the terms that are used incorrectly.

The term checker is for safety-critical documentation. Thus, a design principle is that for each rule, it must find all errors. As a result, there are false positive results.

Limits and defects in LanguageTool prevent the correct analysis of some text

LanguageTool does not always find the end of a sentence

LanguageTool does not always find the end of a sentence: https://github.com/languagetool-org/languagetool/issues/6318.

This defect in LanguageTool causes an error with the STE disambiguation of sentences in a list, if the sentences do not end with a full stop (period) and are not separated by empty lines:

If list items are not separated by an empty line, the term checker gives incorrect warnings.

LanguageTool incorrectly removes POS tags

In the conditions that follow, LanguageTool incorrectly removes POS tags from a word, adds the VBG POS tag to the word, and makes the quote mark part of the word:

The word ends with in.
The word plus the letter g can be a verb. Examples: basing, bulleting, tinting.
The word is immediately followed by a single quote mark.
There is no single quote mark immediately before the word.

These changes are done in the LanguageTool Java code. It is not possible to correct these errors in disambiguation-ste9.xml. These changes can cause the term checker to give an incorrect analysis:

Changes to POS tags cause an incorrect analysis

For more examples, refer to https://github.com/languagetool-org/languagetool/issues/9001.

LanguageTool does not have POS tags for all English words

LanguageTool does not have POS tags for all the words that are in English. Because the analysis of text used n-grams, a missing POS tag can cause an incorrect analysis.

If you find a word that has a missing POS tag, make a report of the error on https://github.com/languagetool-org/languagetool/issues .

Precision and recall

ASD-STE100 is for safety-critical documentation. The best possible analysis is if the term checker finds all the errors in a text (recall=1.0) and does not give incorrect warnings (precision=1.0). For an introduction to precision and recall, refer to Classification: Precision and Recall (https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall).

The precision and the recall is dependent on the text. (The rules for semantics usually always give a warning.) Typical values are as follows:
Precision: 0.86.
Recall: 0.98.

Refer also to

Building a controlled language lexicon for Danish (https://rauli.cbs.dk/index.php/LSP/article/download/2069/2068)

A specification and validating parser for simplified technical Spanish (https://aclanthology.org/2003.eamt-1.15.pdf)

A simple rule-based part of speech tagger (https://aclanthology.org/H92-1022/)

RSS feed