The design of the term checker for ASD-STE100

This term checker is a customized version of LanguageTool. The grammar and the terms that the term checker validates are specified in 4 XML files. (The word term means the same as the word word in ASD-STE100.)

An important difference between a grammar checker and the term checker is as follows:

The term checker for ASD-STE100 is not Software as a Service (SaaS). You install LanguageTool on your computers. All the processing is on your computers.

The structure of LanguageTool

LanguageTool has rules for many languages. For each language, the rules are in the files disambiguation.xml and grammar.xml. The term checker uses the English disambiguation.xml and grammar.xml to specify the location of the rules for the term checker.

LanguageTool has rules for many languages. For each language, the rules are in the files disambiguation.xml and grammar.xml. (Some languages also have style rules in style.xml.) The term checker uses the English disambiguation.xml and grammar.xml to specify the location of the rules for the term checker.

The disambiguation files specify the terms and the grammar files tell LanguageTool which terms to find

The rules for the term checker are in 4 XML files:

Disambiguation files and grammar files in the term checker

Changes to LanguageTool

To make LanguageTool into a checker for STE, TechScribe makes these changes to files in LanguageTool:

The files for the term checker are not in the LanguageTool directory

You will create the installation directory in the installation step 'Create the directories and install LanguageTool'.

The STE term checker finds the word 'work' if it is a verb

In LanguageTool, disambiguation.xml and grammar.xml contain all the rules for the grammar checks and the style checks. The 2 XML files are in the LanguageTool-n.n directory. Thus, if you use the stand-alone version of LanguageTool and the OpenOffice version, you have 2 sets of files.

The term checker files are not in the LanguageTool-n.n directory. This method has these advantages:

When you install the term checker, you will replace disambiguation.xml and grammar.xml with files from TechScribe. The new files contain links to the files for the term checker (disambiguation-ste8.xml, grammar-ste8.xml, disambiguation-projectterms.xml, and grammar-projectterms.xml).

The files disambiguation-ste8.xml and grammar-ste8.xml are in these locations:

The disambiguation of terms

If ASD-STE100 shows a term as approved for one part of speech, and unapproved for a different part of speech, then the term checker finds the term if it is used incorrectly. For example, work is approved as a noun, but not as a verb:

The STE term checker finds the word 'work' if it is a verb

In the text, "You must work quickly," work is a verb, not a noun. Thus, the term checker finds the term.

The design of disambiguation-ste8.xml

The disambiguation rules that are in disambiguation-ste8.xml analyse text and add a POS tag to show the part of speech that a word has.

The analysis uses pattern matching. For example, in the text "the X was," X is a noun. X cannot be a verb or some other part of speech. Each rule is applied in sequence. If text matches a pattern, the rule adds a part of speech to the matched term. If the pattern is not matched, the next rule is used.

The analysis is not always correct. The examples that follow are not STE, they are standard English:

The disambiguator is different to the LanguageTool disambiguator

LanguageTool has a disambiguator, but the term checker cannot use it:

For more information about disambiguation, refer to these documents:

The design of grammar-ste8.xml

The rules that are in grammar-ste8.xml tell LanguageTool the text to find. For example, the pattern that finds the verb work is as follows:

<token>work<exception postag_regexp="yes" postag="IS_(NOUN|NNP)|PROJECT_TN_NOUN_MULTI_WORD.*"/></token>

The XML code means, "find the word work, unless work is a noun, a proper noun, or a multi-word project noun."

To prevent 2 errors for 1 problem, rule STE_RULE_1_1_USE_APPROVED_WORDS tells LanguageTool to find only the unknown terms. Other rules in grammar-ste8.xml cause LanguageTool to find the unapproved terms and the terms that are used incorrectly.

Limits and defects in LanguageTool prevent the correct analysis of some text

LanguageTool does not always find the end of a sentence

LanguageTool does not always find the end of a sentence: https://github.com/languagetool-org/languagetool/issues/6318.

This defect in LanguageTool causes an error with the STE disambiguation of sentences in a list, if the sentences do not end with a full stop (period) and are not separated by empty lines:

If list items are not separated by an empty line, the term checker gives incorrect warnings.

LanguageTool incorrectly removes POS tags

In the conditions that follow, LanguageTool incorrectly removes POS tags from a word, adds the VBG POS tag to the word, and makes the quote mark part of the word:

These changes are done in the LanguageTool Java code. It is not possible to correct these errors in disambiguation-ste8.xml. These changes can cause the term checker to give an incorrect analysis:

Changes to POS tags cause an incorrect analysis

For more examples, refer to https://github.com/languagetool-org/languagetool/issues/9001.

Precision and recall

ASD-STE100 is for safety-critical documentation. The best possible analysis is if the term checker finds all the errors in a text (recall=1.0) and does not give incorrect warnings (precision=1.0). For an introduction to precision and recall, refer to Classification: Precision and Recall (https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall).

The precision and the recall is dependent on the text. (The rules for semantics usually always give a warning.) Typical values are as follows:
Precision: 0.86.
Recall: 0.98.

Refer also to

Building a controlled language lexicon for Danish (https://rauli.cbs.dk/index.php/LSP/article/download/2069/2068)

A specification and validating parser for simplified technical Spanish (https://aclanthology.org/2003.eamt-1.15.pdf)

A simple rule-based part of speech tagger (https://aclanthology.org/H92-1022/)

RSS feed