The design of the term checker for ASD-STE100

The rules for the term checker are in 4 XML files:

The disambiguation files specify the terms and the grammar files tell LanguageTool which terms to find

Disambiguation files and grammar files in the term checker

The files for the term checker are outside the LanguageTool installation directory

The STE term checker finds the word 'work' if it is a verb

In LanguageTool, disambiguation.xml and grammar.xml contain all the rules for the grammar checks and the style checks. The 2 XML files are in the installation directory. If you use the stand-alone version of LanguageTool and the OpenOffice version, then you have 2 sets of files.

The term checker files are outside the installation directories. This method has these advantages:

When you install the term checker, you will replace disambiguation.xml and grammar.xml with files from TechScribe. The new files contain links to the files for the term checker.

The disambiguation of terms

If ASD-STE100 shows a term as approved for one part of speech, and unapproved for a different part of speech, then the term checker finds the term if it is used incorrectly. For example, work is approved as a noun, but not as a verb:

The STE term checker finds the word 'work' if it is a verb

In the text, "You must work quickly," work is a verb, not a noun. Thus, the term checker finds the term.

The design of disambiguation-ste7.xml

The rules that are in disambiguation-ste7.xml analyse text, and apply a part of speech if a term is identified unambiguously as one part of speech.

The analysis uses pattern matching. For example, in the text "the X was," X is a noun. X cannot be a verb or some other part of speech. Each rule is applied in sequence. If text matches a pattern, the rule adds a part of speech to the matched term. If the pattern is not matched, the next rule is used.

For technical details, refer to these documents:

The design of grammar-ste7.xml

The rules that are in grammar-ste7.xml tell LanguageTool the text to find. For example, the pattern that finds the verb work is as follows:

<token>work<exception postag_regexp="yes" postag="IS_NOUN|PROJECTTERM"/></token>

The XML code means, "find the word work, unless work is a noun or a project term." Token is the LanguageTool term for word.

To prevent 2 errors for 1 problem, rule STE_RULE_1_1_USE_APPROVED_WORDS tells LanguageTool to find only the unknown terms. Other rules in grammar-ste7.xml cause LanguageTool to find the unapproved terms and the terms that are used incorrectly.

For an STE_NOT_APPROVED term, the exception is applicable to a multi-word project term. The postags prevent a rule from finding an unapproved STE term that is in a multi-word project term or in a multi-word unapproved project term.

Refer also to

Building a controlled language lexicon for Danish (http://rauli.cbs.dk/index.php/LSP/article/download/2069/2068)

A specification and validating parser for simplified technical Spanish (www.csis.ul.ie/staff/Richard.Sutcliffe/ruiz_cascales_thesis03.pdf)

A simple rule-based part of speech tagger (www.aclweb.org/anthology/H92-1022)

RSS feed