Skip to main content

Machine Readable Glossary Generation Tool

Editor's note

Documentation needs to be adjusted for:

  • Converting formPhrases: MRGT will write expanded formPhrase macros into MRGEntry formPhrases field

The Machine Readable Glossary generation Tool (MRGT) generates Machine Readable Glossaries (MRGs) for one specific, or all terminology versions that are curated within a specific scope. MRGs come in a specific, well-defined format. They contain some meta-data, followed by a list of so-called MRG entries, one for every term in its scope, which represent concepts and other semantic units that are known within that scope.

The (newly generated) MRG(s) are meant to be processed by the other tools in the toolbox, regardless of whether such tools are called from within the context of another scope. As they contain every term that is used in the scope, and include all the relevant meta-data, an MRG serves as the single, authoritative source of that (version of the) scope's terminology.

Installing the Tool

The tool can be installed from the command line and made globally available by executing

npm install -g @tno-terminology-design/mrgt
Before running the tool from the command line, make sure you have met the necessary prerequisites depending on your operating environment.

  1. Node.js and NPM: Ensure Node.js and NPM are installed.
  2. Global Installation: If you have installed the package globally, confirm the global NPM modules path by running npm config get prefix. The global modules are usually stored under <prefix>/node_modules.
  3. Environment Variables: Add the path to global NPM binaries to your system's PATH environment variable. This should be <prefix> on Windows. To add to PATH, you can edit your environment variables or run set PATH=%PATH%;<prefix> in the CMD.

Calling the Tool

The behavior of the MRGT can be configured per call e.g. by a configuration file and/or command-line parameters. The command-line syntax is as follows:

mrgt [ <paramlist> ]

where <paramlist> is an (optional) list of parameters.

Legend

The columns in the following table are defined as follows:

  1. Parameter specifies the parameter and further specifications
  2. Req'd specifies whether (Y) or not (n) the field is required to be present when the tool is being called. If required, it MUST either be present in the configuration file, or as a command-line parameter.
  3. Description specifies the meaning of the Value field, and other things you may need to know, e.g. why it is needed, a required syntax, etc.

If a configuration file used, the long version of the parameter must be used (without the preceding --).

KeyReq'dDescription
-c, --config <path>nPath (including the filename) of the tool's (YAML) configuration file.
-h, --helpndisplay help for command.
-o, --onNotExist <action>nThe action in case a vsntag was specified, but wasn't found in the SAF.
-s, --scopedir <path>nPath of the scope directory from which the tool is called.
-v, --vsntag <vsntag>nVersiontag for which the MRG needs to be (re)generated.
-V, --versionnoutput the version number of the tool.

The <action> parameter can take the following values:

<action>Description
'throw'an error is thrown (an exception is raised), and processing will stop.
'warn'a message is displayed (and logged) and processing continues.
'log'a message is written to a log(file) and processing continues.
'ignore'processing continues as if nothing happened.
info

Some parameters may only be configurable through the use of a configuration file. In this case macros, for use in form phrases, is an example. See the configuration file page for details.

Running the Tool

One run of the MRGT either

  • generates an MRG for one specific terminology version within the current scope (which is the case when the version parameter was specified), or it
  • generates multiple MRGs, i.e., one for every version of the terminology that is curated within the current scope (which is the case when the version parameter is omitted).

Running the tool comprises the following phases:1

  1. Constructing a provisional MRG;
  2. Post-processing the entries in that provisional MRG;
  3. Creating/overwriting MRG file(s) in the glossarydir of the current scope.

Phase 1: constructing a provisional MRG

Generating an MRG for a particular version of a terminology starts by reading the SAF of the scope within which that terminology is curated, which exists in the scopedir that was provided as one of the calling parameters. If a vsntag argument is provided, it will search the versions section of the SAF to find the corresponding entry. This corresponding entry will have the value of the vsntag parameter either in its vsntag field, or it is one of the elements in the altvsntags field. If the SAF does not have a corresponding entry, the action specified in the onNotExist parameter will determine whether or not (and how) to proceed.

In this phase, for every terminology version that is to be created, one provisional MRG is created, that contains a provisional MRG entry for every term contained in the particular version of the terminology. This provisional MRG entry either contains:

Creating a provisional MRG

Creating a provisional MRG starts with an empty set of MRG entries - we use the term "provisional MRG" to refer to this set.

Then, the list of term selection instructions as specified in the appropriate entry of the versions section of the SAF is processed. This is done by subsequently processing each instruction, in the order as specified.

Instructions exist for:

Processing FormPhrases

Form phrases that are specified in a curated text may include uppercase characters, special characters, spaces etc., all of which make their use by tools cumbersome. In order to make it easier for TEv2 tools to use them, they need to be converted into regularized form phrases.

Converting the set of form phrases (as specified in the formPhrases field from a curated text) into regularized form phrases (for storage in an MRG entry) is done as follows:

  1. every form phrase (in the set of form phrases) that contains a form phrase macro, is replaced with one or more form phrases that are the result of processing that macro - see Form Phrase Macro Expansion for the details and examples.
  2. as a single form phrase may contain multiple macros, step 1 must be repeated until all macros are processed and the set of form phrases no longer contains any macro.
  3. all form phrases in the resulting set are now regularized, i.e., turned into regularized form phrases.
  4. a regularized form phrases is added, the value of which is the same as the value of the term field of the curated text. Thus, tools that work with form phrases from MRG entries can find all forms, including that of the term itself, as an element in the formPhrases field of the MRG entry.
  5. finally, the resulting set of regularized form phrases is pruned, such that every regularized form phrase appears only once in the end result.
  6. this end-result is then written into the formPhrases field of the MRG entry.
tip

An MRG SHOULD NOT have two (or more) MRG entries that have a same element in their formPhrases field, because that would mean that the form phrase is ambiguous, as it refers to two different semantic units.

Storing a provisional MRG in the glossarydir

When the creation of a provisional MRG is complete, a filename mrg.<scopetag>.<vsntag>.yaml is constructed, where:

If a file with that name already exists in the glossarydir of the current scope, it will be deleted. Then, a new file with that name will be created, which will contain:

Then, if the <vsntag> part of the filename equals the value of the defaultvsn field in the scope section of the SAF, a copy of that file is created in the glossarydir whose filename is mrg.<scopetag>.yaml, which is the name by which the default MRG of the current scope is referred to.

Next, the MRGT will create a copy of the MRG file for every versiontag that exists in the altvsntags-field of the element in the versions section of the SAF from which the MRG was generated. The copy will contain the same MRG as the file that has just been written. The name of this copied file is mrg.<scopetag>.<altvsntag>.yaml, which is the same name as the MRG file, except that the <vsntag> part of that filename is replaced with the value of the versiontag found in the altvsntags-field.

Phase 2: post processing Synonyms

Editor's note

We may want to deprecate the use of Synonyms as they have been specified now, because it is a complex thing, and most often, its uses can also be realized in a different way (particularly since we can now generate HRGs using multiple converters)

This phase starts only after all provisional MRGs are created that the MRGT was instructed to build in this run, and the corresponding files have been added to the glossarydir of the current scope. This allows post processing, e.g. of synonyms, to use the newly generated provisional MRG entries

When a provisional MRG entry in (one of) the created provisional MRGs has a synonymOf field that contains a term identifier, this will now refer to either

Effectively, this means that whenever a term is defined as a synonym of some other term, the corresponding MRG entry will have all fields of this other term, except for those that were specified in the header of the term that is defined as a synonym of that other term.

Phase 3: post processing other fields

Now, all provisional MRG entries in all provisional MRGs are processed so as to become useable from the context within which they have been selected. All fields that are required for regular MRG entries will be processed, as specified in the following table

FieldValue(s) that are assigned to the fields
scopetagensure the contents of this field has the value of the scopetag field as found in the scope section of the SAF.
vsntagensure the contents of this field contains the versiontag that identifies the version of the terminology from which the contents of the MRG entry is obtained. If the contents of the MRG entry was constructed from a curated text, its value equals the value of the vsntag field in the terminology-section of the MRG that this MRG entry is a part of. As a result, scopetag:versiontag identifies the terminology from which this MRG entry has originated.
locatorensure that the contents of this field is the path, relative to scopedir/curatedir/, of the file that contains the (header of) the curated text.
navurlensure that the contents of this field is the (localized) path to which browsers navigate in order to see the rendered version of the curated text.
termTypeensure that the contents of this field exists, and that it is regularized; if it does not, it must be created and its value shall be the same as the value of the defaulttype field in the scope section in the SAF, or, if that doesn't exist, its value should be concept.
termensure that the contents of this field exists, and that it is regularized. An exception must be raised if this field does not exist.
termidensure that the value of this field is "<termType>:<term>", where <termType> and <term> are the values of the corresponding fields in this MRG entry. There MUST NOT be another MRG entry within the MRG that has a termid field with the same value.
formPhrasesensure that this field contains an array of regularized form phrases, and that one of its elements has the same value as the term field.
headingidsensure that the contents of this field is a list of the markdown headings and/or heading ids that are found in the body of the curated text. Note that this body can be either in the curated text file or in a separate body file. This is explained in more detail in a subsection below.

The following sections elaborate on the construction of (the contents) of some of these fields.

The navurl field is constructed by concatenating website/navpath/curatedir/id, where

  • website, navpath and curatedir are given by the contents of the respective fields in the scope section of the SAF.
    However, if the bodyFile field in the header of the curated text file is set, the path to the body file is used instead of the navpath and curatedir, so navurl will then be website/bodyFile

  • The id part is one of the following:

    1. if the scope section of the SAF contains the field navid, then its contents specify the name of the field in the header of the curated text or body file that will be used to create the id part. Thus, static site generators such as Docusaurus, which use the id field to specify this value, can be accommodated.
    2. if the SAF does not specify the navid field, or the navid field in a curated text or body file is not set, then id will be based on the name of the curated text file or the name of the body file.

Constructing the headingid fields (#headingids-construction)

The headingids field is constructed by finding all markdown headings in the body-file (or the curated text file if there is no separate body file), and making a list out of them.

Example of Markdown Headers and their `headingid` fields

Markdown headings are only recognized when they are preceded with number signs (#) at the beginning of a line.

Here is an example of a markdown header:


## This is a Markdown Header

This header will result in the text this-is-a-markdown-header being added as an element in the headingids field.

Phase 4: checking the result

The last step consists of checking crucial properties that MRGs are relied on to have, and raising appropriate exceptions in case something is wrong. This helps curators that check the log outputs to become aware of things they may need to fix before these MRGs are further used (or published).

In this step, the following checks are done (as a minimum):

Exceptions, Warnings, and Logging

Editor's note

This section needs to be reviewed/revised so as to enable a consistent way of error checking and logging, similar to what is done in the TRRT

The general principle is that the MRGT helps its users to do their jobs. This means that errors that terminate the processing are limited to the max, that warnings (perhaps at different 'levels' of detail/severity) are given output whenever possible (yet may be limited by command-line options), and that texts are tailored for the envisaged users of the tool.

The MRGT logs conditions that prevent it from properly:

Also, the MRGT provides suggestions that help tool-operators (curators) to not only identify, but also fix any problems.

The MRGT comes with documentation that enables developers to ascertain its correct functioning (e.g. by using a test set of files, test scripts that exercise its parameters, etc.), and also enables them to deploy the tool in a git repo and author/modify CI-pipes to use that deployment.

Notes


  1. The MRGT MUST NOT start by overwriting files that contain an MRG, as they should remain available as a (possible) source for copying MRG entries from during the construction of one or more provisional MRGs. Writing the actual files should be done after all provisional MRGs have been constructed.