Machine Readable Glossary Generation Tool
Documentation needs to be adjusted for:
- Converting formPhrases: MRGT will write expanded formPhrase macros into MRGEntry
formPhrases
field
The Machine Readable Glossary generation Tool (MRGT) generates Machine Readable Glossaries (MRGs) for one specific, or all terminology versions that are curated within a specific scope. MRGs come in a specific, well-defined format. They contain some meta-data, followed by a list of so-called MRG entries, one for every term in its scope, which represent concepts and other semantic units that are known within that scope.
The (newly generated) MRG(s) are meant to be processed by the other tools in the toolbox, regardless of whether such tools are called from within the context of another scope. As they contain every term that is used in the scope, and include all the relevant meta-data, an MRG serves as the single, authoritative source of that (version of the) scope's terminology.
Installing the Tool
The tool can be installed from the command line and made globally available by executing
npm install -g @tno-terminology-design/mrgt
Before running the tool from the command line, make sure you have met the necessary prerequisites depending on your operating environment.
- CMD.exe (Windows)
- PowerShell(Windows)
- Bash (Linux/Mac)
- Node.js and NPM: Ensure Node.js and NPM are installed.
- Global Installation: If you have installed the package globally, confirm the global NPM modules path by running
npm config get prefix
. The global modules are usually stored under<prefix>/node_modules
. - Environment Variables: Add the path to global NPM binaries to your system's PATH environment variable. This should be
<prefix>
on Windows. To add to PATH, you can edit your environment variables or runset PATH=%PATH%;<prefix>
in the CMD.
- Node.js and NPM: Ensure Node.js and NPM are installed.
- Global Installation: Check the global NPM modules path as in CMD.
- Environment Variables: Update the PATH environment variable as in CMD. You can also use
$env:Path += ";<prefix>"
to update the PATH temporarily in the current PowerShell session.
- Node.js and NPM: Ensure Node.js and NPM are installed.
- Global Installation: If globally installed, run
npm config get prefix
to get the global modules path, usually<prefix>/lib/node_modules
. - Environment Variables: Add the
<prefix>/bin
directory to yourPATH
if it's not already. You can do this by addingexport PATH=$PATH:<prefix>/bin
to your~/.bashrc
or~/.zshrc
file.
Calling the Tool
The behavior of the MRGT can be configured per call e.g. by a configuration file and/or command-line parameters. The command-line syntax is as follows:
mrgt [ <paramlist> ]
where <paramlist>
is an (optional) list of parameters.
Legend
The columns in the following table are defined as follows:
Parameter
specifies the parameter and further specificationsReq'd
specifies whether (Y
) or not (n
) the field is required to be present when the tool is being called. If required, it MUST either be present in the configuration file, or as a command-line parameter.Description
specifies the meaning of theValue
field, and other things you may need to know, e.g. why it is needed, a required syntax, etc.
If a configuration file used, the long version of the parameter must be used (without the preceding --
).
Key | Req'd | Description |
---|---|---|
-c , --config <path> | n | Path (including the filename) of the tool's (YAML) configuration file. |
-h , --help | n | display help for command. |
-o , --onNotExist <action> | n | The action in case a vsntag was specified, but wasn't found in the SAF. |
-s , --scopedir <path> | n | Path of the scope directory from which the tool is called. |
-v , --vsntag <vsntag> | n | Versiontag for which the MRG needs to be (re)generated. |
-V , --version | n | output the version number of the tool. |
The <action>
parameter can take the following values:
<action> | Description |
---|---|
'throw' | an error is thrown (an exception is raised), and processing will stop. |
'warn' | a message is displayed (and logged) and processing continues. |
'log' | a message is written to a log(file) and processing continues. |
'ignore' | processing continues as if nothing happened. |
Some parameters may only be configurable through the use of a configuration file. In this case macros
, for use in form phrases, is an example. See the configuration file page for details.
Running the Tool
One run of the MRGT either
- generates an MRG for one specific terminology version within the current scope (which is the case when the
version
parameter was specified), or it - generates multiple MRGs, i.e., one for every version of the terminology that is curated within the current scope (which is the case when the
version
parameter is omitted).
Running the tool comprises the following phases:1
- Constructing a provisional MRG;
- Post-processing the entries in that provisional MRG;
- Creating/overwriting MRG file(s) in the glossarydir of the current scope.
Phase 1: constructing a provisional MRG
Generating an MRG for a particular version of a terminology starts by reading the SAF of the scope within which that terminology is curated, which exists in the scopedir that was provided as one of the calling parameters. If a vsntag
argument is provided, it will search the versions section of the SAF to find the corresponding entry. This corresponding entry will have the value of the vsntag
parameter either in its vsntag
field, or it is one of the elements in the altvsntags
field. If the SAF does not have a corresponding entry, the action specified in the onNotExist
parameter will determine whether or not (and how) to proceed.
In this phase, for every terminology version that is to be created, one provisional MRG is created, that contains a provisional MRG entry for every term contained in the particular version of the terminology. This provisional MRG entry either contains:
- all fields in the header of the curated text that documents its term, or
- all fields in the MRG entry that comes from another MRG (typically, but not necessarily, from another scope).
Creating a provisional MRG
Creating a provisional MRG starts with an empty set of MRG entries - we use the term "provisional MRG" to refer to this set.
Then, the list of term selection instructions as specified in the appropriate entry of the versions
section of the SAF is processed. This is done by subsequently processing each instruction, in the order as specified.
adding MRG entries to the provisional MRG; these can either be entries that have been created from curated texts, or entries whose contents are obtained from an MRG other than the one that is being created.1
removing MRG entries from the provisional MRG;
modifying attributes of a specific MRG entry in the provisional MRG, e.g. for renaming a term that originated from another scope.
Processing FormPhrases
Form phrases that are specified in a curated text may include uppercase characters, special characters, spaces etc., all of which make their use by tools cumbersome. In order to make it easier for TEv2 tools to use them, they need to be converted into regularized form phrases.Converting the set of form phrases (as specified in the formPhrases
field from a curated text) into regularized form phrases (for storage in an MRG entry) is done as follows:
- every form phrase (in the set of form phrases) that contains a form phrase macro, is replaced with one or more form phrases that are the result of processing that macro - see Form Phrase Macro Expansion for the details and examples.
- as a single form phrase may contain multiple macros, step 1 must be repeated until all macros are processed and the set of form phrases no longer contains any macro.
- all form phrases in the resulting set are now regularized, i.e., turned into regularized form phrases.
- a regularized form phrases is added, the value of which is the same as the value of the
term
field of the curated text. Thus, tools that work with form phrases from MRG entries can find all forms, including that of the term itself, as an element in theformPhrases
field of the MRG entry. - finally, the resulting set of regularized form phrases is pruned, such that every regularized form phrase appears only once in the end result.
- this end-result is then written into the
formPhrases
field of the MRG entry.
An MRG SHOULD NOT have two (or more) MRG entries that have a same element in their formPhrases
field, because that would mean that the form phrase is ambiguous, as it refers to two different semantic units.
Storing a provisional MRG in the glossarydir
When the creation of a provisional MRG is complete, a filename mrg.<scopetag>.<vsntag>.yaml
is constructed, where:
<scopetag>
is the scopetag that is used within the current scope to refer to itself. Its value can be found in thescopetag
-field in thescope
section of the SAF.<vsntag>
is the versiontag that identifies the version of the terminology for which the MRG contains entries. Its value must be equal to that found in thevsntag
-field of the element in the versions section of the SAF from which the MRG was generated.
If a file with that name already exists in the glossarydir of the current scope, it will be deleted. Then, a new file with that name will be created, which will contain:
- a
terminology
section, the contents of which is obtained by copying relevant fields from theterminology
section in the SAF; - a
scopes
section, the contents of which is obtained by copying relevant fields from thescopes
section in the SAF; - an
entries
section, the contents of which consists of the provisional MRG entries of the provisional MRG.
Then, if the <vsntag>
part of the filename equals the value of the defaultvsn
field in the scope
section of the SAF, a copy of that file is created in the glossarydir whose filename is mrg.<scopetag>.yaml
, which is the name by which the default MRG of the current scope is referred to.
Next, the MRGT will create a copy of the MRG file for every versiontag that exists in the altvsntags
-field of the element in the versions section of the SAF from which the MRG was generated. The copy will contain the same MRG as the file that has just been written. The name of this copied file is mrg.<scopetag>.<altvsntag>.yaml
, which is the same name as the MRG file, except that the <vsntag>
part of that filename is replaced with the value of the versiontag found in the altvsntags
-field.
Phase 2: post processing Synonyms
We may want to deprecate the use of Synonyms as they have been specified now, because it is a complex thing, and most often, its uses can also be realized in a different way (particularly since we can now generate HRGs using multiple converters)
This phase starts only after all provisional MRGs are created that the MRGT was instructed to build in this run, and the corresponding files have been added to the glossarydir of the current scope. This allows post processing, e.g. of synonyms, to use the newly generated provisional MRG entries
When a provisional MRG entry in (one of) the created provisional MRGs has a synonymOf
field that contains a term identifier, this will now refer to either
- an MRG entry in one of the MRGs that either already existed, or
- a provisional MRG entry in a [provisional MRG] that has just been created. This (possibly provisional) MRG entry is then copied, after which all fields in the provisional MRG entry that contained the term identifier are added thereto, overwriting any already existing fields, or adding fields that did not yet exist. Then, the resulting data is used to replace the provisional MRG entry that contained the term identifier.
Effectively, this means that whenever a term is defined as a synonym of
some other term, the corresponding MRG entry will have all fields of this other term, except for those that were specified in the header of the term that is defined as a synonym of that other term.
Phase 3: post processing other fields
Now, all provisional MRG entries in all provisional MRGs are processed so as to become useable from the context within which they have been selected. All fields that are required for regular MRG entries will be processed, as specified in the following table
Field | Value(s) that are assigned to the fields |
---|---|
scopetag | ensure the contents of this field has the value of the scopetag field as found in the scope section of the SAF. |
vsntag | ensure the contents of this field contains the versiontag that identifies the version of the terminology from which the contents of the MRG entry is obtained. If the contents of the MRG entry was constructed from a curated text, its value equals the value of the vsntag field in the terminology -section of the MRG that this MRG entry is a part of. As a result, scopetag :versiontag identifies the terminology from which this MRG entry has originated. |
locator | ensure that the contents of this field is the path, relative to scopedir /curatedir /, of the file that contains the (header of) the curated text. |
navurl | ensure that the contents of this field is the (localized) path to which browsers navigate in order to see the rendered version of the curated text. |
termType | ensure that the contents of this field exists, and that it is regularized; if it does not, it must be created and its value shall be the same as the value of the defaulttype field in the scope section in the SAF, or, if that doesn't exist, its value should be concept . |
term | ensure that the contents of this field exists, and that it is regularized. An exception must be raised if this field does not exist. |
termid | ensure that the value of this field is "<termType> :<term> ", where <termType> and <term> are the values of the corresponding fields in this MRG entry. There MUST NOT be another MRG entry within the MRG that has a termid field with the same value. |
formPhrases | ensure that this field contains an array of regularized form phrases, and that one of its elements has the same value as the term field. |
headingids | ensure that the contents of this field is a list of the markdown headings and/or heading ids that are found in the body of the curated text. Note that this body can be either in the curated text file or in a separate body file. This is explained in more detail in a subsection below. |
The following sections elaborate on the construction of (the contents) of some of these fields.
Constructing the navurl
field
The navurl
field is constructed by concatenating website
/navpath
/curatedir
/id
,
where
website
,navpath
andcuratedir
are given by the contents of the respective fields in thescope
section of the SAF.
However, if thebodyFile
field in the header of the curated text file is set, the path to the body file is used instead of thenavpath
andcuratedir
, sonavurl
will then bewebsite
/bodyFile
The
id
part is one of the following:- if the
scope
section of the SAF contains the fieldnavid
, then its contents specify the name of the field in the header of the curated text or body file that will be used to create theid
part. Thus, static site generators such as Docusaurus, which use theid
field to specify this value, can be accommodated. - if the SAF does not specify the
navid
field, or thenavid
field in a curated text or body file is not set, thenid
will be based on the name of the curated text file or the name of the body file.
- if the
Constructing the headingid
fields (#headingids-construction)
The headingids
field is constructed by finding all markdown headings in the body-file (or the curated text file if there is no separate body file), and making a list out of them.
Example of Markdown Headers and their `headingid` fields
- Default Markdown Headers
- Custom Heading IDs
Markdown headings are only recognized when they are preceded with number signs (#) at the beginning of a line.
Here is an example of a markdown header:
## This is a Markdown Header
This header will result in the text this-is-a-markdown-header
being added as an element in the headingids
field.
A markdown heading may also contain a (custom) heading id that allows you to link directly to headings and modify them with CSS.
Here is an example of a markdown header with a custom heading-id:
# This is a Markdown Header {#custom-id}
This header will result in the text custom-id
being added as an element in the headingids
field.
Phase 4: checking the result
The last step consists of checking crucial properties that MRGs are relied on to have, and raising appropriate exceptions in case something is wrong. This helps curators that check the log outputs to become aware of things they may need to fix before these MRGs are further used (or published).
In this step, the following checks are done (as a minimum):
- The value of the
termid
field in one MRG Entry differs from the value of thetermid
field of all other MRG Entries. This ensures thattermid
contains a unique identifier (primary key) within the context of the MRG. - When a regularized form phrase is an element of the
formPhrases
field of an MRG entry, there MUST NOT be another MRG entry in the same MRG that has this regularized form phrase in itsformPhrases
field.
Exceptions, Warnings, and Logging
This section needs to be reviewed/revised so as to enable a consistent way of error checking and logging, similar to what is done in the TRRT
The general principle is that the MRGT helps its users to do their jobs. This means that errors that terminate the processing are limited to the max, that warnings (perhaps at different 'levels' of detail/severity) are given output whenever possible (yet may be limited by command-line options), and that texts are tailored for the envisaged users of the tool.
The MRGT logs conditions that prevent it from properly:
- obtaining the scopedir from a scopetag;
- parsing a curated text (e.g. because it is not in the expected format);
- resolving terms, scope tags, group tags, or version tags;
- writing the output (e.g. because it has no write-permission for the designated location);
- etc.;
Also, the MRGT provides suggestions that help tool-operators (curators) to not only identify, but also fix any problems.
The MRGT comes with documentation that enables developers to ascertain its correct functioning (e.g. by using a test set of files, test scripts that exercise its parameters, etc.), and also enables them to deploy the tool in a git repo and author/modify CI-pipes to use that deployment.
Notes
- The MRGT MUST NOT start by overwriting files that contain an MRG, as they should remain available as a (possible) source for copying MRG entries from during the construction of one or more provisional MRGs. Writing the actual files should be done after all provisional MRGs have been constructed.↩