Machine Readable Glossary Generation Tool
Documentation needs to be adjusted for:
- Converting formPhrases: MRGT will write expanded formPhrase macros into MRGEntry
formPhrases
field
The Machine Readable Glossary generation Tool (MRGT) generates Machine Readable Glossaries (MRGs) for one specific, or all terminology versions that are curated within a specific scope. MRGs come in a specific, well-defined format. They contain some meta-data, followed by a list of so-called MRG entries, one for every term in its scope, which represent concepts and other semantic units that are known within that scope.
The (newly generated) MRG(s) are meant to be processed by the other tools in the toolbox, regardless of whether such tools are called from within the context of another scope. As they contain every term that is used in the scope, and include all the relevant meta-data, an MRG serves as the single, authoritative source of that (version of the) scope's terminology.
Installing the Tool
The tool can be installed from the command line and made globally available by executing
npm install -g @tno-terminology-design/mrgt
Before running the tool from the command line, make sure you have met the necessary prerequisites depending on your operating environment.
- CMD.exe (Windows)
- PowerShell(Windows)
- Bash (Linux/Mac)
- Node.js and NPM: Ensure Node.js and NPM are installed.
- Global Installation: If you have installed the package globally, confirm the global NPM modules path by running
npm config get prefix
. The global modules are usually stored under<prefix>/node_modules
. - Environment Variables: Add the path to global NPM binaries to your system's PATH environment variable. This should be
<prefix>
on Windows. To add to PATH, you can edit your environment variables or runset PATH=%PATH%;<prefix>
in the CMD.
- Node.js and NPM: Ensure Node.js and NPM are installed.
- Global Installation: Check the global NPM modules path as in CMD.
- Environment Variables: Update the PATH environment variable as in CMD. You can also use
$env:Path += ";<prefix>"
to update the PATH temporarily in the current PowerShell session.
- Node.js and NPM: Ensure Node.js and NPM are installed.
- Global Installation: If globally installed, run
npm config get prefix
to get the global modules path, usually<prefix>/lib/node_modules
. - Environment Variables: Add the
<prefix>/bin
directory to yourPATH
if it's not already. You can do this by addingexport PATH=$PATH:<prefix>/bin
to your~/.bashrc
or~/.zshrc
file.
Calling the Tool
The behavior of the MRGT can be configured per call e.g. by a configuration file and/or command-line parameters. The command-line syntax is as follows:
mrgt [ <paramlist> ]
where <paramlist>
is an (optional) list of parameters.
Legend
The columns in the following table are defined as follows:
Parameter
specifies the parameter and further specificationsReq'd
specifies whether (Y
) or not (n
) the field is required to be present when the tool is being called. If required, it MUST either be present in the configuration file, or as a command-line parameter.Description
specifies the meaning of theValue
field, and other things you may need to know, e.g. why it is needed, a required syntax, etc.
If a configuration file used, the long version of the parameter must be used (without the preceding --
).
Key | Req'd | Description |
---|---|---|
-h , --help | n | display help for command. |
-c , --config <path> | n | Path (including the filename) of the tool's (YAML) configuration file. |
-s , --scopedir <path> | Y | Path of the scope directory where the SAF is located. |
-v , --vsntag <vsntag> | n | Version tag for which the MRG needs to be (re)generated. If omitted, MRGs for all versions will be generated. |
-V , --version | n | Output the version number of the tool. |
-o , --outputdir <path> | n | Directory where the generated MRG files will be stored. Defaults to the glossarydir of the scopedir if not provided. |
--altvsntag <vsntag> | n | Create additional MRGs with alternative version tag (overrides settings in the SAF). |
-e , --onNotExist <action> | n | The action to take if a specified MRG file does not exist. Possible values are throw , warn , log , and ignore . |
-d , --debug | n | Enable debug mode to provide more detailed output and logging for troubleshooting purposes. |
Debug Levels
-d
(--debug
) option may not yet work as specified.Debug Level | Description |
---|---|
info | General informational output about the tool's operation, such as high-level actions. (Default) |
warn | Shows warning messages indicating potential issues or non-critical problems. |
debug | Provides detailed output, including internal variables, stack traces, and low-level function calls. |
error | Displays error messages for critical problems that prevent the tool from running correctly. |
trace | The most verbose output, including trace-level logs for in-depth debugging and step-by-step details. |
`-e`, `--onNotExist` Actions
<action> | Description |
---|---|
'throw' | an error is thrown (an exception is raised), and processing will stop. |
'warn' | a message is displayed (and logged) and processing continues. |
'log' | a message is written to a log(file) and processing continues. |
'ignore' | processing continues as if nothing happened. |
Some parameters may only be configurable through the use of a configuration file. In this case macros
, for use in form phrases, is an example. See the configuration file page for details.
Running the Tool
One run of the MRGT either
- generates an MRG for one specific terminology version within the current scope (which is the case when the
version
parameter was specified), or it - generates multiple MRGs, i.e., one for every version of the terminology that is curated within the current scope (which is the case when the
version
parameter is omitted).
Running the tool comprises the following phases:1
- Constructing a provisional MRG;
- Post-processing the entries in that provisional MRG;
- Creating/overwriting MRG file(s) in the glossarydir of the current scope.
Phase 1: Constructing a provisional MRG
During Phase 1, the MRGT tool constructs a provisional MRG for each specified version of a terminology. This step involves reading the Scope Analysis File (SAF) and gathering all relevant entries to form an initial, provisional MRG. The goal is to prepare an intermediate representation of the MRG that will be refined and finalized in subsequent phases.
Step-by-Step Process
Reading the SAF:
- The tool begins by reading the SAF file from the specified
--scopedir
directory. The SAF contains metadata and configuration details about the scope, terminology versions, and their corresponding tags (vsntag
andaltvsntags
). - If a
--vsntag
parameter is provided, the tool looks for the corresponding version in theversions
section of the SAF and extracts relevant information, such as thevsntag
,altvsntags
, and the list of term selection instructions.
- The tool begins by reading the SAF file from the specified
Processing the
vsntag
Argument:- If the
vsntag
argument is provided on the command line, the tool searches for the entry in the SAF'sversions
section with a matchingvsntag
or one of the elements in itsaltvsntags
field. - If the
vsntag
is not found, the action specified by the--onNotExist
parameter (throw
,warn
,log
, orignore
) determines how the tool handles this situation.
- If the
Determining the MRG Entries Using Term Selection Instructions:
- The tool processes a list of term selection instructions found in the SAF's
versions
section for the correspondingvsntag
(oraltvsntag
, as may be the case). - These instructions specify which entries are added to, removed from, or modified in the provisional MRG.
- Term selection instructions include:
- Adding Entries: specify the source from which a new provisional MRG entry is to be created (see next bullet). This can either be a particular curated texts or a particular MRG entry from an MRGs that already exists.
- Removing Entries: specify which MRG entries that exist in the provisional MRG are to be removed therefrom.
- Modifying Attributes specify the specific MRG entries that exist in the provisional MRG are to have fields modified, and specify which fields (and how they) are to be modified. This allows, e.g., for renaming terms or adjusting other metadata fields.
- The tool processes a list of term selection instructions found in the SAF's
Creating a Provisional MRG for Each Version:.
- For every version of the terminology that is to be generated (based on the presence or absence of
vsntag
), the tool creates a provisional MRG. This provisional MRG is essentially a collection of provisional MRG entries. - Creating a provisional MRG entry is done as a result of a term selection instruction that specifies its source. This can be :
- A curated text (that documents a term). The provisional MRG entry will then consist of all fields from the header of the curated text.
- An MRG entry from an existing MRG (often, but not necessarily, from a different scope). The provisional MRG entry will then consist of all fields from that MRG entry.
- NOTE: Two (or more) MRG entries cannot have the same value in their
termid
fields. Therefore, if an MRG entry is added whosetermid
value exists in an MRG entry that is already in the provisional MRG, then this latter entry will be discarded, after which the new entry is added.
- For every version of the terminology that is to be generated (based on the presence or absence of
Normalizing MRG Entries: After adding entries to the provisional MRG, each entry is normalized, which means that various fields are modified, to ensure consistency and standardization when they are further processed. Normalization consists of:
- regularization of fields that are meant to be processed by tools. They include
term
,termType
,formPhrases
. - Expansion of Form Phrase Macros, which consists of replacing such macros with their expanded equivalents, resulting in multiple possible alternatives. The tool recursively processes the form phrases until every of their macros is expanded. This results in a list of regularized form phrases that replaces the original list of formPhrases.
- regularization of fields that are meant to be processed by tools. They include
An MRG SHOULD NOT have two (or more) MRG entries that have a same element in their formPhrases
field, because that would mean that the form phrase is ambiguous, as it refers to two different semantic units.
- Resulting Provisional MRG:
- The output of Phase 1 is a provisional MRG for each version specified. These provisional MRGs serve as intermediate representations that will be refined, validated, and finalized in subsequent phases.
Phase 2: Synonym Processing
In Phase 2, the MRGT tool processes synonyms in the provisional MRG to ensure that terms defined as synonyms of other terms are correctly handled and represented. This phase can only begin after all provisional MRGs have been fully constructed and stored in the glossarydir of the current scope during Phase 1, because only then all (provisional) MRG entries will be available that synonymOf
fields refer to.
Step-by-Step Process
Identifying Synonyms:
- The tool searches through all provisional MRG entries in each provisional MRG and identifies those that have a
synonymOf
field containing a term identifier. - The
synonymOf
field indicates that the term in this MRG entry is a synonym of another term, and its entry should be derived from that term's entry.
- The tool searches through all provisional MRG entries in each provisional MRG and identifies those that have a
Locating the Original MRG Entry:
- For each provisional MRG entry with a
synonymOf
field, the tool locates the original MRG entry that it refers to. This entry could be:- An MRG entry in one of the existing MRGs.
- A provisional MRG entry in the current provisional MRG that was just created.
- For each provisional MRG entry with a
Copying and Merging Fields:
- Once the original MRG entry is located, its data is copied into the provisional MRG entry that has the
synonymOf
field, but - Any fields already present in the provisional MRG entry that contained the
synonymOf
reference will overwrite the corresponding fields copied from the original MRG entry. - This ensures that the resulting MRG entry for the synonym has all the fields of the original term it is synonymous with, except for the fields explicitly defined in its own entry.
- Once the original MRG entry is located, its data is copied into the provisional MRG entry that has the
Ensuring Consistency and Avoiding Ambiguity:
- The tool checks to ensure that no two (or more) MRG entries in the same MRG have the same regularized form phrase in their
formPhrases
field. If two entries end up having the same form phrase, an exception is raised to avoid ambiguity in referencing semantic units.
- The tool checks to ensure that no two (or more) MRG entries in the same MRG have the same regularized form phrase in their
Resulting Provisional MRG after Synonym Processing:
- After processing synonyms, the provisional MRG contains updated entries where all synonyms are correctly linked to their originals. This is crucial for maintaining a consistent and unambiguous terminology within the scope.
- These refined provisional MRGs are ready for further processing in the next phases, where other fields and checks will be finalized.
Phase 3: Storing a provisional MRGs in the glossarydir
In Phase 3, the MRGT tool finalizes the provisional MRG by ensuring that all necessary fields for each MRG entry are correctly populated, standardized, and consistent. This phase comes after all synonyms have been processed in Phase 2 and ensures that the provisional MRG entries are fully prepared for storage as the final MRG files.
Step-by-Step Process
Populating and Validating Fields:
- Each provisional MRG entry must have specific fields populated and validated to conform to the required structure. The tool ensures that these fields are either filled with correct values or generated if missing:
scopetag
: This field is filled with the value from thescopetag
field in thescope
section of the SAF. It uniquely identifies the scope within which the terminology is curated.vsntag
: This field is set to the versiontag that identifies the version of the terminology for which the MRG entry is generated. If the entry is derived from a curated text, its value is taken from thevsntag
field in theterminology
section of the MRG.termType
: ThetermType
field must exist and be regularized. If it does not exist, it is created with a value equal to thedefaulttype
field in thescope
section of the SAF, orconcept
ifdefaulttype
is absent.term
: This field is regularized and must exist. If not, an exception is raised as this is a critical field.termid
: The value of this field is set as "<termType>
:<term>
", combining the regularized values oftermType
andterm
. Each MRG entry must have a uniquetermid
within the MRG to avoid conflicts.
- Each provisional MRG entry must have specific fields populated and validated to conform to the required structure. The tool ensures that these fields are either filled with correct values or generated if missing:
Setting Up Navigation and Locators:
- The
locator
andnavurl
fields are populated to ensure correct referencing and navigation within the documentation system:locator
: This field contains the path (relative toscopedir
/curatedir
) of the file that contains the header of the curated text. It ensures traceability back to the original curated document.navurl
: Thenavurl
is constructed by concatenatingwebsite
/navpath
/curatedir
/id
, where these elements are defined in thescope
section of the SAF. If thebodyFile
field in the header of the curated text file is set,navurl
becomeswebsite
/bodyFile
. Theid
part is determined based on the presence of anavid
field:- If
navid
is specified in thescope
section of the SAF, it specifies which field from the curated text or body file is used forid
. - If
navid
is not specified,id
defaults to the name of the curated text file or body file.
- If
- The
Generating
formPhrases
andheadingids
:- The tool completes the setup of all MRG entries by processing fields that support additional navigation and search capabilities:
formPhrases
: This field is populated with an array of regularized form phrases. One of the elements must be the same as theterm
field, ensuring that tools can find all relevant forms.headingids
: This field is constructed by extracting all markdown headings found in the body-file or curated text file, and normalizing them into a list. Custom heading IDs, if present, are also included as-is. This supports both default headers and custom-defined ones, ensuring accurate navigation.
- The tool completes the setup of all MRG entries by processing fields that support additional navigation and search capabilities:
Final Consistency Check:
- Before concluding this phase, the MRGT performs a final consistency check to ensure the integrity of the MRG entries:
- All
termid
values must be unique within the MRG. - No two MRG entries should have the same regularized form phrase in their
formPhrases
field to avoid ambiguity.
- All
- Before concluding this phase, the MRGT performs a final consistency check to ensure the integrity of the MRG entries:
Resulting MRG Ready for Storage:
- After Phase 3 is complete, each provisional MRG has all required fields accurately populated and validated. The entries are consistent, standardized, and ready to be stored as final MRG files in the glossarydir of the current scope.
Key Points to Remember:
- Phase 3 ensures that all fields necessary for MRG entries are correctly set up, validated, and standardized.
- The
termid
,locator
,navurl
,formPhrases
,headingids
, and other fields are crucial for the correct functionality of the generated MRG files. - A final consistency check ensures the uniqueness of identifiers and prevents any ambiguities in the glossary.
By the end of Phase 3, the MRG is complete and ready for storage or use by other tools in the toolbox.
Exceptions, Warnings, and Logging
This section needs to be reviewed/revised so as to enable a consistent way of error checking and logging, similar to what is done in the TRRT
The MRGT tool is designed to assist its users—primarily curators and developers—by providing informative feedback that aids in both identifying and resolving issues encountered during the generation of MRGs. The tool follows a principle of minimizing errors that halt processing, instead favoring warnings and informative messages whenever possible. These messages can be adjusted based on verbosity levels set by command-line options.
Error Handling Strategy
The MRGT employs a robust error-handling strategy that focuses on:
- Limiting Terminating Errors: Errors that stop the entire process are kept to a minimum. They only occur in critical scenarios where further processing would lead to invalid results or corrupt data. Examples include missing required parameters, encountering unreadable files, or insufficient write permissions for output directories.
- Providing Warnings with Varying Severity Levels: Warnings are issued to inform users of potential problems that do not immediately stop the process but may affect the output or require user attention. These warnings can be controlled through the
--debug
flag, which allows users to choose betweeninfo
,warn
,debug
,error
, andtrace
levels of verbosity, based on their needs. For example:info
: General information about the processing.warn
: Non-critical issues that need attention.debug
: Detailed output for diagnosing problems.error
: Critical errors that prevent proper execution.trace
: Most verbose output for step-by-step troubleshooting.
-d
(--debug
) option may not yet work as specified.- Helpful Suggestions for Resolution: Whenever an error or warning is generated, the MRGT provides context and actionable suggestions to help the user resolve the issue. This includes potential fixes for file format errors, missing fields, and configuration issues.
Common Logging Scenarios
The MRGT logs conditions that prevent it from properly executing tasks, such as:
- Obtaining the
scopedir
from ascopetag
: If thescopetag
does not resolve to a valid directory, a warning or error is logged. - Parsing a curated text: Issues may arise if the text is not in the expected format (e.g., invalid YAML front matter or markdown errors), which will be logged with details to assist the user in correcting the format.
- Resolving terms, scope tags, group tags, or version tags: If these elements cannot be resolved due to mismatches or missing entries in the SAF, warnings or errors are logged.
- Writing the output: Problems such as lacking write permissions for the designated location are logged as errors.
Leveraging Logging for Troubleshooting
- Adjust Logging Levels: Use the
--debug
flag to set the desired verbosity level when running the tool. For detailed debugging, usedebug
ortrace
levels to see internal state information, variable values, and detailed stack traces. - Review Log Messages: Analyze the log messages to pinpoint where issues occur. For example, a message like "Failed to parse curated text at
path/to/file
" not only indicates the file but often provides the line or character position where the parsing failed. - Follow Suggestions: Each warning or error message includes suggestions for resolving the problem. These may involve correcting file paths, adjusting configurations, or ensuring dependencies are met.
Developer Support and Continuous Integration (CI)
- The MRGT comes with comprehensive documentation that enables developers to verify its correct functioning. This includes guidelines for setting up test environments, using test scripts to validate parameter handling, and examples of common use cases.
- The tool is designed to be easily integrated into a CI/CD pipeline, allowing for automated testing and deployment in git repositories. Developers can configure CI pipelines to run the MRGT with various configurations and ensure that any updates or changes do not introduce new issues.
By effectively using the error handling, warnings, and logging mechanisms provided by the MRGT, users can efficiently identify and resolve issues, ensuring smooth and reliable generation of MRGs.
- The MRGT does NOT overwrite files that contain an MRG, until all content has been constructed. Thus, the 'old' MRGs remain available as a (possible) source for copying MRG entries from during the construction of one or more provisional MRGs.↩