Business Glossary
This plugin pulls business glossary metadata from a yaml-formatted file. An example of one such file is located in the examples directory here.
CLI based Ingestion
Install the Plugin
pip install 'acryl-datahub[datahub-business-glossary]'
Starter Recipe
Check out the following recipe to get started with ingestion! See below for full configuration options.
For general pointers on writing and running a recipe, see our main recipe guide.
source:
type: datahub-business-glossary
config:
# Coordinates
file: /path/to/business_glossary_yaml
enable_auto_id: true # recommended to set to true so datahub will auto-generate guids from your term names
# sink configs if needed
Config Details
- Options
- Schema
Note that a .
is used to denote nested fields in the YAML recipe.
Field | Description |
---|---|
file ✅ One of string, string(path) | File path or URL to business glossary file to ingest. |
enable_auto_id boolean | Generate guid urns instead of a plaintext path urn with the node/term's hierarchy. Default: False |
The JSONSchema for this configuration is inlined below.
{
"title": "BusinessGlossarySourceConfig",
"type": "object",
"properties": {
"file": {
"title": "File",
"description": "File path or URL to business glossary file to ingest.",
"anyOf": [
{
"type": "string"
},
{
"type": "string",
"format": "path"
}
]
},
"enable_auto_id": {
"title": "Enable Auto Id",
"description": "Generate guid urns instead of a plaintext path urn with the node/term's hierarchy.",
"default": false,
"type": "boolean"
}
},
"required": [
"file"
],
"additionalProperties": false
}
Business Glossary File Format
The business glossary source file should be a .yml file with the following top-level keys:
Glossary: the top level keys of the business glossary file
Example Glossary:
version: 1 # the version of business glossary file config the config conforms to. Currently the only version released is `1`.
source: DataHub # the source format of the terms. Currently only supports `DataHub`
owners: # owners contains two nested fields
users: # (optional) a list of user IDs
- njones
groups: # (optional) a list of group IDs
- logistics
url: "https://github.com/datahub-project/datahub/" # (optional) external url pointing to where the glossary is defined externally, if applicable
nodes: # list of child **GlossaryNode** objects. See **GlossaryNode** section below
...
GlossaryNode: a container of GlossaryNode and GlossaryTerm objects
Example GlossaryNode:
- name: Shipping # name of the node
description: Provides terms related to the shipping domain # description of the node
owners: # (optional) owners contains 2 nested fields
users: # (optional) a list of user IDs
- njones
groups: # (optional) a list of group IDs
- logistics
nodes: # list of child **GlossaryNode** objects
...
knowledge_links: # (optional) list of **KnowledgeCard** objects
- label: Wiki link for shipping
url: "https://en.wikipedia.org/wiki/Freight_transport"
GlossaryTerm: a term in your business glossary
Example GlossaryTerm:
- name: FullAddress # name of the term
description: A collection of information to give the location of a building or plot of land. # description of the term
owners: # (optional) owners contains 2 nested fields
users: # (optional) a list of user IDs
- njones
groups: # (optional) a list of group IDs
- logistics
term_source: "EXTERNAL" # one of `EXTERNAL` or `INTERNAL`. Whether the term is coming from an external glossary or one defined in your organization.
source_ref: FIBO # (optional) if external, what is the name of the source the glossary term is coming from?
source_url: "https://www.google.com" # (optional) if external, what is the url of the source definition?
inherits: # (optional) list of **GlossaryTerm** that this term inherits from
- Privacy.PII
contains: # (optional) a list of **GlossaryTerm** that this term contains
- Shipping.ZipCode
- Shipping.CountryCode
- Shipping.StreetAddress
custom_properties: # (optional) a map of key/value pairs of arbitrary custom properties
- is_used_for_compliance_tracking: true
knowledge_links: # (optional) a list of **KnowledgeCard** related to this term. These appear as links on the glossary node's page
- url: "https://en.wikipedia.org/wiki/Address"
label: Wiki link
domain: "urn:li:domain:Logistics" # (optional) domain name or domain urn
To see how these all work together, check out this comprehensive example business glossary file below:
Example business glossary file
Source file linked here.
Generating custom IDs for your terms
IDs are normally inferred from the glossary term/node's name, see the enable_auto_id
config. But, if you need a stable
identifier, you can generate a custom ID for your term. It should be unique across the entire Glossary.
Here's an example ID:
id: "urn:li:glossaryTerm:41516e310acbfd9076fffc2c98d2d1a3"
A note of caution: once you select a custom ID, it cannot be easily changed.
Compatibility
Compatible with version 1 of business glossary format. The source will be evolved as we publish newer versions of this format.
Code Coordinates
- Class Name:
datahub.ingestion.source.metadata.business_glossary.BusinessGlossaryFileSource
- Browse on GitHub
Questions
If you've got any questions on configuring ingestion for Business Glossary, feel free to ping us on our Slack.