Knowledge sources
A knowledge source is a connector to an external source of data. Generally speaking, knowledge sources provide sets of records (i.e. atomic elements with multiple properties), that CogTL will convert to small graphs that will be merged in the global knowledge graph (or core brain in CogTL words).
For example, a "row" in a database is a record, having all its columns as properties. Other examples are : a line in a CSV file, an item in a JSON array, an element and all its descendants in a XML stream, an entry in a LDAP directory. The principle is always the same. You have a set of properties as input (that can be multivalued), and a knowledge source will describe how to convert this set to a graph.
The edition screen of a knowledge source is composed of the following parts :
- Basic elements : its name and on which agent the knowledge source will be acquired
- Input configuration : how to access the source data. This includes the connectivity parameters, location of the files, and configuration of the data parsers
- Preloading actions : actions that should be executed prior to loading the data (e.g. invoke a script to copy data from somewhere else)
- Adapters : a set of pipelines that will modify/merge/split the properties of a record to create new properties
- Filter : an expression that will allow to select the records to import or discard
- Records preview : an informative test panel allowing to preview the input records
- Modelling : definition of the graph that will be created for every record
Depending on the source type, a supplementary automation panel will be displayed, allowing to define how to watch the source and automatically start reloading if it has changed.
Input configuration
Input configuration depends on the type of knowledge source. For file-based sources (e.g. CSV, Excel, XML, JSON, LDIF, generic...), you can choose between three main options:
- Remote file: Fetch the source file(s) from a remote directory. You can configure the path, the file(s) mask to select files in the source path, an optional username/password and domain to read the files as a specific user. If watching the source, the last modified timestamp of the files will be used to determine whether the data should be reloaded.
- URL: Fetch the source data from an URL. You can configure the URL, requested MIME type, and whether GZip compression should be used. You can also specify an alternate URL to check for update if watching the source.
- Upload a file: Load the source files(s) directly from CogTL server staging. This functionality allows to upload a file (or a new version) directly to CogTL server. Please refer to the integration guide to setup the staging properly. By default, the staging is located inside of CogTL server container - which may not survive a container upgrade. This option can only be used if the knowledge source is deployed on the main CogTL server (not on an agent).
Preloading actions
Preloading actions allows to define things that must happen when the source should be reloaded, just before data starts to be imported. The following actions are available:
Execute an external program
Use this preloading action to execute an external binary before the loading begins, by specifying the path and filename of the program to run, optional parameters and an optional starting directory.
Optionally, you can also specify an expected exit code. If the exit code doesn't match this expectation, the loading of the source is then interrupted.
Adapters
Adapters are multiple "pipelines" that can be designed to modify an input set of parameters. An adapter has always three mandatory parts:
- The input parameter name or index
- A chain (pipeline) of functions, that will be executed in order
- The output parameter name (that can be the same that the input parameter name, to overwrite the content)
Optionally, a multi-values group name can be specified, to group multivalued fields together (more on that topic below).
If multiple adapters are specified, they are also executed in order, meaning that the every adapter can use as input parameter the output of a previous adapter defined higher.
Functions
The following type of adapters are available:
Type | Description |
---|---|
Default value | Allows to define a value (of any type) in case the input is not defined |
Switch | To be used with an explicit set of codes as input (enumerations). Allows to replace any value by something else, for example a code by its description |
Case | Allows to adapt the case of the input : convert it to UPPERCASE, lowercase, or Camel Case |
Trim | Removes leading and trailing spaces, tabs, or newlines |
Reg.exp | Allows to define a regular expression with one capturing group, that will be converted to something else (potentially including the capturing group) |
Date/time string | Parsing of a date/time that will be converted to a date within CogTL. Please refer to (this page)[https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html] to see the syntax to use |
Numeric timestamp | Parse a numeric timestamp to a date in CogTL. Available numeric input types include UNIX (Epoch) time (number of seconds since Jan 1st, 1970), UNIX time in milliseconds, also used by most modern languages (e.g. Java, .NET), LDAP timestamps, and Excel date/times stored as doubles. |
IPv4 Subnet | Allows to use create a IPv4 subnet from two distinct fields (typically the IP and the mask) |
Splitter | Convert a single-value parameter to a multivalued parameter by splitting it using a pattern described by a regular expression (e.g. if you use \s*,\s* , you will split all different values separated by comma signs, having any number of spaces before or after the comma) |
Merger | Allows to merge multiple parameters to an unique value |
Expression | Transform the input to something else using a CogTL expression. The expression will be able to use only the input, or other available attributes |
MV creator | Allows to transform multiple single- or multivalued fields to an unique multivalued field, possibly removing duplicates |
MV filter | Filter only specific elements in a multivalued parameter using a CogTL expression - can stop on the first match (i.e. transform the multivalued attribute to an unique value), or keep all elements that passed through the filter |
MV merger | Merge a multivalued field to a single value, either using a concatenation, or a mathematic function (sum, min, max) for numeric values |
Hash function | Calculate the hash of the input parameter using a specified algorithm and encoding |
Multivalued groups
CogTL deals with multivalued parameters by creating clones of the entities concerned by those parameters in the modelling graph. However, there may be situations where multiple parameters are multivalued, but represent different collections of different sizes.
Example: a Country
could contain x Region
s, and y Province
s. Depending on the way the information is stored in the data source, it could come from two different parameters, that should be converted to multivalued parameters using a Splitter adapter. However, they belong to different collections and should therefore NOT be used together. In that situation, we would set two different multivalued group names for those collections.
In other cases, two collections obtained by two different parameters may be related. In the example above, a Country
record could contain a parameter Region
, with a list of regions separated by a comma symbol, and another parameter Population
, with the population of every region also separated by commas. In that situation, both collections must be aligned, and therefore we would set the same multivalued group name for both adapters pipelines.
Filter
The filter is a boolean expression that will be applied to every record. If the expression evaluates to "false", then the record will be discarded and not imported.
You can refer to all record parameters using the syntax @"<fieldname>"
, for example @"age" > 18
, to keep only records where the numeric field "age" is above 18.
Please refer to the documentation of expressions for more information about the expressions syntax.
Modelling
Modelling is probably the most important part of a knowledge source. This is where every record is converted to a graph.
Create an entity
In edition mode (you can toggle edition mode using the small pencil button on the upper right), click anywhere on the template to create an entity. It will open a popover requesting the entity type to create, and as soon as you have selected the entity type, you can fill data values for this entity type.
Every data value has two options:
- The "@" button allows to refer to a field parameter. If it is not checked, the value will be treated as a constant.
- The "*" button means that this field is mandatory. If no value is found for it, the entity won't be created.
The "Options" panel allows also to define a probability for the entity (in case the knowledge source has some uncertainty). When entities are merged, their probability helps to define which one has the priority. This panel also contains a checkbox Ignored if not already existing. This option allows to import an entity only if it already exists in the knowledge graph (typically used in the case of a knowledge source that contains some old/dirty data).
Create a relation
In edition mode also, you can drag n' drop from one entity to another to create a relation. Here again, a popover will show, requesting the relation type to create, and its probability.
Edit or delete elements
In edition mode, you can just click on any entity or relation to open an edition popover, allowing you to modify some elements, or delete it.
Merge process of a knowledge source
Anytime a knowedge source is loaded, CogTL will iterate on all source records, to build a knowledge graph of this particular source. Every record will be, after application of the adapters and filters, transformed to a subgraph defined by the template that you have set. Entity constraints are already applied at that time to immediately merge the similar entities.
When all data is loaded, the imported graph is compared with the existing graph of all entities and relations known by this knowledge source. If the new size is drastically smaller than before, the loading is aborted unless the safe mode is deactivated. Otherwise, the two graphs are merged using all entities constraints, and objects that have disappeared from the source are updated or deleted, depending if they are known by other knowledge sources.