Knowledge sources

A knowledge source is a connector to an external source of data. Generally speaking, knowledge sources provide sets of records (i.e. atomic elements with multiple properties), that CogTL will convert to small graphs that will be merged in the global knowledge graph (or core brain in CogTL words).

For example, a "row" in a database is a record, having all its columns as properties. Other examples are : a line in a CSV file, an item in a JSON array, an element and all its descendants in a XML stream, an entry in a LDAP directory. The principle is always the same. You have a set of properties as input (that can be multivalued), and a knowledge source will describe how to convert this set to a graph.

The edition screen of a knowledge source is composed of the following parts :

Basic elements : its name and on which agent the knowledge source will be acquired
Input configuration : how to access the source data. This includes the connectivity parameters, location of the files, and configuration of the data parsers
Preloading actions : actions that should be executed prior to loading the data (e.g. invoke a script to copy data from somewhere else)
Adapters : a set of pipelines that will modify/merge/split the properties of a record to create new properties
Filter : an expression that will allow to select the records to import or discard
Records preview : an informative test panel allowing to preview the input records
Modelling : definition of the graph that will be created for every record

Depending on the source type, a supplementary automation panel will be displayed, allowing to define how to watch the source and automatically start reloading if it has changed.

Input configuration

Input configuration depends on the type of knowledge source. For file-based sources (e.g. CSV, Excel, XML, JSON, LDIF, generic...), you can choose between three main options:

Remote file: Fetch the source file(s) from a remote directory. You can configure the path, the file(s) mask to select files in the source path, an optional username/password and domain to read the files as a specific user. If watching the source, the last modified timestamp of the files will be used to determine whether the data should be reloaded.
URL: Fetch the source data from an URL. You can configure the URL, requested MIME type, and whether GZip compression should be used. You can also specify an alternate URL to check for update if watching the source.
Upload a file: Load the source files(s) directly from CogTL server staging. This functionality allows to upload a file (or a new version) directly to CogTL server. Please refer to the integration guide to setup the staging properly. By default, the staging is located inside of CogTL server container - which may not survive a container upgrade. This option can only be used if the knowledge source is deployed on the main CogTL server (not on an agent).

Preloading actions

Preloading actions allows to define things that must happen when the source should be reloaded, just before data starts to be imported. The following actions are available:

Execute an external program

Use this preloading action to execute an external binary before the loading begins, by specifying the path and filename of the program to run, optional parameters and an optional starting directory.

Optionally, you can also specify an expected exit code. If the exit code doesn't match this expectation, the loading of the source is then interrupted.

Adapters

Adapters are multiple "pipelines" that can be designed to modify an input set of parameters. An adapter has always three mandatory parts:

The input parameter name or index
A chain (pipeline) of functions, that will be executed in order
The output parameter name (that can be the same that the input parameter name, to overwrite the content)

Optionally, a multi-values group name can be specified, to group multivalued fields together (more on that topic below).

If multiple adapters are specified, they are also executed in order, meaning that the every adapter can use as input parameter the output of a previous adapter defined higher.

Functions

The following type of adapters are available:

Type	Description
Default value	Allows to define a value (of any type) in case the input is not defined
Switch	To be used with an explicit set of codes as input (enumerations). Allows to replace any value by something else, for example a code by its description
Case	Allows to adapt the case of the input : convert it to UPPERCASE, lowercase, or Camel Case
Trim	Removes leading and trailing spaces, tabs, or newlines
Reg.exp	Allows to define a regular expression with one capturing group, that will be converted to something else (potentially including the capturing group)
Date/time string	Parsing of a date/time that will be converted to a date within CogTL. Please refer to (this page)[https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html] to see the syntax to use
Numeric timestamp	Parse a numeric timestamp to a date in CogTL. Available numeric input types include UNIX (Epoch) time (number of seconds since Jan 1st, 1970), UNIX time in milliseconds, also used by most modern languages (e.g. Java, .NET), LDAP timestamps, and Excel date/times stored as doubles.
IPv4 Subnet	Allows to use create a IPv4 subnet from two distinct fields (typically the IP and the mask)
Splitter	Convert a single-value parameter to a multivalued parameter by splitting it using a pattern described by a regular expression (e.g. if you use `\s,\s`, you will split all different values separated by comma signs, having any number of spaces before or after the comma)
Merger	Allows to merge multiple parameters to an unique value
Expression	Transform the input to something else using a CogTL expression. The expression will be able to use only the input, or other available attributes
MV creator	Allows to transform multiple single- or multivalued fields to an unique multivalued field, possibly removing duplicates
MV filter	Filter only specific elements in a multivalued parameter using a CogTL expression - can stop on the first match (i.e. transform the multivalued attribute to an unique value), or keep all elements that passed through the filter
MV merger	Merge a multivalued field to a single value, either using a concatenation, or a mathematic function (sum, min, max) for numeric values
Hash function	Calculate the hash of the input parameter using a specified algorithm and encoding

Multivalued groups

CogTL deals with multivalued parameters by creating clones of the entities concerned by those parameters in the modelling graph. However, there may be situations where multiple parameters are multivalued, but represent different collections of different sizes.

Example: a Country could contain x Regions, and y Provinces. Depending on the way the information is stored in the data source, it could come from two different parameters, that should be converted to multivalued parameters using a Splitter adapter. However, they belong to different collections and should therefore NOT be used together. In that situation, we would set two different multivalued group names for those collections.

In other cases, two collections obtained by two different parameters may be related. In the example above, a Country record could contain a parameter Region, with a list of regions separated by a comma symbol, and another parameter Population, with the population of every region also separated by commas. In that situation, both collections must be aligned, and therefore we would set the same multivalued group name for both adapters pipelines.

Filter

The filter is a boolean expression that will be applied to every record. If the expression evaluates to "false", then the record will be discarded and not imported.

You can refer to all record parameters using the syntax @"<fieldname>", for example @"age" > 18, to keep only records where the numeric field "age" is above 18.

Please refer to the documentation of expressions for more information about the expressions syntax.

Modelling

Modelling is probably the most important part of a knowledge source. This is where every record is converted to a graph.

Create an entity

In edition mode (you can toggle edition mode using the small pencil button on the upper right), click anywhere on the template to create an entity. It will open a popover requesting the entity type to create, and as soon as you have selected the entity type, you can fill data values for this entity type.

Every data value has two options:

The "@" button allows to refer to a field parameter. If it is not checked, the value will be treated as a constant.
The "*" button means that this field is mandatory. If no value is found for it, the entity won't be created.

The "Options" panel allows also to define a probability for the entity (in case the knowledge source has some uncertainty). When entities are merged, their probability helps to define which one has the priority. This panel also contains a checkbox Ignored if not already existing. This option allows to import an entity only if it already exists in the knowledge graph (typically used in the case of a knowledge source that contains some old/dirty data).

Create a relation

In edition mode also, you can drag n' drop from one entity to another to create a relation. Here again, a popover will show, requesting the relation type to create, and its probability.

Edit or delete elements

In edition mode, you can just click on any entity or relation to open an edition popover, allowing you to modify some elements, or delete it.

Merge process of a knowledge source

Anytime a knowedge source is loaded, CogTL will iterate on all source records, to build a knowledge graph of this particular source. Every record will be, after application of the adapters and filters, transformed to a subgraph defined by the template that you have set. Entity constraints are already applied at that time to immediately merge the similar entities.

When all data is loaded, the imported graph is compared with the existing graph of all entities and relations known by this knowledge source. If the new size is drastically smaller than before, the loading is aborted unless the safe mode is deactivated. Otherwise, the two graphs are merged using all entities constraints, and objects that have disappeared from the source are updated or deleted, depending if they are known by other knowledge sources.