EntityStream Schema Instructions

How to configure EntityStream

Contents

  1. Schemas

  2. Tables

  3. Purposes

    1. Purpose Columns
    2. Purpose Column Maps
  4. Metadata Rules

  5. Indexes

  6. Dynamic Rules

    1. ER – Equivalent in Rule
    2. NR – Not Equivalent in Rule
    3. IR – Ignore in Rule
    4. TR – Translate in Rule
    5. EK – Equivalent in Key
    6. IK – Ignore in Key
    7. TK – Translate in Key

Schemas

Within the EntityStream product we use the concept of a schema definition to allow us to define a group of inter-related sources of data, such as LEI and Company or Employee and Payroll. Each schema is distinct from others and there is no sharing of information, metadata or objects between them. A schema file is defined as below:

{ "SchemaName": "GLEIF", 
  "System": "GLEIF",
  "commitSize": 1000,
  "cacheSize": 10000,
  "threadSize": 1, 
  "Rules": [...],
  "Tables": [...],
  "Indexes": [...],
  "Purposes": [...] }

This definition contains  a high level definition of the name of the schema and any other associated values, such as the System that the data should be tagged with, the commitSize, cacheSize and threadSize for processing (options are only available for EntityStream Custodian, but are kept for compatibility). Schema definitions are supplied to EntityStream in a file called something.json and placed in the schemas directory specified by the environment variable schemaDir.

Rules, Indexes,. Purposes, and Tables will be defined further in this documentation.


Tables

Each source system must be defined as a table (sometimes referred to as a subject or data object), the table is a nested columular description of a complex object or a flat file record, depending on the type of data you ask the system to match. For example with a XML source the table would be defined as a nested structure, but with a CSV record it would be better defined as a flat record type structure. There is no material difference between the metadata for each type, but you will note the dot notation in the column names and the displayType of the columns for a nested structure will signify the type of data stored in the object.

From a high level the table consists of a few key fields:

{"keyThreshold": 100,
"candidateThreshold": 60,
"isHistory": false,
"tableName": "Company",
"tableDisplayName": "Company",
"useInPaths": true,
"isInternal": true,
"icon": "images/icons/Company.png",
"columns": [ ... ]
}

For the purposes of the Mars project the only ones that are needed are in fact the tableName and columns, the others are non-mandatory, however tableDisplayName and icon are often useful if you are building a UI application on-top of the Mars project.

“columns”:

Each table needs at least 1 column to be valid, and at least 2 to be useful 🙂 The minimum of 1 column is because we need as a minimum a key field to represent a primary key of the record. Without a primary key the record will not be identifiable and the EntityStream schema will refuse to accept it.

A column definition should look like this:

{
"action": "latest",
"display": "Source Key",
"isEID": true|false,
"labelPos": 0,
"colName": "SRC_PKEY",
"order": 0,
"primaryKey": true~false,
"displayType": "readonly|text|number|date|structure|list",
"displayPage": "not used in mars"
}

action: not used in mars

display: alternative display name used for the column, optional

isEID: not used in mars

labelPos: positive integer – signifies the order of output of the columns within the display label (where used)

colName: actual name by which the column should be referred to in the input and output to mars

order: positive integer – signifies the order of output of the columns within the owning table

primaryKey: is this column the primary key – true can only occur once per table. Combined keys are not suitable, please resolved them outside Entitystream.

displayType: readonly, text, number, date, structure, list – the physical type of the column list signifies a repeating group of child items, structure is a group of child items (fields)

displayPage: not used in mars

Note on colName:

The colName is a special attribute of the table, it’s name can be simple text, or it can be in a dot notation such as registration.names.name.firstname and registration.names.name.lastname, for this to be successful the metadata will also need to define all the upper path components such as:

  • registration,
  • registration.names,
  • registration.names.name,

In this case the registration would be a structure, registration.names is a list and registration.names.name a structure grouping theregistration.names.name.firstname and registration.names.name.lastname together. There are a bunch of examples in the GLEIF schema. 


Purposes

In the EntityStream schema we do not match using tables and columns as this would involve manipulating all data sources to look the same, which isn’t really true to life, instead we do not insist on data normalisation before you match the records from different sources, the main aim here is that you can introduce new fields and structures into the matching system without having to change the match rules and indexing rules. In the full Custodian product you can even dynamically create tables and have the product guess the mappings between the new table and all the other sources of data.

Purposes are groupings of inter-related Purpose Columns, simply put they are used as a convenience for defining Indexes, where more than one (purpose) column type is needed in a single index, for example (and this will be discussed later) indexing Legal Entities or Companies is better if you use company name and the country, to avoid collision across countries. As such we would define a single purpose used for indexing that would group together the company name and country “purpose columns”.

In most instances where we use the purpose for a match rule, we would define a purpose with one purpose column each.

Examples of Purposes are: Company_Name_Country (2x PCs), Company_Name, Address, Registered_Company_ID, Person_Name etc

Indexes and Metadata Rules can only refer to Purposes and not Purpose Columns in their definition.

{

  "purposeName": "RegCoNum",

  "targetAlgo": 0,

  "purposeColumns": [ ... ]

}

purposeName is the key field for the purpose, remember purpose is simply a grouping of purposeColumns, targetAlgo should always be 0 – it is not supported by mars.

Purpose Columns

Purpose columns are defined as a logical group of fields using in a match rule, for example where one system has the concept of PersonName as one field and in another the system has the concept of FirstName, MiddleName and LastName – we would define a purpose column (conceptual column) called Person_name. In the first system we group only one field and the other all three fields used (in order) to make up the persons name.

{

"column": "RegCoNum",

"gradientType": 0,

"mandatory": false,

"matchClass": "MatchString",

"maxWidth": 3,

"minWidth": 1,

"purposeName": "RegCoNum"

"purposeColumnMaps": [ ... ],

}

matchClass – MatchString, MatchCompanyName, MatchPersonName, etc see reference here

mandatory – is this purposeColumn mandatory in an index it is used in.

gradientType – 0,1,2,3 represents linear, leftHigh, rightHigh, middleHigh, the character weighting used in scoring

minWidth – minimum number of tokens to be used in the generation of an index key

maxWidth – maximum number of tokens used in index key.

Purpose Column Maps

Purpose Column maps are a somewhat inconvenient convenience, in the sense that they provide a mapping between the physical table columns and the purpose columns, this can be an inconvenient exercise as this step is needed before we can match on any field in the system, however they are convenient as because once we have done the mappings the rules can be changed and it will affect all the mapped tables of source data.

{
   "columnOrder": 1,
   "purposeColumn": "RegCoNum",
   "purposeName": "RegCoNum",
   "tableColumn": "Entity.RegistrationAuthority.RegistrationAuthorityEntityID",
   "tableName": "GLEIF"
},{
   "columnOrder": 1,
   "purposeColumn": "RegCoNum",
   "purposeName": "RegCoNum",
   "tableColumn": "RegCoNum",
   "tableName": "Company"
}

Mappings above represent pairs of table/tableColumns with purpose/purposeColumn in a particular order. Order is relevant where more than one tableColumn maps to a purposeColumn, ie Address purposeColumn should be made up from tableColumns addr1, addr2, city, postcode in this order.


Metadata Rules

Metadata Rules (Rules) are a group of purposes with additional attributes applied to enable them to score between two records:

{
"action": "EID",
"actionText": "",
"active": true,
"highScore": 95.0,
"lowScore": 80.0,
"matchSameSystem": true,
"order": 0,
"rulePurpose": [ ... ],
"systemMatchType": 0
}

Where:

  • highScore – is the indicator of the percentage considered to be a good match that can be accepted without manual intervention.
  • lowScore – is the indication of the percentage where a match is possible but the user may need to review it.
  • matchSameSystem – true|false prevents you from matching two records from the same System to itself.
  • order – the order of precedence to process each rule, the rules must be numbered sequentially from 0 upwards.

In order for a rule to work it needs one or more rulePurpose elements:

{
"acceptWeight": 1.0,
"mandatory": true,
"negate": false,
"purposeName": "Company_Search",
"rejectWeight": 1.0
},
{
"acceptWeight": 1.0,
"mandatory": true,
"negate": false,
"purposeName": "RegCoNum",
"rejectWeight": 1.0
}

purposeName is the purpose defined in the metadata earlier, and negate will invert the rule, so a good match can be a bad match!

Weightings (acceptWeight, rejectWeight and mandatory)

Weighting ratios are the key to scoring with multiple rulePurposes, the general rule is this:

When their is a positive effect from the rulePurpose (ie some similarity is detected) – the ratio of the score will be weighted according to the acceptWeighting of that rulePurpose as a proportion of the total of all mandatory acceptWeighting(s), in the event that there is no match for the rulePurpose the rejectWeighting will be used. Generally the acceptWeighting and the rejectWeighting can be the same (and the default rejectWeighting is the acceptWeighting if you miss it out), but if a rulePurpose is very important and in the event it is significant to over value the lack of a match on the purpose then the rejectWeighting can be increased, similarly it can also be reduced, so a rulePurpose only contributes to a match and not prevents one – however mandatory=false will also help you here.


Indexes

Indexes are used to create a key that can be used to collide records with similar profiles, they are not unique for each record and many records will share the same keys. Index key generation is key to reducing the number of records that need to be compared, and causes the speed that you can match to be far more predicable than comparing all records together.

Often this technique is referred to as Pigeon-Holing or bucketing – the key generated tells you how to group records to compare, then you only compare records in the same bucket or pigeon-hole.

{
  "exactIndex": false,
  "indexName": "CompanyNameIdx",
  "instance": 0,
  "keyThreshold": 1000,
  "match": false,
  "purposeName": "Company_Search",
  "search": true
}

where:

exactIndex – true|false is this index an exact index type – you will need to configure a Metadata rule with the same Purpose to accept matches where this index identifies produces a suggested match key for two similar records. Be careful in the purposeColumn definitions that you don’t introduce too much fuzziness to the matching else you will end up auto-matching records that are only similar and not the same.

indexName – the textual name of the index, it should be a short string without special characters, it will be returned with the index key definitions so you can understand which index produced the key, also it can be used with the index method in the rest interface, as an optional indexName qualifier to the POST call.

instance – you could have multiple indexes with the same name, this will make them unique, its old functionality and not recommended to use.

keyThreshold – this is nit used by Mars, its purpose is to set the number of keys that is the limit to produce when loading data into an index in the Custodian Product, it would stop the system from producing a key too many times.

match – true|false, is this index useful for matching, if set the index will generate keys when the index method is called specifying that an index key is required. Be careful with match indexes that the purposeColumn has an appropriate setting for minWidth and maxWidth – it is recommended that the minWidth should not be too low and maxWidth not too high or else too many keys would be generated.

search – true|false, is this index useful for searching

purposeName – links to the purpose definition


Dynamic Rules

Dynamic rules (not Metadata Rules defined in the metadata section) are defined in the “rules” section of the schema.  mostly these are supplied to the entitystream in a file called “mapset.custom” under the schemas directory specified by the environment variable schemaDir. Metadata rules as discussed are useful for scoring record similarities, dynamic rules are manipulation of scoring.

There are 7 classification of dynamic rules and these are described here, 4 for Metadata rules and 3 for indexing.

ER – Equivalent in Rule

ER dynamic rules are to persuade the matching process that two (or more terms) are functionally equivalent, it is often used to signify where the matching algorithm should consider the terms to be the same meaning.

{
"rulePurpose": "MatchCompanyName",
"type": "ER",
"parent": "",
"items": [
"ACCOUNTING",
"ACCOUNTANCY",
"ACCOUNTANT",
"ACCOUNTANTS",
"ACCT"
]
}

In this example where the type of rulePurpose is MatchCompanyName – please excuse the incorrect property name, the value is actually one of the matchClass types defined here.

items is a list of equivalent terms – please use UPPERCASE as the system will process all matching comparisons in uppercase and this aids in the process.

parent is not used in ER rules and should be included as an empty string.

NR – Not Equivalent in Rule

NR dynamic rules are very important in preventing false positives in the match system, often you will see similar names such as “Robert Haynes 1st” and “Robert Haynes II”, clearly from a human understanding of these two person names they are not the same person, in fact they are most likely a father-son. Matching rules often fail to notice the differences and it can lead to false matching. If you also consider a more subtle example “Haynes LLP” and “Haynes LLC” – I think we can agree that these again are very similar companies and could cause a positive match, but in reality they are different legal forms and there is absolutely no case for considering them as a match – they are in fact 0% similar. To facilitate this NR rules are used.

Please note some more common rules are already pre-built into the match system ie Mr & Mrs and there is no need to manipulate the system in these more obvious cases.

{

  "rulePurpose": "MatchCompanyName",

  "type": "NR",

  "parent": "",

  "items": [

    "LTD",

    "LLP",

    "PLC",

    "INC",

    "LLC",

    "SA",

    "BV",

    "LLC",

    "CORP"

  ]

}

items is a list of non-equivalent terms – please use UPPERCASE as the system will process all matching comparisons in uppercase and this aids in the process.

parent is not used in NR rules and should be included as an empty string.

Scoring note: in the event that two records are compared together and one of them has a NR item in a value and the other does not for the same dynamic rule, they will still be considered a match, if neither has a NR item then they will similarly be considered – however if they both have NR items from the same dynamic rule and the item values are not the same, then the match will be discarded. 

IR – Ignore in Rule

Ignore in rule is far more subtle than NR, this will not reject potential matches but it will suggest to the match algorithm that a certain term is not important and likely a noise word. Please do not confuse the concept of noise words with equivalencies in matching – often people use IR as a method to ignore commonly abbreviated words such as LTD and Limited, this should not be done with an IR rule, as this would cause the system to ignore key tokens in the value being prepared, instead use a ER or TR rule.

{

  "rulePurpose": "MatchCompanyName",
  "type": "IK",
  "parent": "",
  "items": [
    "MR", "MRS", "MS", "DR"
  ]
}
TR – Translate in Rule

Translate in Rule is used to replace values in the record with other values, it can be used to replace commonly misspelt terms that should be standardise, such as England -> United Kingdom. Different to the ER rule where the values will be considered to be the same for match comparison, with the TR the values will actually be replaced in the record with the parent (or stem).

{
  "rulePurpose" : "MatchCountry",
  "type" : "TR",
  "parent" : "Algeria",
  "items" : [ 
    "Algeria", 
    "DZ", 
    "DZA", 
    "012"
  ]
}

items, is the list of values to search for, parent is the value to replace it with.

EK – Equivalent in Key

EK dynamic rules are to persuade the indexing process that two (or more terms) are functionally equivalent, it is often used to signify where the indexing algorithm should consider the terms to be the same meaning. This will generally result in multiple output index keys for the record.

{
"rulePurpose": "MatchCompanyName",
"type": "EK",
"parent": "",
  "items": [
    "LTD",
    "LIMITED"
  ]
}

In this example where the type of rulePurpose is MatchCompanyName – please excuse the incorrect property name, the value is actually one of the matchClass types defined here.

items is a list of equivalent terms – please use UPPERCASE as the system will process indexing comparisons in uppercase and this aids in the process.

parent is not used in EK rules and should be included as an empty string.

IK – Ignore in Key

Ignore in rule will suggest to the index algorithm that a certain term is not important and likely a noise word. 

{

  "rulePurpose": "MatchCompanyName",
  "type": "IK",
  "parent": "",
  "items": [
    "GROUP"
  ]
}
TK – Translate in Key

Translate in Rule is used to replace values in the record with other values, it can be used to replace commonly misspelt terms that should be standardise, such as England -> United Kingdom. Different to the EK rule where the system would general multiple keys for each record because of the EK, with TK the term will be replaced with the parent value before the key is generated.

{
  "rulePurpose" : "MatchCountry",
  "type" : "TK",
  "parent" : "Algeria",
  "items" : [ 
    "Algeria", 
    "DZ", 
    "DZA", 
    "012"
  ]
}

items, is the list of values to search for, parent is the value to replace it with.

Grouping rules together

Rules are defined in groups in the “rules” section of the schema, all MatchCompanyName rules should be grouped in a list:

rules: {MatchCompanyName: [ ... ], MatchPersonName: [ ... ]}