Table of Contents
Concept Groups (AKA Purposes)
In Custodian we do not match using tables and columns as this would involve manipulating all data sources to look the same or to define many match rules, instead we do not insist on data normalization before you match the records from different sources, the main aim here is that you can introduce new fields and structures into the matching system without having to change the match and indexing rules.
Concept Groups are groupings of inter-related Concepts (AKA Purpose Columns), they are used as a convenience for defining Indexes and Match Rules, to remove the need to define a match rule for each table of data in the system.
A concept group is loosely defined as a single conceptual element of the data such as Address and often should be able to be broken down into Concepts such as Address Lines and PostCode:
Note: Items below the dotted line signify the table and column mappings to the conceptual layer.
Concept Group Properties:
|Inheritance Algorithm:||Dont, Latest Value, Any Value, Longest Value||What is the Inheritance Type for the Concept Group. Not currently in use.|
|Purpose Type:||Matching Type, Payload Type, Report Uniques, Report Concepts||What is the use for the Concept Group. Ie is it for matching, is it a payload attribute, or used for reporting (Dashboard)|
Normally Concepts are defined as a logical group of fields using in a match rule, for example PersonName which would have physical columns from each table mapped to it ie FirstName, MiddleName and LastName. In some tables the person name may be represented differently, ie as two or one combined field, in this case the mapping to the Person Name concept would represent this.
However for the purposes of Indexing data it is often useful to define a combined Concept Group where two different conceptual elements are grouped together and given a suitable cause – such as PersonMatching (concept group) which would contain two concepts such as PersonName and PostCode – this is such that each Concept is defined as a matchable element and will be assigned a data purpose such as “Fuzzy String”. Having two concepts within a concept group like this means that you can define a more granular index and avoid comparing too many similar records together.
|Data Type||Fuzzy String, Person Name, Chinese First Name, Stock Ticker, Color, Postal Code, Address, Person Age, Fuzzy Date, Person Name Using Syllable Coding, Simple String, Age Range, End Address, Formula, Country, UK Police Beat Code, Code, Company Name, Height, Sorted Fuzzy String, Text, Email Address, Column To Concept, Phonetic String|
|Match Variation Type||Linear Transform, Both Ends are More Valuable, Address Type, Left End is Higher Value, Right End is Higher Value, Both Ends are Lower Value|
|Minimum Key Width (Optional)||Numeric Value, minimum number of tokens to be used in the generation of an index key|
|Maximum Key Width (Optional)||Numeric Value, maximum number of tokens used in index key.|
|Mandatory Key Component||Numeric Value, is this purposeColumn mandatory in an index it is used in.|
Concept maps provide a mapping between the physical table columns and the concepts, this can be an inconvenient exercise as this step is needed before we can match on any field in the system, however they are convenient as because once we have done the mappings the rules can be changed and it will affect all the mapped tables of source data.
Mappings above represent pairs of table/columns with Concepts in a particular order. Order is relevant where more than one table column maps to a Concept.
Concept Data Types
All concepts have a functional Data type, rather like a table specifies the type of the column as text, numeric, list etc, the concept defines the actual purpose of the concept, these are defined below.
|Fuzzy String||Fuzzy String Matching|
|Person Name||Fuzzy Person Name Matching|
|Chinese First Name||Matching Chinese Names using Transliteration|
|Stock Ticker||US Stock Ticker|
|Color||Person Hair Color|
|Postal Code||UK, US, Australian PostCodes|
|Address||Address excluding country|
|Person Age||Person Age Matching|
|Fuzzy Date||Fuzzy Date Matching|
|Person Name Using Syllable Coding||Person Name Fuzzy matching using Specialised English Syllablization|
|Simple String||Simple String Matching|
|Age Range||Person Age ranges|
|End Address||Last Part of the Address ie PostCode, Country|
|UK Police Beat Code||UK Police Beat Code|
|Company Name||Fuzzy Company name|
|Height||Dimensions for Person Height|
|Sorted Fuzzy String||Fuzzy String using Alphabetic Order|
|Text||Long Text Matching|
|Email Address||Email Address Matching|
|Column To Concept||Maps the value to a Concept|
|Phonetic String||Fuzzy Phonetic String matching (NYSIIS)|
Match Variations enable the use to alter the behavior of the weighting of the similarity algorithm, this is useful for manipulating the match process for tuning purposes. Often regional variations on demographic data means that these settings are more useful. For example some nationalities always use the family name first, some use the given name first, this setting will enable the matching to deal with the variations better.
|Linear Transform||Use all parts of the value in equal proportion|
|Both Ends are More Valuable||Tokens at the start and end of the value has more significance|
|Address Type||Treat the value as an address, where numerics and street names matter more|
|Left End is Higher Value||The start of the string has more significance|
|Right End is Higher Value||The End of the string has more significance|
|Both Ends are Lower Value||The center of the value has more significance|