Modeling Concepts

Concept Groups (AKA Purposes)

In Custodian we do not match using tables and columns as this would involve manipulating all data sources to look the same or to define many match rules, instead we do not insist on data normalization before you match the records from different sources, the main aim here is that you can introduce new fields and structures into the matching system without having to change the match and indexing rules.

Concept Groups are groupings of inter-related Concepts (AKA Purpose Columns), they are used as a convenience for defining Indexes and Match Rules, to remove the need to define a match rule for each table of data in the system.

A concept group is loosely defined as a single conceptual element of the data such as Address and often should be able to be broken down into Concepts such as Address Lines and PostCode:

Example of Address Concept Group mapping Address Lines for two Tables
Mapping for PostCode to two Tables

Note: Items below the dotted line signify the table and column mappings to the conceptual layer.

Concept Group Properties:

OptionPossible ValuesUsage
Inheritance Algorithm:Dont, Latest Value, Any Value, Longest ValueWhat is the Inheritance Type for the Concept Group. Not currently in use.
Purpose Type:Matching Type, Payload Type, Report Uniques, Report Concepts What is the use for the Concept Group. Ie is it for matching, is it a payload attribute, or used for reporting (Dashboard)

Concepts

Normally Concepts are defined as a logical group of fields using in a match rule, for example PersonName which would have physical columns from each table mapped to it ie FirstName, MiddleName and LastName. In some tables the person name may be represented differently, ie as two or one combined field, in this case the mapping to the Person Name concept would represent this.

Indexing Considerations

However for the purposes of Indexing data it is often useful to define a combined Concept Group where two different conceptual elements are grouped together and given a suitable cause – such as PersonMatching (concept group) which would contain two concepts such as PersonName and PostCode – this is such that each Concept is defined as a matchable element and will be assigned a data purpose such as “Fuzzy String”. Having two concepts within a concept group like this means that you can define a more granular index and avoid comparing too many similar records together.

Concept Options:

Data TypeFuzzy String, Person Name, Chinese First Name, Stock Ticker, Color, Postal Code, Address, Person Age, Fuzzy Date, Person Name Using Syllable Coding, Simple String, Age Range, End Address, Formula, Country, UK Police Beat Code, Code, Company Name, Height, Sorted Fuzzy String, Text, Email Address, Column To Concept, Phonetic String
Match Variation TypeLinear Transform, Both Ends are More Valuable, Address Type, Left End is Higher Value, Right End is Higher Value, Both Ends are Lower Value
Minimum Key Width (Optional)Numeric Value, minimum number of tokens to be used in the generation of an index key
Maximum Key Width (Optional)Numeric Value, maximum number of tokens used in index key.
Mandatory Key ComponentNumeric Value, is this purposeColumn mandatory in an index it is used in.
Concept Maps

Concept maps provide a mapping between the physical table columns and the concepts, this can be an inconvenient exercise as this step is needed before we can match on any field in the system, however they are convenient as because once we have done the mappings the rules can be changed and it will affect all the mapped tables of source data.

Mappings above represent pairs of table/columns with Concepts in a particular order. Order is relevant where more than one table column maps to a Concept.

Concept Data Types

All concepts have a functional Data type, rather like a table specifies the type of the column as text, numeric, list etc, the concept defines the actual purpose of the concept, these are defined below.

Fuzzy StringFuzzy String Matching
Person NameFuzzy Person Name Matching
Chinese First NameMatching Chinese Names using Transliteration
Stock TickerUS Stock Ticker
ColorPerson Hair Color
Postal CodeUK, US, Australian PostCodes
AddressAddress excluding country
Person AgePerson Age Matching
Fuzzy DateFuzzy Date Matching
Person Name Using Syllable CodingPerson Name Fuzzy matching using Specialised English Syllablization
Simple StringSimple String Matching
Age RangePerson Age ranges
End AddressLast Part of the Address ie PostCode, Country
FormulaFormula used to generate column values using JavaScript
CountryCountry
UK Police Beat CodeUK Police Beat Code
CodeCode matching
Company NameFuzzy Company name
HeightDimensions for Person Height
Sorted Fuzzy StringFuzzy String using Alphabetic Order
TextLong Text Matching 
Email AddressEmail Address Matching
Column To ConceptMaps the value to a Concept
Phonetic String Fuzzy Phonetic String matching (NYSIIS)

Match Variation

Match Variations enable the use to alter the behavior of the weighting of the similarity algorithm, this is useful for manipulating the match process for tuning purposes. Often regional variations on demographic data means that these settings are more useful. For example some nationalities always use the family name first, some use the given name first, this setting will enable the matching to deal with the variations better.

Match VariationDescription
Linear Transform Use all parts of the value in equal proportion
Both Ends are More Valuable Tokens at the start and end of the value has more significance
Address Type Treat the value as an address, where numerics and street names matter more
Left End is Higher Value The start of the string has more significance
Right End is Higher Value The End of the string has more significance
Both Ends are Lower Value The center of the value has more significance