 |
The Mural Matching Engine supports the joining of records that relate to the same entity in two or more disparate data sets. In absence of a shared, unique key, record matching requires the comparison of groups of partially-identifying, non-unique data items between pairs of records. Data items with variable formats, such as names and addresses, need to be transformed and normalized in order to validly carry out these comparisons. This process is commonly referred to as "standardization". This mini-talk presents Mural's approach to standardization which uses a combination of lexicon-based tokenization and finite state machine-driven parsing where the statistical distribution of input symbol sequences is learned from training data sets.
|