uniserv

Methods and Algorithms in Address Management

 

Error tolerance

Optimized methods and algorithms for error tolerance are very important functions for Customer Data Quality (CDQ). The reason? Reading, hearing or spelling errors are highly likely to occur during the visual or acoustic transmission or acquisition of address data, or terms are entered in a different order or simply abbreviated in a different way.

Amongst other things, this can lead to the following:
  • Data entry errors are present rown instead of Brown or Feild instead of Field
  • Words sound different from the way they are spelled Meier instead of Mayer or Steven instead of Stephen
  • Words are exchanged Brown George instead of George Brown
  • Or words are abbreviated differently G. Brown instead of George Brown or Road instead of Rd.
These deviations are not the exception but almost the rule. Studies have shown, that 10 - 30 percent of the addresses are changed every time in some way when they are transferred visually or acoustically several times or recorded several times.

Human intelligence i.e. knowledge about the meaning of the relevant terms, combined with the ability to associate, usually makes it easy to recognize two different addresses which were "changed" because of transmission errors and to judge whether they refer to the same person or company or not.

For a computer, this problem is much more difficult to solve without standard customer and prospective customer reference numbers. Nevertheless, there are several methods which lead to very good results:
  • Technologies are used for recognizing typical reading or data entry errors which are based on fuzzy logic or use special algorithms, which similar to Hamming or Levenshtein, calculate the distance between two character strings.
  • Fuzzy logic is only suitable up to a point for error recognition in acoustic transmission e.g. in a call centre. In this case, special phonetic methods are necessary to evaluate letter combinations which sound the same. Attention must be paid to the fact that different phonetic methods are necessary depending on the language and the country, in order to obtain optimum results.
  • It is important that the two methods described above can also work in combination, because a name can be easily misunderstood over the telephone and then entered into the database with a typing error!
 
 

Address analysis

A useable address management solution is still not possible using only an error-tolerant string comparison. An additional module with a lexical, syntactic and semantic comparison is required for optimized address analysis, because, in the same way as humans, the software must be able to make decisions about the meaning of the terms which are being compared. This is the only way the software can make "sensible" decisions. Some examples for this:
In spite of the close match, there is only a low probability that this is the same person:
  • George Brown jun, 14 Abbeydale Road, Sheffield, S8 0ZL
  • George Brown sen, 14 Abbeydale Road, Sheffield, S8 0ZL
 
In spite of not being a close match, it is almost certain that this is the same company:
  • General Insurance Brokers Ltd.
  • Gen Insurance Brokers Ltd.
 
Addresses which consist of the same words can be very different because of syntax:
  • Frank King Ltd., FAO Robert Smith
  • Robert Smith Ltd., FAO Frank King

 
Addresses with a different syntax can nevertheless be a close match:
  • ABC Ltd., FAO Martin Jones, Head of IT
  • Mr Jones, Data Processing Dept., c/o ABC Ltd.
 
To solve this problem, the respective Uniserv products have an internal database for the particular country which contains important terms for names and addresses, their meaning and frequency as well as an ambiguous, context-sensitive set of rules in which name and address elements for the respective country are described.

On the basis of this internal database, lexical, syntactic and semantic comparisons take place with the aid of an error-tolerant parser for ambiguous grammars. Although this is a rather complicated method, it can be used to obtain results with a significantly higher quality than can be obtained with simple stop-word lists, especially in critical cases.
 
 

Data access

Processing the examples described above in an acceptable length of time from a database of several million addresses is beyond the limit of human comprehension. This is also a problem for many address management programs, which supply acceptable results at a low address volume but fail miserably when it comes to large address databases. Here the performance is either unacceptable or there is a clear loss of quality.

Uniserv has developed in-house methods for accessing data, which combine error-tolerant comparisons and address analysis with technologies integrated in the database systems. Separate data access technologies have been developed for sequential mass processing (mass matching n:n) on the one hand and interactive online processing (individual matching 1:n) on the other. These methods are optimized for the respective application and ensure a high throughput per hour in mass matching and a fast response time in individual matching. Uniserv's technology guarantees a high data access performance both in mass processing and individual processing without the necessity of segmentation.
 
 

Parameterization

However, any method is only as good as its adaptability to the respective task. In Uniserv products, this is achieved by "customizing" the relevant parameters to meet specific customer requirements, e.g. the conditions under which two addresses are considered to be similar or not can be specified individually and absolutely. The cases in which there is a very high certainty that the addresses are the same, and processing can therefore take place automatically; can also be specified, as can cases where there is only a suspicion which must be clarified with the help of additional information.
 
 

Unicode

Correct character interpretation is of crucial importance as part of initiatives for securing the quality of customer and address data during the transmission, acquisition and storage of address information, especially in an era of globalization and internationalization. The Uniserv products for postal validation and duplicate checking are Unicode-capable, in order to reliably exclude problems with various character sets and their display from the very beginning. Against this background, the Uniserv products therefore also support languages such as Latin, Arabic, Greek, Cyrillic, Hebrew, Katakana, Hiragana and Hangul.
Unicode

Unicode itself is an international standard, in which a digital code is specified in the long-term for each meaningful character and text element of all known literate cultures and character systems. The aim is to eliminate the problem of different incompatible encodings in different countries. Conventional computer character systems consist of a character set of either 128 (7 bit) characters, such as the very well-known ASCII standard, or 256 (8 bit) characters, such as ISO Latin-1. After deduction of the control characters, 96 elements can be displayed as characters and special characters in ASCII and 192-224 elements in the 8 bit ISO character sets. These character encodings permit the simultaneous display of only a few languages in the same text, unless different fonts with different character sets are used in a text. This hinders international data exchange to a considerable extent. On the other hand, Unicode provides each character with its own code, independent of the system, program and language. As a result, all known characters are supported as standard in the Unicode system. The Unicode Consortium is responsible for the standard (www.unicode.org).

 

If you have a high quality demand, large address databases and the desire for excellent performance while using few resources - there's no getting round Uniserv solutions!

 
 


www.uniserv.com  | 
2012-02-04
Sitemap | Webmaster | Privacy Policy | Imprint | © 2011 Uniserv GmbH