UNISERV GmbH

   
Rastatter Straße 13   
75179 Pforzheim
Germany   
Tel. +49 (0) 7231 / 936 - 0   
Fax +49 (0) 7231 / 936 - 2500

Methods and Algorithms in Address Management

 

Error tolerance

One of the most important functions with many products for address management comprises optimized methods and algorithms for error tolerance. Why? In every visual or acoustic transfer or entry of address data, reading, hearing or spelling errors occur very easily. The information given is entered in a different chronological order or simply abbreviated differently.

Here are some examples of what this can lead to:

  • Data entry errors  mith instead of Smith or Francks instead of Franks
  • Words sound different from the way they are spelled Smiths instead of Smith or Steven instead of Stephen
  • Words are exchanged Franks John instead of John Franks
  • Or words are abbreviated differently J. Franks instead of John Franks or Avenue instead of Ave.

These deviations are not the exception but almost the rule. Studies have shown, that 10 - 30 percent of the addresses are changed every time in some way when they are transferred visually or acoustically several times or recorded several times.

Human intelligence can grasp the significance of the respective concepts due to prior knowledge. This ability, paired with the ability to associate, usually makes it easy for human beings to recognize two different addresses which were "changed" because of transmission errors and to judge whether they refer to the same person or company or not.

For a computer this problem is much more difficult to solve without uniform customer and prospect numbers. Nevertheless, there are several methods, which lead to very good results:

  • Technologies are used for recognizing typical reading or data entries that are based on fuzzy logic or use special algorithms, which like Hamming or Levenshtein calculate the distance between two strings.
  • For error recognition in acoustic transmission, e.g. in a call center, fuzzy logic is only very conditionally suited. Here special phonetic methods are necessary, which evaluate same-sounding letter combinations as similar. Attention must be paid to the fact that depending on the language and country other phonetic methods are necessary to obtain optimal results.
  • It is important that the two methods described above can also work together in combination. How easy it is that a name on the telephone is misunderstood and then, on top of that, entered into the database with a typo!
 
 

Address analysis

With an error-tolerant string comparison alone a useable address management solution is still not possible. For optimized address analysis, an additional building block is required with lexical, syntactic and semantic comparison. The software - just as in a human reaction - must be able to decide what meaning the terms that are being compared have. That is the only way the software can make "meaningful" decisions. Some examples for this:

Despite the close match there is only small probability that this is the same person:

  • John Franks jun, 17 Kings Road, LONDON N22 5SN
  • John Franks sen, 17 Kings Road, LONDON N22 5SN
 
Despite the fact that it is not a close match, there is great certainty that this is the same person or company:
  • General Brewery Insurance Company Ltd.
  • Genral Brewry Insurance Ltd.
 
Addresses that consist of the same words can be very different because of syntax:
  • John Franks Ltd. attn. Mr Tim Smith
  • Tim Smith Ltd. attn. Mr John Franks
 
Syntactically different addresses can nevertheless show great agreement:
  • ABC Ltd., attn. John Franks Head of Department IT
  • Mr Franks Department Data Processing, c/o ABC Ltd.

    To solve this problem, the respective Uniserv products have an internal database containing important terms for the particular country - names and addresses, their meaning and frequency as well as an ambiguous, context-sensitive set of rules in which name and address elements for the respective country are described

    On the basis of this internal database, lexical, syntactic and semantic comparisons take place with the aid of an error-tolerant parser for ambiguous grammars. Indeed, this is a rather sophisticated method, but by using it qualitatively much better results can be obtained especially in critical cases than with simple stop-word lists.
 
 

Data access

In any case, processing the examples described above in an acceptable length of time from a database of several million addresses goes beyond the limits of human comprehension. Many address management programs also fail in solving exactly the same problem. With low address volume their result output is quite acceptable, but they fail lamentably with large address databases. Here either the performance is unacceptable or there is a clear loss in quality.

Uniserv has developed its own proprietary methods for accessing data, which combine error-tolerant comparisons and address analysis with technologies that are integrated into the database systems. Separate data access technologies were developed for sequential batch processing (batch comparison n:n) on the one hand and interactive online processing (individual comparison 1:n) on the other. These methods are optimized for the purpose of the specific use and ensure high quantities of records processed per hour in batch comparisons and fast response times in individual comparisons. Uniserv's technology guarantees high performance in accessing the data both in batch and individual processing without the necessity of segmentation.
 
 

Parametrization

But all methods are only as good as how well they deal with the respective problems. This is done in Uniserv products by "customizing" the parameters to meet specific customer requirements. For instance, here you can individually and completely establish under which conditions two addresses shall be considered similar and in which cases not. You can also indicate in which cases there is a very high certainty that the addresses are the same and the processing can take place fully automatically and in which cases there is only a hunch which must be clarified by gathering additional information.
 
 

Unicode

Correct character interpretation is of crucial importance as part of initiatives for securing the quality of customer and address data during the transmission, acquisition and storage of address information, especially in an era of globalisation and internationalization. The Uniserv products, e.g. for postal validation and duplicate checking, are Unicode-capable, in order to reliably preclude problems with various character sets and their display from the very beginning. Against this background, the UNISERV products therefore also support languages such as Latin, Arabic, Greek, Cyrillic, Hebrew, Katakana, Hiragana, Hangul, etc.

Unicode itself is an international standard, in which a digital code is specified in the long-term for each meaningful character and text element of all known literate cultures and character systems. The aim is to eliminate the problem of different incompatible encodings in different countries. Conventional computer character systems consist of a character set of either 128 (7 bit) characters, such as the very well-known ASCII standard, or 256 (8 bit) characters, such as ISO Latin-1. After deduction of the control characters, 96 elements can be displayed as characters and special characters in ASCII and 192-224 elements in the 8 bit ISO character sets. These character encodings permit the simultaneous display of only a few languages in the same text, unless different fonts with different character sets are used in a text. This hinders international data exchange to a considerable extent. On the other hand, Unicode provides each character with its own code, independent of the system, program and language. As a result, all known characters are supported as standard in the Unicode system. The Unicode Consortium is responsible for the standard (www.unicode.org).

 
If you have high demand for quality, large address databases and the desire for excellent performance while using few resources - Uniserv solutions are what you need!
 
 

Quick Links

News

Uniserv listed in the Magic Quadrant for Data Quality Tools 2007 more... 
________________________

Postal Validation:
Three new postal expert systems available: Rep Czech, Hungary and Slovakia. Test it at our live demo!


UNISERV GmbH

   
Rastatter Straße 13   
75179 Pforzheim
Germany   
Tel. +49 (0) 7231 / 936 - 0   
Fax +49 (0) 7231 / 936 - 2500


www.uniserv.com  | 
16.05.2008
Sitemap | Webmaster | Disclaimer | Privacy Policy | Imprint | © 2008 Uniserv GmbH