Processes and Algorithms in Address Management

Data Quality

Reading time

verfahren-und-algorithmen-im-adressmanagement_header.jpg

In the development of its software solutions for address management, Uniserv places special emphasis on technical aspects and the technologies used. This corresponds to our intention to make the best possible program solutions available to you for all address management tasks.

Error tolerance

Particularly important functions for high-quality customer master data are optimized procedures and algorithms for error tolerance. Why? In any visual or acoustic transmission or in the recording of address data, reading, hearing and writing errors easily occur or terms are recorded in a different order or simply abbreviated differently.

This then leads, for example, to:

Data entry errors -> üller instead of Müller or Wiedner instead of Weidner
Words with the same or similar phonetic picture are spelled differently -> Meier instead of Mayer or Stefan instead of Stephan
Words are interchanged -> Müller Hans instead of Hans Müller
or words are not abbreviated uniformly -> H. Müller instead of Hans Müller or Straße instead of Str.

These deviations are not the exception, but almost the rule. Investigations have shown that 10 to 30 percent of addresses are changed in some way when they are transmitted visually or acoustically several times or recorded several times.

Human intelligence, i.e. knowledge of the meaning of corresponding terms, coupled with the ability to make associations, usually makes it easy to recognize two different addresses that have been "changed" by corresponding transmission errors and to judge whether they are the same person/company or not.

For a computer, without a uniform customer or prospect number, this task is much more difficult to solve. Nevertheless, there are some procedures that allow very good results:

For example, to detect typical read or data entry errors, one employs techniques based on fuzzy logic, or uses specialized algorithms similar to Hamming or Levenshtein to calculate the distance between two strings.
For error detection in acoustic transmission, for example in a call center, fuzzy logic is only suitable to a limited extent. Here, special phonetic methods are required, which evaluate similar-sounding letter combinations as similar. It should be noted that depending on the language and country, other phonetic procedures are necessary to achieve optimal results.
It is important to note that the two procedures described above can also be effective in combination. After all, how easy it is for a name to be misunderstood on the telephone and then to be given an additional typing error during data entry!

Address analysis / data analysis

Error-tolerant string comparison alone does not provide a useful address management solution. For optimized data analysis or address analysis, a further module with lexical, syntactic and semantic comparison is necessary. This is because the software must be able to decide, similar to the human reaction, what meaning the terms being compared have. Only in this way can the program make "meaningful" decisions. Here are some examples:

Despite high correspondence, only low probability that this is the same person:

Hans Müller jun., Hubertusallee 16, 76135 Karlsruhe
Hans Müller sen., Hubertusallee 16, 76135 Karlsruhe

Despite low match high certainty that this is the same company:

Münchener Allgemeine Brauereiversicherungsgesellschaft mbH
Münchner Allgemeine Brauereiversicherungen GmbH

Addresses consisting of the same words can nevertheless be very different because of the syntax:

Alfons Meier GmbH z. Hd. Herrn Otto Müller
Otto Müller GmbH z. Hd. Herrn Alfons Meier

Syntactically different addresses can still have a high match:

ABC GmbH, z. Hd. Manfred Schwarz Abteilungsleiter EDV
Herrn Schwarz Abt. Datenverarbeitung, c/o ABC GmbH

In order to be able to solve this task, the corresponding Uniserv products have an internal database which contains the terms for names and addresses which are important for the respective country, as well as their meanings and frequencies. It also implies an ambiguous, context-sensitive set of rules which describes how name and address elements are formed in the respective country.

Based on this internal database, the lexical, syntactic and semantic comparisons are performed using an error-tolerant parser for ambiguous grammars. Although this is a rather complex procedure, it allows to obtain qualitatively much better results than with simple stop word lists, especially in critical cases.

Data access

In any case, it is beyond human comprehension to find the examples described above from a data collection of several million addresses in an acceptable time. However, it is precisely this problem that causes many address management programs to fail, which still deliver perfectly acceptable results with a small address volume, but fail miserably with large address databases, because either unacceptable performance or a significant loss of quality occur here.

Uniserv has developed its own data access methods which combine error-tolerant comparisons and address analysis with technologies used in database systems. Separate data access technologies have been developed for sequential mass processing (mass matching n:n) on the one hand and interactive online processing (individual matching 1:n) on the other. These methods are optimized for the respective application and guarantee a high throughput per hour in mass matching and a fast response time in individual matching. The Uniserv technology for data access guarantees high performance in both mass processing and individual case processing without the need for segmentation.

Parameterization

However, all processes are only as good as they can be adapted to the respective task. This is done with the Uniserv products by means of corresponding parameters. In this way, the customizing for the respective customer requirement takes place. Here, for example, it can be specified completely individually under which conditions two addresses are to be considered similar and in which cases not. It can also be defined in which cases there is a very high degree of certainty that the addresses are the same and the processing can be carried out fully automatically, and in which cases there is only a suspicion which must be clarified by consulting further information.

Unicode

Particularly in times of globalization and internationalization, the correct character interpretation is of decisive importance in the context of initiatives to ensure the quality of customer and address data when transmitting, recording and storing address information. In order to reliably exclude problems with different character sets and their representation from the outset, Uniserv products such as address validation and duplicate matching are Unicode-capable. Against this background, the Uniserv products thus also support languages such as Latin, Arabic, Greek, Cyrillic, Hebrew, Katakana, Hiragana, Hangul, etc.