Processes and Algorithms in Address Management
Uniserv places particular importance on the core technologies utilized in the development of its software solutions.
This corresponds with our aim of providing you with the best possible software solutions for all the tasks of data quality and address management.
Optimised processes and algorithms for error tolerance are particularly important functions for high-quality customer master data - but why? Because during every visual or acoustic transfer, or when address data is otherwise collected, reading, hearing and typing mistakes can occur, or terms can written in a different order or simply abbreviated in different ways.
This can, for example lead to:
- Data input errors? mith instead of Smith or Ried instead of Reid
- Words with the same or similar phonetics are written differently? Callahan instead of Callahagn or Steven instead of Stephen
- Words are transposed? Smith John instead of John Smith
- or words are not uniformly abbreviated? H. Smith instead of Henry Smith or Street instead of St.
These deviations are not the exception, but almost the rule. Research has shown that 10 to 30 percent of addresses change in some way following multiple visual or acoustic transfers, or other multiple periodic collection.
Human intelligence (the knowledge we have about the importance of particular terms, paired with the ability to associate) enables us to mostly easily identify and evaluate two different addresses altered during transfer, and to identify whether they both refer to the same person or company, or not.
But without a unified customer or enquirer number, this task is far more difficult for a computer to deal with. Nevertheless, there are several processes here which provide very good results:
- Techniques touching on fuzzy logic are used to identify typical reading or data entry errors; or specialised algorithms similar to Hamming or Levenshtein calculate the relationships of two character chains.
- For identifying errors during acoustic transfer (e.g., in a call centre), fuzzy logic is only suitable to a certain extent. Special phonetic processes are needed now to evaluate same-sounding letter combinations as being similar. It should be noted that other phonetic processes are also necessary for achieving optimal results, according to language and country.
- It is important that both processes described above can also be effective in combination. It is easy to misunderstand a name spoken on the telephone, and then perhaps to make a further error when typing the entry!
An error-tolerant string match alone is not an effective address management solution. Optimised address analysis needs a further component to provide lexical, syntactic and semantic matching. Similar to a human reaction, the software must be able to decide the importance of the terms being matched – because only then can the program make “meaningful” decisions. Here are a few examples:
Despite high conformity, there is only a low probability that these are the same person:
- Henry Smith jun., Hubertusallee 16, 76135 Karlsruhe
- Hans Smith sen., Hubertusallee 16, 76135 Karlsruhe
Despite low conformity, there is high certainty that these are the same company:
- Münchener Allgemeine Brauereiversicherungsgesellschaft mbH
- Münchner Allgemeine Brauereiversicherungen GmbH
Addresses consisting of the same words can nevertheless be quite different, due to syntax:
- John Smith Ltd., attn. Mr Otto Brown
- Otto Brown Ltd., attn. Mr John Smith
Syntactically differing addresses can nevertheless show high conformity:
- ABC Ltd., attn. John Smith Department Head EDV
- Mr Smith Data processing dept., c/o ABC Ltd
To solve these tasks, Uniserv products have an internal database containing important terms for names and addresses for each respective country, including their importance and frequency; as well as an ambiguous, context-sensitive rule set, in which the layout of name and address elements for each country are contained.
Based upon this internal database, a lexical, syntactic and semantic match follows, with the aid of a error-tolerant parsers for ambiguous grammar. Although this is a relatively elaborate process, it ensures that considerably better quality results can be achieved, particularly in critical cases, than with simple stop word lists.
The human mind is simply not capable of finding the examples as above in a database of several million addresses - and within an acceptable time. But this is also a problem for many address management programmes. Although they deliver acceptable results for low address volumes, they fail abysmally when it comes to dealing with large address databases; either because performance is unacceptable, or there is a noticeable loss of quality.
Uniserv has developed its own data access methods, whereby error-tolerant matching and address analysis combined with the technology used in database systems. Separate data access technologies have been developed for sequential mass processing (mass matching n:n) on the one hand, and interactive online processing (single matching 1:n) on the other. These methods are specifically optimized for each respective operation and guarantee high throughput-per-hour for mass matching, and a rapid response time for individual matching. Uniserv’s data access technology guarantees high performance in both individual and mass processing, without the necessity for segmentation.
All processes are only as good as their ability to adapt to the respective task. Uniserv products are adapted by applying appropriate parameters; thereby customizing the product for the customer’s demands. Entirely individual rules can be made as to when two addresses can be considered most probably identical and then processed automatically; and what is classed as uncertain and requires more information for clarity.
Especially in these times of internationalisation and globalisation, correct character interpretation is decisive for the framework of initiatives ensuring customer and address data quality during collection, transfer, and storage of information. To reliably eradicate the problems of different character sets and their appearance from the onset, Uniserv products, such as the postal check and duplicate matching, are Unicode compatible. With this background, Uniserv products also support languages such as Latin, Arabic, Greek, Cyrillic, Hebrew, Katakana, Hiragana, Hangul, etc.