We often want to match words that were referring to the same thing, but they were written somewhat different, or with a typographical error. This is a common problem we experience in a wide range of scenarios starting from human errs to joining data from different databases where it coded differently. Alteryx dispense fuzzy match tool to address these scenarios with ease. Fuzzy matching is a process that enables the identification of duplicates or matches that are not the same.
Data preparation is the key to success!
In order to perform a successful fuzzy matching, it is essential to prep the data for it. To extract the most out of a fuzzy matching tool, data must be properly cleaned. Also, all unneeded fields can be removed to keep the matching process simple and clean. Please keep in mind that, we can always use a Join tool to join back the required fields at the end of the fuzzy matching process.
The very first step is setting the record ID for our data. The fuzzy match tool uses the Record ID to output the list of matched records. We can either use an existing field as Record ID or create one from scratch. In case we are comparing records from different sources (Merge Process), it is always good to keep an eye on the Record ID field to make sure that these IDs are not getting overlapped and duplicated.
The next step in the preparation process is to cleanse the data. A comprehensive set of tools is available in Alteryx to perform this step. Better success rates, it is always important to understand the data and correcting corrupt or inaccurate records from the data set. Data cleansing activities may also involve activities like, harmonization of data and standardization of data. After cleansing activities, the data model should be consistent in most of the aspects with different sources.
Following is a set of possible tools to use in the Data Preparation Process, but it is not only limited to this set. These tools are just frequently used.
- Record ID
- Data Cleansing Tool
- Multi-field formula tool
- Select Tool
- Unique Tool
- Formula Tool
Fuzzy Matching is an Art
Once the data preparation process is successful, then and only then, performing the fuzzy matching on top of this cleansed data sets would yield the required outcome.
Unlike the other tools in the Join family, the Fuzzy matching tool has only one input stream available. So, it always necessary to union the different data streams into one before Fuzzy matching.
There are two match modes available in this tool.
- Purge Mode
Records from a single source will be compared against each other to identify the potential duplicates.
- Merge Mode
Records from different sources will be compared to identify the duplicates across different input files. It is always good to de-dupe the input files before the merge mode because Merge mode does not detect duplicate records within the same source. And, each source must contain a Source ID. Source ID column helps us to locate the stream of data where a record is coming from.
Once we decided on the matching mode, specify the record ID field on the tool. And then, the match threshold as a percentage should be specified on the tool. If the match score is less than this specified value, then the record will not be considered as a potential match. Under the match fields section, select the required field and appropriate match style for the selected field. Match style can be either pre-configured or custom style.
Once these configurations are done in the fuzzy matching tool and the workflow is executed, the Tool will output the list of potential matches based on the configured match styles and threshold percentage.
The fuzzy matching tool will output the Record ID fields only. It is always expected that there will be duplicated in the result set, its all related to the way how keys are generated in the Fuzzy matching process. So, it is very important that every time the Fuzzy matching tool is used in the workflow, we should always de-dupe the result set right after it.