Record linking is the task of identifying similar or duplicated records in a database and forming a consolidated set. Finding similar objects within datasets, and forming groups, is often referred to as clustering in statistics. Clusters are composed of similar objects, with respect to a specific characteristic that can be measured.
An example is shown below where a record database has two identical individuals both with the same name but with a misspelled address. In this context, record linkage is focused on forming the link between these records (green line).
In this lab you will work on a record-linkage problem and learn about one way to solve the problem. Specifically, you focus on:
Blocking: using BigQuery to reduce the number of comparisons that you need to make to identify duplicates.
Comparing attributes: using common text comparison methods to measure how similar names and address fields are in a database.
Detection: forming a simple classifier by thresholding similarity metrics.
Clustering: applying hierarchical clustering to group duplicates.
Join Qwiklabs to read the rest of this lab...and more!
- Get temporary access to the Google Cloud Console.
- Over 200 labs from beginner to advanced levels.
- Bite-sized so you can learn at your own pace.