Record Linkage

Open Google Console

Caution: When you are in the console, do not deviate from the lab instructions. Doing so may cause your account to be blocked. Learn more.

Record Linkage

1 hour 7 Credits


Google Cloud Self-Paced Labs


Record linking is the task of identifying similar or duplicated records in a database and forming a consolidated set. Finding similar objects within datasets, and forming groups, is often referred to as clustering in statistics. Clusters are composed of similar objects, with respect to a specific characteristic that can be measured.

An example is shown below where a record database has two identical individuals both with the same name but with a misspelled address. In this context, record linkage is focused on forming the link between these records (green line).


In this lab you will work on a record-linkage problem and learn about one way to solve the problem. Specifically, you focus on:

  • Blocking: using BigQuery to reduce the number of comparisons that you need to make to identify duplicates.

  • Comparing attributes: using common text comparison methods to measure how similar names and address fields are in a database.

  • Detection: forming a simple classifier by thresholding similarity metrics.

  • Clustering: applying hierarchical clustering to group duplicates.

Join Qwiklabs to read the rest of this lab...and more!

  • Get temporary access to the Google Cloud Console.
  • Over 200 labs from beginner to advanced levels.
  • Bite-sized so you can learn at your own pace.
Join to Start This Lab