Prevent Breaches of Confidentiality by Triangulation

By the term “triangulation,” I mean the risk that the reader of a survey report pulls together data from different parts of it and thereby figures out an individual’s response. Multiple sources of plots or tables assembled can contribute to a breach of anonymity.

Here is an example of triangulation from a survey of law firm COOs about their compensation. One scatter plot shows total compensation increasing from a point for the lowest paid COO on the left to a point for the highest paid on the right. On its own, with no other information available, each point represents the data of an individual respondent but no one can figure out who it is. However, what if a second scatter plot shows years in position on one axis against the number of practicing lawyers in a firm on the second axis? If the largest firm has 3,300 practicing lawyers and that person has served 12 years, research online could let a detective guess that the firm is Kirkland & Ellis, and therefore that the point likely represents the COO of the firm! LinkedIn could complete the confirmation.

Here is another way triangulation might work. If the report shows on a scatterplot the total comp figures by metropolitan area, the report runs the same risk of data breach. The triangulation of city and law firm size would confirm which firm is Kirkland & Ellis. If key data of the plots were presented in tables and there was a column for the maximum compensation, the same vectors of size of firm, years served, and location could out the Kirkland & Ellis COO.

The risk of disclosure by triangulation does not exist at the other end of the range of law firm sizes because many firms could be in that group. This risk arises with scatterplots or with detailed tables of values where maximum values are displayed with other clues.

One way to protect against inadvertent disclosure is to drop the highest total comp value from the plot or table or drop high values to the point where no one can confidently suss out the respondent. You might call out the highest value in the text but not associate it with the values being plotted against in the chart. Another method is to jitter the points at the high end so that the reader can’t precisely match a point to the axis. Or, the survey report could combine metropolitan areas into several categories, such as “Huge,”, “Large,” “Medium size,” and so forth. A third technique that I have used is to create a category at the top end, such as “Firms with more than 3,000 lawyers,” which then obfuscates which firms correspond to which value on the other axis.

This divination of confidential information by assembling data from two or through perspectives also strengthens the argument for not disclosing the names of law firms (let alone individuals) who have entrusted the survey with their personal, highly private information. Citing participants’ firms makes the detective work of triangulation much easier.