The decision to add a citizenship question to the 2020 census, a controversial decision the Supreme Court has just agreed to review, is not the only issue facing the U.S. Census Bureau. It also has adopted a new math-based method to protect privacy of personal information collected during the upcoming census after internal tests showed a worrisome vulnerability in its privacy protections for the previous census.
The new methodology, called differential privacy, was described by a key Census Bureau official during a February 16 AAAS news briefing on the science behind the census. John Abowd, the bureau’s chief scientist and associate director for research and methodology, said new approaches were called for after tests by the bureau on 2010 census data showed a higher risk than expected that outsiders potentially could use some of the agency’s thousands of information tables to “reconstruct” someone’s personal data.
The concern has been driven by the “data reconstruction theorem,” first published by two researchers in 2003, that lays out a method using large amounts of data from census summary tables to produce records on individuals.
Abowd, reading from a statement posted on the Census Bureau website the night before the AAAS briefing, said that agency specialists had been able to match nearly half of the 308,745,538 people counted in the 2010 census to commercial and other online information that led to names and addresses.
“We know that the amount and accuracy of online personal information as well as the computational power to analyze that information continues to grow,” Abowd said. The accuracy of the reconstructed data was limited, however, and confirmation required access to confidential Census Bureau information. When the bureau checked its test findings against that confidential data, the number of identified persons fell from 138 million to 52 million, or 17 % of the total population.
Still, that figure is troubling enough. As Abowd said in remarks for presentation at an afternoon symposium following the news briefing, “the data we are showing today aren’t the end of the story. They simply show that we cannot accept the status quo. We cannot presume that what worked a decade ago will work again in 2020.”
Census has long introduced “noise” into its data to prevent “re-identification” of individuals and the compromise of their privacy. A primary tool is swapping – moving households from one location to another before tabulating data – to introduce uncertainty about whether any households allegedly reidentified from published data were correct. For example, as The New York Times has reported, the ethnicity of a couple living on the same island as the Statue of Liberty was changed with that of another couple living elsewhere in New York.
Differential privacy uses mathematical algorithms to predict the likelihood that sensitive information will “leak” from publicly available census data as varying amounts of “noise” are introduced. It produces a numerical value, called epsilon, that can be tweaked. Introduce more “noise” into the data, privacy is more robustly protected, but at some loss of accuracy in the data. Introduce less noise, accuracy of the data goes up but privacy may be less secure.
Jerry Reiter, a professor of statistics at Duke University, said the process is like turning knobs as noise is introduced, allowing specialists at the Census Bureau to estimate and manage the risks of unacceptable data disclosures and set an acceptable level of risk.
“The methods used by the Census Bureau to protect confidentiality of census participants’ responses in earlier censuses no longer can be counted on to provide adequate protection for the 2020 census,” Reiter said.
Joe Salvo, director of the population division of the New York City Department of City Planning, said his department is “keeping an open mind” on the differential privacy regime being introduced by the Census Bureau. He said there is a concern that, for some areas of the city, the introduction of noise into the data “may be an issue.”
Salvo said that in the 2010 census there were some New York neighborhoods where housing units were incorrectly classified as vacant. The key to a successful census, he said, is to get adequate self-response rates from residents from the outset rather than having to do follow-up searches for non-responders.
With the prospect that a citizenship question may be included in the 2020 census – a step critics say will lead undocumented residents to avoid filling out the census forms and lead to an undercount – some backers of differential privacy say it will be more important than ever to offer assurances to vulnerable populations that the Census Bureau is doing all it can to protect the privacy of respondents.
[Associated image: US Census Bureau/Flickr CC BY 2.0]