Janith Weerasinghe, Kediel Morales and Rachel Greenstadt
Recent studies have shown that machine learning can identify individuals with mental illnesses by analyzing their social media posts. Topics and words related to mental health are some of the top predictors. These findings have implications for early detection of mental illnesses. However, they also raise numerous privacy concerns. To fully evaluate the implications for privacy, we analyze the performance of different machine learning models in the absence of tweets that talk about mental illnesses. Our results show that machine learning can be used to make predictions even if the users do not actively talk about their mental illness. To fully understand the implications of these findings, we analyze the features that make these predictions possible. We analyze bag-of-words, word clusters, part of speech n-gram features, and topic models to understand the machine learning model and to discover language patterns that differentiate individuals with mental illnesses from a control group. This analysis confirmed some of the known language patterns and uncovered several new patterns. We then discuss the possible applications of machine learning to identify mental illnesses, the feasibility of such applications, associated privacy implications, and analyze the feasibility of potential mitigations.
We study both the practical and theoretical efficiency of private information retrieval (PIR) protocols in a model wherein several untrusted servers work to obliviously service remote clients’ requests for data and yet no pair of servers colludes in a bid to violate said obliviousness. In exchange for such a strong security assumption, we obtain new PIR protocols exhibiting remarkable efficiency with respect to every cost metric—download, upload, computation, and round complexity—typically considered in the PIR literature.
The new constructions extend a multiserver PIR protocol of Shah, Rashmi, and Ramchandran (ISIT 2014), which exhibits a remarkable property of its own: to fetch a b-bit record from a collection of r such records, the client need only download b + 1 bits total. We find that allowing “a bit more” download (and optionally introducing computational assumptions) yields a family of protocols offering very attractive trade-offs. In addition to Shah et al.’s protocol, this family includes as special cases (2-server instances of) the seminal protocol of Chor, Goldreich, Kushilevitz, and Sudan (FOCS 1995) and the recent DPF-based protocol of Boyle, Gilboa, and Ishai (CCS 2016). An implicit “folklore” axiom that dogmatically permeates the research literature on multiserver PIR posits that the latter protocols are the “most efficient” protocols possible in the perfectly and computationally private settings, respectively. Yet our findings soundly refute this supposed axiom: These special cases are (by far) the least performant representatives of our family, with essentially all other parameter settings yielding instances that are significantly faster.
Geoffrey Alexander, Antonio M. Espinoza and Jedidiah R. Crandall
We present a novel attack for detecting the presence of an active TCP connection between a remote Linux server and an arbitrary client machine. The attack takes advantage of side-channels present in the Linux kernel’s handling of the values used to populate an IPv4 packet’s IPID field and applies to kernel versions of 4.0 and higher. We implement and test this attack and evaluate its real world effectiveness and performance when used on active connections to popular web servers. Our evaluation shows that the attack is capable of correctly detecting the IP-port 4-tuple representing an active TCP connection in 84% of our mock attacks. We also demonstrate how the attack can be used by the middle onion router in a Tor circuit to test whether a given client is connected to the guard entry node associated with a given circuit.
In addition we discuss the potential issues an attacker would face when attempting to scale it to real world attacks, as well as possible mitigations against the attack. Our attack does not exhaust any global resource, and therefore challenges the notion that there is a direct one-to-one connection between shared, limited resources and non-trivial network side-channels. This means that simply enumerating global shared resources and considering the ways in which they can be exhausted will not suffice for certifying a kernel TCP/IP network stack to be free of privacy risk side-channels.
Qiaozhi Wang, Hao Xue, Fengjun Li, Dongwon Lee and Bo Luo
With the growing popularity of online social networks, a large amount of private or sensitive information has been posted online. In particular, studies show that users sometimes reveal too much information or unintentionally release regretful messages, especially when they are careless, emotional, or unaware of privacy risks. As such, there exist great needs to be able to identify potentially-sensitive online contents, so that users could be alerted with such findings. In this paper, we propose a context-aware, text-based quantitative model for private information assessment, namely PrivScore, which is expected to serve as the foundation of a privacy leakage alerting mechanism. We first solicit diverse opinions on the sensitiveness of private information from crowdsourcing workers, and examine the responses to discover a perceptual model behind the consensuses and disagreements. We then develop a computational scheme using deep neural networks to compute a context-free PrivScore (i.e., the “consensus” privacy score among average users). Finally, we integrate tweet histories, topic preferences and social contexts to generate a personalized context-aware PrivScore. This privacy scoring mechanism could be employed to identify potentially-private messages and alert users to think again before posting them to OSNs.
Stylometric authorship attribution aims to identify an anonymous or disputed document’s author by examining its writing style. The development of powerful machine learning based stylometric authorship attribution methods presents a serious privacy threat for individuals such as journalists and activists who wish to publish anonymously. Researchers have proposed several authorship obfuscation approaches that try to make appropriate changes (e.g. word/phrase replacements) to evade attribution while preserving semantics. Unfortunately, existing authorship obfuscation approaches are lacking because they either require some manual effort, require significant training data, or do not work for long documents. To address these limitations, we propose a genetic algorithm based random search framework called Mutant-X which can automatically obfuscate text to successfully evade attribution while keeping the semantics of the obfuscated text similar to the original text. Specifically, Mutant-X sequentially makes changes in the text using mutation and crossover techniques while being guided by a fitness function that takes into account both attribution probability and semantic relevance. While Mutant-X requires black-box knowledge of the adversary’s classifier, it does not require any additional training data and also works on documents of any length. We evaluate Mutant-X against a variety of authorship attribution methods on two different text corpora. Our results show that Mutant-X can decrease the accuracy of state-of-the-art authorship attribution methods by as much as 64% while preserving the semantics much better than existing automated authorship obfuscation approaches. While Mutant-X advances the state-of-the-art in automated authorship obfuscation, we find that it does not generalize to a stronger threat model where the adversary uses a different attribution classifier than what Mutant-X assumes. Our findings warrant the need for future research to improve the generalizability (or transferability) of automated authorship obfuscation approaches.
Gerry Wan, Aaron Johnson, Ryan Wails, Sameer Wagh and Prateek Mittal
The popularity of Tor has made it an attractive target for a variety of deanonymization and fingerprinting attacks. Location-based path selection algorithms have been proposed as a countermeasure to defend against such attacks. However, adversaries can exploit the location-awareness of these algorithms by strategically placing relays in locations that increase their chances of being selected as a client’s guard. Being chosen as a guard facilitates website fingerprinting and traffic correlation attacks over extended time periods. In this work, we rigorously define and analyze the guard placement attack. We present novel guard placement attacks and show that three state-of-the-art path selection algorithms—Counter-RAPTOR, DeNASA, and LASTor—are vulnerable to these attacks, overcoming defenses considered by all three systems. For instance, in one attack, we show that an adversary contributing only 0.216% of Tor’s total bandwidth can attain an average selection probability of 18.22%, 84× higher than what it would be under Tor currently. Our findings indicate that existing location-based path selection algorithms allow guards to achieve disproportionately high selection probabilities relative to the cost required to run the guard. Finally, we propose and evaluate a generic defense mechanism that provably defends any path selection algorithm against guard placement attacks. We run our defense mechanism on each of the three path selection algorithms, and find that our mechanism significantly enhances the security of these algorithms against guard placement attacks with only minimal impact to the goals or performance of the original algorithms.
Jeremy Martin, Douglas Alpuche, Kristina Bodeman, Lamont Brown, Ellis Fenske, Lucas Foppe, Travis Mayberry, Erik Rye, Brandon Sipes and Sam Teplov
We investigate Apple’s Bluetooth Low Energy (BLE) Continuity protocol, designed to support interoperability and communication between iOS and macOS devices, and show that the price for this seamless experience is leakage of identifying information and behavioral data to passive adversaries. First, we reverse engineer numerous Continuity protocol message types and identify data fields that are transmitted unencrypted. We show that Continuity messages are broadcast over BLE in response to actions such as locking and unlocking a device’s screen, copying and pasting information, making and accepting phone calls, and tapping the screen while it is unlocked. Laboratory experiments reveal a significant flaw in the most recent versions of macOS that defeats BLE Media Access Control (MAC) address randomization entirely by causing the public MAC address to be broadcast. We demonstrate that the format and content of Continuity messages can be used to fingerprint the type and Operating System (OS) version of a device, as well as behaviorally profile users. Finally, we show that predictable sequence numbers in these frames can allow an adversary to track Apple devices across space and time, defeating existing anti-tracking techniques such as MAC address randomization.
Benjamin Hilprecht, Martin Härterich and Daniel Bernau
We present two information leakage attacks that outperform previous work on membership inference against generative models. The first attack allows membership inference without assumptions on the type of the generative model. Contrary to previous evaluation metrics for generative models, like Kernel Density Estimation, it only considers samples of the model which are close to training data records. The second attack specifically targets Variational Autoencoders, achieving high membership inference accuracy. Furthermore, previous work mostly considers membership inference adversaries who perform single record membership inference. We argue for considering regulatory actors who perform set membership inference to identify the use of specific datasets for training. The attacks are evaluated on two generative model architectures, Generative Adversarial Networks (GANs) and Variational Autoen-coders (VAEs), trained on standard image datasets. Our results show that the two attacks yield success rates superior to previous work on most data sets while at the same time having only very mild assumptions. We envision the two attacks in combination with the membership inference attack type formalization as especially useful. For example, to enforce data privacy standards and automatically assessing model quality in machine learning as a service setups. In practice, our work motivates the use of GANs since they prove less vulnerable against information leakage attacks while producing detailed samples.
Jonathan Rusert, Osama Khalid, Dat Hong, Zubair Shafiq and Padmini Srinivasan
There is a natural tension between the desire to share information and keep sensitive information private on online social media. Privacy seeking social media users may seek to keep their location private by avoiding the mentions of location revealing words such as points of interest (POIs), believing this to be enough. In this paper, we show that it is possible to uncover the location of a social media user’s post even when it is not geotagged and does not contain any POI information. Our proposed approach Jasoos achieves this by exploiting the shared vocabulary between users who reveal their location and those who do not. To this end, Jasoos uses a variant of the Naive Bayes algorithm to identify location revealing words or hashtags based on both temporal and atemporal perspectives. Our evaluation using tweets collected from four different states in the United States shows that Jasoos can accurately infer the locations of close to half a million tweets corresponding to more than 20,000 distinct users (i.e., more than 50% of the test users) from the four states. Our work demonstrates that location privacy leaks do occur despite due precautions by a privacy conscious user. We design and evaluate countermeasures based Jasoos to mitigate location privacy leaks.