Associate Editor: Sarah Nadi, Technische Universität Darmstadt, Germany (@sarahnadi)
How do we know that the apps on our mobile devices actually access and collect the private information we are told they do? This is an important question particularly about mobile devices due to their various sensors that can produce private information. Typically, end users can read an app’s privacy policy that is provided by the publisher in order get details on what private information is being collected by the app. But even so, it is difficult to verify that the app’s code does indeed adhere to the promises made in the policy. This is an important problem not only for end users who care about their right to privacy, but for developers who have moral and legal obligations to be honest about their code.
In order to aid developers and end users in answering these questions, we have created an approach that connects the natural language used in privacy policies with the code used to access sensitive information on Android devices [4]. This connection, or mapping, allows for a fully-automated violation detection process that can check for consistency between a compiled Android application and its corresponding natural language privacy policy.
Privacy Policies
If you look for an app on the Google Play store, you’ll commonly find a link to a legal document disclosing the private information that is accessed or collected through the app. Perhaps the biggest hindrance in understanding and analyzing these privacy policies is their lack of a canonical format. Privacy policies exist in all lengths and levels of detail, yet under United States law, they must all provide the end user with enough information to be able to make an informed decision on the app’s access to their private information [3].
Sensors and Code
As mentioned above, mobile devices often provide access to various sensors including GPS, Bluetooth, cameras, networking devices, and many others. In order for an app’s code to access data from theses sensors, it must invoke methods from an application program interface (API). For the Android operating system, accessing this API is as simple as invoking the appropriate methods, such as android.location.LocationManager.getLastKnownLocation(), directly in the app’s code. It is these invocations that need to align with the apps’ privacy policies for consistency to be true.
Bridging the Gap
For our approach, we created associations between the API methods used for accessing private data and the natural language used in privacy policies to describe that data.
First, we used the popular crowd-sourcing tool, Amazon Mechanical Turk, to identify commonly-used phrases in privacy policies that describe information that can be produced by Android’s API. The tasks involved users reading through short excerpts from a set of 50 random privacy policies and annotating the phrases used to describe information that was collected. For example, words like “IP address”, “location”, and “device identifier” were some of the most frequently found phrases. The resulting privacy policy lexicon represented the general language used in privacy policies when referencing sensitive data.
Next, we used a similar approach to identify words descriptive of the data produced from all of the publicly-accessible API methods that are sources [2] of private information. Tasks for this portion consisted of individual methods with their descriptions from the API documentation. Users annotated phrases in the description that described the information being produced by the method. This created a natural language representation of the methods’ data to which we could associate phrases from the privacy policy lexicon. The result was a many-to-many mapping of 154 methods to 76 phrases.
Detecting Violations
The resulting mapping between API methods and the language used in privacy policies made violation detection possible. To do so, we use the information flow analysis tool, FlowDroid [1], to detect API invocations that produce sensitive information and then relay it to the network. We considered such invocations as probable instances of data collection. If such a method invocation did not have a corresponding phrase in the app’s privacy policy, it was flagged as a potential privacy policy violation.
Using the above technique, we were able to discover 341 violations from the top 477 Android applications. We believe this implies a lack of a policy verification system for developers and end users alike.
Implications for Developers
Based on our results, we believe that this information and framework can be used to aid developers in ensuring consistency for their own privacy policies. To this end, we are extending our work with an IDE plugin to aid developers in consistency verification as well as a web-based tool for checking compiled apps against their policies. We believe that such tools could be invaluable especially to smaller development teams that may not have the legal resources available to more established development firms. Ultimately, access to such tools could lead to not only a better development experience, but a better product for the end user.
References
[1] S. Arzt, S. Rasthofer, C. Fritz, E. Bodden, A. Bartel, J. Klein, Y. Le Traeon, D. Octeau, and P. McDaniel. Flowdroid: Precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for Android apps. In 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2014.
[2] S. Rasthofer, S. Arzt, and E. Bodden. A machine-learning approach for classifying and categorizing Android sources and sinks. In Network and Distributed System Security Symposium, 2014.
[3] J.R. Reidenberg, T. D. Breaux, L. F. Cranor, B. French, A. Grannis, J. T. Graves, F. Liu, A. M. McDonald, T. B. Norton, R. Ramanath, et al. Disagreeable privacy policies: Mismatches between meaning and users’ understanding. Berkeley Tech. LJ 30 (2014): 39.
[4] R. Slavin, X. Wang, M. Hosseini, W. Hester, R. Krishnan, J. Bhatia, T. D. Breaux, and J. Niu. Toward a framework for detecting privacy policy violation in Android application code., In 38th ACM/IEEE International Conference on Software Engineering, 2016, Austin, Texas.
EmoticonEmoticon