An Automated Approach for Complementing Ad Blockers’ Blacklists

Open access


Privacy in the Web has become a major concern resulting in the popular use of various tools for blocking tracking services. Most of these tools rely on manually maintained blacklists, which need to be kept up-to-date to protect Web users’ privacy efficiently. It is challenging to keep pace with today’s quickly evolving advertisement and analytics landscape. In order to support blacklist maintainers with this task, we identify a set of Web traffic features for identifying privacyintrusive services. Based on these features, we develop an automatic approach that learns the properties of advertisement and analytics services listed by existing blacklists and proposes new services for inclusion on blacklists. We evaluate our technique on real traffic traces of a campus network and find in the order of 200 new privacy-intrusive Web services that are not listed by the most popular Firefox plug-in Adblock Plus. The proposed Web traffic features are easy to derive, allowing a distributed implementation of our approach.

[1] J. Abbatiello. RefControl – Firefox Add-on. Accessed: 2015-02-14.

[2] G. Acar, C. Eubank, S. Englehardt, M. Juarez, A. Narayanan, and C. Diaz. The web never forgets: Persistent tracking mechanisms in the wild. In Proc. CCS ’14, pages 674–689, 2014.

[3] L. Andrews. Facebook Is Using You. New York Times (2012-02-04), Accessed: 2015-02-14.

[4] M. F. Arlitt and C. L. Williamson. Web server workload characterization: the search for invariants. In Proc. SIGMETRICS ’96, pages 126–137, 1996.

[5] P. Barford, A. Bestavros, A. Bradley, and M. Crovella. Changes in web client access patterns: Characteristics and caching implications. World Wide Web, 2(1-2):15–28, 1999.

[6] L. Bernaille, R. Teixeira, I. Akodkenou, A. Soule, and K. Salamatian. Traffic classification on the fly. SIGCOMM Comput. Commun. Rev., 36(2):23–26, Apr. 2006.

[7] C. M. Bishop. Pattern recognition and machine learning. Springer, 2006.

[8] M. Butkiewicz, H. V. Madhyastha, and V. Sekar. Understanding website complexity: Measurements, metrics, and implications. In Proc. IMC ’11, pages 313–328, 2011.

[9] R. Cookson. Google, Microsoft and Amazon pay to get around ad blocking tool. Financial Times (2015-02-01), Accessed: 2015-02-15.

[10] M. E. Crovella and A. Bestavros. Self-similarity in world wide web traffic: evidence and possible causes. IEEE/ACM Trans. Netw., 5(6):835–846, 1997.

[11] J. Demšar, T. Curk, A. Erjavec, Črt Gorup, T. Hočevar, M. Milutinovič, M. Možina, M. Polajnar, M. Toplak, A. Starič, M. Štajdohar, L. Umek, L. Žagar, J. Žbontar, M. Žitnik, and B. Zupan. Orange: Data mining toolbox in python. Journal of Machine Learning Research, 14:2349–2353, 2013.

[12] Disconnect | Online Privacy & Security.

[13] P. Domingos and M. Pazzani. On the optimality of the simple bayesian classifier under zero-one loss. Machine learning, 29(2-3):103–130, 1997.

[14] F. Douglis, A. Feldmann, B. Krishnamurthy, and J. Mogul. Rate of change and other metrics: a live study of the world wide web. In Proc. USENIX Symp. on Internet Technologies and Systems, Dec. 1997.

[15] U. M. Fayyad and K. B. Irani. Multi-interval discretization of continuous-valued attributes for classification learning. In Proc. 13th Int. Joint Conf. on Artificial Intelligence, pages 1022–1027, 1993.

[16] M. Fertik. The Rich See a Different Internet Than the Poor. Scientific American Volume 308, Issue 2, Accessed: 2015-02-14.

[17] R. Fielding and J. Reschke. Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content. RFC 7231.

[18] P. Gill, V. Erramilli, A. Chaintreau, B. Krishnamurthy, K. Papagiannaki, and P. Rodriguez. Follow the money: Understanding economics of online aggregation and advertising. In Proc. IMC ’13, pages 141–148, 2013.

[19] D. Gugelmann, B. Ager, and V. Lenders. Towards classifying third-party web services at scale. In Proc. CoNEXT Student Workshop ’14, pages 34–36, 2014.

[20] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning, volume 2. Springer, 2009.

[21] R. Hill. Comparative benchmarks against widely used blockers: Top 15 Most Popular News Websites. Accessed: 2015-02-13.

[22] S. Ihm and V. S. Pai. Towards understanding modern web traffic. In Proc. IMC ’11, pages 295–312, 2011.

[23] T. Karagiannis, K. Papagiannaki, and M. Faloutsos. Blinc: multilevel traffic classification in the dark. SIGCOMM Comput. Commun. Rev., 35(4):229–240, 2005.

[24] H. Kim, K. Claffy, M. Fomenkov, D. Barman, M. Faloutsos, and K. Lee. Internet traffic classification demystified: Myths, caveats, and the best practices. In Proc. ACM CoNEXT ’08, pages 11:1–11:12, 2008.

[25] B. Krishnamurthy. I know what you will do next summer. SIGCOMM Comput. Commun. Rev., 40(5):65–70, 2010.

[26] B. Krishnamurthy, D. Malandrino, and C. E. Wills. Measuring privacy loss and the impact of privacy protection in web browsing. In Proc. 3rd Symp. on Usable Privacy and Security (SOUPS ’07), pages 52–63, 2007.

[27] B. Krishnamurthy, K. Naryshkin, and C. E. Wills. Privacy leakage vs. protection measures: the growing disconnect. In Proc. Web 2.0 Security and Privacy Workshop, 2011.

[28] T. Libert. Privacy implications of health information seeking on the web. Commun. ACM, 58(3):68–77, 2015.

[29] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker. Beyond blacklists: Learning to detect malicious web sites from suspicious urls. In Proc. 15th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, KDD ’09, pages 1245–1254, 2009.

[30] X. Ma, J. Zhu, Z. Wan, J. Tao, X. Guan, and Q. Zheng. Honeynet-based collaborative defense using improved highly predictive blacklisting algorithm. In 8th World Congr. on Intelligent Control and Automation, WCICA ’10, pages 1283–1288, 2010.

[31] G. Maier, A. Feldmann, V. Paxson, and M. Allman. On dominant characteristics of residential broadband internet traffic. In Proc. IMC ’09, pages 90–102, 2009.

[32] J. R. Mayer and J. C. Mitchell. Third-party web tracking: Policy and technology. In Proc. SP ’12, pages 413–427, 2012.

[33] J. Mikians, L. Gyarmati, V. Erramilli, and N. Laoutaris. Detecting price and search discrimination on the internet. In Proc. HotNets-XI ’12, pages 79–84, 2012.

[34] Mozilla | Lightbeam for Firefox. Accessed: 2015-04-28.

[35] L. Olejnik, C. Castelluccia, and A. Janc. Why johnny can’t browse in peace: On the uniqueness of web browsing history patterns. In Proc. HotPETs ’12, 2012.

[36] L. Olejnik, T. Minh-Dung, and C. Castelluccia. Selling off privacy at auction. In Proc. NDSS ’14, 2014.

[37] H.-K. Pao, Y.-L. Chou, and Y.-J. Lee. Malicious url detection based on kolmogorov complexity estimation. In Proc. Int. Conf. on Web Intelligence and Intelligent Agent Technology, WI-IAT ’12, pages 380–387, 2012.

[38] V. Paxson. Bro: a system for detecting network intruders in real-time. Computer Networks, 31(23-24):2435–2463, 1999.

[39] D. Peck. They’re Watching You at Work. The Atlantic (2013-11-20), Accessed: 2015-02-14.

[40] Adblock.

[41] Adblock Plus.

[42] Ghostery.

[43] NoScript.

[44] P. Prakash, M. Kumar, R. R. Kompella, and M. Gupta. Phishnet: Predictive blacklisting to detect phishing attacks. In Proc. INFOCOM ’10, pages 1–5, 2010.

[45] R. Pries, Z. Magyari, and P. Tran-Gia. An http web traffic model based on the top one million visited web pages. In Proc. EURO-NGI Conf. Next Generation Internet (NGI), pages 133–139, 2012.

[46] Electronic Frontier Foundation | Privacy Badger. Accessed: 2015-02-13.

[47] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., 1993.

[48] F. Roesner, T. Kohno, and D. Wetherall. Detecting and defending against third-party tracking on the web. In Proc. NSDI ’12, 2012.

[49] L. Salgarelli, F. Gringoli, and T. Karagiannis. Comparing traffic classifiers. SIGCOMM Comput. Commun. Rev., 37(3):65–68, 2007.

[50] L. Scism and M. Maremont. Insurers Test Data Profiles to Identify Risky Clients. Wall Street Journal (2010-11-19), Accessed: 2015-02-14.

[51] F. Soldo, A. Le, and A. Markopoulou. Blacklisting recommendation system: Using spatio-temporal patterns to predict future attacks. J. on Selected Areas in Commun., 29(7):1423–1437, 2011.

[52] Tcpdump/Libpcap.

[53] Tor | Anonymity Online.

[54] M. Tran, X. Dong, Z. Liang, and X. Jiang. Tracking the trackers: Fast and scalable dynamic analysis of web content for privacy violations. In Proc. Conf. on Applied Cryptography and Network Security, ACNS ’12, pages 418–435, 2012.

[55] H. Zhang. The Optimality of Naive Bayes. In Proc. FLAIRS ’04, 2004.

[56] J. Zhang, P. A. Porras, and J. Ullrich. Highly predictive blacklisting. In Proc. USENIX Security ’08, 2008.

Journal Information


All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 223 223 38
PDF Downloads 68 68 19