Minhui Xue, Gabriel Magno, Evandro Cunha, Virgilio Almeida and Keith W. Ross
Due to the recent “Right to be Forgotten” (RTBF) ruling, for queries about an individual, Google and other search engines now delist links to web pages that contain “inadequate, irrelevant or no longer relevant, or excessive” information about that individual. In this paper we take a data-driven approach to study the RTBF in the traditional media outlets, its consequences, and its susceptibility to inference attacks. First, we do a content analysis on 283 known delisted UK media pages, using both manual investigation and Latent Dirichlet Allocation (LDA). We find that the strongest topic themes are violent crime, road accidents, drugs, murder, prostitution, financial misconduct, and sexual assault. Informed by this content analysis, we then show how a third party can discover delisted URLs along with the requesters’ names, thereby putting the efficacy of the RTBF for delisted media links in question. As a proof of concept, we perform an experiment that discovers two previously-unknown delisted URLs and their corresponding requesters. We also determine 80 requesters for the 283 known delisted media pages, and examine whether they suffer from the “Streisand effect,” a phenomenon whereby an attempt to hide a piece of information has the unintended consequence of publicizing the information more widely. To measure the presence (or lack of presence) of a Streisand effect, we develop novel metrics and methodology based on Google Trends and Twitter data. Finally, we carry out a demographic analysis of the 80 known requesters. We hope the results and observations in this paper can inform lawmakers as they refine RTBF laws in the future.
This study aims at constructing a microblog influence prediction model and revealing how the user, time, and content features of microblog entries about public health emergencies affect the influence of microblog entries. Microblog entries about the Ebola outbreak are selected as data sets. The BM25 latent Dirichlet allocation model (LDA-BM25) is used to extract topics from the microblog entries. A microblog influence prediction model is proposed by using the random forest method. Results reveal that the proposed model can predict the influence of microblog entries about public health emergencies with a precision rate reaching 88.8%. The individual features that play a role in the influence of microblog entries, as well as their influence tendencies are also analyzed. The proposed microblog influence prediction model consists of user, time, and content features. It makes up the deficiency that content features are often ignored by other microblog influence prediction models. The roles of the three features in the influence of microblog entries are also discussed.
The research question addressed here is whether the semantic value implicit in environmental terms in an activity description text string, can be translated into economic value for firms in the construction sector. We address this question using a relatively new applied statistical method called Latent Dirichlet Allocation (LDA). We first identify a satellite register of firms in construction sector that engage in some form of environmental work. From these we construct a vocabulary of meaningful words. Then, for each firm in turn on this satellite register we take its activity description text string and process this string with LDA. This softly-classifies the descriptions on the satellite register into just seven environmentally relevant topics. With this seven-topic classification we proceed to extract a statistically meaningful weight of evidence associated with environmental terms in each activity description. This weight is applied to the associated firm’s overall output value recorded on our national Business Register to arrive at a supply side estimate of the firm’s EGSS value. On this basis we find the EGSS estimate for construction in Ireland in 2013 is about EURO 229m. We contrast this estimate with estimates from other countries obtained by demand side methods and show it compares satisfactorily, thereby enhancing its credibility. Our method also has the advantage that it provides a breakdown of EGSS output by EU environmental classifications (CEPA/CReMA) as these align closely to discovered topics. We stress the success of this application of LDA relies greatly on our small vocabulary which is constructed directly from the satellite register.