## Abstract

Healthcare communication on Twitter is challenging because the space for a tweet is limited, but the topic is too sophisticated to be concise. Comparing medical-terminology hashtags versus lay-language hashtags, this paper explores the characteristics of healthcare hashtags using an entropy matrix which derived from information theory. In this paper, the entropy matrix comprises of six different components used for constructing a tweet and serves as a framework for the structural analysis with the granularity of tweet composition. These granular components include image(s), text with semantic meanings, hashtag(s), @ username(s), hyperlink, and unused space. The entropy matrix proposed in this paper contributes to a new approach to visualizing the complexity level of hashtag collections. In addition, the calculated entropy could be an indicator of the diversity of a user’s choice across those tweet components. Furthermore, the visualizations (radar graph and scatterplot) illustrate statistical structures and the dynamics of the hashtag collections measured by entropy. The results from this study demonstrate a manifest relationship between tweet composition and the number of being retweeted.

## 1 Introduction

The effective and efficient communication on healthcare topics is challenging for healthcare professionals because health-related topics involve communicating sophisticated and sometimes confusing messages. This is the case especially on the Twitter platform where the messages (*as known as, tweets*) are usually insufficient to make a point on healthcare-related topics. Composing a tweet on the Twitter platform involves a choice of combining typical components, such as photos, video clips, textual content which may include hashtags, hyperlinks, and the @username “mention” function. The solution to the issue of insufficient space for tweets involves two options: (1) use a hyperlink to direct the audience to a webpage where more space is available to operate, or (2) use image(s) and/or video to enhance the tweet content. Either approach increases the complexity in the structure (*the way different components are organized*) of the Twitter messages (*i.e., tweets*). Therefore, the more components a tweet contains, the more complex its structure appears. An orchestrated presentation of tweet content usually improves the usability, effectiveness, and perceived quality of a campaign message. Houts, Doak, Doak, and Loscalzo (2006) found a well-crafted balance of words, numbers, images, and other illustrations can improve comprehension more than using text alone. However, to the best of our knowledge, there has not been any study that examining the influence of composing a tweet with diversified components on its own chances of getting more attention and retweeted. Primarily adopted in engineering and computer science, Shannon’s entropic equation has been used to evaluate the level of predictability (Miller, 1953), redundancy (Hayes, 1993), and degree of randomness/complexity/uncertainty (Ritchie, 1986) in a well-defined system. The concept of entropy in information theory (Shannon, 1948) provides researchers with a means to examine a variety of combinations of typical components in a tweet.

Healthcare professionals face challenges when communicating campaign messages to the general public on Twitter by leveraging the limited space (140-character limit) to deliver an efficient message. The trade-off is that assigning multiple hashtags within a tweet certainly increase the chances of being found and retweeted. However, the opportunity cost (*the loss of potential gain from other alternatives when one choice is made*) associated with this choice deserves further consideration because hashtags inevitably consume part of the precious space. Would simpler tweets (i.e., pure text or a sole image) attract more attention or tweets with more hashtags or hyperlinks draw more attention? In terms of the tweet structures, this study intended to compare if the healthcare information has been communicated differently on Twitter using medical-terminology hashtags and lay-language hashtags. Therefore, we classified the healthcare hashtags into two main categories: (1) medical-terminology hashtags, whose origin came from Latin or Ancient Greek, and (2) lay-language hashtags for medical/healthcare terms. For instance, #glucose, #hypertension, and #influenza are medical-terminology hashtags, while #bloodsugar, #bloodpressure, and #flu are examples of lay-language hashtags. Some medical-terminology hashtags and lay-language hashtags have the same semantic meaning while others do not. As an example, glucose, a medical-terminology word meaning monosaccharide, is derived from the Latin word glucosium; In lay-language, glucose is called blood sugar although it does not refer to real cane sugar in human blood (different molecules). However, glucose and blood sugar share the same semantic meaning.

Although the content of a healthcare message on Twitter is highly constrained by the character limit, users are always creative in composing tweets by combining components such as text, hashtags, hyperlink, and image/video, etc.. Therefore, the primary motivation of this study is investigating the creative variety of recombining different components in tweets to gain insights into tweeting/retweeting behavior. Additionally, the choice between medical hashtags and lay-language hashtags remains an under-researched topic in the subject of efficient healthcare communication on Twitter.

This study explores the characteristics of tweets by comparing semantically similar healthcare hashtags and focuses on understanding the effect of tweet content diversity on the efficiency of healthcare communication. In this study we not only review the relevant studies that apply information theory to social science and social media but also introduce the concept of entropy from Claude Shannon’s information theory as a measure of structural complexity in a tweet. Using entropy as an indicator detecting the various ways to compose a tweet, we extend the entropy equation in Claude Shannon’s information theory to a multi-dimensional analysis tool for content structure comparison. In terms of research methodology, we conducted a comparative case study to examine the tweets associated with three pairs of semantically similar hashtags, namely #glucose versus #bloodsugar, #hypertension versus #bloodpressure, and #influenza versus #flu. We report the analysis results and conclude that the extended entropy equation has the potential to be an automatic tool for sensing tweet structure.

## 2 Related Works

### 2.1 Application of Shannon’s Information Theory in Social Science

Although Claude Shannon was called the “Father of Information Theory” (Horgan, 2016) and credited for his “single handedly” independent development of the classical information theory (Graham, 2002), the predecessors of Shannon in communication systems had been working on the nature of electronic signal transmission for decades before the birth of Shannon’s information theory in 1948. For instance, Nyquist (1924) argued that the transmission rate of a signal is proportional to the logarithm of the number of signal levels in a unit duration. Hartley (1928) concluded that the channel capacity is proportional to the bandwidth and used the letter H to denote the amount of information associated with finite selections. Rice (1944) introduced the random process into communication studies. Shannon’s information theory in communication systems also has direct relationship with disciplines of natural science, such as, cryptography and statistical mechanics, and the original concept of entropy can be even traced its genesis back to thermodynamics in the 19^{th} century. Clausius (1867) was the first one who gave entropy its definition. Thereafter, Boltzmann (1877) gave entropy its statistical equation S = k·logW and developed his famous H theorem; it was also Boltzmann who assigned the symbol H to entropy. Boltzmann’s entropy equation deals with a thermodynamic situation called equilibrium, where the microstate of the system has equal probability. In the meantime, Gibbs (1878) defined his entropy as the sum of the entropies of all the individual microstates in the system:

Unlike Boltzmann’s entropy as a function of the number of microstates, Gibbs entropy is a function of probabilities of microstates and his equation has direct connection to Shannon’s entropy equation:

The information theory developed by Claude E. Shannon was originally applied to communication systems of electronic signals and transmission channel, it has reached out to disciplines of natural science (physics, biology, and chemistry, etc.) and sub-disciplines of computer science (data compression, source coding, and digital imaging, etc.). Initially suggested by Weaver, attempts by scholars and researchers to apply Shannon’s information theory in studies of social phenomena, psychology, humanities, anthropology, economics, and education have expanded continually (Verdu, 1998). However, the distinction between these two directions of application should be clearly specified. Shannon’s information theory was made up of two parts. Part one was its twenty-three theorems on the nature of electronic signal transmission; these theorems were applied only in disciplines of natural science and built the theoretical foundation of data compression principles in computer science. The other part of Shannon’s information theory was the mathematical method, also known as the entropic equation. It was actually the entropy equation (and only this method for calculating value of entropy in Shannon’s information theory) that went beyond the scope of natural science and into the realm of all those disciplines which analyze society and culture. Shannon was originally concerned about pushing the application of his theory into engineering communication and clearly considered applications of information theory outside of communication to be a problem (Tribus, 1983). He strongly preferred the term communication theory to information theory (Luce, 2003).

In his article for the New York Times, Johnson (2001) quoted Shannon’s comment on the bandwagon effect of his work, which spread beyond the fields of communications engineering and computer science: “Information theory has perhaps ballooned to an importance beyond its actual accomplishments.” It is worth mentioning that in Shannon’s original ideas, the engineering aspect of communication was irrelevant to the semantic meaning. On the other hand, Weaver (1953), coauthor of the book *Mathematical Theory of Communication*, was more concerned with information theory’s biological application in central nervous system phenomena and connected the concept of signal channel capacity to the capacity of audience. Ever since, the studies that apply information theory to the social phenomenon were criticized because they do not focus on the twenty-three theorems but the entropic concept and the mathematical method.

The incompatibility between Shannon’s information theory and most of the social systems is caused by the fact that information theory requires elements of choice to be absolutely neutral and transmitted continuously in a long string. These initial conditions are rarely possible for the study of social phenomena and human behaviors. However, every organism on earth is, in the sense of thermodynamics, a negentropy system, as Brillouin (1953) called it. Therefore, the question arises, how can a society which was made up of all these living systems be seen as a closed system with positive and ever-increasing entropy? Popper (1963) was aware that the social problem cannot be condensed to well-isolated, stationary, and recurrent systems because these systems are very rare in nature and the modern society is not one of them. Hayek (1967) also argued that theories and techniques of investigation and interpretation of observed facts cannot help in complex phenomena of society. Prigogine and Stengers (1984) concluded and warned that making models of human situations was risky due to the incompleteness of information in complex systems. However, social media systems can be viewed as closed social systems and fulfill the assumptions of information theory.

### 2.2 Social Media as Closed Social Systems

The advent of social media has been a driving force for scholars and researchers to model the emerging digital platforms as closed systems simulating real-world society. The entropic method worked well in any closed systems – like thermal, mechanical, and chemical systems. However, it encountered obstacles when being applied to social (open) systems in which the initial conditions do not matter, and the equilibrium state is dynamic due to the input of matter or energy from outside. According to Bailey (1990), the crux of the problem of modeling complex social systems is summed as follows:

Lack of adequate definition of a social system.

Lack of specification of the boundaries for the social system as a whole.

Lack of quantitative measure of system state, either on macro (system) level or on micro (components) level.

Lack of justification of the isomorphic problem between the theoretical system model and the empirical, complex system in real world.

Lack of defense to the determination and selection of a suitable set of explanatory variables out of the almost infinite number that could be identified in a complex social system.

Social media platform collapses these problems. First, social media (i.e., Twitter) can be viewed as a closed system because for any given time, the total amount of data on Twitter platform is a finite number, and the total number of variables (data and metadata) that can be collected from the Twitter platform is a fixed number. Second, social media is an information-seeking and sharing platform where users engage on a daily base. Information on social media platforms covers almost every aspect of human society. An analytical model built on social media platforms is more practical and easier to control than those built directly on social phenomenon and human behaviors. Variables and their relationships are easily identified and quantified when they are inherited directly from data and metadata on social media platforms. The entropic method of information theory is more justifiable and more efficient when applied to models built on closed system as social media than the ones built on social (open) systems. On the other hand, social media bring several challenges along with the opportunities. Although the total variables and the amount of data on social media are finite numbers, they still demand nearly unlimited calculating power to process them in a timely manner. The data change rate, on the other hand, is approaching infinity because the value of variables is changing in fractions of a second. Zunde (1987) summarized on the nature of regularities in information science and concluded the laws which control human social actions and interactions were subject to rapid change. Nowadays the social media platforms are magnifying the dimensions of big data (volume, variety and velocity) on an unprecedented scale. However, the available methods for studies in social science are lacking in their capabilities. Statistics that dominate the major research method in sociology must face the unavoidable challenge of incomplete data and insufficient variables for representation. They also must sacrifice response speed for the ability to decipher content when the volume and velocity of data from social media are approaching an infinite set. This leaves scholars with a series of questions: Can mathematical statistics effectively capture the characteristics of social media? Are those sophisticated statistical techniques (clustering, scaling methods, vector distance, and regression, etc.) cost-efficient for automatic analysis that matches the velocity of social media data stream? If not, is there a better solution? Brillouin (1962) asserted that the methods of this theory cannot be introduced to investigate the process of thought due to the elimination of the human element. The entropic method was indeed applied as a standalone analytical tool for structural properties for more than half century, if was not misused for interpretation of semantic value.

## 3 Research Methodology

### 3.1 Research Design: A Case Study with Entropy Method

The Twitter platforms connects its users (*personal or organizational accounts that write or/and read tweets*) across the world through their information sharing behavior (i.e., tweeting/retweeting activities) and information seeking behavior with the hashtag search feature of the platform. According to Shannon (1948), although these messages on Twitter have meanings, these semantic aspects of communication are irrelevant to the engineering problem (the structure of the message). The first step to understanding the complexity in tweet-samples in terms of the structure is to define the level of granularity. In this study, the granular levels of a tweet are listed as below:

The left end of the spectrum (*i.e., letter*) represents smaller granularity whereas the right end of the spectrum (*i.e., tweet*) demonstrates greater granularity. There are six typical components available to construct a tweet and the way of combining these components are unlimited, only depending upon the choice of the tweet creator. In this study, the granular components for composing a tweet are categorized as (1) image(s) (short for img), (2) text with semantic meaning (short for txt), (3) hashtag(s) (short for #), (4) @username(s) (short for @), (5) hyperlink (short for HL), and (6) unused space (short for spc). These six components serve as the fundamental elements for the coding scheme to build the so-called alphabets (Kinsner, 2004) for calculating entropy based on Shannon’s entropy equation. In this study, the tweeting/retweeting behavior was defined as the choices made among six typical components to construct a tweet. Figure 1 illustrates the structure of a tweet.

The calculation of entropy in this study is based on the following premises: (1) All the entities in each component alphabet are independent of each other. An entity refers to a single component, which could be one of the six decomposed components, in a specific tweet in a collection. Although the choice among different components to compose a single tweet is restrained by the 140-character limit, this restriction does not affect the independence of entities in each component alphabet. For example, in a hypothetical sample of 100 tweets, there could be 75 tweets with hyperlink and all these 75 hyperlinks are independent of each other because they come from different tweets with different users. Figure 2 illustrates the structure of a tweet-collection and the relationship between entity and tweet. (2) Each alphabet has a finite number of variables. In this empirical study, each collection contained a finite number of tweets, and therefore there were a finite number of entities of each component. (3) All the entities in each alphabet were discrete variables. (4) The empirical frequency of an entity in each alphabet served as the probability of a variable in Shannon’s equation. (5) The logarithm of the probability distribution is additive for independent sources.

The entropy calculation is a straightforward process because it only involves one coding scheme at a time and generates only one entropy value for each coding scheme. Inspired by the work of Kearns and O’Connor (2004) and their approach of calculating form complexity in moving image documents, this study conducted a comparative case study and extended Shannon’s original entropy equation to a multidimensional matrix. We integrated the six tweet components with their own coding schemes to examine the complexity of the statistical structure (Shannon, 1948) in our tweet samples. We sampled tweets associated with three pairs of semantically similar healthcare hashtags as six hashtag collections, including #glucose versus #bloodsugar, #hypertension versus #bloodpressure, and #influenza versus #flu. *Table 1* illustrates an example of this multi-dimensional matrix for calculating the entropy value of each component in a sample (along with the *vertical direction*) and the synthetic value of H’_{(tweet-x)} for each tweet in that sample (along with the *horizontal direction*). The operational definitions of the variables in the matrix and their notations in this study are as follow: H_{(x)} is the general notation of the matrix for entropy calculation. H_{(sample)} is the cumulative sum of all components’ entropy value in the H_{(x)} matrix. H’_{(hashtag)}, short for H’_{(#)}, is the weighted entropy value of component *Hashtag*. H’_{(hyperlink)}, short for H’_{(HL)}, is the weighted entropy value of component *Hyperlink*. H’_{(@username)}, short for H’_{(@)}, is the weighted entropy value of component *@username*. H’_{(space)} is the weighted entropy value of component *Unused Space*. H’_{(text)}, short for H’_{(txt)}, is the weighted entropy value of component *Text with Semantic Meaning*. H’_{(red)}, H’_{(green)}, and H’_{(blue)} are respectively the calculative results of weighted entropy value of each color component in *Image’s RGB color*. For each tweet in a sample, H’_{(tweet)} is the horizontal sum of each entity’s P(x_{i})× log_{2} P(x_{i}) value in that tweet.

Example of H(x) Matrix and Entropic Equation for Each Component in a Tweet

The Text-based Content of a Tweet | The Image Component in a Tweet | H_{(x)} | ||||||||
---|---|---|---|---|---|---|---|---|---|---|

Hashtag(s) | Hyperlink | @username(s) | Unused Space | Semantic Text | Red Color | Green Color | Blue Color | |||

Tweet 1 | P(x_{i})× log_{2} | P(x_{j})× log_{2} | P(x_{k})× log_{2} | P(x_{l})× log_{2} | P(x_{m})× log_{2} | P(x_{n-r}) × | P(x_{n-g}) × log_{2} | P(x_{n-b}) × | H’_{(tweet-1)} | |

P(x_{i}) | P(x_{j}) | P(x_{k}) | P(x_{l}) | P(x_{m}) | log_{2} P(x_{n-r}) | P(x_{n-g}) | log_{2} P(x_{n-b}) | |||

Tweet 2 | P(x_{i})× log_{2} | - | P(x_{k})× log_{2} | - | P(x_{m})× log_{2} | - | - | - | H’_{(tweet-2)} | |

P(x_{i}) | P(x_{k}) | P(x_{m}) | ||||||||

Tweet 3 | P(x_{i})× log_{2} | P(x_{j})× log_{2} | - | P(x_{l})× log_{2} | P(x_{m})× log_{2} | P(x_{n-r}) × | P(x_{n-g}) × log_{2} | P(x_{n-b}) × | H’^{(}_{tweet-3)} | |

P(x_{i}) | P(x_{j}) | P(x_{l}) | P(x_{m}) | log_{2} P(x_{n-r}) | P(x_{n-g}) | log_{2} P(x_{n-b}) | ||||

Tweet n | P(x_{i})× log_{2} | P(x_{j})× log_{2} | - | P(x_{l})× log_{2} | P(x_{m})× log_{2} | - | - | - | H’^{(}_{tweet-n)} | |

P(x_{i}) | P(x_{j}) | P(x_{l}) | P(x_{m}) | |||||||

SUM_{(component)} | H’_{(#hashtag)} | H’_{(Hyperlink)} | H’_{(@username)} | H’_{(Space)} | H’_{(Text)} | H’_{(red)} | H’_{(green)} | H’_{(blue)} | H_{(sample)} |

Structure of a Tweet-collection with Granular Components.

Citation: Data and Information Management 2, 3; 10.2478/dim-2018-0011

The nomenclature in this study complied with the following rules: (1) The denotation of letter H as entropy was inherited from information theory (Shannon, 1948).H_{(x)} and H_{(sample)} were both derived from the original entropy concept regardless either in a thermodynamic-closed system or for a social media data stream. (2) Denotation of all the H’_{(…)} means that these variables were not the same as Claude Shannon’s original entropy concept. According to Shannon’s information theory, the calculation of a logarithm should use probability of occurrence of each entity. However, in a real-world scenario especially in a study of social media data like this one where the theoretical probability is unavailable, the empirical frequency was used instead.

### 3.1.1 Calculating the weighted entropy values for the textual content

There are five different textual components that can be used to construct the textual content of a tweet, composing (1) text with semantic meaning, (2) hashtag(s), (3) @username(s) mentions, (4) hyperlink, and (5) unused space. The NodeXL Pro Software automatically collected Twitter network information for component of hashtag in the column of “*Hashtag*s in Tweet” and the column of *“URLs in Tweet”* for the component of hyperlink. The *@ username*component was identified as vertexes for each edge in NodeXL dataset. The component of *Unused Space* for each tweet was calculated by the formula: *unused space equals 140 characters minus the length of the tweet*. The component of *Text with Semantic Meaning* was the textural content of a tweet excluding all the components of *Hashtag*, *@username*, and *Hyperlink*.

For each previously identified component, the configuration of all its entities is called the coding alphabet (Kinsner, 2004).In this study, each alphabet was generalized by summing up all the entities in each component. The next step was to calculate the frequency of occurrences for each entity (each cell in Table 1) of a specific component. Regarding calculating P(x_{i}) × log_{2} P(x_{i}) value in each cell, this study used 2 as the base of logarithm and then multiplied the frequency of the entity in that cell with its corresponding logarithm. Choosing 2 as the base of the logarithm gives the calculation result in units of “bits”, as recommended by Tukey to Claude Shannon (1948). In this study, all the calculated results of P(x_{i})× log_{2} P(x_{i}) value were transferred to be positive, because the symbol doesn’t carry any meaning.

Each individual tweet in the samples has a unique H’_{(tweet-x)} value. However, this H’_{(tweet-x)} was a synthetic entropy value because the classical entropy concept is a measure of the overall property for a closed system therefore could not be applied at the entity level. The entropy value of each component in the tweet-sample was denoted as H’_{(component)} and calculated using the following equation:

There was a special case with the component of hyperlink. Each hyperlink is unique because it is a 23-character long string with no semantic meaning and there could be only one hyperlink in each tweet. As a result, the alphabet of a hyperlink was made of unique entities which had equal probability. The appropriate entropy equation for situation like this should be Boltzmann’s equation, S = K log W, to save calculating work. The Boltzmann Constant is omitted here because it’s used only for thermodynamic situation; Boltzmann’s equation is a special case of Gibbs equation, and Gibbs equation is mathematically the same as Shannon’s equation. This characteristic guarantees the calculated results with Boltzmann’s equation are the same with the ones with Shannon’s. The entropy of the textual content of the tweet-sample was denoted as H’_{(content)}, and was the integrated value that calculated by summing up all the values of entropy for each of the five different components as follows:

### 3.1.2 Calculating the entropy value for the image component

An image in a tweet can be numerically represented in many ways. According to Marr (1982), representation is used to clarify certain characteristics of an entity within a system and to provide a scheme for coding. Knowledge about patterns of the characteristics is crucial for determining a functional and appropriate representation for coding scheme in a system. Anderson and O’Connor (2009) used RGB data to map the color distribution of each frame in the Bodega Bay scene for structural analysis of the sequence of Hitchcock’s movie *The Birds*. Likewise, in this study, a set of three numbers, namely the average RGB values (*from 0 to 255*), was used to represent each image in a single tweet for the entropy calculation of the image component.

In each image there is a possibility of 256 shades of red, green, and blue color. In total over 16 million (*256*3) combinations are available to represent a single image file. For those cases where a particular tweet contained more than one image, the set of weighted average RGB values of all images in that tweet served as the numerical representation. This approach provided an objective way to token an image without human intervention. In a repetitive test with more than 700 images, this approach appeared to be effective and adequate. No identical set was assigned to different images. Sometimes there are textual tweets contain the same image but with different contents; while other times tweets share both content and image, but those image files are in distinctive resolutions. For those cases, the numerical set remains the same across different tweets regardless size or resolution of the image(s) which provides consistency for the study.

All the red values in each RGB set constructed the alphabet of red color for that tweet-sample, and so did the green and blue color. As shown in the following equation, the frequency of each value of red, green, and blue color was calculated, multiplied by its own logarithm, and then adding up together to get the entropy of each color:

The current solution of assigning a set of average RGB color to each image has a unique tendency. A dark image, in general, has relatively lower average RGB values than a bright one. However, the final effect of this tendency is minute because the image component only takes 3/8 of the total proportion; the ratio of the weighted entropy value of the textual content and the entropy value of the image component is countless. The value of entropy of the image component of the tweet-sample was denoted as H’_{(image)} and was calculated by summing up all the values of entropy for each of the three colors as expressed by the following equation:

However, the configuration that the image component being the sum of three different entropy values is because an image in a tweet takes up a certain amount of space in any display devices. The Twitter default size of the image (*440 X 220*) is usually larger than the space for the textual content (*140 characters*) of the tweet, and tweets with image(s) are more competitive for reader’s attention. A regular cell phone screen can only fit in at most two tweets with image(s) at a time, which means these tweets stay longer time on screen when user scrolling and browsing. In the study conducted by Houts, Doak, Doak, and Loscalzo (2006) to evaluate the effect of pictures on health communication, investigators found that pictures linked to text can increase attention to health information, compared with text alone. As a multi-media supplement for textual communication messages, image plays a crucial role not only in visualizing the main idea about the content but also in attracting users’ attention to increase the probability of being retweeted. Therefore, it makes sense that the image component accounts for more proportions in the H_{(x)} matrix than any of the other components alone. In this study, the final calculated result of H_{(sample)} was used as an indicator of the complexity in the structure of a tweet-sample. In addition, the complexity in the structure is an indicator of the variety in tweeting behaviors in terms of choices for tweet composite. For example: individual users might involve more point to point communication using @username mention function while healthcare agencies might tend to embed a hyperlink into their tweets to direct network traffic to the target webpages. For this reason, the structure of medical-terminology hashtag collection could be distinct from that of lay-language hashtag collection. This implies that users with different profiles have different preferences towards the choice of medical or lay-language hashtags. The final product of the calculation matrix is H_{(x)} and is calculated by the following formula:

The calculated result of H_{(sample)} was made up of eight entropy values from six typical components (if any) in a tweet sample. That is because the image component was composed of red, green, and blue three different color subsets. These components were on a unique level of granularity of the tweet-sample to represent the diversity of the statistical structure in terms of choosing different components

### 3.2 Data

Using the hashtag-search function supported by NodeXL Pro software, version 1.0.1.378, we retrieved three pairs of medical/healthcare hashtags versus their corresponding lay-language counterparts. They were #glucose versus #bloodsugar, #hypertension versus #bloodpressure, and #influenza versus #flu. The timeframe for investigation was setup as 48 hours on random days in 2017 because it is long enough to gather sufficient tweets in each tweet

collection with hashtag for longitudinal comparison. This study extracted six random 48-hour samples, namely, #glucose Jan-25 to Jan-26, #glucose Feb-16 to Feb-17, #bloodsugar Feb-07 to Feb-08, #bloodsugar Feb-15 to Feb-16, #hypertension Feb-11 to Feb-12, and #bloodpressure Feb-11 to Feb-12. On top of those, four more samples (#glucose Oct-07 to Oct-08, #bloodsugar Oct-07 to Oct-08, #hypertension Oct-07 to Oct-08, and #bloodpressure Oct-07 to Oct-08) were collected to make longitudinal comparison, plus another pair of medical vs. lay-language hashtags, namely #influenza (Apr-18 to Apr-19 sample and Oct-07 to Oct-08 sample) versus #flu (Apr-18 to Apr-19 sample and Oct-07 to Oct-08 sample)was included into the study. These were totally 1458 tweets in 14 samples from January, February, April, and October 2017.

During the data collection process, a variation of #bloodsugar was found: #bloodsuger. However, the tweets containing #bloodsuger were eventually excluded from this study due to the consideration for consistency in comparison. This phenomenon implies that #bloodsugar was used by users who occasionally spell incorrectly. On the other hand, no variation of hashtag spelling was identified in the data gathering process for the hashtag #glucose collection, indicating that people who use medical-terminology hashtags are less likely to make spelling errors.

Although Twitter officially expanded its character limit from 140 to 280 started on November 7^{th}, 2017, we collected the data before this change. Therefore, there is no issue inconsistency issue in data characteristics because the method proposed for this study is independent of character limit. It only deals with the proportion of each component in a tweet rather than the absolute length in character. Regarding data filtering and cleaning, to calculate entropy value in a consistent way, the inclusion criteria were:

All the tweets must be written in English, meaning the value for the NodelXL column of Language in Edge sheet must be “en”.

All the tweets must contain the investigated hashtag. For example, all tweets in the hashtag #glucose collection must contain #glucose, those tweets that only have the word “glucose” in the text component were not qualified.

The tweets that contain video or gif image were excluded from this study because the entropy value of a video clip or a gif image demand much more complicated calculating technique and, therefore, will be included in future studies.

The tweets that contain emoji and/or special characters were excluded because these symbols and emoji are dependent on display devices. In other words, they do not look the same across different cellphone operation systems, and they cannot fit into any of the six components which this study defines.

The sample size and descriptive statistics of the tweets are summarized in *Table 2*. Each tweet in our sample was reviewed and labeled as either a retweet or an original tweet. A retweet means a reposted or forwarded message on Twitter. The NodeXL software tool uses “Retweet ID” to mark all the retweets in the datasheet. Those tweets with void value in “Retweet ID” were categorized as original tweets in this study. Each tweet has a chance to be retweeted by the followers of the original tweet creator. The “Retweet Count” in NodeXL records the times a specific tweet (regardless as a retweet or an original Tweet) gets retweeted. This “Retweet Count” is the key variable in this study because we were trying to figure out the characteristics (especially from the perspective of structure complexity) of those tweets that got higher “Retweet Count” with the H_{(x)} matrix as the analysis tool. Healthcare topics are sophisticated and healthcare communication messages usually resort to the aid of rich media such as image(s)/videos to visualize ideas and/or external hyperlink to direct audience to the destination webpage with further explanation. The percentage of tweets that contain each component (if any) in each sampled hashtag collection is listed in *Table 2*.

Summary of Collected Tweets for All Samples

Investigated Hashtag | Sample Size | Weight of Each Component ^{1} | Total Number of Being Retweeted in each Sample | Retweets (RTs) in Each sample | Original Tweets (OTs) in Each Sample | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Image % | Hyperlink % | @ username % | Hashtags % ^{2} | Unused Space % | RT Counts | RT% | # of Being Retweeted | OT Counts | OT% | # of Being Retweeted | |||

#glucose | |||||||||||||

Jan-25 to Jan-26 | 75 | 44.00% | 52.00% | 64.00% | 90.67% | 50.67% | 309 | 41 | 54.67% | 266 | 34 | 45.33% | 43 |

Feb-16 to Feb-17 | 42 | 40.48% | 50.00% | 54.76% | 90.48% | 52.38% | 275 | 21 | 50.00% | 251 | 21 | 50.00% | 24 |

Oct-07 to Oct-08 | 23 | 26.09% | 43.48% | 91.30% | 100.00% | 26.09% | 176 | 18 | 78.26% | 173 | 5 | 21.74% | 3 |

#bloodsugar | |||||||||||||

Feb-07 to Feb-08 | 43 | 46.51% | 65.12% | 48.84% | 86.05% | 65.12% | 45 | 16 | 37.21% | 24 | 27 | 62.79% | 21 |

Feb-15 to Feb-16 | 48 | 45.83% | 58.33% | 47.92% | 79.17% | 58.33% | 114 | 18 | 37.50% | 103 | 30 | 62.50% | 11 |

Oct-07 to Oct-08 | 42 | 71.43% | 80.95% | 64.29% | 83.33% | 45.24% | 169 | 25 | 59.52% | 151 | 17 | 40.48% | 18 |

#flu | |||||||||||||

Apr-18 to Apr-19 | 151 | 26.49% | 77.48% | 43.71% | 77.48% | 81.46% | 490 | 40 | 26.49% | 440 | 111 | 73.51% | 50 |

Oct-07 to Oct-08 | 502 | 52.19% | 56.77% | 70.32% | 73.90% | 53.39% | 5751 | 315 | 62.75% | 5637 | 187 | 37.25% | 114 |

#influenza | |||||||||||||

Apr-18 to Apr-19 | 71 | 29.58% | 70.42% | 61.97% | 63.38% | 69.01% | 177 | 28 | 39.44% | 125 | 43 | 60.56% | 52 |

Oct-07 to Oct-08 | 73 | 38.36% | 71.23% | 69.86% | 79.45% | 64.38% | 323 | 42 | 57.53% | 302 | 31 | 42.47% | 21 |

#bloodpressure | |||||||||||||

Feb-11 to Feb-12 | 94 | 67.02% | 77.66% | 62.77% | 89.36% | 46.81% | 1582 | 55 | 58.51% | 1531 | 39 | 41.49% | 51 |

Oct-07 to Oct-08 | 129 | 39.53% | 79.07% | 62.79% | 88.37% | 56.59% | 1469 | 75 | 58.14% | 1404 | 54 | 41.86% | 65 |

#hypertension | |||||||||||||

Feb-11 to Feb-12 | 61 | 22.95% | 78.69% | 31.15% | 67.21% | 73.77% | 69 | 14 | 22.95% | 51 | 47 | 77.05% | 18 |

Oct-07 to Oct-08 | 104 | 21.15% | 63.46% | 81.73% | 81.73% | 57.69% | 959 | 81 | 77.88% | 898 | 23 | 22.12% | 61 |

Summary of the H_{(x)} Matrix Values for All the Tweet Collections with Investigated Hashtags

Investigated Hashtag | Sample Size | H'(_{Text)} | H'_{( @ u s e r n a m e )} | H'_{( H a s h t a g s )} | H'_{(Hypertink)} | H'_{( S p a c e )} | H'_{( I m a g e s )} | H'_{( C o n t e n t )} | H'_{( S a m p l e )} | Median H'_{( t w e e t )} | Average H'_{( t w e e t )} | Average H'Retweets | _{(tweet)} Original Tweets |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

#glucose | |||||||||||||

Jan-25 toJan-26 | 75 | 14.7680 | 1.9358 | 5.7227 | 2.7652 | 2.4169 | 14.7516 | 27.6086 | 42.3602 | 0.5144 | 0.5648 | 0.5644 | 0.5653 |

Feb-16 to Feb-17 | 42 | 11.1515 | 1.2305 | 6.3318 | 2.1120 | 2.2738 | 12.2151 | 23.0996 | 35.3147 | 0.7132 | 0.8408 | 0.8941 | 0.7876 |

Oct-07 to Oct-08 | 23 | 11.0209 | 1.7989 | 3.1497 | 1.7536 | 0.8608 | 7.5271 | 18.5838 | 26.1109 | 0.9286 | 1.1353 | 1.0436 | 1.4651 |

#bloodsugar | |||||||||||||

Feb-07 to Feb-08 | 43 | 9.8025 | 1.4000 | 6.9842 | 2.7478 | 3.0449 | 12.5872 | 23.9793 | 36.5665 | 0.7939 | 0.8504 | 0.9724 | 0.7780 |

Feb-15 to Feb-16 | 48 | 12.7465 | 1.2858 | 5.3805 | 2.4894 | 2.5008 | 13.0723 | 24.4030 | 37.4753 | 0.7081 | 0.7807 | 0.8810 | 0.7206 |

Oct-07 to Oct-08 | 42 | 10.8982 | 1.4028 | 6.8090 | 3.1866 | 0.9884 | 13.5666 | 23.2849 | 36.8515 | 0.8944 | 0.8774 | 0.8486 | 0.9198 |

#flu | |||||||||||||

Apr-18 to Apr-19 | 151 | 16.1055 | 1.5098 | 4.3035 | 4.4093 | 7.0226 | 15.6001 | 33.3507 | 48.9508 | 0.2497 | 0.3242 | 0.4045 | 0.2952 |

Oct-07 to Oct-08 | 502 | 23.2375 | 3.4049 | 5.1701 | 4.3799 | 5.3952 | 23.7550 | 41.5877 | 65.3427 | 0.1205 | 0.1302 | 0.1382 | 0.1166 |

#influenza | |||||||||||||

Apr-18 to Apr-19 | 71 | 14.2852 | 1.6310 | 4.1715 | 3.4779 | 4.5480 | 12.7894 | 28.1136 | 40.9030 | 0.4644 | 0.5761 | 0.6735 | 0.5127 |

Oct-07 to Oct-08 | 73 | 13.4464 | 2.2456 | 5.0927 | 3.5751 | 4.0274 | 14.2077 | 28.3872 | 42.5948 | 0.4873 | 0.5835 | 0.6485 | 0.4954 |

#bloodpressure | |||||||||||||

Feb-11 to Feb-12 | 94 | 11.5257 | 4.6161 | 6.2584 | 3.8381 | 3.8287 | 17.1517 | 30.0670 | 47.2187 | 0.5036 | 0.5023 | 0.5108 | 0.4904 |

Oct-07 to Oct-08 | 129 | 15.0363 | 2.5713 | 7.1544 | 4.3415 | 3.2591 | 16.0772 | 32.3627 | 48.4398 | 0.3037 | 0.3755 | 0.4013 | 0.3397 |

#hypertension | |||||||||||||

Feb-11 to Feb-12 | 61 | 11.8774 | 1.9041 | 5.3826 | 3.4632 | 3.5440 | 10.6938 | 26.1714 | 36.8651 | 0.4353 | 0.6043 | 0.7731 | 0.5541 |

Oct-07 to Oct-08 | 104 | 14.2654 | 3.1008 | 6.8957 | 3.5835 | 3.1380 | 13.1202 | 30.9835 | 44.1037 | 0.3523 | 0.4241 | 0.4427 | 0.3583 |

## 4 Data Analysis and Results

Healthcare communication tweets refer to the tweets written for communicating health information in support of health education or public health campaigns. This study examined how healthcare communication *tweets* were composed by analyzing the complexity in structural components and variety of tweet composition in three pairs of medical hashtags that share similar semantic meanings. *Table 3* summarizes the calculated results of the H_{(x)} matrix for the #hypertension sample. These observed tweeting behaviors were associated with different types of Twitter users, meaning that tweeting behaviors varied across Twitter accounts with diversified profiles. In addition, the components used to compose a tweet are independent of each other. Each tweet has limited space (i.e. 140 characters) to express its main idea. Therefore, a user’s choice between the medical-terminology hashtags and lay-language hashtags requires consideration of the opportunity cost for the different options. Interestingly, the percentage of tweets that contained both medical-terminology hashtag and lay-language hashtag was very low in all the samples. This indicates that most users tend to reduce the redundancy in hashtag usage by avoiding hashtags with similar or identical semantic meanings.

### 4.1 Data Analysis with Visualizations: Radar Graph and Scatterplot

This study employs radar graph and scatterplot as data visualization aids to get an intuitive demonstration. These visualize the complexity in the structure of a tweet-sample and reveal the pattern of the characteristics of each individual tweet in the sample. To compare each pair of semantically similar healthcare hashtags, we organized and mapped entropy values in the last row of *Table 1* with six vectors on a radar graph by their weighted average proportion in the tweet-sample. Then, the radar graph for each tweet-sample were placed together to build a combined radar graph for the given hashtag. A radar graph shows the weight of each component in the tweet-sample.

In the calculation process for the textual content of a tweet, namely H’_{(txt)}, H’_{(#)}, H’_{(@)}, H’_{(space)}, and H’_{(HL)}, the value of P(x_{i})× log_{2} P(x_{i}) for each cell within the matrix is calculated first to generate the entropy value to each component (*vertical direction*), then all the P(x_{i})× log_{2} P(x_{i}) values for each component (if any) in a single tweet are added together to generate the value for H’_{(tweet)}. Finally, the value H’_{(tweet)} is relocated back to each component according to its own proportion in the 140-charater textual content of that tweet because of the following reasons:

The weight of each granular component is a crucial factor which user must consider when composing a tweet. It is assumed that different choices among the various combinations of the six components have a conspicuous impact on efficiency of communication on the Twitter platform. For instance, the component of text with semantic meaning is to convey an idea or make a point. The component of @username is usually viewed as specifying the recipient of the message. A hyperlink is a string and does not have semantic meaning at all, but it could direct the audience outside the Twitter platform to other web resources. Hashtag is a hybrid feature; sometimes its semantic meaning serves as a phrase with grammatical value in a sentence; other times it serves as a navigation aid (keywords) for information retrieval. This study assumes hashtags serve only as a feature for information retrieval. For the situation of hashtag(s) as part of the sentence, the data preparation involves more manual efforts or more sophisticated algorithm, and the calculating process of H

_{(x)}matrix would be more complex due to the duality of hashtags.The user’s choice in the allocation of the 140-character space is the cause of different empirical frequencies of the typical components in a sample. As a result, these H’

_{(…)}values are weighted entropy values unique to the design of the H_{(x)}Matrix.

Unlike the weighting technique for the textual content, it is unnecessary to reallocate the calculated entropy value for image component because there is no limit on the combination of RGB colors to form an image. There is no variable of time in the calculating process of entropy. However, each tweet in a sample has its particular tweet timestamp. The timestamp of each tweet is combined with its own H’_{(tweet)} value from the H_{(x)} matrix and connects each other to form a scatterplot (all scattered data points were connected by line to show their trace along timeline) for each sample. As shown in *Table 1*, the value of H’_{(Component)} was calculated separately and then aggregated into H_{(x)}. For each tweet in the matrix, its own H’_{(tweet)} was calculated by summing up all the P(x_{i}) × log_{2} P(x_{i}) entities for each component in that tweet (if presents). The rationale behind the summation is twofold. First, according to information theory (Shannon, 1948), the entropy of the joint event is equal to the sum of the individual uncertainties. Second, all the cells within the matrix have the same unit, bits; because the values of these cells are the calculative results from the frequency of an entity multiplied by the logarithm of its frequency. The values in column H’_{(#hashtag)}, H’_{(Hyperlink)}, H’_{(@username)}, H’_{(Space)}, H’_{(Text)}, and H’_{(Images)} were the result of entropy calculation of each typical component in each sample. H’_{(Images)} in the table equals the sum of H’_{(red)}, H’_{(green)}, and H’_{(blue)}. H’_{(Content)} equals the sum of H’_{(#hashtag)}, H’_{(Hyperlink)}, H’_{(@username)}, H’_{(Space)}, and H’_{(Text)}. H_{(sample)} equals the sum of H’_{(image)} and H’_{(content)}.

Radar Graph and Scatterplots for Tweet Collections with #hyperlink

Citation: Data and Information Management 2, 3; 10.2478/dim-2018-0011

## 4.2 Analysis Results of the #hypertension Collections and the #bloodpressure Collections

### 4.2.1 The #hypertension collections

The two collections of tweets with #hypertension were collected individually from February 11_{th} to 12_{th} (totally 61 tweets) and from October 7_{th} to 8_{th} (totally 104 tweets). Figure 3.1 illustrates the comparative radar graphs for February sample (*in blue*) versus October sample (*in orange*). The October sample size is 70% larger than the February sample, resulting in its greater size in the radar graph. However, the radar graph of October sample was only 20% bigger than the one of February sample because the calculated entropy value is not lineally proportional to the total number of entities. As shown in *Table 3*, the H_{(sample)} values are 44.1 for Oct-sample and 36.87 for Feb-sample. The advantage of this feature is that the radar graphs with very different sample sizes (ranging from 10 to 100 times difference) can easily fit into one single figure. The shape of two radar graphs was similar, indicating that the tweeting/retweeting behaviors of Twitter users associated with #hypertension were stable in the two observation periods.

Figure 3.2and Figure 3.3depict the distributions of individual tweet in the two samples plotted with its H’_{(tweet)} value along the timeframe. H’_{(tweet)} is the synthetic entropy value of a tweet because it is the sum of P(x_{i})× log_{2} P(x_{i}) for all components in that tweet. Although the October sample had 70% more tweets than the February sample, their scatterplots showed similar patterns in terms of density and distribution. In the February sample as shown in Figure 3.2, where 31 tweets were categorized as high complexity (by either the median value or the average value). 17 tweets of them distributed at middle-to-top area and their H’_{(tweet)} values were higher than both median and average value of H’_{(tweet)}. These 31 tweets contributed 67% of the total number of being retweeted for all tweets in the sample. In the October sample as shown in Figure 3.3, there were 54 tweets were categorized as high complexity, and 23 tweets of them distributed at middle-to-top area. These 54 tweets contributed 78% of the total number of being retweeted for all tweets in the sample.

### 4.2.2 The #bloodpressure Collections

The two collections of tweets with #bloodpressure were collected individually from February 11^{th} to 12^{th} (totally 94 tweets) and from October 7^{th} to 8^{th} (totally 129 tweets). The October sample size was 37% larger than the February sample, the radar graph of October sample was only 3% greater than the one of February sample as shown in Figure 4.1. The shape of two radar graphs had slight disparity. This indicates that Twitter users who used #bloodpressure were making different choices among the six typical components during the two observation periods. The values of H_{(Sample)} from both the February sample and October sample showed minor difference (As shown in *Table 3*, the H_{(sample)} values are 47.22 versus 48.44), suggesting these two collections had the similar degree of complexity in their own structures.

Radar Graph and Scatterplots for Tweet Collections with #bloodpressure

Citation: Data and Information Management 2, 3; 10.2478/dim-2018-0011

While a radar graph depicts the statistical structure of tweets, a scatterplot delineates the dynamics of tweet characteristics over time. The October sample has 37% more tweets than the February sample. However, their scatterplots showed very different patterns in terms of density and distribution. The Februarysample (Figure 4.2) had more intense tweeting/retweeting activities during the morning of February 12^{th}, 2017. The tweets in the October sample (Figure 4.3) were more evenly distributed. This finding suggests that although these two tweet collections had similar structures in the data stream, the tweeting/retweeting activities that associated with #bloodpressure thrived in different timeframes. Even so, the cause of this phenomenon cannot be explained solely by the structural analysis. Therefore, this issue will be further investigated in future studies.

In the February collection (Figure 4.2), there were 60 tweets were categorized as high complexity (by either the median value or the average value). All of them distributed at middle-to-top area and their H’_{(tweet)} values were higher than both the median and the average value of H’_{(tweet)}. These 60 tweets contributed 95% of the total number of being retweeted in the sample. In the October collection (Figure 4.3), there were 79 tweets were categorized as high complexity. 51 tweets of them distributed at middle-to-top area. These 79 tweets contributed 92% of the total number of being retweeted in the sample.

### 4.2.3 The comparative analysis: #hypertension versus #bloodpressure

This pair of healthcare hashtags has some stats in common. First, their October collections had bigger sample size than their February collections, indicating that the intensity of tweeting/retweeting activities with both hashtags was higher during the fall season and lower in the spring. This finding could be related to the phenomenon of seasonal variations in blood pressure, which has been examined by several medical studies (Frohlich, 2004; Rosenthal, 2004). Additionally, investigating the seasonality of healthcare hashtags could provide more indirect evidences for seasonal diseases. This study didn’t include #hypotension tweet collection because #hypotension was not a popular hashtag and there were less than 10 tweets contained #hypotension collected by NodeXL during February 2017, causing data insufficiency for making a meaningful comparison.

The slight difference in the shape of the radar graphs was consistent with the findings in the study by Zhang and Chang (2018, January) and it can be explained by the difference in their semantic meaning. The disparity in the perception of semantic meaning influenced users to make different choices among the six components when composing tweets. Hypertension in English means high blood pressure and its opposite word is hypotension, low blood pressure. The #bloodpressure collection could contain tweets about issues for both hypertension and hypotension. therefore, the users’ choices among those typical components might be different. Another advantage of investigating healthcare hashtag sample is that they provide multi-dimensional information on the subject. For example, the radar graphs and scatterplots of #hypertension collections showed similar patterns, the #bloodpressure collections, on the other hand, demonstrated sight difference in radar graphs and very distinct distribution in scatterplots.

Radar Graph and Scatterplots for Tweet Collections with #influenza

Citation: Data and Information Management 2, 3; 10.2478/dim-2018-0011

## 4.3 Analysis Results of the #influenza Collections and the #flu Collections

### 4.3.1 The #influenza Collections

The two collections of tweets with #influenza were collected individually from April 18^{th} to 19^{th} (totally 71 tweets) and from October 7^{th} to 8^{th} (totally 73 tweets). The October sample size was nearly the same as the April sample, and the radar graph of October sample was closely overlapped with that of April sample (Figure 5.1). The shape of two radar graphs was almost identical because these two samples had approximately the same variation in the composition of the components in their respective structures (*shape and size*). This indicates that the tweeting/retweeting behaviors of Twitter users who used #influenza were stable over these two observation periods because the users made very similar choices in selecting components when composing their tweets.

The scatterplots of these two collections showed similar patterns in terms of density and distribution. In the April sample (Figure 5.2), there were 37 tweets were categorized as high complexity (by either the median value or the average value). 22 tweets of them distributed at middle-to-top area and their H’_{(tweet)} values were higher than both the median and the average value of H’_{(tweet)}. These 37 tweets contributed 75% of the total number of being retweeted for all tweets in the sample. In the October collection (Figure 5.3), there were 37 tweets were categorized as high complexity. 28 tweets of them distributed at middle-to-top area. These 54 tweets contributed 80% of the total number of being retweeted in the sample.

### 4.3.2 The #flu collections

The two collections of tweets with #flu was collected individually from April 18^{th} to 19^{th} (totally 151 tweets) and from October 7^{th} to 8^{th} (totally 502 tweets). The October sample size is 230% larger than the April sample, resulting in its greater size of the radar graph. However, the radar graph of October collection was merely 33% bigger than that of the April collection (Figure 6.1). In addition, as shown in *Table 3*, the value of H_{(Sample)} for the April sample is obviously lower than that of the October sample (65.34 versus 48.95). There were only 151 tweets in the April sample in contrast to the 502 tweets in the October sample. This finding was consistent with the observation on the value of H’_{(Images)} of this pair of hashtag samples (*15.60 versus 23.76*) and H’_{(Content)} of this pair (*33.35 versus 41.59*), as shown in *Table 3*. This revealed that the total number of tweets in each collection is an influential factor on the calculated results of H’_{(Component)}, H’_{(Content)}, H’_{(Images)}, and H_{(Sample)}. On the other hand, the shape of two radar graphs was similar, signifying that the tweeting/retweeting behaviors of Twitter users associated with #flu were stable over these two observation periods.

Although the October collection had 230% more tweets than the April one, their scatterplots showed similar patterns in terms of density and distribution. In the April sample (Figure 6.2), there were 76 tweets categorized as high complexity (by either the median value or the average value). 43 tweets of them distributed at middle-to-top area and their H’_{(tweet)} values were higher than both the median and the average value of H’_{(tweet)}. These 76 tweets contributed 60% of the total number of being retweeted in the sample. In the October collection (Figure 6.3), there were 215 tweets categorized as high complexity, and 28 tweets of them distributed at middle-to-top area. These 252 tweets contributed 71% of the total number of being retweeted in the sample.

Radar Graph and Scatterplots for Tweet Collections with #flu

Citation: Data and Information Management 2, 3; 10.2478/dim-2018-0011

### 4.3.3 The comparative analysis: #influenza versus #flu

The shapes of the radar graphs of the #influenza collections were similar to those of the #flu collections. This manifests that the users made alike choices among six components when tweeting with these two hashtags because their semantic meanings are the same. Nevertheless, the #flu collections demonstrated obvious seasonal variation. The October sample was much bigger than the April one because the flu season starts from October every year (Figure 6.1). The intensive tweeting/retweeting activities associated with #flu in October indicated that the flu topics were more popular on Twitter than they were in April (Figure 6.3 versus Figure 6.2). At the same time, the #influenza collections did not demonstrate obvious seasonality (Figure 5.1).

## 4.4 Analysis Results of the #glucose Collections and the #bloodsugar Collections

### 4.4.1 The #glucose collections

The three collections of tweets with #flu were collected individually from January 25^{th} to 26^{th} (totally 75 tweets), from February 16^{th} to 17^{th} (totally 42 tweets), and from October 7^{th} to 8^{th} (totally 23 tweets). The January sample size is 79% larger than the February sample and 226% larger than the October sample. The radar graph of January sample was 20% bigger than the one of February sample and 62% bigger than the one of October sample (Figure 7.1). As shown in *Table 3*, the H_{(sample)} values for this hashtag collections are 42.36 for Jan-sample, 35.31 for Feb-sample, and 26.11 for Oct-sample. The shapes of these three radar graphs were similar, indicating that the tweeting/retweeting behaviors of Twitter users associated with #glucose were stable over these three observation periods.

In the January collection (Figure 7.2), there were 39 tweets categorized as high complexity (by either the median or the average value). 31 tweets of them distributed at middle-to-top area and their H’_{(tweet)} values were higher than both the median and the average value of H’_{(tweet)}. These 39 tweets contributed 68% of the total number of being retweeted in that sample. In the February collection (Figure 7.3), there were 21 tweets categorized as high complexity, and 17 tweets of them distributed at middle-to-top area. These 21 tweets contributed 77% of the total number of being retweeted in that sample. In the October collection (Figure 7.4), there were 12 tweets categorized as high complexity, and 8 tweets of them distributed at middle-to-top area. These 12 tweets contributed 19% of the total number of being retweeted in that sample.

### 4.4.2 The #bloodsugar collections

The three collections of tweets with #flu were collected individually from February 7^{th} to 8^{th} (totally 43 tweets), from February 15^{th} to 16^{th} (totally 48 tweets), and from October 7^{th} to 8^{th} (totally 42 tweets). These three samples were about the same size, and the three radar graphs were approximately similar in terms of shape and size (Figure 8.1), indicating that the tweeting/retweeting behaviors of Twitter users associated with #influenza were stable over these three observation periods.

In the early February collection (Figure 8.2), there were 22 tweets categorized as high complexity (by either the median or the average value). 17 tweets of them distributed at middle-to-top area and their H’_{(tweet)} values were higher than both the median and the average value of H’_{(tweet)}. These 22 tweets contributed 44% of the total number of being retweeted in that sample. In the mid-February collection (Figure 8.3), there were 24 tweets categorized as high complexity, and 22 tweets of them distributed at middle-to-top area. These 24 tweets contributed 80% of the total number of being retweeted in that sample. In the October collection (Figure 8.4), there were 24 tweets categorized as high complexity, and 22 tweets of them distributed at middle-to-top area. These 24 tweets contributed 78% of the total number of being retweeted in that sample.

Radar Graph and Scatterplots for Tweet Collections with #glucose

Citation: Data and Information Management 2, 3; 10.2478/dim-2018-0011

Radar Graph and Scatterplots for Tweet Collections with #bloodsugar

Citation: Data and Information Management 2, 3; 10.2478/dim-2018-0011

### 4.4.3 The comparative analysis: #glucose versus #bloodsugar

An observed trend in the #glucose collection was the decreasing number of tweets during the data collection periods. The periodic tendency on Twitter platform might provide indirect evidence to the impact of seasonality on the average glucose level in diabetes patients. A study by Kershenbaum, et al. (2011) reported that diabetes patients’ glucose levels tend to be higher in winter than in summer.

Scatterplots with Captures for #glucose October Collection and the #bloodsugar October Collection

Citation: Data and Information Management 2, 3; 10.2478/dim-2018-0011

## 5 Discussions

The #glucose October collection and the #bloodsugar October collection were selected to gain insights from a more intuitive presentation. As shown in Figure9, each tweet in the trail is plotted with its timestamp with the actual screen capture of that tweet. After mapping snapshots of all the tweets with its own H’_{(tweet)} value along the timeline and its own H’_{(tweet)} value along with the screen shot. A pattern of characteristics from the scattered tweets emerged. The tweets with higher H’_{(tweet)} value had a higher chance to stay in the middle-to-top area of the scatterplot while the tweets with lower complexity were more likely to be found in the middle-to-bottom area. Most tweets with high H’_{(tweet)} values contained image(s) or retweets. The tweets in the middle-to-bottom area had lower H’_{(tweet)} values because of their low complexity of the structure (low variety in the combination of tweet components). Those were mainly original textual tweets without any image attached. Given all other conditions remain the same, a retweet tends to result in a higher synthetic value of its H’_{(tweet)} value.

According to the entropy results, Figure 10 summarizes the types of tweets. The major factors that differentiate tweets in our samples are (1) the complexity level of the tweet structure, and (2) the originality of tweet (*i.e., whether the tweets are original tweets or retweets*). A simple structure is defined throughout this study as a structure with a low level of variation when combining different components, whereas a complex-structured tweet means the high level of variation of combining different. As a result, each tweet in our sample was categorized into one of the following subgroups for further investigation: (1) original tweet with simple structure, (2) retweet with simple structure, (3) original tweet with complex structure, and (4) retweet with complex structure.

The criteria of the High/Low level of complexity in structure can be setup according to the specific goal. In this study, we considered either (1) the median value of H’_{(tweet)} in each sample, or (2) the average value of H’_{(tweet)} of each sample. One of our goals is to examine whether high complexity of structure is associated with higher number of being retweeted. The stats of all tweets from four subgroups in each collection were summarized in *Table 4*. The subgroup of retweets-of-complex-structured-tweets contributed most to *the total number of being retweeted* (a stat collected by NodeXL in the column “Retweet Count”) among all four subgroups across all 14 tweet collections. On the other hand, the subgroup of original-tweet-with-simple-structure contributed least to *the total number of being retweeted* across all 14 collections.

Contribution to Total Number of Being Retweeted for Four Subgroups of Tweets in Each Sample *

* The criteria of complexity are based on either (1) the median of each sample, or (2) the average value of H’_{(tweet)} of each sample.

Investigated Hashtag | Sample Size | Total Number Being Retweeted for | Retweets with High Complexity | Retweets with Low Complexity | Original Tweets with High Complexity | Original Tweets with Low Complexity | ||||
---|---|---|---|---|---|---|---|---|---|---|

Tweets in Each | Median, | Average, | Median, | Average, | Median, | Average, | Median, | Average, | ||

Sample | % | % | % | % | % | % | % | % | ||

#glucose | ||||||||||

Jan-25 to Jan-26 | 75 | 309 | 59.55 | 50.49 | 26.54 | 35.60 | 8.41 | 7.12 | 5.50 | 6.80 |

Feb-16 to Feb-17 | 42 | 275 | 70.91 | 70.91 | 20.36 | 20.36 | 6.55 | 6.18 | 2.18 | 2.55 |

Oct-07 to Oct-08 | 23 | 176 | 17.05 | 15.34 | 81.25 | 82.95 | 1.70 | 0.57 | 0.00 | 1.14 |

#bloodsugar | ||||||||||

Feb-07 to Feb-08 | 43 | 45 | 35.56 | 31.11 | 17.78 | 22.22 | 8.89 | 6.67 | 37.78 | 40.00 |

Feb-15 to Feb-16 | 48 | 114 | 76.32 | 76.32 | 14.04 | 14.04 | 2.63 | 2.63 | 7.02 | 7.02 |

Oct-07 to Oct-08 | 42 | 169 | 75.15 | 77.51 | 14.20 | 11.83 | 0.00 | 0.00 | 10.65 | 10.65 |

#flu | ||||||||||

Apr-18 to Apr-19 | 151 | 490 | 53.27 | 36.73 | 36.53 | 53.06 | 7.14 | 6.33 | 3.06 | 3.88 |

Oct-07 to Oct-08 | 502 | 5751 | 69.69 | 64.70 | 28.33 | 33.32 | 0.83 | 0.73 | 1.15 | 1.25 |

#influenza | ||||||||||

Apr-18 to Apr-19 | 71 | 177 | 53.67 | 42.94 | 16.95 | 27.68 | 20.90 | 19.21 | 8.47 | 10.17 |

Oct-07 to Oct-08 | 73 | 323 | 77.40 | 71.52 | 16.10 | 21.98 | 2.79 | 0.93 | 3.72 | 5.57 |

#bloodpressure | ||||||||||

Feb-11 to Feb-12 | 94 | 1582 | 94.37 | 94.37 | 2.40 | 2.40 | 0.32 | 0.32 | 2.91 | 2.91 |

Oct-07 to Oct-08 | 129 | 1469 | 90.67 | 45.20 | 4.90 | 50.37 | 1.70 | 1.70 | 2.72 | 2.72 |

#hypertension | ||||||||||

Feb-11 to Feb-12 | 61 | 69 | 57.97 | 17.39 | 15.94 | 56.52 | 8.70 | 8.70 | 17.39 | 17.39 |

Oct-07 to Oct-08 | 104 | 959 | 77.37 | 8.55 | 16.27 | 85.09 | 0.52 | 0.42 | 5.84 | 5.94 |

Classification of Tweet Types

Citation: Data and Information Management 2, 3; 10.2478/dim-2018-0011

Table 5 organizes the contribution of each component to the *total number of being retweeted* across all 14 collections. In the most contributive subgroup, namely retweets-with-complex-structure, 100% tweets contained the component of *@username*. Although the @username has the highest opportunity cost, this point-to-point communication feature somehow was the most-efficient approach to get a reply (reply is labeled as a type of retweet in NodeXL). The second influential component was *#hashtags* because the more hashtags one uses in a tweet, the higher chances this tweet can be found by Twitter search function. The percentage of tweets with image(s) is higher in the subgroup of retweets-with-complex-structure. This denotes that a tweet with image(s) has a higher probability to be associated with greater value in the NodeXL column “*Retweet Count*”. The component of unused space, on the other hand, is an inverse variable. The less space you left, the more content a tweet had. The subgroup of retweets-with-complex-structure had the lowest percentage of component of unused space means the tweets in the subgroup were the ones made the most efficiency out of the Twitter character limit. However, a celebrity effect could also boost up the number of retweets. In this study a celebrity effect is defined as a Twitter account has much more followers than average, which leads to its tweets receive more retweets due to the enthusiasm of its fans. In essence, assuming all other condition remain the same and no celebrity effect, the best strategy for a healthcare message on Twitter in pursue of getting more opportunity of being retweet include (1) use @username to mention an account with celebrity effect, (2) incorporate more hashtags, (3) incorporate image(s) to draw users’ attentions and be more competitive against those purely text-based tweets, and (4) use as much space as it could to be more informative.

This study introduces the H_{(x)} matrix for analyzing the complexity of structure in collection of tweets. However, readers are suggested to note the following characteristics of the H_{(x)} matrix before jumping to a conclusion based on its calculated results.

The H

_{(x)}index only reflects the relative degree of complexity in statistical structures. According to information theory, the statistical structure of the message is irrelevant to the semantic aspect of communication, which means complexity in structure does not necessarily lead to richer information in its content. H_{(x)}is not designed as an indicator for evaluating the value of the tweet content.The synthetic value of H’

_{(tweet)}mainly depends on the sample size. Fox example, the range of H’_{(tweet)}varies from 0-3 in the #glucose October collection (totally 23 tweets), while it only varies from 0-0.23 in the #flu October collection (totally 502 tweets). Due to this reason, comparison of H’_{(tweet)}among different tweets is only valid within each sample. On the other hand, H_{(sample)}can be used across samples because it is, by definition, close to the classical entropy concept. H_{(sample)}can be used as an indicator of the variety of composition because it is based on the complexity of the statistical structure.The objective of this study is not to find any factor in tweets that influence the probability of getting retweeted. The phenomenon that a higher level of complexity is associated with higher number of being retweeted does not guarantee causal relationship. The method in this study is neither for statistical inference nor for lineal regression modeling. therefore, a causal relationship cannot be determined by this method alone. Our work attempts to develop an alternative tool for social science researchers. According to Kolmogorov (1968), the dominance of statistical methods in social science has become its distinct feature. Information theory is an important branch of probability theory and shares similarities with statistics used in social science research. Mathematical statistics tries to describe the overall situation with variables that theoretically represent the majority and then infer the whole picture of a phenomenon from a small sample with the assumption that all the data in the real world are distributed in the same pattern. Information theory, on the other hand, analyzes the structure and properties of the system with the premise that all the conditions in the system or all the variations of all the elements are known in advance (theoretically or empirically).

The major difference between this proposed method and other information-theoretical approaches is that the past endeavors tried to apply the entropy equation directly on the model built from real-world phenomena. The variables in these models are derived from intuitive concepts so the issue of over simplification is questionable. The proposed method for analysis of tweet collection is built on data which were inherited directly from the Twitter platform. By doing so the objectivity of representation in the variables is not an issue anymore for this method.

Contribution of Each Typical Component to the Total Number of Being Retweeted in All Samples^{3}^{4}

Weight in All Samples | Contribution to Total Number of Being Retweeted in All Samples | Percentage of Component in Each Category | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Image(s) | Unused Space | @username | Hyperlink | Hashtags | ||||||||||

Median, % | Average, % | Median, % | Average, % | Median, % | Average, % | Median, % | Average, % | Median, % | Average, % | Median, % | Average, % | Median, % | Average, % | |

Retweets with High | 34.64 | 26.47 | 69.15 | 53.72 | 78.02 | 98.45 | 31.68 | 16.84 | 100.00 | 100.00 | 62.77 | 56.22 | 84.16 | 84.97 |

Complexity | ||||||||||||||

Original Tweets with High | 17.90 | 13.17 | 3.75 | 3.29 | 72.03 | 93.23 | 84.67 | 79.69 | 32.18 | 22.92 | 84.29 | 84.38 | 78.16 | 79.69 |

Complexity | ||||||||||||||

Retweets with Low | 19.48 | 27.64 | 22.21 | 37.65 | 11.62 | 11.91 | 23.59 | 40.20 | 100.00 | 100.00 | 33.10 | 47.64 | 69.37 | 72.95 |

Complexity | ||||||||||||||

Original Tweets with Low | 27.98 | 32.72 | 4.88 | 5.35 | 3.43 | 4.61 | 84.07 | 86.16 | 10.78 | 17.61 | 78.92 | 79.66 | 80.39 | 79.45 |

Complexity | ||||||||||||||

Retweets | 54.12 | 54.11 | 91.36 | 91.37 | 29.29 | 29.35 | 15.57 | 15.57 | 54.12 | 54.11 | 28.19 | 28.05 | 42.67 | 42.66 |

Original Tweets | 45.88 | 45.89 | 8.63 | 8.64 | 13.85 | 13.79 | 38.68 | 38.69 | 8.78 | 8.78 | 37.17 | 37.18 | 36.48 | 36.49 |

High Complexity | 52.54 | 39.64 | 72.90 | 57.01 | 18.12 | 19.17 | 13.02 | 19.53 | 48.77 | 56.92 | 21.71 | 28.35 | 36.60 | 43.25 |

Low Complexity | 47.46 | 60.36 | 27.09 | 43.00 | 7.32 | 7.84 | 41.27 | 45.95 | 7.04 | 9.79 | 39.14 | 43.13 | 39.23 | 42.74 |

## 6 Conclusion

This study examines the use of medical-terminology versus lay-language hashtags on Twitter by introducing entropy related indicator, H_{(x)}, to quantify the level of structural complexity in each tweet collection. The visualizations (the radar graph and the scatterplot) are intuitive demonstrations and provide insights into the tweet structure for healthcare communication. With its explorative nature, this case study has limitations that future studies could address. Firstly, it only investigated three pairs of hashtags; the results might reflect a part of the whole picture. More cases of medical hashtags with similar semantic meanings between medical-terminology and lay-language should be compared to achieve a more generalizable conclusion. Secondly, Video and emoji are important features commonly incorporated in a tweet. Therefore, future studies should consider including video and emoji as components into the current coding scheme and integrate more sophisticated methods of representation.

The H_{(x)} matrix and its data visualizations (radar graph and scatterplot) unveil patterns within the structure of hashtag collections. The entropic method presented throughout this study has the potential to be an automatic approach. It also allows researchers to examine the Twitter data from a new perspective based on information theory. Last but not least, beyond the mainstream Twitter studies focusing on semantic analysis, this study elucidates a novel way to probe Twitter data concentrating on what Claude Shannon called the engineering aspect of communication.

The practical implications of this study are two-fold. First, the proposed entropy matrix can be mapped to the radar graph visualization thus illustrating the composition of six tweet components in any given tweet collection for an intuitive comparison. Second, information theory does not account for semantic meanings but transmission structures in the communication process. The entropy-based scatter graph visualization can assist in differentiating more informative tweets from less informative ones without considering the semantic meanings.

## References

Anderson, R. L., & O’Connor, B. C. (2009). Reconstructing Bellour: Automating the semiotic analysis of film. Bulletin of the American Society for Information Science and Technology 35(5), 31–40.

Bailey, K. D. (1990). Social entropy theory New York, USA: SUNY Press.

Boltzmann, L. (1877). On the nature of gas molecules. The London, Edinburgh and Dublin Philosophical Magazine and Journal of Science 3(18), 320.

Brillouin, L. (1953). The negentropy principle of information. Journal of Applied Physics 24(9), 1152–1163.

Brillouin, L. (1962). Science and information theory New York, USA: Academic Press.

Clausius, R. (1867). The mechanical theory of heat: with its applications to the steam-engine and to the physical properties of bodies London: J. van Voorst.

Frohlich, E. D. (2004). Seasonal variations in blood pressure. The American Journal of Geriatric Cardiology 13(5), 274–275.

Gibbs, J. W. (1878). ART. LII.--On the Equilibrium of Heterogeneous Substances. American Journal of Science and Arts (1820-1879) 16(96), 441.

Graham, P, C. (2002, October 14). Claude E. Shannon: Founder of Information Theory. Scientific American Retrieved from https://www.scientificamerican.com/article/claude-e-shannon-founder/

Hartley, R. V. (1928). Transmission of information. Bell Labs Technical Journal 7(3), 535–563.

Hayes, R. M. (1993). Measurement of information. Information Processing & Management 29(1), 1–11.

Hayek, F. A. (1967). Studies in philosophy, politics and economics London, England: Routledge & Kegan Paul.

Horgan, J. (2016, April 27). Claude Shannon: Tinkerer, Prankster, and Father of Information Theory. IEEE Spectrum Retrieved from https://spectrum.ieee.org/tech-history/cyberspace/claude-shannon-tinkerer-prankster-and-father-of-information-theory

Houts, P. S., Doak, C. C., Doak, L. G., & Loscalzo, M. J. (2006). The role of pictures in improving health communication: A review of research on attention, comprehension, recall, and adherence. Patient Education and Counseling 61(2), 173–190.

Johnson, G. (2001, February 27). Claude Shannon, Mathematician, Dies at 84. The New York Times Retrieved from http://www.nytimes.com/2001/02/27/nyregion/claude-shannon-mathematician-dies-at-84.html

Kearns, J., & O’Connor, B. (2004). Dancing with entropy: Form attributes, children, and representation. The Journal of Documentation 60(2), 144–163.

Kershenbaum, A., Kershenbaum, A., Tarabeia, J., Stein, N., Lavi, I., & Rennert, G. (2011). Unraveling seasonality in population averages: An examination of seasonal variation in glucose levels in diabetes patients using a large population-based data set. Chronobiology International 28(4), 352–360

Kinsner, W. (2007). Is entropy suitable to characterize data and signals for cognitive informatics? International Journal of Cognitive Informatics and Natural Intelligence 1(2), 34–57.

Kolmogorov, A. (1968). Logical basis for information theory and probability theory. IEEE Transactions on Information Theory 14(5), 662–664.

Luce, R. D. (2003). Whatever happened to information theory in psychology? Review of General Psychology 7(2), 183–188.

Marr, D. (1982). Vision: A Computational Investigation into the Human Representation and Processing of Visual Information San Francisco, CA: W.H. Freeman.

Miller, G. A. (1953). What is information measurement? The American Psychologist 8(1), 3–11.

Nyquist, H. (1924). Certain factors affecting telegraph speed. Transactions of the American Institute of Electrical Engineers, XLIII 412–422.

Popper, K. (1963). Conjectures and refutations: The growth of scientific knowledge London, England: Routledge & Kegan Paul.

Prigogine, I., & Stengers, I. (1984). Order out of chaos: Man’s new dialogue with nature New York, USA: Bantam books.

Rice, S. O. (1944). Mathematical analysis of random noise. Bell Labs Technical Journal 23(3), 282–332.

Ritchie, D. (1986). Shannon and Weaver: Unravelling the paradox of information. Communication Research 13(2), 278–298.

Rosenthal, T. (2004). Seasonal variations in blood pressure. The American Journal of Geriatric Cardiology 13(5), 267–272.

Shannon, C. E. (1948). A mathematical theory of communication, Part I, Part II. The Bell System Technical Journal 27(4), 623–656.

Tribus, M. (1983). Thirty years of information theory. In F. Machlup (Ed.), The Study of Information: Interdisciplinary Messages (pp. 475–513). New York, USA: Wiley.

Verdu, S. (1998). Fifty years of Shannon theory. IEEE Transactions on Information Theory 44(6), 2057–2078.

Weaver, W. (1953). Recent contributions to the mathematical theory of communication. Etc.; a Review of General Semantics 10(4), 261–281.

Zhang, Y., & Chang, H. C. (2018, January). Selfies of Twitter Data Stream through the Lens of Information Theory: A Comparative Case Study of Tweet-trails with Healthcare Hashtags. In Proceedings of the 51st Hawaii International Conference on System Sciences

Zunde, P. (1987). Information Science Laws and Regularities: A Survey. In J. Rasmussen & P. Zunde (Eds.), Empirical Foundations of Information and Software Science III (pp. 243–270). New York, USA: Springer US.

## Footnotes

^{1}

*The typical component of TEXT is not included in this table because every tweet has textural content*.

^{2}

*Percentage of tweets that contain hashtag(s) other than the investigated hashtag, because all the tweets in each sample contain the investigated hashtag of that sample according to the data cleaning criteria*.

^{3}

^{*} The component TEXT is not included in this table because every tweet has textural content.

^{4}

#The criteria of complexity for each sample are based on either (1) the median value of all H'(_{tweet)}, or (2) the average value of all H'_{(tweet)}.