Social tagging provides an easy and intuitive way to annotate, share, and retrieve resources on the Web. In recent years, several studies have focused on constructing tag trees (Almoqhim, Millard, & Shadbolt, 2013; Benz et al., 2010; Candan, Di Caro, & Sapino, 2008; Helic & Strohmaier, 2011; Heymann & Garcia-Molina, 2006; Li et al., 2007; Luo & Chen, 2013; Tsai et al., 2009; Verma et al. 2015). Some of them depended on expert thesauri and tried to derive a model of knowledge domain with precise concepts and relationships. Other studies applied hierarchical clustering algorithms to classify tags based on their similarity. While much work has been centered on the structure of social tagging systems, little is known about the ways of how users navigate such systems.
A tag tree can serve as a navigation tool in many cases, especially when users explore an unfamiliar domain, or their information needs are not specific enough. From the users’ perspective, they may not care whether the semantic relationships are as canonical as those defined by domain experts; yet they would be deeply impressed if the navigation experience with a tag tree is smooth.
In order to adapt to multiple complex domains, some foundational algorithms have been put forward to generate tag trees without expert thesauri (e.g. Benz et al., 2010; Heymann & Garcia-Molina, 2006; Strohmaier et al., 2012b). However, a problem of semantic drift along a path, such as “
The contributions of this study are twofold: (1) an algorithm was developed to construct tag trees with consideration given to both semantic coherence and structural balance; and (2) the effectiveness of a node generality metric,
The paper is organized as follows: Section 1 introduces the background and the contributions of this study. Section 2 reviews the related work. Section 3 discusses the desired features of a tag tree from the perspective of navigation of tagging systems, followed by Section 4, which proposes the algorithm for constructing a desired tag tree from a tagging network. Then Section 5 evaluates the algorithm as well as the node generality metric based on Open Directory Project (ODP) classification. Finally Section 6 is the conclusion as well as suggestions for the future work.
Organizing resources into a hierarchical structure for subject browsing has been recognized as an important tool in information-seeking processes (Golub & Lykke, 2009). A hierarchy can offer searchers information on the collection being searched before any interaction happens. Chen et al. (2012) explored Web resource organization structures based on open-access Web FTP sites. They found that users usually have an upper limit for the depth of the hierarchical structures when organizing resources. With the increasing amount of resources, users preferred to have a flatter structure instead of a deeper one. According to the study, structural complexity needs to be controlled.
Tags are non-hierarchical keywords given by different users to describe information resources. Although unavoidably noisy and ambiguous, these tags reflect the context of the resource domain and the evolution of the knowledge. Thus tags have been widely used in digital resource retrieval, classifications, and summarization. The relationships between tags also facilitate peoples’ understanding of the knowledge of the resource domain. Some studies have tried to generate taxonomies by finding “is-a” or “a-part-of” relationships among concept tags (Si, Liu, & Sun, 2010; Tsui et al., 2010, Verma et al., 2015). Many tags were removed if they were not taken as concepts. Naturally, these taxonomies were not suitable for the purpose of navigation. On the other hand, other studies aimed to develop more loosely structured hierarchies for the purpose of navigating through a large number of resources (e.g. Candan, Di Caro, & Sapino, 2008; Helic et al., 2010; Heymann, & Garcia-Molina, 2006; Huang et al., 2013; Sinclair & Cardew-Hall, 2008). These hierarchies usually involved various tags and were more practical when users were not familiar with the resource domain. This study belongs to this stream and goes further on semantic coherence and structural balance of the hierarchy for the purpose of navigation.
Two aspects are important in developing a tag tree: selecting quality tags as nodes of the hierarchy and inferring reasonable parent-children relationship among the selected tags.
An empirical study on complex dynamics of tagging systems (Halpin, Robu, & Shepherd, 2007) has shown that consensus around stable tag distributions and shared vocabularies did exist, even if there was no central controlled vocabulary. It is suggested that the popular tags used by different people were valuable in understanding a given domain. Therefore many studies used tag frequency as a simple but effective tag selection criterion (e.g. Strohmaier, Körner, & Kern, 2012; Suchanek, Vojnovic, & Gunawardena, 2008).
The hierarchy generation methods need to detect the hidden relationships among tags and organize them to represent the resource domain. The previous approaches proposed in the literature include three types: (1) developing heuristic rules based on natural language processing (NLP) (Tsui et al., 2010); (2) depending on a thesaurus such as Wordnet (Verma et.al., 2015); and (3) using unsupervised methods (Gemmell et al., 2008; Huang et al., 2013; Heymann, & Garcia-Molina, 2006; Luo, & Chen, 2013; Si, Liu, & Sun, 2010; Zhou et al., 2007). The first two types have limitations in some circumstances because of manual cost or the scope of thesauri.
Among the unsupervised methods, hierarchical clustering-based methods (Begelman, Keller, & Smadja, 2006; Gemmell et al., 2008; Zhou et al., 2007) and tag generality-based methods (Heymann, & Garcia-Molina, 2006; Luo & Chen, 2013) were widely used. Clustering-based methods iteratively partitioned the tags into groups and built the tag tree using either bottom-up or top-down approaches. In contrast generality-based methods first ranked the tags by their generality in a flat tag network and then added a tag as a child of its most similar tag that had already existed in the hierarchy. A typical problem in clustering-based methods is the interpretability of a node’s label. Although related tags were hierarchically clustered, it was difficult to determine which one would represent a category naturally. The generality-based methods were mainly derived from the idea of Heymann and Garcia-Molina (2006), who proposed a simple, efficient algorithm for converting a large set of tags into a navigable tag tree (Camiña, 2010).
According to Heymann and Garcia-Molina (2006), tags were first linked in an undirected graph if their similarity is above a predefined similarity threshold. They were then ranked in descending order by their generality, and these ranked tags were added to the tree through comparing their similarity with the present tags in the tree. Note that, in their study,
In generality-based methods, tag networks could be built based on tag similarity (Heymann & Garcia-Molina, 2006) and co-occurring frequency (Benz et al., 2010), denoted as Cos and Cooc, respectively; the generality of a tag could be measured by its closeness or degree centrality in the network, represented as ClosCen and DegCen. According to Helic et al. (2011) and Strohmaier et al. (2012), the performance of DegCen/Cooc and ClosCen/Cos had no significant difference. However, the DegCen/Cooc combination has a lower computing cost and therefore is more practical. In this case, this study adopted DegCen/Cooc combination as baseline.
Although Heymann’s algorithm and its variations of other studies were considered effective in the above evaluation experiments, there are two issues that have not been solved by either DegCen/Cooc or ClosCen/Cos.
Semantic drift refers to the phenomenon that the semantics of nodes along a path are not coherent; for example, “ The path was taken from a tree generated with Heymann’s algorithm on Social-ODP-2k9 dataset (Luo & Chen, 2013).
One possible reason for this problem is that the similarity between a child and its parent is not the global optimal value in Heymann’s algorithm. Markines et al. (2009) noticed this issue in their qualitative evaluation. In the study of Tsai et al. (2009), the combination of the proposed sibling independence rule and the hierarchical feature was close to a solution to prevent semantic drift from happening, yet they did not test semantic coherence in their experiment. In our previous study (Luo & Chen, 2013), we have proposed a hybrid method combining clustering and generality, aiming at the semantic coherence inside a sub-tree. However, that work did not consider the representativeness of nodes when deciding siblings for a parent. In the current study, we consider representativeness to keep the diversity among sibling nodes of one parent. This is also a complement to avoid semantic drift.
The methods have a risk of structural skew since there is no measure to control the structural balance between branches. When the distribution of tags’ similarity relationships is uneven, this method may result in extremely deep paths and dense branches. According to Wiesman, van den Herik, and Hasman (2004), a hierarchical structure could create usability problems if the breadth and depth of the structure is not well designed. A searcher needs to be able to understand the relationship being presented along the structure to avoid any potential confusion. Song, Qiu, and Farooq (2011) have proposed the idea that the optimal position for a tag in a tree should minimize the change of the current tree in its depth and breadth growth. In Heymann and Garcia-Molina’s approach (2006), if a tag was not similar to an existing tag in the tree above a certain threshold, it would be taken as a child of the root. In some cases, many trivial nodes are set right under the root. Such a structure is harmful to navigation.
The tag tree constructed by the proposed method is supposed to be a user-friendly navigable tool as well as a domain knowledge markup tool. When exploring in a tag tree, users will always track along the paths that most likely lead them to find their desired information. Thus in each decision, they would take the branch whose name most closely meets their search intention. If none of the names indicates a feasible direction, they probably roll back to an ancestor and restart from another branch, or just abandon the exploration.
Based on the above understanding, we define several desired features:
The nodes in the tree should be representative. A tag is selected as a node in the tree if, (a) it can represent a group of similar tags in terms of meaning; and (b) it has the capability to associate many other nodes in terms of its role in the network. According to the selection, all the tree nodes collectively will represent the main content of the resource domain as complete as possible. The meaning of sibling nodes should have the least overlap. The judgement cost for users is thus lowered when they face multiple choices. This idea is borrowed from the field of search engines, where diversified retrieval results are kept to illuminate a searcher’s vague searching intention (Agrawal et al., 2009; Carbonell & Goldstein, 1998; Rafiei, Bharat, & Shukla, 2010). The detailed implementation is shown in Algorithm 1 of Section 4. Get representative nodes from tag set The relationship between nodes should be reasonable and intuitive. This means that the semantic meaning along a path keeps coherent with as little drift as possible. In addition, the meaning from parent nodes to their children should be increasingly fine-grained. For arbitrary node To keep semantic coherence, the similarity between the nodes in level Equation (1) denotes that the similarity between In order to enhance the navigability of the tag tree, representative nodes need to be selected by similarity and generality. The structure is desired to have a relative balance between breadth and depth, as well as a balance of density of each branch. In an extremely dense or deep hierarchy, users will have to spend considerable time deciding the exploration path. To solve this problem, in this study, the structural balance is controlled by specifying the maximum number of children for each parent by their generality scores. 1. Initialize 2. Get generality for each tag 3. 4. for 5. select the tag 6. 7. for 8. generality according to their similarity to the selected node 9. end for 10. end for 11. return to
Given a resource set
In this article, a tag network was built by co-occurrence in order to compare with the well-established solution DegCen/Cooc, according to Helic et al. (2011) and Strohmaier et al. (2012). And the generality of a tag was measured by
In a weighted network, the
Given a tag
The difference between
Representativeness in Feature (1) in Section 3.1 is different from the generality. It is determined by generality and similarity in the proposed method. Nodes of high representativeness will be assigned close to the root. Yet a general node is not necessarily representative if it is very similar to an existing representative node.
The basic idea of the proposed algorithm is as follows. First, the nodes are divided recursively to at most
Selecting the representative nodes involves the judgement of both similarity and generality. Initially, the node of max generality is chosen from a given tag set, and the rest of the nodes are punished according to their similarity to it. The punishment is to keep diversity among siblings and avoid overlapping semantic boundary in each level, as discussed in Feature (1) in Section 3.1.
In Algorithm 1,
At first, the input tag set
In Algorithm 2,
Construct a tag tree.1. 2. 3. if 4. 5. for 6. if ( 7. 8. if ( 9. 10. end for 11. for 12. build a node 13. 14. add 15. end for 16. return to
In each iteration, the representative tags (at most
If the generality of a node
The tag similarity comparison has been involved in Line 7 of Algorithm 2 and Line 8 in Algorithm 1. Cosine similarity is applied to tag frequency vectors to compare their relationships.
The experiment was conducted using dataset PINTS
The baseline algorithm was derived from Heymann and Garcia-Molina (2006), and thus it is denoted as
However, Helic et al. (2011) and Strohmaier et al. (2012) did not mention the value of a threshold, above which a link was permitted to be a child of an existing tag other than the root. The higher the threshold was, the more trivial nodes were taken as the children of the root. For detail on the function of the threshold, readers may refer to Section B of the origin literature (Heymann & Garcia-Molina, 2006). In this study, the comparison between a child and an existing tag was based on their co-occurrence.
To choose a reasonable threshold in
The semantic evaluation results of different similarity threshold setting values in Threshold Degree centrality TP TR F1 TP TR F1 0 0.0448 0.0368 0.0404 0.0509 0.0405 0.0451 1 0.0448 0.0368 0.0404 0.0509 0.0405 0.0451 2 0.0448 0.0368 0.0404 0.0509 0.0405 0.0451 3 0.0448 0.0368 0.0404 0.0510 0.0405 0.0451 4 0.0448 0.0368 0.0404 0.0510 0.0405 0.0451 5 0.0448 0.0368 0.0404 0.0509 0.0404 0.0451 6 0.0448 0.0368 0.0404 0.0509 0.0404 0.0451 7 0.0447 0.0367 0.0403 0.0509 0.0404 0.0450 8 0.0446 0.0365 0.0402 0.0508 0.0402 0.0449 9 0.0445 0.0365 0.0401 0.0507 0.0402 0.0448 10 0.0443 0.0364 0.0400 0.0506 0.0401 0.0447 20 0.0417 0.0339 0.0374 0.0479 0.0376 0.0421 30 0.0382 0.0309 0.0342 0.0446 0.0346 0.0389 40 0.0344 0.0273 0.0305 0.0406 0.0310 0.0351 50 0.0316 0.0249 0.0279 0.0378 0.0286 0.0325 60 0.0298 0.0232 0.0261 0.0360 0.0267 0.0307 70 0.0279 0.0218 0.0245 0.0343 0.0252 0.0290 80 0.0264 0.0205 0.0231 0.0328 0.0238 0.0276 90 0.0252 0.0198 0.0222 0.0319 0.0230 0.0268 100 0.0241 0.0187 0.0211 0.0308 0.0221 0.0257
Table 1 also denotes that the
The biggest challenge in evaluating the navigation of a tag tree is the lack of a baseline. The difficulty is mainly because the resources may come from different domains, and the hierarchies can be generated for different purposes. Researchers compared the auto-generated tag hierarchy against those well-accepted taxonomies in certain resource domains, such as Wordnet, Yago, and Wikitaxonomy. (Strohmaier et al., 2012). In comparing taxonomy generation techniques, Camiña (2010) has proposed several criteria to help develop the evaluation methodology. However, we need to keep in mind that taxonomy serves to represent a logical model of the knowledge domain with precise concepts, instances, and their relationships, rather than facilitating users’ browsing behavior by adapting to a given resource set.
The baseline hierarchy is the Open Directory Project (ODP)
Each
Given node
For the whole tree
We calculated the
Since a tag might be found in multiple ODP branches (e.g. the tag “Sports” appears in both paths “Arts/Movies/Genres/Sports” and “Business/Arts_and_ Entertainment/Sports”), the final
The
Since the
Figure 2 also shows the comparison of degree and
The possible reason why
What should be noted for the purpose of navigation is that precision is more important than recall. The reason is that precision reflects both the semantic and structural features of a tag tree, which is essential to the users. Although the
Finally, the reason why the values of
Two visualized fragments of
In addition, the algorithm of
This study proposed an algorithm for developing a navigation-oriented tag tree from a tag dataset. The goal of the algorithm is to build a tree characterized by: (1) the ancestors should be more representative than the descendants; (2) the semantic meaning along the node paths needs to be coherent; (3) the children of one parent are collectively exhaustive and mutually exclusive in describing their parent; (4) last but not least, tags are roughly evenly distributed to their upper-level parents to avoid structural skew. The proposed algorithm as well as the
The proposed algorithm will benefit the development of resource navigation systems. It can also be used in managing different online resources, such as academic publications, government documents, and medical communities based on a navigable hierarchy.
What should be highlighted is that the algorithm can be extended to applications in multiple resource domains to generate a flexible domain knowledge structure that is easy to navigate. Easy navigation is possible because the tags mentioned in the algorithm can be not only the social annotations, but also keywords created by authors or experts, as well as automatically extracted from text.
We did a preliminary evaluation in this study. More details are expected on the usability of the hierarchy. As our future work, a thorough investigation into the evaluation methodology is needed, including user studies and comprehensive metrics for navigation performance.