Finding sequential patterns with TF-IDF metrics in health-care databases

Open access

Abstract

Finding frequent sequential patterns has been defined as finding ordered list of items that occur more times in a database than a user defined threshold. For big and dense databases that contain really long sequences and large itemset such as medical case histories, algorithm proposed on this idea of counting the occurrences output enourmous number of highly redundant frequent sequences, and are therefore simply impractical. Therefore, there is a need for algorithm that perform frequent pattern search and prefiltering simultaneously. In this paper, we propose an algorithm that reinterprets the term support on text mining basis. Experiments show that our method not only eliminates redundancy among the output sequences, but it scales much better with huge input data sizes. We apply our algorithm for mining medical databases: what diagnoses are likely to lead to a certain future health condition.

[1] R. Agrawal, R. Srikant, Mining sequential patterns, Proc. Eleventh International Conference on Data Engineering, Taipei, Taiwan, 1995, pp. 3-14. ⇒300

[2] L. M. Aouad, Nhien-An Le-Khac, T. M. Kechadi, Performance study of distributed apriori-like frequent itemsets mining, Knowledge and Information Systems, 23, 1 (2009) 55-72. ⇒300

[3] J. Ayres, J. Gehrke, T. Yiu, J. Flannick, Sequential pattern mining using bitmaps, Proc. Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Canada, July 2002, pp. 429-435. ⇒291

[4] P. Fournier-Viger, SPMF - an open-source data mining library, 2014. ⇒ 291, 306

[5] T. Z. Gál, G. Kovács, Z. T. Kardkovács, Survey on privacy preserving data mining techniques in health care databases, Acta Univ. Sapientiae, Informatica, 6, 1 (2014) 33-55. ⇒305

[6] L. Geng, H. J. Hamilton, Interestingness measures for data mining: A survey, ACM Computing Surveys (CSUR), 38, 3 (2006) ⇒292, 293, 294

[7] K. Gouda, M. Hassaan, Mining sequential patterns in dense databases, International Journal of Database Management Systems (IJDMS), 3, 1 (2011) 179-194. ⇒291

[8] J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, Proc. International Conference Management of Data (ACM-SIGMOD ’00), Dallas, USA, May 2000, pp. 1-12. ⇒290

[9] T. P. Hong, C. W. Lin, K. T. Yang, S. L. Wang, A heuristic data-sanitization approach based on TF-IDF, Proc. 24th International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, Lecture Notes in Artificial Intelligence 6703 (2011) 156-164. ⇒301

[10] K. McGarry, A survey of interestingness measures for knowledge discovery, The Knowledge Engineering Review, 20, 1 (2005) 39-61. ⇒292

[11] P. W. Purdom, D. Van Gucht , D. P. Groth, Average-case performance of the apriori algorithm, SIAM Journal on Computing, 33, 5 (2004) 1223-1260. ⇒300

[12] G. Salton, E. A. Fox, H. Wu, Extended boolean information retrieval, Communications of ACM, 26, 12 (1983) 1022-1036. ⇒288, 301

[13] R. Srikant, R. Agrawal, Mining sequential patterns: Generalizations and performance improvements, Proc. 5th International Conference on Extending Database Technology: Advances in Database Technology (EDBT ’96), Lecture Notes in Security and Cryptology 1057, (1996) 3-17. ⇒290

[14] Y. Tabei, An imprementation of PrefixSpan (prefix-projected sequential pattern mining), 2008. ⇒306

[15] M. J. Zaki, SPADE: An efficient algorithm for mining frequent sequences, Machine Learning, 42, 1-2 (2001) 31-60. ⇒290, 295

Acta Universitatis Sapientiae, Informatica

The Journal of "Sapientia" Hungarian University of Transylvania

Journal Information

Metrics

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 189 189 23
PDF Downloads 53 53 5