Augmenting Statistical Data Dissemination by Short Quantified Sentences of Natural Language

Open access

Abstract

Data from National Statistical Institutes is generally considered an important source of credible evidence for a variety of users. Summarization and dissemination via traditional methods is a convenient approach for providing this evidence. However, this is usually comprehensible only for users with a considerable level of statistical literacy. A promising alternative lies in augmenting the summarization linguistically. Less statistically literate users (e.g., domain experts and the general public), as well as disabled people can benefit from such a summarization. This article studies the potential of summaries expressed in short quantified sentences. Summaries including, for example, “most visits from remote countries are of a short duration” can be immediately understood by diverse users. Linguistic summaries are not intended to replace existing dissemination approaches, but can augment them by providing alternatives for the benefit of diverse users of official statistics. Linguistic summarization can be achieved via mathematical formalization of linguistic terms and relative quantifiers by fuzzy sets. To avoid summaries based on outliers or data with low coverage, a quality criterion is applied. The concept based on linguistic summaries is demonstrated on test interfaces, interpreting summaries from real municipal statistical data. The article identifies a number of further research opportunities, and demonstrates ways to explore those.

Adolfsson, C., G. Arvidson, P. Gidlund, A. Norberg, and L. Nordberg. 2010. “Development and Implementation of Selective Data Editing at Statistics Sweden.” In Proceedings of the European Conference on Quality in Official Statistics, May 4, 2010. Helsinki Available at: https://q2010.stat.fi/media//presentations/Norberg_et_all__Statistics_Sweden_slutversion.pdf (accessed April 2017).

Almeida, R.J., M-J. Lesot, B. Bouchon-Meunier, U. Kaymak, and G. Moyse. 2013. “Linguistic Summaries of Categorical Time Series Septic Shock Patient Data.” In Proceedings of the 2013 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2013), July 7–10, 2013. 1–8. Hyderabad.

Altin, L., M. Tiru, E. Saluveer, and A. Puura. 2015. “Using Passive Mobile Positioning Data in Tourism and Population Statistics.” In Proceedings of the New Techniques and Technologies in Statistics (NTTS 2015), March 10–12, 2015. Brussels. Available at: https://ec.europa.eu/eurostat/cros/system/files/Altin-etal_abstract_ntts_2301LA_0.pdf (accessed January 2017).

Arguelles, L. and G. Triviño. 2013. “I-struve: Automatic Linguistic Descriptions of Visual Double Stars.” Engineering Applications of Artificial Intelligence 26: 2083–2092. Doi: http://dx.doi.org/10.1016/j.engappai.2013.05.005.

Barcaroli, G., M. Scannapieco, D. Summa, and M. Scarnò. 2015. “Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies.” In Proceedings of the New Techniques and Technologies in Statistics (NTTS 2015), March 10–12, 2015. Brussels. Available at: https://ec.europa.eu/eurostat/cros/system/files/Barcaroli-etal_WebScraping_Final_unblinded.pdf (accessed February 2017).

Bavdaž, M. (editor). 2011. Final Report Integrating Findings on Business Perspectives Related to NSIs Statistics. Brussels: European Commission. (Deliverable 3.2 from FP7 project BLUE-Enterprise and Trade Statistics). Blue-Ets Project: SSH-CT-2010-244767.

Bier, V. and P. Nymand-Andersen. 2011. “Communicating Statistics to Frequent Users – One Size Fits All?” In Proceedings of the Committee for the Coordination of Statistical Activities (CCSA Special Session), September 8, 2011. Luxembourg.

Boran, F.E., D. Akay, and R.R. Yager. 2016. “An Overview of Methods for Linguistic Summarization with Fuzzy Sets.” Expert Systems with Applications 61: 356–377. Doi: http://dx.doi.org/10.1016/j.eswa.2016.05.044.

Bosc, P. and O. Pivert. 1995. “SQLf: a Relational Database Language for Fuzzy Querying.” IEEE Transactions on Fuzzy Systems 3: 1–17. Doi: http://dx.doi.org/10.1109/91.366566.

Coddington, M. 2015. “Clarifying Journalism’s Quantitative Turn.” Digital Journalism 3: 331–348. Doi: http://dx.doi.org/10.1080/21670811.2014.976400.

Disability Rights Commission. 2004. The Web Access and Inclusion for Disabled People – A Formal Investigation conducted by the Disability Rights Commission. London: TSO. Available at: https://www.city.ac.uk/__data/assets/pdf_file/0004/72670/DRC_Report.pdf (accessed, May 2018).

Duraj, A., P.S. Szczepaniak, and J. Ochelska-Mierzejewska. 2015. “Detection of Outlier Information Using Linguistic Summarization.” In Proceedings of the 11th International Conference Flexible Query Answering Systems (FQAS 2015), October 26–28, 2015. 101–113. Cracow.

EU Guide. 2015. User guide to the SME Definition. Luxembourg: Publications Office of the European Union. Available at: http://ec.europa.eu/growth/tools-databases/newsroom/cf/itemdetail.cfm?item_id=8274&lang=en (accessed November, 2016).

Galindo, J., A. Urrutia, and M. Piattini. 2006. Fuzzy Databases––Modeling. Design and Implementation. Hershey: Idea Group Publishing.

George, R. and R. Srikanth. 1996. “Data Summarization Using Genetic Algorithms and Fuzzy Logic.” In Genetic Algorithms and Soft Computing, edited by F. Herrera and J.L. Verdegay, 599–611. Heidelberg: Physica–Verlag.

Glöckner, I. 2006. Fuzzy Quantifiers – A Computational Theory. Berlin Heidelberg: Springer-Verlag.

GSIM. 2013. Generic Statistical Information Model (GSIM): Specification. Geneva: United Nations Economic Commission for Europe (UNECE). Available at: http://www1.unece.org/stat/platform/display/gsim/GSIM+Specification (accessed February 2017).

Goebel, R., A. Chander, K. Holzinger, F. Lecue, Z. Akata, S. Stumpf, P. Kieseberg, and A. Holzinger. 2018. “Explainable AI: The New 42?” In Machine Learning and Knowledge Extraction, Springer Lecture Notes in Computer Science LNCS 11015, edited by A. Holzinger, P. Kieseberg, A. Tjoa, and E. Weippl, 295–303. Cham: Springer.

Graefe, A. 2016. Guide to Automated Journalism. New York: Tow Center for Digital Journalism. Available at: https://www.cjr.org/tow_center_reports/guide_to_automated_journalism.php (accessed April 2018).

Heimgärtner, R., A. Holzinger, and R. Adams. 2008. “From Cultural to Individual Adaptive End-User Interfaces: Helping People with Special Needs.” In Proceedings of the 11th International Conference on Computers Helping People with Special Needs (ICCHP 2008), July 9–11, 2008. 82–89. Linz.

Hirota, K. and W. Pedrycz. 1999. “Fuzzy Computing for Data Mining.” Proceedings of IEEE 87: 1575–1600. Doi: http://dx.doi.org/10.1109/5.784240.

Holzinger, A. 2002. “User-Centered Interface Design for Disabled and Elderly People: First Experiences with Designing a Patient Communication System (PACOSY).” In Proceedings of the 8th International Conference on Computer Helping People with Special Needs (ICCHP 2002), July 15–20, 2002. 33–40. Linz.

Holzinger, A., B. Malle, P. Kieseberg, P.M. Roth, H. Müller, R. Reihs, and K. Zatloukal. 2017. “Machine Learning and Knowledge Extraction in Digital Pathology needs an integrative approach.” In Towards Integrative Machine Learning and Knowledge Extraction, edited by A. Holzinger, R. Goebel, M. Ferri, and V. Palade, 13–50. Cham: Springer.

Hudec, M. 2013. “Improvement of Data Collection and Dissemination by Fuzzy Logic.” In Proceedings of the Joint UNECE/Eurostat/OECD Meeting on the Management of Statistical Information Systems (MSIS), April 22–24, 2013. Paris and Bangkok. Available at: http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.50/2013/Topic_3_Slovakia.pdf (accessed January 2017).

Hudec, M. 2016. Fuzziness in Information SystemsHow to Deal with Crisp and Fuzzy Data in Selection, Classification, and Summarization. Cham: Springer.

Hudec, M. 2017. “Merging Validity and Coverage for Measuring Quality of Data Summaries.” In Information Technology and Computational Physics, edited by P. Kulczycki, L.T. Kóczy, R. Mesiar, and J. Kacprzyk, 71–85. Cham: Springer.

Hudec, M. and D. Praženka. 2016. “Collecting and Managing Fuzzy Data in Statistical Relational Databases.” Statistical Journal of the IAOS 32: 245 – 255. Doi: http://dx.doi.org/10.3233/SJI-160956.

Hudec, M. and V. Torres Van Grinsven. 2013. “Business’ Participants Motivation in Official Surveys by Fuzzy Logic.” In Proceedings of the 1st Eurasian Multidisciplinary Forum (EMF 2013), October 24–26, 2013. 42–52. Tbilisi.

Kacprzyk, J. and P. Strykowski. 1999. “Linguistic Data Summaries for Intelligent Decision Support.” In Proceedings of the fourth European Workshop on Fuzzy Decision Analysis and Recognition Technology for Management, Planning and Optimization (EFDAN 1999), June 14–15, 1999. 3–12. Dortmund.

Kacprzyk, J., A. Wilbik, and S. Zadroz˙ny. 2006. “Linguistic Summarization of Trends: A Fuzzy Logic Based Approach.” In Proceedings of the 11th Information Processing and Management of Uncertainty in Knowledge Based Systems (IPMU 2006), July 2–7, 2006. 2166–2172. Paris.

Kacprzyk, J. and R.R. Yager. 2001. “Linguistic Summaries of Data Using Fuzzy Logic.” International Journal of General Systems 30: 133–154. Doi: http://dx.doi.org/10.1080/03081070108960702.

Kacprzyk, J. and S. Zadroz˙ny. 1995. “FQUERY for Access: Fuzzy Querying for Windows-Based DBMS.” In Fuzziness in Database Management Systems, edited by P. Bosc and J. Kacprzyk, 415–433. Heidelberg: Physica-Verlag.

Kacprzyk, J. and S. Zadroz˙ny. 2005. “Linguistic Database Summaries and Their Protoforms: Towards Natural Language Based Knowledge Discovery Tools.” Information Sciences 173: 281–304. Doi: http://dx.doi.org/10.1016/j.ins.2005.03.002.

Kacprzyk, J. and A. Ziółkowski. 1986. “Database Queries with Fuzzy Linguistic Quantifiers.” IEEE Transactions Systems, Man and Cybernetics SMC-16 3: 474–479. Doi: http://dx.doi.org/10.1109/tsmc.1986.4308982.

Klement, E.P., R. Mesiar, and E. Pap. 2005. “Triangular Norms: Basic Notions and Properties.” In Logical, Algebraic, Analytic, and Probabilistic Aspects of triangular Norms, edited by E.P. Klement and R. Mesiar, 17–60. Amsterdam: Elsevier.

Lesot, M-J., G. Moyse, and B. Bouchon-Meunier. 2016. “Interpretability of Fuzzy Linguistic Summaries.” Fuzzy Sets and Systems 292: 307 – 317. Doi: http://dx.doi.org/10.1016/j.fss.2014.10.019.

Liu, B. 2011. “Uncertain Logic for Modeling Human Language.” Journal of Uncertain Systems 5: 3–20. Available at: www.jus.org.uk (accessed September 2012).

Meyer, A. and H.J. Zimmermann. 2011. “Applications of Fuzzy Technology in Business Intelligence.” International Journal of Computers, Communications & Control VI(3): 428–441. Doi: http://dx.doi.org/10.15837/ijccc.2011.3.2128.

Moyse, G., M-J. Lesot, and B. Bouchon-Meunier. 2013. “Mathematical Morphology Tools to Evaluate Periodic Linguistic Summaries.” In Flexible Query Answering Systems, edited by H.L. Larsen, 257–268. Berlin Heidelberg: Springer-Verlag.

Niewiadomski, A. 2002. “Appliance of Fuzzy Relations for Text Documents Comparing.” In Proceedings of the 6th Conference on Neural Networks and Soft Computing (ICNNSC’ 2002), June 11–15, 2002. Zakopane.

Niewiadomski, A., J. Ochelska, and P.S. Szczepaniak. 2006. “Interval-Valued Linguistic Summaries of Databases.” Control and Cybernetics 35: 415–443. Available at: http://matwbn.icm.edu.pl/ksiazki/cc/cc35/cc35212.pdf (accessed June 2016).

Raschia, G. and N. Mouaddib. 2002. “SAINTETIQ: A Fuzzy Set-Based Approach to Database Summarization.” Fuzzy Sets and Systems 129: 137–162. Doi: https://doi.org/10.1016/S0165-0114(01)00197-X.

Rasmussen, D. and R.R. Yager. 1997. “Summary SQL – A Fuzzy Tool for Data Mining.” Intelligent Data Analysis 1: 49 – 58. Doi: http://dx.doi.org/10.1016/S1088-467X(98)00009-2.

Ross, M.P. 2009. “Official Statistics in Malta – Implications of Membership of the European Statistical System for a Small Country/NSI.” In Proceedings of the 95th DGINS Conference, October 1, 2009. Malta. Available at: https://ec.europa.eu/eurostat/documents/1001617/4339944/MPR-opening-address-00909.pdf/7c298770-0869-415c-9833-d702e8b3ce9e (accessed October, 2016).

Scanu, M. and C. Casagrande. 2016. “The Generic Statistical Information Model (GSIM): State of Application of the Standard.” In Workshop on Implementing Standards for Statistical Modernisation, 21 – 23 September 2016. Geneva. Available at: https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.58/2016/mtg4/Paper_17_Italy_-_The_Generic_Statistical_Information_Model__GSIM__and_the_Sistema_Unitario.pdf (accessed March 2017).

SDMX. 2012. SDMX 2.1 User Guide, SDMX 2.1 Documentation. SDMX Consortium. Available at: https://sdmx.org/?page_id=1119 (Accessed January 2017).

Schweizer, B. and A. Sklar. 1983. Probabilistic Metric Spaces. Amsterdam: North-Holland.

Schield, M. 2011. “Statistical Literacy: A New Mission for Data Producers.” Statistical Journal of the IAOS 27: 173–183. Doi: http://dx.doi.org/10.3233/SJI-2011-0732.

Smits, G., O. Pivert, and T. Girault. 2013. “ReqFlex: Fuzzy Queries for Everyone.” In Proceedings of the 39th International Conference on Very Large Data Bases, 26–30 August, Trento.

Torres van Grinsven, V. and G. Snijkers. 2015. “Sentiments and Perceptions of Business Respondents on Social Media: An Exploratory Analysis.” Journal of Official Statistics 31: 283–304. Doi: http://dx.doi.org/10.1515/jos-2015-0018.

Wu, D., J.M. Mendel, and J. Joo. 2010. “Linguistic Summarization Using If-Then Rules.” In Proceedings of the 2010 IEEE International Conference on Fuzzy Systems, July 18–23, 2010. 1–8. Barcelona.

Yager, R.R. 1982. “A New Approach to the Summarization of Data.” Information Sciences 28: 69–86. Doi: http://dx.doi.org/10.1016/0020-0255(82)90033-0.

Yager, R.R. 1984. “General Multiple-Objective Decision Functions and Linguistically Quantified Statements.” International Journal of Man-Machine Studies 21: 389–400. Doi: http://dx.doi.org/10.1016/S0020-7373(84)80066-8.

Yager, R.R. 1988. “On Ordered Weighted Averaging Operators in Multicritera Decision Making.” IEEE Transactions on Systems, Man and Cybernetics, SMC-18: 183–190. Doi: http://dx.doi.org/10.1080/03081070108960702.

Yager, R.R., M. Ford, and A.J. Canas. 1990. “An Approach to the Linguistic Summarization of Data.” In Proceedings of the 3rd International Conference of Information Processing and Management of Uncertainty in Knowledge-based Systems (IPMU 1990), July 2–6, 1990. 456–468. Paris.

Zadeh, L.A. 1965. “Fuzzy Sets.” Information and Control 8: 338 – 353. Doi: http://dx.doi.org/10.1016/S0019-9958(65)90241-X.

Zadeh, L.A. 1975. “The Concept of a Linguistic Variable and Its Application to Approximate Reasoning: Part I.” Information Sciences 8: 199 – 249. Doi: http://dx.doi.org/10.1016/0020-0255(75)90036-5.

Zadeh, L.A. 1983. “A Computational Approach to Fuzzy Quantifiers in Natural Languages.” Computers & Mathematics with Applications 9: 149 –184. Doi: http://dx.doi.org/10.1016/0898-1221(83)90013-5.

Zadeh, L.A. 2001. “From Computing With Numbers to Computing With Words––From Manipulation of Measurements to Manipulation of Perceptions.” In Computing with Words, edited by P. Wang, 35–68. New York: Wiley.

Zottoli, M., S. Laurita, and F. Monteleone. 2017. “Contestina: A Visibly Understandable Path toward More Effective Data Dissemination.” In Proceedings of the New Techniques and Technologies in Statistics (NTTS 2017), March 14–16, 2017. Brussels. Available at: https://www.conference-service.com/NTTS2017/documents/agenda/data/abstracts/abstract_151.html (accessed May 2017).

Journal of Official Statistics

The Journal of Statistics Sweden

Journal Information


IMPACT FACTOR 2017: 0.662
5-year IMPACT FACTOR: 1.113

CiteScore 2017: 0.74

SCImago Journal Rank (SJR) 2017: 1.158
Source Normalized Impact per Paper (SNIP) 2017: 0.860

Metrics

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 130 130 74
PDF Downloads 118 118 56