Approach of the bloom filter application for real time text data multi-class classification

  • V. Yaremenko National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"
  • D. Budonnyi National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"
Keywords: streaming analytics, Bloom filter, text data analysis, texts classification

Abstract

This paper examines the Bloom filter that solves the problem of streaming data filtration. A new approach was proposed to use this filter for texts classification. Existing articles in this area and the current problems of the classification of text data are studied. The theoretical part is presented, as well as the practical example of constructing a Bloom filter. The process of model training is shown by constructing the Bloom filter for multiclass classification. The opportunity to increase the number of classes for classification with possible limitations and problems is presented. The model training method and the word selection criterion for improving the model learning process are presented, as well as the process of retraining the existing model during its work using these criteria. The stages of text preprocessing for increasing the accuracy of the model were presented. As the input a real-time text data, which come from multiple resources, was selected. A solution for processing incoming data to avoid the problem of losing some of the data during the working of the system is presented. The model is considered in terms of classification accuracy, learning speed, amount of memory used and speed of classification. The influence of the components of the Bloom filter on the final result of this model, as well as the probability of the false-positive results for various system parameters, are examined. The conclusions of the work of the presented approach are drawn. The problems of this approach are revealed and the ways of their solution or improvement are suggested. Prospects for further research for the development of this model are presented.

References

Role of Bloom Filter in Big Data Research: A Survey / Ripon Patgiri, Sabuzima Nayak, Samir Kumar Borgohain, -International Journal of Advanced Computer Science and Applications. – 2018.

Space/time trade-offs in hash coding with allowable errors / Б. Х. Блум, - Comm. of the ACM, 1970, - vol. 13, no. 7, pp. 422–426

Optimizing Bloom Filter: Challenges, Solutions, and Comparisons / Lailong Luo, Deke Guo, Richard T.B. Ma, Ori Rottenstreich, and Xueshan Luo. – 2018.

A Survey of Text Classification Algorithms / Charu C. Aggarwal, ChengXiang Zhai. – 2012.

The impact of preprocessing on text classification / A. K. Uysal, S. Gunal. – 2014.

Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It / Matthew J. Denny, Arthur Spirling. – 2018.


Abstract views: 26
PDF Downloads: 6
Published
2019-11-28
How to Cite
Yaremenko, V., & Budonnyi, D. (2019). Approach of the bloom filter application for real time text data multi-class classification. COMPUTER-INTEGRATED TECHNOLOGIES: EDUCATION, SCIENCE, PRODUCTION, (36), 153-159. https://doi.org/10.36910/6775-2524-0560-2019-36-24
Section
Computer science and computer engineering