ANALYSIS OF THE TEXT PREPROCESSING METHODS INFLUENCE ON THE DESTRUCTIVE MESSAGES CLASSIFIER

Main Article Content

Oleksandr Orlovskyi
http://orcid.org/0000-0003-4782-566X
Sergey Ostapov
http://orcid.org/0000-0002-4139-4152

Abstract

Problem. Social networks are increasingly becoming an environment for threats, insults, profanity and other destructive manifestations of human communication. Today, a huge number of people are involved in online platforms, and the amount of content created and reactions to it is constantly breaking records. Therefore, there is a need to automate the detection and counteraction of antisocial influences. One of the important areas of such activities is the detection of toxic comments that contain threats, insults, profanity, contempt for others and more. To perform this task, researchers usually build a classifier based on neural networks. And for their training they use a collected or publicly available set of data. The article investigates how different methods of pre-processing of input data affect the final accuracy of the classifier. Previous studies in this direction have confirmed the presence of an impact on the result, but did not allow to draw definitive conclusions about the effectiveness. Goal. Research of preliminary processing of text data methods influence on the destructive messages classifier. Results. It has been shown that the effect of a particular method can be quite dependent on the content in the data set. In addition, it is noted that sometimes the impact may be insignificant, and in some cases may even lead to a worsening of the result. It is also justified the need to pre-check the data set for the percentage of elements that fall under the impact of a particular method. Originality. The methods of data processing are evaluated on the basis of English and Russian data sets. Practical significance. The obtained results allow to make better decisions about the usage of certain pre-processing methods to improve the accuracy of the destructive messages classifier.

Article Details

Section
Information systems research
Author Biographies

Oleksandr Orlovskyi, Yuriy Fedkovych Chernivtsi National University, Chernivtsi

graduate student Department of Software Engineering

Sergey Ostapov, Yuriy Fedkovych Chernivtsi National University, Chernivtsi

Doctor of physical and mathematical sciences, Professor Department of Software Engineering

References

(2020), Social Network Ranking, available at: https://www.statista.com/statistics/272014/global-social-networksranked-by-number-of-users/.

Dadvar. M., Trieschnigg. D., Ordelman. R. and de Jong, F. (2013), “Improving Cyberbullying Detection with User Context”, Serdyukov P. et al. (eds), Advances in Information Retrieval. ECIR 2013, Lecture Notes in Computer Science, vol 7814. Springer, Berlin, Heidelberg.

Salminen, J., Almerekhi, H., Milenkovic, M., Jung, S., An, J., Kwak, H., & Jansen, B.J. (2017), “Anatomy of Online Hate: Developing a Taxonomy and Machine Learning Models for Identifying and Classifying Hate”, Online News Media. ICWSM.

Shtovba, S. D., Shtovba, O. V., Yakhymovych, O. V. and Petrychko, M. V. (2019), “Vplyv syntaksychnykh zviazkiv u rechenniakh na yakist identyfikatsii toksychnykh komentariv v sotsialnii merezhi”, Informatsiini tekhnolohii ta kompiuterna tekhnika, VNTU, Vinnytsia, No. 4, DOI: https://doi.org/10.31649/2307-5376-2019-4-35-42.

Pavlopoulos, J., Sorensen, J., Dixon, L., Thain, N., & Androutsopoulos, I. (2020), “Toxicity Detection: Does Context Really Matter?”, arXiv preprint, arXiv: 2006.00998.

Noever, D. (2018), “Machine learning suites for online toxicity detection”, arXiv preprint, arXiv:1810.01869.

van Aken, B., Risch, J., Krestel, R., & Löser, A. (2018), “Challenges for toxic comment classification: An in-depth error analysis”, arXiv preprint, arXiv:1809.07572.

Mohammad, Fahim (2018), “Is preprocessing of text really worth your time for toxic comment classification?”, Proceedings on the International Conference on Artificial Intelligence (ICAI), The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp), pp. 447-453.

(2020), Toxic Comment Classification Challenge, available at:

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data.

(2020), Russian Language Toxic Comments. Small dataset with labeled comments from 2ch.hk and pikabu.ru, available at: https://www.kaggle.com/blackmoon/russian-language-toxic-comments.

(2020), Tackling Toxic Using Keras, available at: https://www.kaggle.com/sbongo/for-beginners-tackling-toxic-using-keras.

(2020), An Intuitive Understanding of Word Embeddings: From Count Vectors to Word2Vec , available at:

https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/.