METHOD FOR DETERMINING THE SEMANTIC SIMILARITY OF ARBITRARY LENGTH TEXTS USING THE TRANSFORMERS MODELS

Main Article Content

Сергій Олізаренко
В’ячеслав Радченко

Abstract

The paper considers the results of a method development for determining the semantic similarity of arbitrary length texts based on their vector representations. These vector representations are obtained via multilingual Transformers model usage, and direct problem of determining semantic similarity of arbitrary length texts is considered as the text sequence pairs classification problem using Transformers model. Comparative analysis of the most optimal Transformers model for solving such class of problems was performed. Considered in this case main stages of the method are: Transformers model fine-tuning stage in the framework of pretrained model second problem (sentence prediction), also selection and implementation stage of the summarizing method for text sequence more than 512 (1024) tokens long to solve the problem of determining the semantic similarity for arbitrary length texts.

Article Details

How to Cite
Олізаренко, С., & Радченко, В. (2021). METHOD FOR DETERMINING THE SEMANTIC SIMILARITY OF ARBITRARY LENGTH TEXTS USING THE TRANSFORMERS MODELS. Advanced Information Systems, 5(2), 126–130. https://doi.org/10.20998/2522-9052.2021.2.18
Section
Intelligent information systems
Author Biographies

Сергій Олізаренко, Kharkіv National University of Radio Electronics University, Kharkiv

Doctor of Technical Sciences, Senior Researcher, Professor of the Electronic Computers Department

В’ячеслав Радченко, Kharkiv National University of Radio Electronics, Kharkiv

Senior Lecturer of the Electronic Computers Department

References

Olizarenko, S. and Argunov, V. (2020), “On possibilities of multilingual Bert model for determining semantic similarities of the news content”, Control, navigation and communication systems, Poltava: NU PP, No. 3(61), pp. 94-99.

Devlin, J., Ming-Wei Chang, Lee, Ke and Toutanova, K. (2019), BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, arXiv:1810.04805v2 [cs.CL] 24 May 2019.

Sanh, V., Debut, L., Chaumond, J. and Wolf, T. (2020), DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, arXiv:1910.01108v4 [cs.CL] 1 Mar 2020.

Guillaume, Lample and Alexis, Conneau (2019), Cross-lingual Language Model Pretraining, arXiv:1901.07291v1 [cs.CL] 22 Jan 2019.

Sun, C., Qiu, X., Xu, Y. and Huang X. (2020), How to Fine-Tune BERT for Text Classification, arXiv:1905.05583v3 [cs.CL].

Yang, Y., Cer, D., Ahmad, A., Guo, M., Law, J., Constant, N., Abrego, G.H., Yuan, S., Tar, C., Sung, Y., Strope, B. and Kurzweil R. (2019), Multilingual Universal Sentence Encoder for Semantic Retrieval, arXiv:1907.04307v1.

Cer, D., Yang, Y., Kong, S., Hua, N., Limtiaco, N., John, R.St., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., Strope, B. and Kurzweil, R. (2018), “Universal sentence encoder for English”, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 169–174.

Yoon, Kim (2014), “Convolutional neural networks for sentence classification”, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751.

Ashish, Vaswani, Noam, Shazeer, Niki, Parmar, Jakob, Uszkoreit, Llion, Jones, Aidan, Gomez, Łukasz, Kaiser, and Illia, Polosukhin (2017), “Attention is all you need”, Proceedings of NIPS, pp. 6000–6010.

(2020), Multilingual Similarity Search Using Pretrained Bidirectional LSTM Encoder. Evaluating LASER (Language-Agnostic SEntence Representations), available at: https://medium.com/the-artificial-impostor/multilingual-similarity-search-using-pretrained-bidirectional-lstm-encoder-e34fac5958b0.

(2019), Zero-shot transfer across 93 languages: Open-sourcing enhanced LASER library, POSTED ON JAN 22, 2019 TO AI RESEARCH, available at: https://engineering.fb.com/ai-research/laser-multilingual-sentence-embeddings.

Reimers, N. and Gurevych I. (2019), Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, arXiv:1908.10084v1 [cs.CL] 27 Aug 2019.

Patel, M. (2019), TinySearch - Semantics-based Search Engine using Bert Embeddings, available at:

https://arxiv.org/ftp/arxiv/papers/1908/1908.02451.pdf.

Han, X. (2020), Bert-as-service, available at: https://github.com/hanxiao/bert-as-service.

(2020), State of the art Natural Language Processing for Pytorch and TensorFlow 2.0, available at:

https://huggingface.co/transformers/index.html.

Goodfellow, I., Bengio, Y. and Courville, A. (2018), Softmax Units for Multinoulli Output Distributions. Deep Learning, MIT Press. pp. 180–184, ISBN 978-0-26203561-3.

Dolan, B. and Brockett, C. (2005), “Automatically Constructing a Corpus of Sentential Paraphrases”, Proceedings of the 3rd International Workshop on Paraphrasing (IWP 2005), Jeju Island, pp. 9–16.

Olizarenko, S. and Argunov, V. (2020), “Research on the specific features of determining the semantic similarity of arbitrary-length text content using multilingual Transformer-based models”, Advanced Information Systems, Vol. 4, No. 3, pp. 94-103, DOI: https://doi.org/10.20998/2522-9052.2020.3.13