RESEARCH ON THE SPECIFIC FEATURES OF DETERMINING THE SEMANTIC SIMILARITY OF ARBITRARY-LENGTH TEXT CONTENT USING MULTILINGUAL TRANSFORMER-BASED MODELS

Main Article Content

Serhii Olizarenko
http://orcid.org/0000-0002-7762-6541
Vladimir Argunov
http://orcid.org/0000-0002-2505-1969

Abstract

The possibilities of determining the semantic similarity of multilingual arbitrary-length text content have been investigated using their vector representations obtained within different multilingual models based on Transformer architecture. A comparative analysis of the Transformers has been performed to select the most advantageous model for this class of problems. Also, two new unique approaches to determining the semantic similarity of a multilingual text content have been developed to be used in the HIPSTO Open AI Information Discovery Platform, the challenge being to allow arbitrary text length. Experimental and research evidence is offered to support the new approaches as a solution to the semantic similarity problem.

Article Details

Section
Information systems research
Author Biographies

Serhii Olizarenko, Kharkіv National University of Radio Electronics University, Kharkiv

Doctor of Technical Sciences, senior researcher, professor of the electronic computers department

Vladimir Argunov, HIPSTO, Kharkiv

Head of AI Research

References

Olizarenko, S. and Argunov, V. (2019), Research into the possibilities of the multilingual BERT model for determining semantic similarities of text content, available at: https://hipsto.global/BERT-Application-Research-for-HIPSTO-Related-News-Detection.pdf

Devlin, J., Ming-Wei Chang, Lee, Ke.and Toutanova, K. (2019), BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, arXiv:1810.04805v2 [cs.CL] 24 May 2019.

Sanh, V., Debut, L., Chaumond, J. and Wolf, T. (2020), DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, arXiv:1910.01108v4 [cs.CL] 1 Mar 2020.

Guillaume, Lample and Alexis, Conneau (2019), Cross-lingual Language Model Pretraining, arXiv:1901.07291v1 [cs.CL] 22 Jan 2019.

Sun, C., Qiu, X., Xu, Y. and Huang X. (2020), How to Fine-Tune BERT for Text Classification, arXiv:1905.05583v3 [cs.CL] 5 Feb 2020.

Yang, Y., Cer, D., Ahmad, A., Guo, M., Law, J., Constant, N., Abrego, G.H., Yuan, S., Tar, C., Sung, Y., Strope, B. and Kurzweil R. (2019), Multilingual Universal Sentence Encoder for Semantic Retrieval, arXiv:1907.04307v1 [cs.CL] 9 Jul 2019.

Cer, D., Yang, Y., Kong, S., Hua, N., Limtiaco, N., John, R.St., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., Strope, B. and Kurzweil, R. (2018), “Universal sentence encoder for English”, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 169–174.

Yoon, Kim (2014), “Convolutional neural networks for sentence classification”, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751.

Ashish, Vaswani, Noam, Shazeer, Niki, Parmar, Jakob, Uszkoreit, Llion, Jones, Aidan, Gomez, Łukasz, Kaiser, and Illia, Polosukhin (2017), “Attention is all you need”, Proceedings of NIPS, pp. 6000–6010.

(2020), Multilingual Similarity Search Using Pretrained Bidirectional LSTM Encoder. Evaluating LASER (Language-Agnostic SEntence Representations), available at: https://medium.com/the-artificial-impostor/multilingual-similarity-search-using-pretrained-bidirectional-lstm-encoder-e34fac5958b0.

(2019), Zero-shot transfer across 93 languages: Open-sourcing enhanced LASER library, POSTED ON JAN 22, 2019 TO AI RESEARCH, available at: https://engineering.fb.com/ai-research/laser-multilingual-sentence-embeddings.

Reimers, N. and Gurevych I. (2019), Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, arXiv:1908.10084v1 [cs.CL] 27 Aug 2019.

Patel, M. (2019), TinySearch - Semantics-based Search Engine using Bert Embeddings, available at:

https://arxiv.org/ftp/arxiv/papers/1908/1908.02451.pdf.

Han, X. (2020), Bert-as-service, available at: https://github.com/hanxiao/bert-as-service.

(2020), State of the art Natural Language Processing for Pytorch and TensorFlow 2.0, available at:

https://huggingface.co/transformers/index.html.

Arun, S. Maiya (2020), Ktrain: A Low-Code Library for Augmented Machine Learning, available at:

https://arxiv.org/pdf/2004.10703v2.pdf.

Dolan, B. and Brockett, C. (2005), “Automatically Constructing a Corpus of Sentential Paraphrases”, Proceedings of the 3rd International Workshop on Paraphrasing (IWP 2005), Jeju Island, pp. 9–16.

Goodfellow, I., Bengio, Y. and Courville, A. (2018), Softmax Units for Multinoulli Output Distributions. Deep Learning, MIT Press. pp. 180–184, ISBN 978-0-26203561-3.

Markovsky, I. (2012), Low-Rank Approximation: Algorithms, Implementation, Applications, Springer, ISBN 978-1-4471-2226-5.

Daniel, Cer, Yinfei, Yang, Sheng-yi, Kong, Nan Hua, Nicole, Limtiaco, Rhomni, St. John, Noah, Constant, Mario, Guajardo-Cespedes, Steve, Yuan, Chris, Tar, Brian, Strope, and Ray, Kurzweil (2018), “Universal sentence encoder for English”, Proc. of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 169–174.