Maiia BOCHAROVA, PhD Student
ORCID ID: 0009-0004-3875-5019
e-mail: bocharova.maiia@gmail.com
Odesa I. Mechnikov National University
Eugene MALAKHOV, DSc (Engin.), Prof.
ORCID ID: 0000-0002-9314-6062
e-mail: eugene.malakhov@onu.edu.ua
Odesa I. Mechnikov National University
Abstract
DOI: https://doi.org/10.17721/AIT.2024.1.01
B a c k g r o u n d . Learning high-quality text embeddings typically requires large corpuses of labeled data, which can be challenging to obtain for many languages and domains. This study proposes a novel adaptation of cross-lingual knowledge transfer that employs a cosine similarity-based loss calculation to enhance the alignment of learned representations.
M e t h o d s . The impact of teacher model selection on the quality of learned text representations is investigated. Specifically, the correlation between cosine similarity scores among vectors of randomly selected sentences and the transferability of representations into another language is explored. Additionally, recognizing the need for effective evaluation methodologies and the limited availability of Ukrainian resources within existing benchmarks, a comprehensive general-purpose benchmark for assessing Ukrainian text representation learning is curated.
R e s u l t s . A cosine-similarity based loss calculation leads to 14.2% improvement in absolute Normalized Mutual Information (NMI) score compared to using mean squared error loss when distilling knowledge from the English language teacher model into Ukrainian student model. The findings demonstrate the strong correlation between the distributions of cosine similarities of the teacher model’s representations of random sentences with the quality of learnt text embeddings. Pearson’s correlation between “90th percentile of cosine similarity scores distribution” and “Average NMI score” is -0.96, which is a strong negative correlation.
C o n c l u s i o n s . This research advances information theory in cross-lingual knowledge distillation, illustrating that cosine similarity-based loss functions are superior in performance. It underscores the importance of selecting the teacher model with wide distributions of cosine similarity scores. Furthermore, a pioneering broad-scale benchmark, covering five distinct domains for Ukrainian text representation learning is introduced. The source code, pretrained model, and the newly created Ukrainian text embeddings benchmark are publicly available at https://github.com/maiiabocharova/UkrTEB.
K e y w o r d s : Natural Language Processing, text embeddings, Deep Learning, Data Mining, multilingual language models, knowledge transfer, domain adaptation.
Authors’ contribution. Maiia Bocharova – literature overview, development of methods and methodologies of the research, empirical data collection, analysis of results and conclusions. Eugene Malakhov – consultation, ideas and guidance.
Published
2024-12-20
How to Cite
Maiia BOCHAROVA, Eugene MALAKHOV “ GENERAL-PURPOSE TEXT EMBEDDINGS LEARNING FOR UKRAINIAN LANGUAGE,” Advanced Information Technology, vol.1(3), pp. 6–12, 2024
Issue
Advanced Information Technology № 1 (3), 2024
Section
Applied information systems and technology
References
Abdelali, A., Guzman, F., Sajjad, H., & Vogel, S. (2014). The AMARA Corpus: Building parallel language resources for the educational domain. In
N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, S. Piperidis (Eds.).The Proceedings of the 9th International Conference on Language Resources and Evaluation (pp. 1856–1862)/ In Lrec.
Araujo, V., Carvallo, A., Kundu, S., Cañete, J., Mendoza, M., Mercer, R. E., & Soto, A. (2022). Evaluation Benchmarks for Spanish Sentence Representations. In
N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, S. Piperidis (Eds.). In Proceedings of the Thirteenth Language Resources and Evaluation Conference (pp. 6024–6034), Marseille, France. European Language Resources Association.
Artetxe, M., & Schwenk, H. (2019). Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. In L. Lee, M. Johnson,
B. Roark, A. Nenkova (Eds.). Transactions of the Association for Computational Linguistics, 7, 597–610. https://doi.org/10.1162/tacl_a_00288
Binder, M., & Mezhuyev, V. (2024). A framework for creating an IoT system specification with ChatGPT. Internet of Things, 27, 101218. Institute of Industrial Management, University of Applied Sciences FH JOANNEUM, Austria. https://doi.org/10.1016/j.iot.2024.101218
Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. In L. Màrquez, C. Callison-Burch,
J. Su (Eds.). In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 632–642). https://doi.org/10.18653/v1/D15-1075
Cer, D., Yang, Y., Kong, S. Y., Hua, N., Limtiaco, N., John, R. S., & Kurzweil, R. (2018). Universal sentence encoder. In E. Blanco, W. Lu (Eds.). Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 169–174). Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-2029
Decorte, J. J., Van Hautte, J., Demeester, T., & Develder, C. (2021). Jobbert: Understanding job titles through skills. In International workshop on Fair, Effective And Sustainable Talent management using data science (FEAST) (pp. 1–9). Аs part of ECML-PKDD.
Feng, F., Yang, Y., Cer, D., Arivazhagan, N., & Wang, W. (2020). Language-agnostic BERT sentence embedding. In S. Muresan, P. Nakov, A. Villavicencio (Eds.). In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 1 (Long Papers) (pp. 878–891). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.62
Filatov, V., & Kovalenko, A. (2020). Fuzzy systems in data mining tasks. In Advances in Spatio – Temporal Segmentation of Visual Data (pp. 243–274).
Springer. https://doi.org/10.1007/978-3-030-35480-0_6
Geigle, G., Reimers, N., Rücklé, A., & Gurevych, I. (2021). TWEAC: transformer with extendable QA agent classifiers. https://doi.org/10.48550/arXiv.2104.07081 Goyal, N., Gao, C., Chaudhary, V., Chen, P. J., Wenzek, G., Ju, D., & Fan, A. (2022). The flores-101 evaluation benchmark for low-resource and multilingual
machine translation. In B. Roark, A. Nenkova (Eds.). Transactions of the Association for Computational Linguistics, 10, 522–538. https://doi.org/10.1162/tacl_a_00474
Grabar, N., & Hamon, T. (2017). Creation of a multilingual aligned corpus with Ukrainian as the target language and its exploitation. In Computational linguistics and intelligent systems (COLINS 2017). National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”.
Guu, K., Lee, K., Tung, Z., Pasupat, P., & Chang, M. (2020). Retrieval augmented language model pre-training. In International conference on machine learning (pp. 3929–3938). JMLR.org.
Heffernan, K., Çelebi, O., & Schwenk, H. (2022). Bitext mining using distilled sentence representations for low-resource languages. In Y. Goldberg,
Z. Kozareva, Y. Zhang (Eds.). Findings of the Association for Computational Linguistics: EMNLP (pp. 2101–2112), Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.findings-emnlp.154
Kenton, J. D. M. W. C., & Toutanova, L. K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, T. Solorio (Eds.). In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1 (Long and Short Papers) (pp. 4171–4186). https://doi.org/ 10.18653/v1/N19-1423
Lang, K. (1995). Newsweeder: Learning to filter netnews. Proceedings of the 12th International Conference on Machine Learning (pp. 331–339).
Li, Z., Zhang, X., Zhang, Y., Long, D., Xie, P., & Zhang, M. (2023). Towards general text embeddings with multi-stage contrastive learning. https://doi.org/10.48550/arXiv.2308.03281
Lison, P., & Tiedemann, J. (2016). Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles. N. Calzolari, K. Choukri, T. Declerck,
S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis (Eds.). Proceedings of the 10th International Conference on Language Resources and Evaluation (pp. 923–929). In Lrec.
Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2022). MTEB: Massive text embedding benchmark. In A. Vlachos, I. Augenstein (Eds.). In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (pp. 2014–2037). Dubrovnik, Croatia. Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.eacl-main.148
Nishikawa, S., Ri, R., Yamada, I., Tsuruoka, Y., & Echizen, I. (2022). EASE: Entity-aware contrastive learning of sentence embedding. In M. Carpuat,
M. Marneffe, I. Ruiz (Eds.). Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 3870–3885). Association for Computational Linguistics.
Reimers, N. & Gurevych, I. (2019). SentenceBERT: Sentence embeddings using siamese BERT networks. In K. Inui, J. Jiang, V. Ng, X. Wan (Eds.). Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong: China (pp. 3982–3992). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410.
Reimers, N., & Gurevych, I. (2020). Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation. In B. Webber, T. Cohn, Y. He,
Y. Liu (Eds.). In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 4512–4525), оnline. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.365
Schwenk, H., Chaudhary, V., Sun, S., Gong, H., & Guzmán, F. (2019). Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 1351–1361). Association for Computational Linguistics.
Schwenk, H., Wenzek, G., Edunov, S., Grave, E., & Joulin, A. (2021). CCMatrix: Mining billions of high-quality parallel sentences on the web. In P. Merlo, J. Tiedemann, R. Tsarfaty (Eds.).In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 1 (Long Papers) (pp. 6490–6500), оnline. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.eacl-main.115
Tiedemann, J. (2012). Parallel data, tools and interfaces in OPUS (pp. 2214–2218). In Lrec.
Tiedemann, J. (2020). The Tatoeba Translation Challenge–Realistic Data Sets for Low Resource and Multilingual MT. In L. Barrault, O. Bojar, F. Bougares, R. Chatterjee, M. Costa-jussà, C. Federmann, M. Fishel, A. Fraser, Y. Graham, P. Guzman, Ba. Haddow, M. Huck, A. Yepes, P. Koehn, A. Martins, M. Morishita, C. Monz, M. Nagata, T. Nakazawa, M. Negri (Eds.). In Proceedings of the Fifth Conference on Machine Translation (pp. 1174–1182). Association for Computational Linguistics.
“ukr-roberta-base” (11, August, 2024). https://huggingface.co/youscan/ukr-roberta-base
Wang, K., Reimers, N., & Gurevych, I. (2021). TSDAE: Using transformer-based sequential denoising auto-encoderfor unsupervised sentence embedding learning. M. Moens, X. Huang, L. Specia, S. Yih (Eds.). In Findings of the Association for Computational Linguistics: EMNLP. Punta Cana, Dominican Republic (pp. 671–688). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.findings-emnlp.59
Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., & Wei, F. (2022). Text embeddings by weakly-supervised contrastive pre-training. https://doi.org/10.48550/arXiv.2308.03281
Wehrli, S., Arnrich, B., & Irrgang, C. (2023). German Text Embedding Clustering Benchmark. In M. Georges, A. Herygers, A. Friedrich, B. Roth (Eds.). In Proceedings of the 19th Conference on Natural Language Processing. Ingolstadt, Germany (pp. 187–201). Association for Computational Linguistics.
Williams, A., Nangia, N., & Bowman, S. R. (2018). A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In M. Walker,
H. Ji, A. Stent (Eds.). In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1 (Long Papers) (pp. 1112–1122). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-1101
Xiao, S., Liu, Z., Zhang, P., & Muennighof, N. (2023). C-pack: Packaged resources to advance general chinese embedding. https://doi.org/10.48550/arXiv.2309.07597
Xu, H., Tan, W., Li, S. S., Chen, Y., Van Durme, B., Koehn, P., & Murray, K. (2023). Condensing Multilingual Knowledge with Lightweight Language-Specific Modules. In H. Bouamor, J. Pino, K. Bali (Eds.). In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 1575–1587). https://doi.org/10.18653/v1/2023.emnlp-main.97