GENERAL-PURPOSE TEXT EMBEDDINGS LEARNING FOR UKRAINIAN LANGUAGE: Автор є членом редколегії видання, тому не брав участі у рецензуванні та прийнятті рішення щодо публікації цієї статті

Maiia BOCHAROVA; Eugene MALAKHOV

doi:10.17721/AIT.2024.1.01

Authors

Maiia BOCHAROVA, PhD Student Odesa I. I. Mechnikov National University Author https://orcid.org/0009-0004-3875-5019
Eugene MALAKHOV, DSc (Engin.), Prof. Odesa I. I. Mechnikov National University Author https://orcid.org/0000-0002-9314-6062

DOI:

https://doi.org/10.17721/AIT.2024.1.01

Keywords:

Natural Language Processing, text embeddings, Deep Learning, Data Mining, multilingual language models, knowledge transfer, domain adaptation.

Abstract

B a c k g r o u n d . Learning high-quality text embeddings typically requires large corpuses of labeled data, which can be challenging to obtain for many languages and domains. This study proposes a novel adaptation of cross-lingual knowledge transfer that employs a cosine similarity-based loss calculation to enhance the alignment of learned representations. M e t h o d s . The impact of teacher model selection on the quality of learned text representations is investigated. Specifically, the correlation between cosine similarity scores among vectors of randomly selected sentences and the transferability of representations into another language is explored. Additionally, recognizing the need for effective evaluation methodologies and the limited availability of Ukrainian resources within existing benchmarks, a comprehensive general-purpose benchmark for assessing Ukrainian text representation learning is curated. R e s u l t s . A cosine-similarity based loss calculation leads to 14.2% improvement in absolute Normalized Mutual Information (NMI) score compared to using mean squared error loss when distilling knowledge from the English language teacher model into Ukrainian student model. The findings demonstrate the strong correlation between the distributions of cosine similarities of the teacher model’s representations of random sentences with the quality of learnt text embeddings. Pearson’s correlation between “90th percentile of cosine similarity scores distribution” and “Average NMI score” is -0.96, which is a strong negative correlation. C o n c l u s i o n s . This research advances information theory in cross-lingual knowledge distillation, illustrating that cosine similarity-based loss functions are superior in performance. It underscores the importance of selecting the teacher model with wide distributions of cosine similarity scores. Furthermore, a pioneering broad-scale benchmark, covering five distinct domains for Ukrainian text representation learning is introduced. The source code, pretrained model, and the newly created Ukrainian text embeddings benchmark are publicly available at https://github.com/maiiabocharova/UkrTEB.

Downloads

Download data is not yet available.

References

Abdelali, A., Guzman, F., Sajjad, H., & Vogel, S. (2014). The AMARA Corpus: Building parallel language resources for the educational domain. In

N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, S. Piperidis (Eds.).The Proceedings of the 9th International Conference on Language Resources and Evaluation (pp. 1856–1862)/ In Lrec.

Araujo, V., Carvallo, A., Kundu, S., Cañete, J., Mendoza, M., Mercer, R. E., & Soto, A. (2022). Evaluation Benchmarks for Spanish Sentence Representations. In

N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, S. Piperidis (Eds.). In Proceedings of the Thirteenth Language Resources and Evaluation Conference (pp. 6024–6034), Marseille, France. European Language Resources Association.

Artetxe, M., & Schwenk, H. (2019). Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. In L. Lee, M. Johnson,

B. Roark, A. Nenkova (Eds.). Transactions of the Association for Computational Linguistics, 7, 597–610. https://doi.org/10.1162/tacl_a_00288

Binder, M., & Mezhuyev, V. (2024). A framework for creating an IoT system specification with ChatGPT. Internet of Things, 27, 101218. Institute of Industrial Management, University of Applied Sciences FH JOANNEUM, Austria. https://doi.org/10.1016/j.iot.2024.101218

Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. In L. Màrquez, C. Callison-Burch,

J. Su (Eds.). In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 632–642). https://doi.org/10.18653/v1/D15-1075

Cer, D., Yang, Y., Kong, S. Y., Hua, N., Limtiaco, N., John, R. S., & Kurzweil, R. (2018). Universal sentence encoder. In E. Blanco, W. Lu (Eds.). Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 169–174). Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-2029

Decorte, J. J., Van Hautte, J., Demeester, T., & Develder, C. (2021). Jobbert: Understanding job titles through skills. In International workshop on Fair, Effective And Sustainable Talent management using data science (FEAST) (pp. 1–9). Аs part of ECML-PKDD.

Feng, F., Yang, Y., Cer, D., Arivazhagan, N., & Wang, W. (2020). Language-agnostic BERT sentence embedding. In S. Muresan, P. Nakov, A. Villavicencio (Eds.). In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 1 (Long Papers) (pp. 878–891). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.62

Filatov, V., & Kovalenko, A. (2020). Fuzzy systems in data mining tasks. In Advances in Spatio – Temporal Segmentation of Visual Data (pp. 243–274). Springer. https://doi.org/10.1007/978-3-030-35480-0_6

Geigle, G., Reimers, N., Rücklé, A., & Gurevych, I. (2021). TWEAC: transformer with extendable QA agent classifiers. https://doi.org/10.48550/arXiv.2104.07081 Goyal, N., Gao, C., Chaudhary, V., Chen, P. J., Wenzek, G., Ju, D., & Fan, A. (2022). The flores-101 evaluation benchmark for low-resource and multilingual

machine translation. In B. Roark, A. Nenkova (Eds.). Transactions of the Association for Computational Linguistics, 10, 522–538. https://doi.org/10.1162/tacl_a_00474

Grabar, N., & Hamon, T. (2017). Creation of a multilingual aligned corpus with Ukrainian as the target language and its exploitation. In Computational linguistics and intelligent systems (COLINS 2017). National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”.

Guu, K., Lee, K., Tung, Z., Pasupat, P., & Chang, M. (2020). Retrieval augmented language model pre-training. In International conference on machine learning (pp. 3929–3938). JMLR.org.

Heffernan, K., Çelebi, O., & Schwenk, H. (2022). Bitext mining using distilled sentence representations for low-resource languages. In Y. Goldberg,

Z. Kozareva, Y. Zhang (Eds.). Findings of the Association for Computational Linguistics: EMNLP (pp. 2101–2112), Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.findings-emnlp.154

Kenton, J. D. M. W. C., & Toutanova, L. K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, T. Solorio (Eds.). In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1 (Long and Short Papers) (pp. 4171–4186). https://doi.org/ 10.18653/v1/N19-1423

Lang, K. (1995). Newsweeder: Learning to filter netnews. Proceedings of the 12th International Conference on Machine Learning (pp. 331–339).

Li, Z., Zhang, X., Zhang, Y., Long, D., Xie, P., & Zhang, M. (2023). Towards general text embeddings with multi-stage contrastive learning. https://doi.org/10.48550/arXiv.2308.03281

Lison, P., & Tiedemann, J. (2016). Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles. N. Calzolari, K. Choukri, T. Declerck,

S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis (Eds.). Proceedings of the 10th International Conference on Language Resources and Evaluation (pp. 923–929). In Lrec.

Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2022). MTEB: Massive text embedding benchmark. In A. Vlachos, I. Augenstein (Eds.). In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (pp. 2014–2037). Dubrovnik, Croatia. Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.eacl-main.148

Nishikawa, S., Ri, R., Yamada, I., Tsuruoka, Y., & Echizen, I. (2022). EASE: Entity-aware contrastive learning of sentence embedding. In M. Carpuat,

M. Marneffe, I. Ruiz (Eds.). Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 3870–3885). Association for Computational Linguistics.

Reimers, N. & Gurevych, I. (2019). SentenceBERT: Sentence embeddings using siamese BERT networks. In K. Inui, J. Jiang, V. Ng, X. Wan (Eds.). Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong: China (pp. 3982–3992). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410.

Reimers, N., & Gurevych, I. (2020). Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation. In B. Webber, T. Cohn, Y. He,

Y. Liu (Eds.). In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 4512–4525), оnline. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.365

Schwenk, H., Chaudhary, V., Sun, S., Gong, H., & Guzmán, F. (2019). Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 1351–1361). Association for Computational Linguistics.

Schwenk, H., Wenzek, G., Edunov, S., Grave, E., & Joulin, A. (2021). CCMatrix: Mining billions of high-quality parallel sentences on the web. In P. Merlo, J. Tiedemann, R. Tsarfaty (Eds.).In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 1 (Long Papers) (pp. 6490–6500), оnline. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.eacl-main.115

Tiedemann, J. (2012). Parallel data, tools and interfaces in OPUS (pp. 2214–2218). In Lrec.

Tiedemann, J. (2020). The Tatoeba Translation Challenge–Realistic Data Sets for Low Resource and Multilingual MT. In L. Barrault, O. Bojar, F. Bougares, R. Chatterjee, M. Costa-jussà, C. Federmann, M. Fishel, A. Fraser, Y. Graham, P. Guzman, Ba. Haddow, M. Huck, A. Yepes, P. Koehn, A. Martins, M. Morishita, C. Monz, M. Nagata, T. Nakazawa, M. Negri (Eds.). In Proceedings of the Fifth Conference on Machine Translation (pp. 1174–1182). Association for Computational Linguistics.“ukr-roberta-base” (11, August, 2024). https://huggingface.co/youscan/ukr-roberta-base

Wang, K., Reimers, N., & Gurevych, I. (2021). TSDAE: Using transformer-based sequential denoising auto-encoderfor unsupervised sentence embedding learning. M. Moens, X. Huang, L. Specia, S. Yih (Eds.). In Findings of the Association for Computational Linguistics: EMNLP. Punta Cana, Dominican Republic (pp. 671–688). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.findings-emnlp.59

Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., & Wei, F. (2022). Text embeddings by weakly-supervised contrastive pre-training. https://doi.org/10.48550/arXiv.2308.03281

Wehrli, S., Arnrich, B., & Irrgang, C. (2023). German Text Embedding Clustering Benchmark. In M. Georges, A. Herygers, A. Friedrich, B. Roth (Eds.). In Proceedings of the 19th Conference on Natural Language Processing. Ingolstadt, Germany (pp. 187–201). Association for Computational Linguistics.

Williams, A., Nangia, N., & Bowman, S. R. (2018). A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In M. Walker,

H. Ji, A. Stent (Eds.). In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1 (Long Papers) (pp. 1112–1122). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-1101

Xiao, S., Liu, Z., Zhang, P., & Muennighof, N. (2023). C-pack: Packaged resources to advance general chinese embedding. https://doi.org/10.48550/arXiv.2309.07597

Xu, H., Tan, W., Li, S. S., Chen, Y., Van Durme, B., Koehn, P., & Murray, K. (2023). Condensing Multilingual Knowledge with Lightweight Language-Specific Modules. In H. Bouamor, J. Pino, K. Bali (Eds.). In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 1575–1587). https://doi.org/10.18653/v1/2023.emnlp-main.97

GENERAL-PURPOSE TEXT EMBEDDINGS LEARNING FOR UKRAINIAN LANGUAGE

Автор є членом редколегії видання, тому не брав участі у рецензуванні та прийнятті рішення щодо публікації цієї статті

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

Information

Author Guidelines

Make a Submission

Language

Journal indexing

Flag counter