Proceedings 2025

This article explores the role of rhetorical structures in the argument mining task on the material of scientific Internet communication texts in Russian. Two approaches are proposed and studied for argumentative relation prediction: the first one constructs segment vector representations using a Graph Neural Network (GNN) based on rhetorical structure, and the second one uses multitask learning that combines argumentation extraction with rhetorical relations prediction tasks. With proposed approaches three models were implemented: two variations of the model using GNN and one model employing the multitask approach. These models were compared with a simple baseline using the Lonformer model on a dataset annotated with both argumentative and rhetorical structures. Argumentative annotating was performed manually by four experts. Existing resources and tools were used to obtain rhetorical markup. The conducted experiments showed that the approaches using additional rhetorical information improve the quality of argumentative relation prediction, particularly for long-distance relations. The best performance, with an F1 score of 72.32%, was achieved by a model incorporating GNN-enhanced statement representations.

Alibekov Andrey, Migal Alexander, Matenkov Andrey, Muryshev Andrey, Bolshakov Vladislav at al.

RuTaR – A Dataset in Russian for Reasoning about Taxes

In 2024, reasoning have emerged as a new frontier for artificial intelligence and computational linguistics. Reasoning models are typically evaluated either on STEM-related datasets, or on synthetic datasets. This ignores a huge area of human thought—namely, humanitarian. To bridge this gap partially, we present a new open dataset, RuTaR (Russian Tax Reasoning). The dataset consists of modestly modified content of 199 select Ministry of Finances of Russia and Russian Federal Tax Service letters that typically reason to answer some taxpayer question. Despite apparent simplicity of yes/no questions, both off-the-shelf Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) systems struggle to achieve high results on the dataset, with top RAG system studied achieving 77% accuracy.

Baiuk Ilia, Baiuk Alexandra, Petrova Maria

Cobald Parser: Joint Morphosyntactic and Semantic Annotation

Dependency parsing is a common task for modern NLP, and Universal Dependencies (De Marneffe et al., 2021) is widely acknowledged nowadays as a morphosyntactic annotation standard. Yet, its dependency relations are rather generalized, therefore, in order to take more syntactic details into account, the Enhanced UD standard was proposed. A newly developed CoBaLD annotation standard elaborates the E-UD principles by enriching it with the semantic level. It is aimed at structural simplicity and the compatibility with UD in all possible issues. Currently, there are several datasets annotated in CoBaLD standard, but until now, there has been no appropriate tool for automatic data parsing in CoBaLD format. In this paper, we present a neural-based joint parser capable of automatic annotation both in E-UD and in CoBaLD, including ellipsis restoration which is supposed by these standards. Additionally, we provide a qualitative analysis of automatic annotation errors.

Baranov Anatoly

Embedded quotation as a factor of idiomaticity

The report examines one of the types of idiomaticity—embedded quotation. This type is peripheral and is implemented in a small set of idioms, slightly more than twenty elements. Idioms with embedded citations include, in particular, the expressions po samoye ne baluysya (lit. to the very don’t play) ≈ ‘to the maximum extent possible’, po samoye ne kho-chu (lit. to the very don’t want) ‘to the maximum extent possible’, na otskrebis’ (lit. on fuck off) ≈ ‘to do something quickly and badly’, prikazat’ dolgo zhit’ (lit. to order to live long) ≈ ‘to die’, derzhat’sya na chestnom slove (lit. to keep one’s word of honor) ≈ ‘be poorly connected to the main part of the object’, pod chestnoye slovo (lit. under a word of honor) ≈ ‘to do something to someone or allow him to do something under the guarantee of this person’, za Khrista radi (lit. for Christ’s sake) ≈ ‘to beg’, [i] pominay kak zvali (lit. [and] remember his name) ≈ ‘disappear without a trace’, vot tebe / te[i] na (lit. here’s for you) ≈ ‘expression of surprise at failure’, za spasibo (lit. for thanks) ≈ ‘doing something for free’, za [prosto] tak (lit. for no reason) ≈ ‘doing something for free’ and some others. Embedded quotation is a relevant factor of idiomaticity, leading to the emergence of new idioms in the language. Embedded quotation as a factor of idiomaticity is directly related to the phenomena of quotation (citation) and generalized quotation (citation). The existence of generalized quotation in speech provides the very possibility of forming idioms, the idiomaticity of which is based on embedded quotation. This is a necessary intermediate link in the sequence: quotation—generalized quotation—embedded quotation.

Boguslavsky Igor

Collectivity & distributivity

The opposition ‘separately’ / ‘together’ manifests itself in different parts of the linguistic system—in the interpretation of noun phrases (distributive vs. collective), in syntax (200 roubles a piece, 100 kilometres per hour), in the lexicon (together, jointly, collectively, in aggregate, common, collective; apart, separate, apart, separately, each, etc.) in word formation (coauthor). We discuss these meanings in the context of developing a computational model of semantic analysis that highlights the following issues: 1) What information about the meaning of distributive/collective NPs should be available to the parser for reasoning? 2) What valency slots are characteristic of the meanings ‘together’ / ‘separately’ and how are they instantiated in syntactic constructions and lexical items?

Bonch-Osmolovskaya Anastasiya, Gladilin Sergey, Kozerenko Anastasiya, Lyashevskaya Olga, Morozov Dmitry et al.

Russian National Сorpus 2.0: corpus platform, analysis tools, neural network models of data markup

The Russian National Corpus has existed for over 20 years and is a unique linguistic tool. However, the technical limitations of the software platform on which it was implemented significantly narrowed its development prospects. In 2020, work was launched on a comprehensive update of the RNC software platform, as a result of which the National Corpus switched to a new generation 2.0 platform. The implemented deep changes concerned both the development of functionality that meets modern approaches to corpus linguistics, and a fundamental restructuring of the platform architecture as a whole, from data preparation and indexing systems to the user interface. A separate area of development of the capabilities of the RNC was associated with the implementation of neural network models used for metadata tagging, disambiguation, word-formation markup, etc. This article provides a short description of the new corpus platform as of 2024. The description includes key parameters of changes in the architecture of the RNC platform and its user interface, descriptions of new corpus data analysis services and the specifics of their implementation, as well as a description of the experience of using neural network models for tasks related to corpus data markup. The purpose of the article is to describe the technological layer of changes implemented in the National Corpus of the Russian Language as part of a large-scale update carried out in recent years.

Burov Daniil, Panich Maria, Sadkovskii Fedor, Fedorova Olga, Shevelev Sergey

Lexical Decision Task: Modality matters

Lexical decision task is one of the most wide-spread methods used is psycholinguistic experiments, performed in either visual or auditory modality. The comparability of results acquired in the two modalities, however, has only recently received attention. In this paper, we present the results of two parallel experiments, one in the visual modality and one in the auditory modality, run on the same lexical material (words and pseudo-words). The reaction time distributions differ between the two modalities: mean reaction time is longer in the auditory modality. Mean reaction time for pseudo-words in significantly longer the mean reaction time for real words.

Chuikova Oxana, Gorbova Elena

Derivation of Aktionsarten from (non-)directional verbs of motion: posing restrictions on restrictions

The paper examines prefixal derivatives of Russian motion verbs that represent perfective Aktionsarten. The focus is on the statistical analysis of motion verbs as a specific subset of Aktionsarten characterized by their derivational features. Both sets (verbs of motion and Aktionsarten) are included in the database of Russian prefixed verbs developed by the authors based on the Dictionary of Russian language. One of the aims of the study is to verify the statement defended in previous works that the stem of a motion verb imposes restrictions on the possible Aktionsarten derived from it. This verification will be based on both dictionary data and the data from the Russian National Corpus and the Runet.

Galitsky Boris, Ilvovsky Dmitry, Morkovkin Anton

Enhancing RAG and Knowledge Graphs with Discourse

We consider a number of Retrieval Augmented Generation (RAG) architectures to address a lack of specific information and hallucination issues of Large Language Models (LLM)—based question answering. We start with conformal prediction which acts on top of LLM and maintains a set of generations instead of a single one and attempts to find the best element of this set, which is assumed to be the “most average one”. We then proceed to LLM self-reflection series of RAG architectures predicting the multi-hop question answering session before actual search for an answer. After that, we propose a mechanism for LLM to filter out answers inappropriate with respect to style. All these components need discourse-level analysis for more robust functioning. Knowledge graph (KG) and Abstract Meaning Representation (AMR)-based knowledge graph construction follow. We evaluate the contribution of all of these components to overall answer relevance and also zoom in on the role of discourse-based subsystem in each of these components. There is a substantial improvement of performance due to the fourcomponent architecture introduced in this paper; the contribution of discourse-based subsystems is fairly modest.

Gromenko Elizaveta, Kalacheva Daria, Klokova Ksenia, Krongauz Maxim, Moroz Oksana, Shulginov Valery, Yudina Tatiana

Cultural Evaluation of LLMs in Russian: Catchphrases and Cultural Types

This study addresses the gap in evaluating large language models’ (LLMs) cultural awareness and alignment within the Russian sociocultural context by introducing a structured framework comprising 8 Cultural Types (e.g., Spiritual Practitioner, Soviet Intellectual) and 5 catchphrase groups (e.g., memes, proverbs). A 400-question evaluation dataset was developed to probe 10 multilingual LLMs, including GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro, across fact-based cultural knowledge and nuanced linguacultural understanding in a zero-shot setting. Results show that closed-source models GPT-4o and Claude 3.5 Sonnet outperform other models, with one of the smallest models (Mistral NeMo 12B) achieving the lowest result. Performance disparities were noticed in separate evaluation on Cultural Type tasks and catchphrases. Model-specific skews emerged, with lower-ranked models showing inclination toward specific cultural types. Qualitative analysis revealed common errors, such as selecting synonymous but incorrect answers or failing to grasp culturally specific logic. The contribution outlines the limitations of LLMs in interpreting cultural context and lays the groundwork for further research in assessing the cultural-linguistic alignment of LLMs.

Inkova Olga

Multi-senses connectives: some precisions

The article discusses multi-senses connectives and their annotation in text corpora. The author clarifies the concept, which is usually applied to three different phenomena: i) the uncertainty of annotators in choosing one of the meanings of a polysemic connector; ii) the possibility of establishing more than one relation, both explicit and implicit, between two fragments of text; iii) the “combination” of several meanings by a connective. The author then considers how these cases are annotated in the Penn Discourse Treebank and in the Supracorpora Database of Connectives, which use a multi-label annotation for discourse relations. The study also raises a number of theoretical questions: what information can we obtain from double labels, which relations are distributionally close, i.e. can be established in the same contexts, how to separate the contribution of the connective and of the context to the overall meaning of a sequence of sentences, how legitimate is it to talk about the combination of meanings by the connective, how to annotate the features of the context of the connectives so that this data can be used for Natural Language Processing? The solution to these questions is important for both theoretical cognitive and applied research (in particular, for machine learning and machine translation).

Ivleva Anna, Solovyev Valery

An Algorithm for Genre Imbalance Correction in the Russian Subcorpus of the Google Books Ngram Corpus

Recently, the latest version of the Google Books Ngram was shown to be imbalanced. This corpus being an important and widely used database, the imbalance can affect the results of research in different areas. In the paper, we present an algorithm for correcting the dynamics of word frequency in the Google Books Ngram corpus. The algorithm takes into account the discovered imbalance of the main literary styles—fiction, publicistics, and nonfiction. The rationale of the algorithm is given, as well as examples of correction.

Khomenko Anna, Komratova Anastasia, Isakov Danila, Balba Daria, Shishkovskaya Tatiana, Khudyakova Maria

Mental disorders diagnostic model by oral speech

The integration of automated speech analysis in diagnosing mental health disorders is becoming increasingly significant in both clinical and computational linguistics. This study aims to construct linguistic profiles for individuals with neurocognitive and affective mental disorders. Using speech transcriptions and relevant to the study computational techniques like lexical clustering and stylostatistical analysis, this research looks for characteristics capable of distinguishing speech patterns indicative of various mental health conditions. A text corpus of oral speech from 136 people diagnosed with schizophrenia, schizotypal disorder, schizoaffective disorder, borderline personality disorder, other personality disorders, primary depressive episode, recurrent depressive disorder, bipolar affective disorder and 210 participants with no diagnosed diseases in the control group was used in the research. As a result of the study, it was proved that people with mental disorders display specific features in oral speech, that can be used in creation of an automatic mental disorders diagnostic model.

Kiose Maria, Rzheshevskaya Anastasia, Petrov Andrey

Local discourse structure of descriptive discourse in the speech of adult learners of Russian as a foreign language

This paper explores the local structure expressed in rhetorical, communicative and narrative relations and speech disfluencies in the descriptive discourse of adult learners of Russian as a foreign language. In the experiment, two learner groups with B1/B2 and C1/C2 levels of proficiency in Russian as a foreign language had to read two descriptive texts and relate their contents to an interested listener. The study contrasts the distribution of local discourse structure in stimuli texts (59 EDUs) and in the speech of experiment participants in two groups (629 EDUs). The results reveal that the speech in C1/C2 group manifests more frequent narrative relation Setting, communicative relation Metadiscourse, rhetorical relation Interpretation. Meanwhile, the speech of B1/B2 group demonstrates the predominance of communicative relation Agreement and rhetorical relation Joint. The most common type of speech disfluency of C1/C2 speakers is Self-correction, whereas for B1/B2 learners this is Repetition. Additionally, as opposed to the rhetorical structure of the stimuli, Joints are used far more frequently in the speech of both groups, which evidences of the problems in rhetorical coherence with both B1/B2 and C1/C2 learners.

Korotaev Nikolay

Contexts for overlapping talk in Russian conversations

The paper analyses instances of overlap in Russian triadic conversations. Based on the “Russian Pear Chats and Stories” corpus, I provide a multimodal treatment of 364 overlapping episodes. I mostly focus on two conversational parameters — discourse contexts for overlaps and respective epistemic statuses of the interlocutors. A vast majority of overlaps in the data can be attributed to one of the four basic discourse contexts: Sideline Turn, Turn Transition, Turn Interception, and Competitive Development. These contexts differ in the degree of cooperation vs. competition exhibited by the participants. I show that overlaps appearing in more competitive contexts tend to be longer and more often include truncated turns. Epistemic statuses of participants are fixed throughout the studied conversations. When statuses are considered equal, competitive overlaps are more frequent than when one participant has a higher epistemic status.

Kotov Artemiy, Nosovets Zakhar, Filatov Alexander, Arinkin Nikita

The development of a natural language reasoning system for a companion robot

The natural language inference system allows the robot to go from the meaning of the incoming text (from a visual or tactile stimulus) to the derived meaning—the inference. This system uses a rule-based parser, and pairs of semantic representations constructed by the parser are combined into a scenario. In this paper, we represent the robot’s natural language inference space as a graph, where the robot can move from the premise of a scenario to the consequence, and from the consequence of one scenario to the premise of another. We involve annotators who propose derived sentences (semantic components) for a given premise and, in the annotator interface, immediately evaluate the proximity of the proposed sentence to the available scenarios. This procedure allows us to develop the graph of scenarios, to evaluate its connectivity and the absence of dead ends (deadlock vertices), as well as the adequacy of the analysis of incoming texts by scenarios within this graph. The graph contains 5,000 scenarios and approximately 22,000 nodes. We estimate that a graph consisting of 7,000 scenarios can be sufficient for modelling the mechanism of human natural language reasoning.

Kovalev Grigory, Tikhomirov Mikhail, Kozhevnikov Evgeny, Kornilov Max, Loukachevitch Natalia

RusBEIR – Russian Benchmark for Zero-shot Evaluation of Information Retrieval Models

Weintroduce RusBEIR, a comprehensive benchmark designed for zero-shot evaluation of information retrieval (IR) models in the Russian language. Comprising 17 datasets from various domains, it integrates adapted, translated, and newly created datasets, enabling systematic comparison of lexical and neural models. Our study highlights the importance of preprocessing for lexical models in morphologically rich languages and confirms BM25 as a strong baseline for full-document retrieval. Neural models, such as mE5-large and BGE-M3, demonstrate superior performance on most datasets, but face challenges with long-document retrieval due to input size constraints. RusBEIR offers a unified, open-source framework that promotes research in Russian-language information retrieval. The benchmark is available for public use on GitHub.

Lee George, Loukachevitch Natalia, Khokhlov Alexey

Generating Encyclopedic Articles Based on a Collection of Scientific Publications

Generating texts that demand high factual accuracy and strict formatting, such as encyclopedic articles, presents numerous challenges: how and where to gather relevant information, how to structure it into a coherent and wellformatted text, and how to ensure that the compiled article does not contain factual errors. We propose a solution to these problems for the Russian language by developing a system for generating encyclopedic articles that extracts the most recent and relevant knowledge from scientific publications in the online library eLIBRARY.RU and structures it as a single context for input into a generative model. To evaluate both the impact of the extracted knowledge on the content of the final texts and the overall quality of generation, we considered several prompting strategies, some of which do not use the context found in publications, and compared these approaches using automatic metrics and human expert evaluation. We hope that the created framework will become a reliable reference material for scientists researching new and relevant topics in their field of expertise.

Lepekhin Mikhail, Sharoff Serge

Causal Models and Adversarial Training: Selecting the right properties for robust non-topical text classification

The vast majority of datasets for non-topical classification of texts contain distribution shifts. In most cases, those are topical shifts. Their presence in the data forces the classifiers to fit topics-related features instead of focusing on those relevant for the target class. It causes a dramatic decrease in the accuracy of the trained models when the test data are taken from a different data source. To address this problem, we experiment with two techniques: causal models and adversarial domain adaptation. In our work, we apply CausalLM, Adversarial Domain Adaptation (ADA), and Energy-based ADA (EADA) for gender classification and compare the results. The results are novel for the non-topical classification task. We show that both causal and adversarial methods manage to make the model more resilient to the distribution shifts although it causes a decrease of accuracy when tested on the domain prevailing in the training dataset. Moreover, we describe the first attempt to reduce the impact of topical shifts in the task of non-topical classification with usage of causal methods. Besides, we provide a link to the GitHub repository with the code of our experiments to ensure their reproducibility: https://github.com/MikeLepekhin/CausalAndAdversarialMethods.

Loukachevitch Natalia, Tkachenko Natalia, Lapanitsyna Anna, Tikhomirov Mikhail, Rusnachenko Nikolay

RuOpinionNE-2024: Extraction of Opinion Tuples from Russian News Texts

In this paper, we introduce the Dialogue Evaluation shared task on extraction of structured opinions from Russian news texts. The task of the contest is to extract opinion tuples for a given sentence; the tuples are composed of a sentiment holder, its target, an expression and sentiment from the holder to the target. In total, the task received more than 100 submissions. The participants experimented mainly with large language models in zero-shot, fewshot and fine-tuning formats. The best result on the test set was obtained with fine-tuning of a large language model. Wealso compared 30 prompts and 11 open source language models with 3-32 billion parameters in the 1-shot and 10-shot settings and found the best models and prompts.

Mamontova Angelina, Ischenko Roman, Vorontsov Konstantin

RuTermEval-2024: Cross-genre and Cross-domain Automatic Term Extraction and Classification

Automatic Term Extraction (ATE) is a critical NLP task for identifying domain-specific terms, which are essential for tasks like information retrieval, machine translation, and ontology construction. Cross-domain nested term extraction further complicates the task, as traditional methods often fail to handle hierarchical term structures and domain variability. This paper introduces both the CL-RuTerm3 dataset, a novel resource featuring nested term annotations across six domains (the main one is computational linguistics, also mathematics, medicine, economics, literature studies, and agrochemistry), and the RuTermEval-2024 competition, designed to evaluate term extraction systems on this data. The CL-RuTerm3 dataset, comprising 1270 abstracts and 15 full-text articles (over 165k tokens with over 37k annotated entities), is the largest of its kind for Russian scientific texts. Terms are classified into three categories based on lexical and domain specificity: specific terms, common terms, and nomens. The dataset’s unique features, such as nested term markup and cross-domain coverage, enable more realistic evaluation of ATE systems. The paper concludes with an analysis of participant approaches in the RuTermEval-2024 competition, emphasizing the effectiveness of contrastive learning. This work aims to advance ATE research by providing a robust dataset and fostering discussions on term extraction methodologies.

Morozov Dmitry, Glazkova Anna, Garipov Timur

BERT-like Models for Automatic Morpheme Segmentation of the Russian Language

Current approaches to automatic morpheme segmentation for the Russian language rely on machine learning, primarily neural network methods. Among the architectures presented, the best results have been achieved using convolutional neural networks and LSTM networks. However, the quality of automatic annotation is far from ideal, especially when dealing with roots that were not present in the training dataset. In this work, we present a new approach to morpheme segmentation based on fine-tuning BERT-like models. Through comparisons using two morpheme dictionaries with different segmentation paradigms, we demonstrated the superiority of our approach over previous ones, including when working with unfamiliar roots. The best result was achieved by fine-tuning the RuRoBERTa-large model: when working with random words, the share of completely correct segmentations increased from 88.5-90.8% to 92.5-93.5%, and when working with unfamiliar roots, it improved from 70.5-72.6% to 74.9-77.2%. Error analysis of the model showed that root nests not encountered in the training dataset can be distributed into two groups during testing: “recognizable”, meaning those for which more than 90% of the words are correctly analyzed, and “unknown”, meaning those for which the proportion of correct segmentations is less than 10%.

Podlesskaya Vera

“Oh, no! Not on Wednesday! On Thursday!”: interjection in the context of speech disfluency

The subject of the study is a special class of uses of the Russian interjection OY, namely, uses in the context of speech disfluencies. The study is based on data from three corpora—the oral and the multimedia (MURСO) subcorpora of the Russian National Corpus and the pilot version of the corpus of oral personal stories “What I saw”. The main types of speech disfluencies in the context of which speakers resort to using the interjection OY, the structure of constructions with OY, as well as the features of their phonetic—segmental and suprasegmental—implementation are investigated. From a quantitative point of view, a comparison of the use of OY in oral public and oral non-public speech is carried out. It is shown that although OY in oral non-public speech occurs, as could be expected, many times more often, the proportion of OY as a marker of a speech disfluency in relation to the total number of uses of OY is almost the same in public and non-public varieties of oral speech. These quantitative data suggest that the general distribution of the use of OY and the distribution of OY as a marker of speech disfluency are not regulated by the same factors. The general distribution is associated with the degree of spontaneity and publicity of speech: interjections as exclamatory discursive markers become more appropriate when the speaker is not constrained by disciplinary guidelines and can allow himself to express emotions and give assessments more freely. And the distribution of OY as a marker of speech disfluency is also associated with the density of speech disfluencies themselves. Of course, this density varies among individual speakers, but at the same time it largely depends on the general principles regulating the generation of unprepared discourse.

Prokofyeva Olga, Kiose Maria, Leonteva Anna, Smirnova Evgeniya

Intensity and its manifestations in speech and recurrent gestures in spontaneous dialogue

The present article explores the category of intensity in multimodal spontaneous dialogues in Russian. We regard intensity as a cognitive category expressed in notional, referential and sign meanings, i.e. in its manifestation degrees (high, medium, and low) and interaction of quantitative and qualitative meanings, referent types and types of lexical meaning. We study multimodal intensity patterns as revealed in the spontaneous dialogue in question, with the main focus on recurrent co-speech gestures, where the latter are attributed the stability of form and function. The research material consists of the recordings of about 3 hours featuring 20 participants, 1082 speech intensifiers, and 392 corecurrent gestures of 9 groups. The results show, despite the susceptibility to individual preferences, that this is the notional meaning of intensity in speech that is more consistently revealed in recurrent gestures. The presenting recurrent gesture group appears to be most numerically frequent, nevertheless, these are enhancing and locating gestures that could help distinguish between pure quantity and merged quantity-quality cases. In general, exploring the use of gestures as mediated by intensity meanings allowed to specify their discourse functions attributed to intensity.

Rossyaykin Petr

Structured sentiment analysis using few-shot prompting of an ensemble of LLMs

This paper describes our participation in RuOpinionNE-2024 shared task (Loukachevitch et al., 2025). The objective of this task is to extract opinion tuples of the form <holder, target, polarity expression, polarity> from news texts in Russian. We approached this task with few-shot prompting of super large language models (LLMs). ThequalityofLLMs’predictionswasimprovedintwoways. Inthefirststageweusedpromptswithexampleswhich text embeddings were similar to that of the target text. In the second stage we augmented prompts with answers of LLMs from the previous stage, achieving the second-best F1 score in the competition in the post-evaluation stage. Our results show that the addition of answer suggestions to the prompt is particularly useful if they provide novel and variable information.

Rozhkov Igor, Loukachevitch Natalia

Methods for Recognizing Nested Terms

In this paper, we describe our participation in the RuTermEval competition devoted to extracting nested terms. Weapply the Binder model, which was previously successfully applied to the recognition of nested named entities, to extract nested terms. We obtained the best results of term recognition in all three tracks of the RuTermEval competition. In addition, we study the new task of recognition of nested terms from flat training data annotated with terms without nestedness. We can conclude that several approaches we proposed in this work are viable enough to retrieve nested terms effectively without nested labeling of them.

Satdarov Konstantin, Kharlamova Darya, Yakuboy Andrey, Lezina Alisa

Faroese Corpus Development: Strategies for Low-Resource Languages Corpora

This paper presents the development of a comprehensive morphologically annotated corpus for Faroese, a lowresource Germanic language. We describe the creation of a large corpus of contemporary news texts automatically annotated using a custom-trained SpaCy model. The study demonstrates the effectiveness of creating linguistic re sources for low-resource languages using minimal initial data. We trained a Transformer-based morphological parsing model on the small but high-quality OFT treebank using 5-fold cross-validation, achieving significant accuracy in morphological tagging and lemmatization. Manual evaluation confirms satisfactory performance of the automatic annotation, though certain challenges remain in distinguishing homonymous word forms across different parts of speech. This research provides a methodological framework for developing comprehensive linguistic resources for other low-resource languages with minimal initial data requirements.

Semak Vladislav, Bolshakova Elena

Comparing Transformer-Based Approaches for Term Recognition in Russian texts

The paper describes an experimental comparative study of three transformer-based approaches applied for automatic term recognition (ATR) in Russian texts. The approaches include sequential labeling of word tokens in a given text, phrase classification with context sentence enclosing the phrase, and text span prediction by using vector representations obtained by contrastive learning. The BERT-based models were trained and evaluated on the data of RuTermEval-2024 competition for nested term identification and classification, which encompassed three tasks: binary term identification, term recognition with classification (into one of predefined types), and cross-domain term recognition. The experiments have shown that the span prediction models based on contrastive learning outperform the other models across all three RuTermEval tasks, but at the same time demonstrate the most significant decrease in quality in the cross-domain task.

Sherstinova Tatiana, Melnik Aleksey, Petrova Irina, Azarevich Karina, Melkozerova Valeria, Chepovetskaya Sofia

“Okie dokie, here’s the no-cap truth!”: Everyday Russian Youth Speech in Corpus Representation (Structure and Application of the ESC Sound Corpus)

The article focuses on the creation and potential applications of the Everyday Student Conversations Russian speech corpus (ESC corpus), referred to as KURS сorpus in Russian. The corpus is being developed based on a modified methodology of the “one day of speech” recording approach, adopted from the ORD Corpus, with an emphasis on capturing contemporary youth speech. The article highlights the key aspects of the corpus creation process, with special attention to its structure, data collection and processing methodology, as well as the functionality of the corpus online demo version. The primary aim of the project is to explore linguistic changes in the youth environment, create a resource for scientific and applied research, and develop a big data collection of everyday speech for machine learning and advanced artificial intelligence systems. For linguists, the corpus offers unique opportunities to study sociolinguistic, pragmatic, and phonetic features of speech, making it a valuable tool for analyzing contemporary discourse.

Shulginov Valery, Şimşek Hasan Berkcan, Kudriashov Sergei, Randautsova Renata, Shevela Sofya

Evaluating the Pragmatic Competence of Large Language Models in Detecting Mitigated and Unmitigated Types of Disagreement

This study presents a framework for evaluating the effectiveness of language models (LLMs) in detecting disagreement across a wide range of pragmatic strategies, from mitigated forms to overt verbal aggression. Special attention is given to complex cases of implicit manifestations of irony and sarcasm, which pose significant challenges for both automated analysis and interpersonal communication. Experimental testing of LLMs was conducted in two types of tasks: binary classification for identifying disagreement and classification of specific strategies for its expression. The results showed that large multilingual models outperformed other models, especially in binary classification. However, models that focus primarily on the Russian language, such as GigaChat and YaGPT, tend to interpret irony and sarcasm more accurately and have a higher result density. Comparative analysis with human judgments revealed that, despite progress, the accuracy of sarcasm detection by LLMs still lags significantly behind human judgments. The results suggest a need for further optimization of LLMs to improve their pragmatic competence in real communicative situations.

Sidorova Elena, Ivanov Alexander, Ilina Daria, Ovchinnikova Kristina, Osmushkin Nikita, Sery Alexey

An Approach to Information Extraction from Texts of a Limited Subject Domain Based on a Chain of Large Language Models

The paper considers the approach for extracting the information from texts of limited domains of knowledge based on a chain of neural language models. The task is represented in the form of three subtasks solved sequentially: (1) term extraction and classification; (2) coreference resolution; (3) extraction of relations of entities named with the terms. The dataset was based on texts on computational linguistics from the Habr forum. In the markup for term classification and relation extraction, 17 classes of terms and 51 relations were used in accordance with the ontology of computational linguistics. Prompt chain-based methods were used to apply LLMs, where each next query to the LLM is based on the results of the previous step. Six types of prompt templates were developed: for extracting, classifying, verifying terms, extracting coreferential relations, relations specified by the ontology, and a specialized template for relations linking entities of the same class. Sentence-BERT, GPT-4 and Mistral-based models were used at different steps of the study; a comparison with the SFT approach (ruRoBERTa) was made; hybrid approaches that have shown the best results were also developed. For term extraction and classification, F1=0.77 was obtained, for coreference resolution—F1=0.897, and for relation extraction—F1=0.847.

Sorokin Alexey, Nasyrova Regina

LORuGEC: the Linguistically Oriented Rule-annotated corpus for Grammatical Error Correction of Russian

Werelease LORuGEC–thefirstrule-annotated corpus for Russian grammatical error correction. The sentences in it are accompanied with the grammar rules governing their spelling. In total, we collected 48 rules with 348 sentences for validation and 612 for testing. LORuGEC appears to be challenging for open-source LLMs: the best F0.5-score is achieved by Qwen2.5-7B using two-stage finetuning and is only 50%. The closed YandexGPT4 Pro model achieves the score of 75%. Using a rule-informed retriever for fewshot example selection, we improve these scores up to 57% for Qwen and 81% for YandexGPT4 Pro.

Studenikina Kseniia, Lyutikova Ekaterina, Gerasimova Anastasia

Gradual Acceptability Judgments with LLMs: evidence from agreement variation in Russian

This study examines the LLMs’ performance on gradual acceptability judgments task. Previously, the linguistic competence of LLMs was evaluated using binary acceptability scales, which contradicts the theoretical concept of acceptability. We present a new benchmark KVaS (Korpus Variativnogo Soglasovanija ‘Corpus of Variable Agreement’) derived from syntactic experiments on variable agreement in Russian. Our dataset contains multiple phenomena of agreement variation, ideal for modeling diverse acceptability levels, and compiles 7013 sentences rated by native speakers on a 1–7 Likert scale. We evaluated two LLMs, mainly Russian-trained GigaChat-Pro and multilingual Mistral Large, comparing their capability to treat acceptability as a scale to the reference human scores from KVaS. We used prompting providing benchmark sentences in two modes: zero-shot mode included only instructions while a few-shot mode added training sentences and their scores. The results show that GigaChat-Pro underperformed compared to Mistral Large. GigaChat-Pro improved significantly in a few-shot mode while Mistral Large exhibited more stable behavior. The case study shows that Mistral can detect nearly all significant contrasts in an experiment, whereas GigaChat performed near-randomly. The corpus may be useful for ranking LLMs, fine-tuning, and enhancing Russian text generation quality.

Tatevosov Sergey, Kisseleva Xenia

Aspectual pairs in Russia idioms: conditions on and mechanisms of aspectual reduction

В работе описывается аспектуальное поведение глагольных идиом в русском языке. Задача исследования — определить, какую информацию о структуре и интерпретации аспектуальной системы можно получить на фразеологическом материале. Исследование показывает, что аспектуальная редуцированность наблюдается в парах «непроизводный имперфектив — префиксальный перфектив», но не в парах «перфектив — вторичный имперфектив». Согласно нашей гипотезе, она возникает с использованием двух механизмов. Редукция к перфективу происходит в случае включения в состав идиомы лексического префикса, задающего результирующее состояние. Редукция к имперфективу происходит благодаря акциональному сдвигу, в результате которого идиома описывает ситуацию, не способную развиваться к кульминации.

Timoshenko Svetlana, Serdobolskaya Natalia, Kobozeva Irina

Clause Linkers for the Ruscon Database: Selecting Criteria and Statistical Evaluation

The study is devoted to compiling a list of clause linkers, or connectives, for the database Ruscon (available at http://ruscon.belyaev.io/). It is aimed at differentiating between complex connectives (like a ne to ‘else’) and occasional combinations of connectives (a vsledstvie etogo ‘and henceforth’). There is a large number of cases when phonetic, morphosyntactic and semantic properties of connectives do not permit to make this differentiation. In these cases we propose to use the MMI metric (Modified Mutual Information). It rests on the rule of probability multiplication, which allows to estimate the probability of independent events.

Vatolin Aleksei

Structured Sentiment Analysis with Large Language Models: A Winning Solution for RuOpinionNE-2024

This paper presents the winning solution for the RuOpinionNE-2024 competition on structured sentiment analysis in Russian news texts. We propose a novel pipeline with large language models (LLMs) and adapter-based fine-tuning, demonstrating how modern LLMs can be effectively adapted to complex opinion tuple extraction tasks. Our method addresses three key challenges: (1) alignment of model predictions with original text spans through a fuzzy substring matching algorithm, (2) robustness to generation variability via multi-prediction aggregation strategies, and (3) efficient domain adaptation using QLoRA fine-tuning. The proposed approach achieved first place with a test F1 score of 0.405. Experimental results reveal that adapter-based fine-tuning of open-source 70B parameter models (Llama-3.3) surpasses prompt-engineered proprietary models like GPT-4o. Our analysis provides practical insights into adapting LLMs for structured information extraction in morphologically rich languages, showing that targeted fine-tuning with 4-bit quantization enables state-of-the-art performance without taskspecific architectures.

Vatolin Aleksey, Gerasimenko Nikolai, Loukachevitch Natalia, Ianina Anastasia, Vorontsov Konstantin

RuSciFact: Open Benchmark for Verifying Scientific Facts in Russian

Against the backdrop of active LLM development, their tendency to hallucinate, as well as the growing volume of texts they generate, the validation of facts has become increasingly important and relevant. We propose ruSciFact1, a new benchmark for fact-checking scientific claims in Russian. ruSciFact is structured as a NLI task, the goal is to verify whether a fact is confirmed by a given abstract. To generate facts, we used an 3-step pipeline based on LLaMA-405B, validating the resulting sentences with the help of assessors-terminologists. The ruSciFact dataset consists of 1128 pairs in the format <abstract, claim>, which we are releasing as open source together with the benchmark code. Additionally, we are opensourcing the fact-generation pipeline2, which facilitates the expansion of the dataset to specific scientific domains. We evaluated several popular language models on ruSciFact, including text embedders and generative models. The results show that this benchmark allows to effectively assess the fact-checking capabilities of LLMs in Russian.

Yanko Tatiana

Upstep in Russian echo-questions

The paper gives an account of Russian echo questions about expressions. The prosodic upstep model is viewed as a means of expressing the illocutionary force of the echo questions about expressions. Upstep can be construed as the fundamental frequency (F0) contour aligned with the segment material long enough to implement a gradual rise from one step of an echo question to another up to the end of the speech act. It is proposed to consider a generalized step of the upstep as a rise of the F0 on the stressed syllable of a phonetic word followed by high post-tonics. At the same time, it is shown that the prosody of the echo question varies, in particular, it depends on the length of the segment material of an echo question. In addition, the upstep model can as well be considered in another meaning: as a function that converts the F0 contour of a wh-question into the F0 contour of the corresponding echo question. Achieving average F0 values significantly higher than the average F0 of any other speech act for each particular speaker is viewed as the “prosodic target” of the upstep model.

Zimmerling Anton, Baiuk Alexandra

Operator Words Versus Mt Systems And Llms

This paper investigates the meanings of two classes of Russian discourse words, defined by modal operators VER and AFF, and examines their translation equivalents in nine target languages using machine translation (MT) systems, large language models (LLMs), and human expert translations. VER (verification) indicates confirmation of a hypothesis, while AFF (affirmation) expresses a strengthened belief in a hypothesis. The study uses a set of 17 Russian discourse words and evaluates their translations in English, German, Danish, Swedish, Icelandic, Ukrainian, Bulgarian, Ossetiс, and Arabic across 85 test sentences. The primary goal was to test the universality of the VER and AFF distinction, hypothesizing that these classes remain distinct across languages despite the lack of direct one-toone translation equivalents. The study assumed that VER and AFF operators, corresponding to DE RE and DE DICTO attitudes respectively, differ fundamentally in semantics and distributional behavior. The study confirms the semantic and distributional independence of VER and AFF operators, supporting their universality. LLMs, despite not being specialized for MT tasks, showed remarkable adaptability and context awareness compared to traditional MT systems. The findings highlight the potential of LLMs in nuanced translation tasks and underscore the complexity of translating modal discourse words. Future work will explore custom models and further refine evaluation metrics for translation accuracy.

DOWNLOAD FULL ISSUE (pdf)

Proceedings 2025

Contents

Title page

Papers

Abstracts

Title page

Papers

Abstracts

Collection of proceedings

2025

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000