Содержание (SCOPUS)
ББК 81.1
К63
Редакционная коллегия: | В. П. Селегей (главный редактор), В. И. Беликов, И. М. Богуславский, Б. В. Добров, Д. О. Добровольский, Л. М. Захаров, Л. Л. Иомдин, И. М. Кобозева, Е. Б. Козеренко, М. А. Кронгауз, Н. И. Лауфер, Н. В. Лукашевич, Д. Маккарти, П. Наков, Й. Нивре, Г. С. Осипов, А. Ч. Пиперски, В. Раскин, Э. Хови, С. А. Шаров, Т. Е. Янко |
Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва, 30 мая — 2 июня 2018 г.). Вып. 17(24), 2018.
Для специалистов в области теоретической и прикладной лингвистики и интеллектуальных технологий.
© Редколлегия сборника «Компьютерная лингвистика и интеллектуальные технологии» (составитель), 2018
Intra-Text Coherence as a Measure of Topic Models Interpretability
Anastasyev D. G., Gusev I. O., Indenbom E. M.
Improving Part-of-speech Tagging Via Multi-task Learning and Character-level Word Representations
Andriyanets V., Daniel M., Pakendor B.
Discovering Dialectal Differences Based on Oral Corpora
Апресян В. Ю., Шмелев А. Д.
Чайник долго (не) закипает, компьютер долго (не) загружается…
Апресян В. Ю.
Разрешение неоднозначности сфер действия в письменных текстах (на материале английского языка)
Arefyev N., Ermolaev P., Panchenko A.
How much does a word weigh? Weighting word embeddings for word sense induction
Arefyev N. V., Gratsianova T. Y., Popov K. P.
Morphological Segmentation with Sequence to Sequence Neural Network
B
Belyy A. V., Dubova M. A.
Framework for Russian plagiarism detection using sentence embedding similarity and negative sampling
Belyy A. V., Seleznova M. S., Sholokhov A. K., Vorontsov K. V.
Quality Evaluation and Improvement for Hierarchical Topic Modeling
Boguslavsky I. M., Frolova T. I., Iomdin L. L., Lazursky A. V., Rygaev I. P., Timoshenko S. P.
Semantic Analysis with Inference: High Spots of the Football Match
Bolshakova E. I., Ivanov K. M.
Term Extraction for Constructing Subject Index of Educational Scientific Text
Bulygin M. V., Sharoff S. A.
Using Machine Translation for Automatic Genre Classification in Arabic
D
Denisova V. A., Cienki A., Iriskhanova O. K.
Boundary Expression in Verbs and Gesture: Differences between L1 and L2 Speakers
Добровольский Д. О., Зализняк Анна А.
Немецкие конструкции с модальными глаголами и их русские соответствия: проект надкорпусной базы данных
E
Егорова М. А.
Дискурсивный маркер типа по данным национального корпуса русского языка: происхождение, семантика и прагматика
F
Fomin V. V., Bondarenko I. Yu.
A study of machine learning algorithms applied to GIS queries spelling correction
G
Galitsky B., Taylor R.
Discovering and Assessing Heated Arguments at the Discourse Level
Гращенков П. В., Кириллова А. А., Смирнова О. С.
Влияние синтаксиса на просодию: данные одного эксперимента над русским письменным текстом
I
Инькова О. Ю.
Надкорпусная база данных как инструмент изучения формальной вариативности коннекторов
Инькова О. Ю., Нуриев В. А.
Насколько лингвоспецифичен союз хотя?
Иомдин Л. Л.
Еще раз о микроконструкциях, сформированных служебными словами: то и дело
Ivanov V. V., Solnyshkina M. I., Solovyev V. D.
Efficiency of Text Readability Features in Russian Academic Texts
K
Khristoforova E. A., Kimmelman V. I.
Corpus-based investigation of quotation in Russian Sign Language
Kibrik A. A., Fedorova O. V.
Language production and comprehension in face-to-face multichannel communication
Klyshinsky E. S., Lukashevich N. Y., Kobozeva I. M.
Creating a Corpus of syntactic co-occurrences for Russian
Konovalov V. P., Tumunbayarova Z. B.
Learning Word Embeddings for Low Resource Languages: the Case of Buryat
Коротаев Н. А
Интонационная структура устного рассказа в контексте незавершенности
Kotov A. A., Zaidelman L. Y., Arinkin N. A., Zinina A. A., Filatov A. A.
Frames Revisited: Automatic Extraction of Semantic Patterns from a Natural Text
Кривнова О. Ф., Смирнова О. С.
База дискурсивных признаков словораздела в устной русской речи: структура, состав и опыт применения
Кустова Г. И.
Ментальные предикаты 2-го лица в метатекстовых конструкциях
Kutuzov A. B.
Russian Word Sense Induction by Clustering Averaged Word Embeddings
L
Laposhina А. N., Veselovskaya Т. S., Lebedeva M. U., Kupreshchenko O. F.
Automated Text Readability Assessment for Russian Second Language Learners
Levin I., Andriyanets V., Iomdin B., Ambartsumian A.
Lexical Variation: Word Knowledge and Polysemy in Russian Everyday Life Lexicon
Левонтина И. Б.
Об одном случае неканонческого использования междометий (корпусное исследование)
Левонтина И. Б., Шмелев А. Д.
Абы: корпусное исследование в аспекте синхронии и диахронии
Лобанов Б. М., Соломенник А. И., Житко В. А.
Опыт объективной оценки интонационного качества синтезированной русской речи
Loukachevitch N. V., Rusnachenko N.
Extracting Sentiment Attitudes from Analytical Texts
Лютикова Е. А., Татевосов С. Г.
Реинтерпретация события: наблюдения над одной русской языковой инновацией
M
Miftahutdinov Z., Tutubalina E.
Leveraging Deep Neural Networks and Semantic Similarity Measures for Medical Concept Normalisation in User Reviews
Mikhalkova E. V., Ganzherli N. V., Karyakin Y. E., Grigoryev D. A.
Machine Learning Classification of User Interests Across Languages and Social Networks
N
Nedoluzhko A., Novak M., Ogrodniczuk M.
Analysis of coreferential expressions in PAWS (English-Czech-RussianPolish Parallel Treebank with Anaphoric Relations)
Nedoluzhko A., Lapshinova-Koltunski E.
Pronominal Adverbs in German and their Equivalents in English, Czech and Russian: Evidence from the Parallel Corpus
P
Падучева Е. В.
Снятая утвердительность и неверидикативность
Panchenko A., Lopukhina A., Ustalov D., Lopukhin K., Arefyev N., Leontyev A., Loukachevitch N.
RUSSE2018: a Shared Task on Word Sense Induction for the Russian Language
Пекелис О. Е.
Иллокутивное употребление союзов: шкала иллокутивности и ее отражение в грамматике
Petrova M. A., Druzhkina A. A., Garashchuk R. V., Yudina M. V.
Semi-automatic Integration of a new Language into a multilingual NLP model: the case of Japanese
Piperski A. Ch.
Corpus Size and the Robustness of Measures of Corpus Distance
Подлесская В. И.
«А у нас в квартире газ! А у вас?»: конструкции с союзом A по данным просодически размеченного корпуса
R
Rygaev I. P
Referring Expression Generation for Question Answering and Graph Visualization
S
Шерстинова Т. Ю.
Структура повседневного диалога как последовательность речевых актов
Skachkov N. A., Vorontsov K. V
Improving topic models with segmental structure of texts
Skorinkin D., Fischer F., Palchikov G.
Building a Corpus for the Quantitative Research of Russian Drama: Composition, Structure, Case Studies
Слабодкина Т. А., Федорова О. В.
Анализ речевых сбоев в дискурсе русскоязычных детей 10–12 лет
Slioussar N. A.
Gender, Declension and Stem-final Consonants: an experimental Study of Gender Agreement in Russian
Sorokin A. A.
Improving neural morphological Tagging using Language Models
T
Тискин Д. Б.
Интерпретация русских местоимений в контекстах контрфактического тождества: опыт корпусного исследования
Toldova S., Pisarevskaya D., Kobozeva M., Vasilyeva M.
The cues for rhetorical relations in Russian: “Cause—Effect” relation in Russian Rhetorical Structure Treebank
U
Урысон Е. В.
Синтаксис предлогообразных наречий: некоторые сложные случаи
V
Вилинбахова Е. Л.
Что будет, то (и) будет: об одном классе тавтологических конструкций в русском языке
Y
Z
Зализняк Анна А., Денисова Г. В., Микаэлян И. Л.
Русское как-нибудь по данным параллельных корпусов
Циммерлинг А. В.
Два диалекта русской грамматики: корпусные данные и модель
Зинина A. А., Аринкин Н. А., Зайдельман Л. Я., Котов А. А.
Разработка модели коммуникативного поведения робота Ф-2 на основе мультимодального корпуса «REC»
Intra-Text Coherence as a Measure of Topic Models Interpretability
The article is devoted to the problem of how to automatically measure the interpretability of topic models. Some new, intra-text, approaches to estimate the interpretability of the topics are proposed. Computational experiments are conducted with the use of text files from “PostNauka”, which is a collection of popular science content.
Anastasyev D. G., Gusev I. O., Indenbom E. M.
Improving Part-of-speech Tagging Via Multi-task Learning and Character-level Word Representations
In this paper, we explore the ways to improve POS-tagging using various types of auxiliary losses and different word representations. As a baseline, we utilized a BiLSTM tagger, which is able to achieve state-of-the-art results on the sequence labelling tasks. We developed a new method for characterlevel word representation using feedforward neural network. Such representation gave us better results in terms of speed and performance of the model. We also applied a novel technique of pretraining such word representations with existing word vectors. Finally, we designed a new variant of auxiliary loss for sequence labelling tasks: an additional prediction of the neighbour labels. Such loss forces a model to learn the dependencies inside a sequence of labels and accelerates the process of training. We test these methods on English and Russian languages.
Andriyanets V., Daniel M., Pakendor B.
Discovering Dialectal Differences Based on Oral Corpora
This paper discusses a method to detect statistically significant linguistic differences between corpora while factoring in possible variability within the very corpora to be compared. Specifically, we compare two small corpora of dialects of Even, Bystraja and Lamunkhin Even, in an attempt to identify morphemes that are more frequent in either of the corpora. To investigate whether this difference might be due to an over-representation of a speaker who happens to be an outlier in terms of using a particular morpheme, we use DP, a measurement of evenness of the distribution of a specific linguistic feature across subcorpora of the same corpus.
Апресян В. Ю., Шмелев А. Д.
Чайник долго (не) закипает, компьютер долго (не) загружается…
The paper deals with a curious phenomenon of quasi-synonymy that occurs in Russian between sentences with non-negated and negated predicates in the construction with the adverb dolgo ‘for a long time’. Consider sentences like Chainik dolgo zakipal ‘It took the kettle a long time to boil, lit. Kettle for a long time boiled’ vs. Chainik dolgo ne zakipal ‘It took the kettle a long time to boil, lit. Kettle for a long time not boiled’. The paper is an attempt to define the semantic and pragmatic mechanisms of such quasi-synonymy, as well as semantic and aspectual classes of predicates where it occurs. It also considers subtle semantic, pragmatic and communicative differences associated with non-negated and negated construction, respectively. Such quasi-synonymy occurs primarily in cases when the predicate belongs to the aspectual class of accomplishments and denotes a telic process or action with a desired result (‘to boil’, ‘to cool down’, ‘to warm up’, ‘to grow up’, ‘to finish’, etc.). Those predicates include two major semantic components, that is, a lasting process or action and an instant result. In the imperfective aspect they allow at least two possible interpretations, namely, of a process and that of a result. Similar interpretations of sentences with such predicates occur due to different scope assignments of negation and dolgo. In sentences with non-negated predicate dolgo has scope over the ‘process’ component in the verb; in sentences with negated predicate negation has scope over the ‘result’ component of the verb while at the same time falling into the scope of dolgo. The former type of sentences describes long-lasting processes, whereas the latter type describes long-awaited results, which pragmatically amount to the same thing.
Апресян В. Ю.
Разрешение неоднозначности сфер действия в письменных текстах (на материале английского языка)
The paper is a corpus study of the factors involved in disambiguating potential scope ambiguity in written sentences with negation and universal quantifier all, such as I cannot visit all these universities, which, depending on topic-focus assignment, can alternatively mean ‘I cannot visit any of these universities’ (cannot is focus) and ‘I cannot visit some of these universities’ (all is focus). The factors at play in scope disambiguation are the syntactic function of the constituent containing all (subject, direct complement, adjunct); the status of the main predicate and all with respect to the information structure of the utterance (topic vs. focus); veridical vs. nonveridical context; sentence type (unreal conditional, rhetorical question); and pragmatic implicatures pertaining to the situations described in the utterances. The paper also demonstrates differences in the frequency distribution of various scope readings and their underlying causes, as well as formulating typical contexts for each scope interpretation.
Arefyev N., Ermolaev P., Panchenko A.
How much does a word weigh? Weighting word embeddings for word sense induction
The paper describes our participation in the first shared task on word sense induction and disambiguation for the Russian language RUSSE’2018 (Panchenko et al., 2018). For each of several dozens of ambiguous words, the participants were asked to group text fragments containing it according to the senses of this word, which were not provided beforehand, therefore the „induction“ part of the task. For instance, a word “bank” and a set of text fragments (also known as “contexts”) in which this word occurs, e.g. “bank is a financial institution that accepts deposits” and “river bank is a slope beside a body of water” were given. A participant was asked to cluster such contexts in the unknown in advance number of clusters corresponding to, in this case, the “company” and the “area” senses of the word “bank”. The organizers proposed three evaluation datasets of varying complexity and text genres based respectively on texts of Wikipedia, Web pages, and a dictionary of the Russian language. We present two experiments: a positive and a negative one, based respectively on clustering of contexts represented as a weighted average of word embeddings and on machine translation using two state-of-the-art production neural machine translation systems. Our team showed the second best result on two datasets and the third best result on the remaining one dataset among 18 participating teams. We managed to substantially outperform competitive state-of-the-art baselines from the previous years based on sense embeddings.
Arefyev N. V., Gratsianova T. Y., Popov K. P.
Morphological Segmentation with Sequence to Sequence Neural Network
Morphological segmentation is an important task of natural language processing as it can significantly improve the processing of unfamiliar and rare words in different tasks that involve text data. In this paper we present datasets in English and Russian for learning and evaluating morphological segmentation algorithms, demonstrate the method based on the sequence to sequence neural model and show that the proposed approach shows better results in comparison with other existing methods of morpheme segmentation. We start from an English dataset, which is already available and only minor preprocessing has been made, and then we experiment with the Russian language, where we could not obtain prepared data. So, some more serious preprocessing issues are included. Moreover, we demonstrate how morphological segmentation can improve another natural language processing task—evaluation of words semantic similarity. To achieve this goal, first we try to reproduce the best results of the participants of Russian words semantic similarity competition (RUSSE), which was conducted in Dialogue 2015 conference. Then we show how with the help of smart morpheme segmentation these results can be advanced.
B
Belyy A. V., Dubova M. A.
Framework for Russian plagiarism detection using sentence embedding similarity and negative sampling
In this paper, we propose a new approach for advanced plagiarism detection in Russian language. It is based on a classifier, dealing with two different types of sentence similarity measures: token set similarity and cosine similarity between sentence embeddings (based on pre-trained RusVectores, unsupervised fastText, and supervised StarSpace models). The diversity of feature space makes it possible to detect different types of plagiarism, starting from simple copy&paste cases and ending with complex manual paraphrases. The proposed approach implies an ability to focus on the particular plagiarism type identification, allowing to train a universal model at the same time. The method shows great results on detection of different types of plagiarism and outperforms the previous approach.
Belyy A. V., Seleznova M. S., Sholokhov A. K., Vorontsov K. V.
Quality Evaluation and Improvement for Hierarchical Topic Modeling
Generic topics of large-scale document collections can often be divided into more specific subtopics. Topic hierarchies provide a model for such topic relation structure. These models can be especially useful for exploratory search systems. Various approaches to building hierarchical topic models have been proposed so far. However, there is no agreement on a standard approach, largely due to the lack of quality metrics to compare existing models. To bridge this gap we propose automated evaluation metrics which measure the quality of topic-subtopic relations (edges) of a topic hierarchy. We compare automated evaluations with human assessment to validate the proposed metrics. Finally, we show how the proposed metrics can be used to control and to improve the quality of existing hierarchical models.
Boguslavsky I. M., Frolova T. I., Iomdin L. L., Lazursky A. V., Rygaev I. P., Timoshenko S. P.
Semantic Analysis with Inference: High Spots of the Football Match
The paper describes a new version of the semantic analyzer SemETAP. Our approach is based on the assumption that the depth of understanding is growing with the number of inferences we can draw from the text. The salient features of SemETAP include: 1) intensive use of both linguistic and background knowledge. The former is incorporated in the Combinatorial Dictionary and the Grammar, and the latter is stored in the Ontology and Repository of Individuals. 2) Words and concepts of the ontology may be supplied with explicit decompositions for inference purposes. 3) Two levels of semantic structure are distinguished. Basic semantic structure (BSemS) interprets the text in terms of ontological elements. Enhanced semantic structure (EnSemS) extends BSemS by means of a series of inferences. 4) A new logical formalism Etalog is developed in which all inference rules are written. Semantic analysis with inference allows us to extract implicit information. The analyzer is tested on the task of interpreting high spots of the football match.
Bolshakova E. I., Ivanov K. M.
Term Extraction for Constructing Subject Index of Educational Scientific Text
Subject index, or back-of-the-book index, is a device intended to provide an easy access to relevant fragments of a text document. Subject indexes usually contain particular single-word and multi-word terms from the corresponding documents. Such indexes are especially useful for reading large documents with specialized terminology, as well as educational texts in difficult scientific and technical areas. The central problem of back-ofthe-book indexing is recognition of terms to be included into the index. The paper describes a method developed for extracting and filtering terms from a given educational scientific text, with the purpose of reliable term selection in computer indexing systems. The method is primarily based on rules with lexico-syntactic patterns representing linguistic information about terms and typical contexts of their usage in Russian scientific and educational texts; simple occurrences statistics of terms is used as well. Experimental evaluation of the method has shown a considerable increase of precision and recall of term extraction compared with the widely-used standard techniques.
Bulygin M. V., Sharoff S. A.
Using Machine Translation for Automatic Genre Classification in Arabic
This paper addresses the task of automatic genre classification for Arabic within the Functional Text Dimensions framework, which allows texts to get a reliable genre description, while maintaining an adequate amount of genre labels. Our aim in this study is to build an automatic classification model that can annotate any Web text in Standard Arabic in terms of genres. To build the training corpus we translated English and Russian annotated texts into Arabic using Google MT. For building the model experimented with various machine learning approaches, such as Logistic Regression, SVM, LSTM, and different features, such as words, character n-grams and embedding vectors. For testing the classification models, we collected and annotated in terms of FTDs our own corpus of Arabic Web texts. The best performing model offers reasonable classification accuracy in spite of being based on a training corpus produced by MT.
D
Denisova V. A., Cienki A., Iriskhanova O. K.
Boundary Expression in Verbs and Gesture: Differences between L1 and L2 Speakers
The notion of event boundaries is closely connected with the category of aspect. Aspectual forms show different views of “internal temporal consistuency of a situation” (Comrie 1976:3) and, consequently, construals of events in different ways. Recently scholars have started looking into the core of the aspectual distinction through multimodality, considering hand gestures. On the basis of Russian and French oral narratives produced by native speakers, we conducted a study, testing our hypothesis about the existence of direct correlation between the expression of boundaries in verbs and in gestures. Means of boundary expression regarded for Russian on the verbal level were perfective (soversennyj vid) and imperfective (nesoversennyj vid) verbs, and for French—passe compose and imparfait. On the kinesthetic level we distinguished between bounded gestures (i.e., involving a pulse of movement) and unbounded gestures (i.e., smooth by nature). While for French L1 we found a direct correlation between gesture boundary schemas and aspectual forms, the results for Russian L1 did not support our hypothesis. With a view to these differences between the two languages, we studied the boundedness correlation in oral narratives produced by Russians speaking French as L2 (CEFR levels B2-C1). The comparison between L1 and L2 narratives revealed a certain change of gestural patterns: the Russian speakers of French L2 used almost the same number of unbounded and bounded gestures with the perfective verb forms and more unbounded gestures with the imperfective forms, thus moving closer towards French L1 speakers’ verb-gesture patterns. The use of gestures can be accounted for by a series of noise factors related to language peculiarities, the cognitive mechanism of profiling and challenges of speaking in L2.
Добровольский Д. О., Зализняк Анна А.
Немецкие конструкции с модальными глаголами и их русские соответствия: проект надкорпусной базы данных
В статье излагаются принципы контрастивного корпусного исследо- вания немецких и русских модальных конструкций. Ставится задача, во-первых, уточнить номенклатуру значений немецких модальных глаголов и условий их реализации, а во-вторых, выявить и описать средства выражения модальных значений в русском языке на основе анализа множества конструкций, служащих функциональными экви- валентами при переводе на русский язык конструкций с немецкими модальными глаголами. Анализ предлагается осуществлять при по- мощи создания на основе репрезентативного массива параллель- ных немецко-русских текстов Национального корпуса русского языка (НКРЯ) надкорпусной базы данных переводных соответствий, в кото- рой как немецкой конструкции с модальным глаголом, так и ее рус- скому переводному эквиваленту приписывается аннотация в форме набора значений релевантных признаков. Такая база данных, с одной стороны, будет представлять собой ценный лингвистический ресурс, который может быть использован, в том числе, для создания нового поколения электронных интерактивных немецко-русских и русско-не- мецких словарей; с другой стороны, построенная на основе этой базы данных инвентаризация типов конструкций русского языка с (потенци- альным) значением модальности составит важный вклад в грамматику конструкций русского языка, подтверждающий принципиальную не- прерывность в отношениях между лексикой и грамматикой.
E
Егорова М. А.
Дискурсивный маркер типа по данным национального корпуса русского языка: происхождение, семантика и прагматика
Discourse marker tipa became widespread in colloquial Russian in the decade 1990s–2000s. However, until recently, it has gained little attention. In this paper we use the data from the Russian National Corpus and we aim to accomplish the following goals: 1) to highlight the origin of the discourse marker tipa from the noun tip ‘type’, 2) to describe the semantics of the discourse marker tipa as well as that of the partly grammaticalized element tipa as part of parametric constructions. We base our approach mainly on the results achieved by Susanne Fleischman and Marina Yaguello.
F
Fomin V. V., Bondarenko I. Yu.
A study of machine learning algorithms applied to GIS queries spelling correction
The problem of spelling correction is crucial for search engines as misspellings have a negative effect on their performance. It gets even harder when search queries are related to a specific area not quite covered by standard spell checkers, such as geographic information systems (GIS). Moreover, standard spell-checkers are interactive, i.e. they can notice a misspelled word and suggest candidate corrections, but picking one of them is up to the user. This is why we decided to develop a spelling correction unit for 2GIS, a cartographic search company. To do this, we have extracted and manually annotated a corpus of GIS lookup queries, trained a language model, performed various experiments to find the best feature extractor, then fitted a logistic regression using an approach suggested in SpellRuEval, and then used it iteratively to get a better result. We have then measured the resulting performance by means of cross-validation, compared at against two baseline algorithms and observed a substantial increase. We also present an interpretation of the result achieved by calculating and discussing the importance of specific features and analyzing the output of the model.
G
Galitsky B., Taylor R.
Discovering and Assessing Heated Arguments at the Discourse Level
The problem of detecting heated arguments in text such as political debates and customer complaints is formulated as tree kernel learning of discourse structures. Affective argumentation structure is discovered in the form of discourse trees extended with edge labels for communicative actions. Extracted argumentation structures are then encoded as defeasible logic programs and are subject to dialectical analysis, to establish the validity of the main claim being communicated. We evaluate the accuracy of each step of this affect processing pipeline as well as overall performance.
Гращенков П. В., Кириллова А. А., Смирнова О. С.
Влияние синтаксиса на просодию: данные одного эксперимента над русским письменным текстом
The paper examines dependencies between the syntactic and prosodic structure with particular attention to the pausation and different levels of prosodic boundary strength. The research is based on the prosodic data markup for a spoken Russian text and the manual tagging of this text with the relevant syntactic constituent boundaries. Two types of structures, the finite clause and the asyndetic coordination, exhibit a strong positive correlation with the appearance of a pause and the perceptual prosodic boundary. We also demonstrate the presence of a substantial correlation between the syntactic embedding depth and prosodic boundaries. The results of our research show a significant connection between some of the initially proposed syntactic factors and prosodic structure. We thus anticipate that prosodic modules of TTS systems can benefit from taking certain syntactic information into consideration.
I
Инькова О. Ю.
Надкорпусная база данных как инструмент изучения формальной вариативности коннекторов
The article intends to describe the formal variation of the connectors of the Russian language on the basis of a cognitive-semantic approach. Every discourse variant DV of a connector K, i.e. the specific form assumed by K in a discourse section, is singled out, and registered in the supracorpora database of connectors (SCDB), in which a system of intersecting clusters has been developed, allowing to assign in the course of the annotation the same DV to different structural clusters. In the next phase, on the base of further semantic analysis, the DVs with a common element are combined into a structural-semantic complex around a basic form: the minimal linguistic unit that enables the speaker to express a certain logical-semantic relation, and the listener to identify it. In conclusion, criteria for describing the formal variation of the connectors are proposed, as well as examples of the “profiles” of the basic forms. They reflect the potential of linguistic means that the speaker has at his disposal to express one or another logical-semantic relations or one of their combinations.
Инькова О. Ю., Нуриев В. А.
Насколько лингвоспецифичен союз хотя?
The paper describes the Russian connective khotya (‘although’) from a contrastive perspective. First, it focuses on the semantic description of the connective and proposes to differentiate its four meanings, namely, concessive propositional, concessive illocutionary, adversative propositional and adversative illocutionary. The paper analyzes the functioning of the connective khotya (prototypical marker of concessive relations) and that of the connective no (‘but’, prototypical marker of adversative relations). In so doing, it comes to the following conclusion: the adversative meaning of khotya develops on the basis of its concessive meaning as the connection between the situations presented in the textual fragments that are linked by the connective becomes less logical. Similarly, i.e. vice-versa, as the logical connection between situations becomes stronger, this gives rise to a concessive interpretation in utterances with no. Further, the paper takes a closer look at French equivalents khotya gets, when occurring in each of its four meanings. The concluding section attempts to define the degree of language-specificity of khotya. To this end, several parameters are considered: (1) cases where the connective has a zero equivalent, (2) cases of divergent translation (the connective is translated by a non-connective), (3) number of translation patterns. To perform a contrastive analysis and to collect statistical data, the supracorpora database of connectives is used. The database is built upon the parallel Russian-French and FrenchRussian subcorpora of the RNC.
Иомдин Л. Л.
Еще раз о микроконструкциях, сформированных служебными словами: то и дело
Статья продолжает серию исследований микросинтаксиса русского языка, которые автор проводит на протяжении достаточно продолжи- тельного времени. В центре внимания находится адвербиальная ми- росинтаксическая единица то и дело, которая представляется весьма интересной и поучительной, поскольку сочетает в себе ряд имплицит- ных семантических особенностей и уникальный набор синтаксических свойств, часть из которых обнаруживается благодаря рассмотрению не только синхронных, но и диахронных языковых данных. Эта единица исследуется на фоне других микросинтаксических элементов, кото- рые оказываются ее соседями по словарю, но обладают существенно другим набором лингвистически релевантных свойств. Обсуждаются вопросы, связанные с адекватным представлением фразеологиче- ских единиц типа то и дело в Микросинтаксическом словаре русского языка, составляемом автором и его коллегами, и в корпусе текстов, содержащем микросинтаксическую аннотацию.
Ivanov V. V., Solnyshkina M. I., Solovyev V. D.
Efficiency of Text Readability Features in Russian Academic Texts
This paper addresses the problem of readability assessment for Russian texts and investigates the impact of 24 lexical, syntactic and frequency features. The research was conducted on Russian Readability Corpus containing two sub-corpora, two sets of 5–11 grade level textbooks on Social studies for native speakers of Russian. The sub-corpora were collected for research purposes, annotated and marked as BOG and NIK. The application of the Ridge regression has demonstrated the connection between readability and average sentence length, average number of coordinating chains, average number of sub-trees, frequency and lexical features. The results of the study have the potential to be applied in a wide variety of areas including primarily education, as well as webpage design, document management.
K
Khristoforova E. A., Kimmelman V. I.
Corpus-based investigation of quotation in Russian Sign Language
This paper presents corpus-based research of quotation constructions in Russian Sign Language (RSL). Quotation constructions have been observed from different perspective in different signed and spoken languages [Brendel, Meibauer, Steinbach 2011]; [Litvinenko et al. 2009]. Based on the corpus of spontaneous narratives recorded from RSL signers [Burkova 2015], we conducted a quantitative analysis of these constructions. We analyzed constituents of quotation construction, such as the source (author of utterance) indication, the introducing matrix predicate, and the quote. Our investigation of non-manual markers in the corpus revealed that nonmanual marking of quotation is optional for RSL quotations. We distinguished direct and indirect quotations in our data based on the reference of indexical elements, the use of subordinating conjunction, and the imperative mood. We found that in RSL non-manuals do not mark the direct/ indirect type of quotation. Our data show that RSL signers tend to use direct quotation much more frequently than indirect quotation. In addition, we compared our findings with the data on quotation constructions in some other sign languages and with the studies of quotation in natural discourse of spoken languages. This comparison showed that RSL quotations share core properties with quotations in spoken and signed languages [Litvinenko et al. 2009].
Kibrik A. A., Fedorova O. V.
Language production and comprehension in face-to-face multichannel communication
Although language production and comprehension are parts of one and the same linguistic capacity, they have been studied separately for a long time. A key issue in the present day research is how the two processes are related, and whether transitions from thought to language and vice versa are accomplished by a single or two separate systems. Important progress in this area has been achieved in the field of psycho- and neurolinguistics; a brief review is provided in Section 1. In this paper we explore the production—comprehension relationship on the basis of our multichannel resource “Russian Pear Chats and Stories”. In Section 2 we describe this resource, including the stimulus material, data collection setup, participants and corpus size, and technical aspects. Section 3 lays out two main theoretical notions: a model of face-to-face multichannel communication and a scheme of the production-comprehension interweaving in each interlocutor. In subsequent sections we discuss three case studies of production—comprehension relationships: relative contributions of kinetic channels to discourse understanding (Section 4), turn-taking and eye gaze (Section 5), and multichannel continuity (Section 6). The evidence of the multichannel corpus suggests a cognitive architecture that integrates language production and comprehension.
Klyshinsky E. S., Lukashevich N. Y., Kobozeva I. M.
Creating a Corpus of syntactic co-occurrences for Russian
In the paper we discuss methods used to create CoSyCo, a corpus of syntactic co-occurrences, which provides information on syntactically related words in Russian. We describe a list of shallow parsing templates, which were used to collect data for CoSyCo. The paper includes an overview of the corpora collected for CoSyCo creation and an outline of how the noun ‘virus’ is used in its subcorpora as an example of the information which can be obtained from this online resource.
Konovalov V. P., Tumunbayarova Z. B.
Learning Word Embeddings for Low Resource Languages: the Case of Buryat
Word-vector representations have been extensively studied for rich resource languages with large text datasets. However, only a few studies analyze semantic representations of low resource languages, when only small corpus is available. In this study we introduce a methodology and compare techniques to learn semantic representations of low resource languages. The proposed methodology consists of defining accurate preprocessing steps, applying language-independent stemmer and learning word-vector representations. In addition, we propose a simple word embeddings evaluation scheme that can be easily adapted to any language. By using this methodology we learn word-vector representations for Buryat language. In order to promote further research we make the source code and the resulting word embeddings corpus publicly available.
Коротаев Н. А
Интонационная структура устного рассказа в контексте незавершенности
Topic—focus articulation in Russian has been mainly studied against isolated utterances. In a categorical sentence, this communicative opposition is reflected in the linear-accentual structure [Paducheva 2015]. For a simple declarative sentence, that would normally mean that the topic (theme) comes first and has a rising phrasal accent, while the focus (rheme) completes the utterance and is pronounced with a falling accent. At the same time, these formal features do more than just differentiate between topics and foci; they also mark the discourse-semantic category of phase [Kodzasov 2009]. In syntactically simple utterances, topics tend to correlate with anticipated continuation, hence non-final phase; foci are usually phase-final. As I intend to show in this paper, the non-final phase provides a variety of contexts that challenge the topic—focus distinction. The study is based on the “Stories about presents and skiing”—a collection of prosodically annotated spoken narratives. In Section 1, I concentrate on issues within a simple clause, where non-final verbal elements often have a fuzzy communicative interpretation. In Section 2, I analyze complex syntactic structures. The data show that non-final clauses may demonstrate both thematic and rhematic properties with regard to their intonation patterns, internal structure and discourse function. Hence, one can claim that some non-final clauses are topics, while others are foci. However, a majority of non-final clauses in the analyzed corpus may not be unambiguously attributed to either of these categories. Section 3 provides a pilot study of complex intonation patterns. Only phase distinction being considered, utterances with more than one accentual phrase may follow either (i) the basic adaptation strategy (comprising a non-final rising accent and a final falling accent), or, more often, (ii) a complicated strategy: (a) multiple parallel adaption, (b) consecutive adaptation, or (c) parenthetical strategy.
Kotov A. A., Zaidelman L. Y., Arinkin N. A., Zinina A. A., Filatov A. A.
Frames Revisited: Automatic Extraction of Semantic Patterns from a Natural Text
Our project aims to design a syntactic parser, which constructs a semantic representation in a frame format: a clause is represented as a table of valencies, filled in with semantic markers. This representation is compared to a list of scripts—used to disambiguate and classify the semantic representation as well as to select an appropriate reaction for a companion robot F-2.
Кривнова О. Ф., Смирнова О. С.
База дискурсивных признаков словораздела в устной русской речи: структура, состав и опыт применения
Thе paper discusses the most important results of the project “Hierarchy of prosodic phrasing in spoken language: controlling factors and means of realization”. The project was aimed at expanding the empirical base of phrasal prosody researches, which inadequacy is marked in many scientific areas: discourse theory, syntax, intonational phonology, general phonetics, speech synthesis and recognition etc. The introduction provides a brief description of the study background and formulates the tasks which were necessary to solve for the ultimate goal of the project planned for 3 years of implementation. The first section describes the characteristics of speech corpora created in the the project for construction of a complex, linguistic-prosodic database required for the study and modeling of prosodic phrasing in Russian speech, which takes into account, if possible, all controlling factors and means of realization. The second section is devoted to the description of the structure and composition of wordbreaks’ discursive features database (BDF), obtained on the basis of annotated, prosodically graduated and acoustically analyzed speech corpora. It should be noted the universality and flexibility of the format and structure of the database as a computer resource, freely admitting to extend its feature set and to detail their parametric characteristics. The third section illustrates as the BDF application for theoretical and statistical modelling of inter-level correlations “syntax—linguistic prosody” in both directions and “linguistic prosody and speech signal (acoustic speech)” in both directions. The conclusion summarizes the results of research and discusses some promising directions for further studies on relevant topics.
Кустова Г. И.
Ментальные предикаты 2-го лица в метатекстовых конструкциях
В работе рассматриваются метатекстовые (вводные) конструкции с ментальными глаголами во 2-м лице. Показано, что если пропозиции, ассоциированные с вводными словами 1-го лица (думаю; боюсь; знаю и т.д.) и 3-го лица (считают и т.п.) принадлежат говорящему и 3-му лицу соответственно, то пропозиции, ассоциированные с вводными сло- вами 2-го лица (думаешь, представляешь, знаешь и т.п.), обычно не принадлежат адресату. Рассматриваются следующие вопросы: есть ли семантическая корреляция между пропозицией и МК, какую иллокутивную функцию имеют МК и пропозиция. Было показано, что некоторые МК употребляются только в вопросительных предложениях.
Kutuzov A. B.
Russian Word Sense Induction by Clustering Averaged Word Embeddings
The paper reports our participation in the shared task on word sense induction and disambiguation for the Russian language (RUSSE’2018). Our team was ranked 2nd for the wiki-wiki dataset (containing mostly homonyms) and 5th for the bts-rnc and active-dict datasets (containing mostly polysemous words) among all 19 participants. The method we employed was extremely naive. It implied representing contexts of ambiguous words as averaged word embedding vectors, using off-the-shelf pre-trained distributional models. Then, these vector representations were clustered with mainstream clustering techniques, thus producing the groups corresponding to the ambiguous word’ senses. As a side result, we show that word embedding models trained on small but balanced corpora can be superior to those trained on large but noisy data—not only in intrinsic evaluation, but also in downstream tasks like word sense induction.
L
Laposhina А. N., Veselovskaya Т. S., Lebedeva M. U., Kupreshchenko O. F.
Automated Text Readability Assessment for Russian Second Language Learners
This paper presents an outline of the readability assessment system construction for the purposes of the Russian language learning. The system is designed to help educators easily obtain the information about the difficulty level of reading materials. The estimation task is posed here as a regression problem on data set of 600 texts and a range of lexico-semantic and morphological features. The scale choice and annotated text collection issues are also discussed. Finally, we present the results of the experiment with learners of Russian as a foreign language to evaluate the quality of a predictive model.
Levin I., Andriyanets V., Iomdin B., Ambartsumian A.
Lexical Variation: Word Knowledge and Polysemy in Russian Everyday Life Lexicon
Many words that according to the dictionaries have just one meaning are in fact understood in different ways by different speakers. In this article we deal with Russian nouns denoting everyday life objects which are subject to much variation by age, gender, and region and are poorly described by the existing dictionaries. We report the results of a multilevel survey, propose some possible metrics of word knowledge and show to what extent the words we studied are known among a certain population. We also claim that different speakers possess different sets of meanings for each word, propose ways to discover the distribution patterns for these sets and introduce the notion of disperse polysemy. We believe that our findings may be useful in lexicography (providing detailed information on current word usage in different social groups), lexical semantics (researching meaning shifts and patterns of its distribution among speakers), and language testing (more precise detection of the vocabulary sizes both in native speakers and in language learners).
Левонтина И. Б., Шмелев А. Д.
Абы: корпусное исследование в аспекте синхронии и диахронии
The paper deals with the Russian aby as a marker of “free choice” (or, rather, not specified choice criteria) within indefinite pronouns against the background of other markers of “free choice” such as ugodno, popalo, pridetsia. It pays attention not only to the synchronic semantics of aby, but also to its history and claims that the modern meaning of aby is related to its usage as a conjunction. The paper makes use of the corpus data (the Russian National Corpus as well as the Internet data) to follow the changes in the use of the particle in question over the last two hundred years. It investigates into the range of K-words that can collocate with aby: the most typical are collocations with kto, chto, kak and kakoi; however, collocations with other K-words are also present in the corpora. In addition, it discusses the question of negative polarity of aby and the increasing degree of its polarization.
Левонтина И. Б.
Об одном случае неканонческого использования междометий (корпусное исследование)
The paper deals with the Russian interjections (oj, oh, aj, ogo, uh, etc.), namely their non-canonical use in collocations with K-words (Wh-words), mostly kak and kakoj. This type of use demonstrates a sort of syntactic recomposition — collocations oj kak, oh kakoj, etc. function as lexical units with the meaning of high degree, high quality or big quantity, although with very specific semantic shades. The paper makes use of the corpus data (the Russian National Corpus as well as the Internet data) to discover individual properties of interjections and their historical changes. Primary interjections are described against the background of interjections derived from the words of different part of speech. It turns out that in non-canonical use of primary interjections K-word can hardly be omitted, whereas derived interjections can also function the same way even without K-word. Noncanonical use of derived interjections is, with and without K-words, is very popular in contemporary Russian, especially in slang.
Лобанов Б. М., Соломенник А. И., Житко В. А.
Опыт объективной оценки интонационного качества синтезированной русской речи
The paper describes an experiment on an instrumental evaluation of the intonation quality of synthesized Russian speech by using of “Inton@Trainer” computer system. The system was originally designed to train learners in producing the basic intonation patterns of Russian speech. It is based on comparing the melodic portraits of a reference sentence and a sentence pronounced by the learner. Our approach to assessing the intonational quality of speech allows to treat a synthesized speech with the same strict requirements as are applied to students studying Russian as a second language. We describe the technology used for the instrumental evaluation of the intonation quality of synthesized speech and the acoustic database of reference phrases used to assess the intonation quality of synthesized speech. The paper presents the results of testing the intonation quality of two Russian synthetic voices. We discuss the results of the experiment and outline the ways for improving the methods for objective evaluation of synthesized speech prosodic quality, as well as the possibility of applying the developed system in other linguistic tasks.
Loukachevitch N. V., Rusnachenko N.
Extracting Sentiment Attitudes from Analytical Texts
In this paper we present the RuSentRel corpus including analytical texts in the sphere of international relations. For each document we annotated sentiments from the author to mentioned named entities, and sentiments of relations between mentioned entities. In the current experiments, we considered the problem of extracting sentiment relations between entities for the whole documents as a three-class machine learning task. We experimented with conventional machine-learning methods (Naive Bayes, SVM, Random Forest).
Лютикова Е. А., Татевосов С. Г.
Реинтерпретация события: наблюдения над одной русской языковой инновацией
The paper explores the distribution and interpretation of the discourse marker po(-)xodu (PX) and addresses a possible path of its diachronic development. We argue that the range of uses of PX attested in the corpora supports an analysis that identifies three meanings / functions of this item labeled eventive PX, epistemic PX and discourse-level PX throughout this paper. We propose that the latter two are the products of re-interpretation of the former. We argue for a presuppositional analysis of the eventive PX whereby it requires there be a set of background events that show a temporal overlap with the asserted event and add up to the integral whole. We analyze the epistemic PX as resulting from inferential reinterpretation of the relationship between background and asserted events, with the abductive reasoning being the key ingredient of this reinterpretation. Finally, we treat the discourse-level PX as a counterpart of the eventive PX in the domain of speech acts. We speculate that Krifka’s (2014) recent view of speech acts as index changers opens a way of accounting for this parallelism in a principled way. On the diachronic side, we identify PX as the product of diachronic development of the construction in which the argument of the noun xod ‘move’ is expressed by an overt DP. In the course of development, this DP was first replaced by pro, which gave rise to the eventive PX, and later on developed epistemic and discourse-level meanings / functions.
M
Miftahutdinov Z., Tutubalina E.
Leveraging Deep Neural Networks and Semantic Similarity Measures for Medical Concept Normalisation in User Reviews
Nowadays a new yet powerful tool for drug repurposing and hypothesis generation emerged. Text mining of different domains like scientific libraries or social media has proven to be reliable in that application. One particular task in that area is medical concept normalization, i.e. mapping a disease mention to a concept in a controlled vocabulary, like Unified Medical Language System (UMLS). This task is challenging due to the differences in language of health care professionals and social media users. To bridge this gap, we developed end-to-end architectures based on bidirectional Long Short-Term Memory and Gated Recurrent Units. In addition, we combined an attention mechanism with our model. We have done an exploratory study on hyperparameters of proposed architectures and compared them with the effective baseline for classification based on convolutional neural networks. A qualitative examination of the mentions in user reviews dataset collected from popular online health information platforms as well as quantitative one both show improvements in the semantic representation of health-related expressions in user reviews about drugs.
Mikhalkova E. V., Ganzherli N. V., Karyakin Y. E., Grigoryev D. A.
Machine Learning Classification of User Interests Across Languages and Social Networks
Being a matter of cognition, user interests should be apt to classification independent of the language of users, social network and the essence of interest itself. To prove it, we built a collection of English and Russian Twitter and Vkontakte community pages manually classified according to the interests of their followers. First, we created a model of Major Interests (MaIs) with the help of expert analysis and then classified the mentioned set of pages using machine learning algorithms (SVM, Neural Network, Naive Bayes, Logistic Regression, Decision Trees, k-Nearest Neighbors) trying different optimization techniques. We take three interest domains that are typical of both English and Russian-speaking communities: football, rock music, vegetarianism. The results of classification show a greater correlation between Russian-Twitter and English-Twitter pages. The Logistic Regression with Bernoulli bag-of-words model proves to be the most effective classification algorithm.
N
Nedoluzhko A., Novak M., Ogrodniczuk M.
Analysis of coreferential expressions in PAWS (English-Czech-RussianPolish Parallel Treebank with Anaphoric Relations)
In this paper, we decribe the coreference annotation on a multi-lingual parallel treebank (PAWS), a portion of Wall Street Journal translated into Czech, Russian and Polish which continues the tradition of multilingual treebanks with coreference annotation. The paper focuses on language-specific differences. We analyse syntactic structures concerning anaphoric relations in the languages under analysis, such as personal and impersonal constructions in polypredicative constructions and pro-drop qualities.
Nedoluzhko A., Lapshinova-Koltunski E.
Pronominal Adverbs in German and their Equivalents in English, Czech and Russian: Evidence from the Parallel Corpus
The paper presents a contrastive analysis of pronominal adverbs in German (dabei, darauf, damit etc.) and their equivalents in English, Czech and Russian. The analysis is based on an empirical study of parallel news texts. Our main focus is to show the interplay between cohesive devices expressed through German pronominal adverbs in text and explore their equivalents in English, Czech and Russian. As the dataset at hand contains translations, we also focus on the influence of the translation factor in parallel texts.
P
Падучева Е. В.
Снятая утвердительность и неверидикативность
В докладе речь идет о снятой утвердительности (suspended assertion). Показано, что термин снятая утвердительность, который был введен в 1963 году У. Вейнрейхом, охватывает тот же круг явлений, что тер- мин nonveridicality (предлагаемый перевод на русский язык — неве- ридикативность), который получил широкое распространение в лите- ратуре по формальной семантике благодаря работам А. Джаннакиду, Ф. Зварца и др.. Рассматриваются факты русского языка, требующие обращения к понятию снятая утвердительность: местоимения типа какой-нибудь, местоимения отрицательной поляризации, исчезнове- ние семантического актанта у глаголов в прямой (не параметрической) диатезе, зеркальная симметрия прошедшего и будущего, отрицание с расширенной сферой действия, местоимения на -нибудь в сфере действия отрицания, взаимозаменимость еще и уже. Высказывается убеждение, что понятие снятой утвердительности будет применяться и в других контекстах.
Panchenko A., Lopukhina A., Ustalov D., Lopukhin K., Arefyev N., Leontyev A., Loukachevitch N.
RUSSE2018: a Shared Task on Word Sense Induction for the Russian Language
The paper describes the results of the first shared task on word sense induction (WSI) for the Russian language. While similar shared tasks were conducted in the past for some Romance and Germanic languages, we explore the performance of sense induction and disambiguation methods for a Slavic language that shares many features with other Slavic languages, such as rich morphology and virtually free word order. The participants were asked to group contexts of a given word in accordance with its senses that were not provided beforehand. For instance, given a word “bank” and a set of contexts for this word, e.g. “bank is a financial institution that accepts deposits” and “river bank is a slope beside a body of water”, a participant was asked to cluster such contexts in the unknown in advance number of clusters corresponding to, in this case, the “company” and the “area” senses of the word “bank”. For the purpose of this evaluation campaign, we developed three new evaluation datasets based on sense inventories that have different sense granularity. The contexts in these datasets were sampled from texts of Wikipedia, the academic corpus of Russian, and an explanatory dictionary of Russian. Overall, 18 teams participated in the competition submitting 383 models. Multiple teams managed to substantially outperform competitive stateof-the-art baselines from the previous years based on sense embeddings.
Пекелис О. Е.
Иллокутивное употребление союзов: шкала иллокутивности и ее отражение в грамматике
В статье рассматривается иллокутивное употребление союзов, при ко- тором союз связывает пропозицию одной клаузы с иллокутивной мо- дальностью другой. Обосновывается шкалярный подход к интерпре- тации этого явления: наряду с бесспорно иллокутивным и бесспорно неиллокутивным употреблением, существует класс конструкций с промежуточными свойствами. Формулируются критерии разграни- чения степеней иллокутивности. Демонстрируется, в частности, что императивные предложения, в отличие от вопросительных, не бывают бесспорно иллокутивными. Предъявляются свидетельства того, что предлагаемый подход находит подтверждение в грамматике: разные союзы совместимы с разными видами иллокутивного употребления; в составе бесспорно иллокутивных конструкций не употребляется кор- релят тогда.
Petrova M. A., Druzhkina A. A., Garashchuk R. V., Yudina M. V.
Semi-automatic Integration of a new Language into a multilingual NLP model: the case of Japanese
The current paper deals with the integration of the Japanese language in a multilingual NLP model, namely, the Compreno model. The formalism includes morphological, syntactic and semantic patterns, covering all possible semantic and syntactic dependencies a word can attach. The architecture of the model allows us to acquire nearly all semantic links of a word through its proper positioning in a thesaurus-like semantic hierarchy, where words are linked through semantic dependencies. The inheritance principle of the hierarchy simplifies the syntactic description of a newly added language as well. Unlike the traditional approach to Japanese parsing based on chunks, or bunsetsus, we suggest a Japanese parser based on constituents. Special attention is given to the tools that allow us to automatize language description process and significantly speed up the description. The work on the Japanese model is still in progress, therefore, we show the current results we have achieved, and point out problems that remain to be solved.
Piperski A. Ch.
Corpus Size and the Robustness of Measures of Corpus Distance
This paper studies the impact corpus size has on the robustness of various frequency-based measures of corpus distance (or similarity, respectively), such as Euclidean distance, Manhattan distance, Cosine distance, ??, Spearman’s ?, and Simple-Maths Keyword distance. An experiment performed using the British National Corpus shows that Euclidean distance is least influenced by corpus size and thus is best suited for the purpose of comparing corpora.
Подлесская В. И.
«А у нас в квартире газ! А у вас?»: конструкции с союзом A по данным просодически размеченного корпуса
The paper focuses on Russian constructions with clauses (or VPs) combined by means of the discourse marker A, that behaves as a conjunction or as a particle in different contexts. Prosodically, the construction may come up in two forms: (a) as a single illocution with the first clause pronounced with a rising pitch that projects discourse continuation, and (b) as two separate illocutions with the first clause pronounced with a falling pitch that projects no continuation. Basing on the data from the Prosodically Annotated Corpus of Spoken Russian, prosody and grammar of (a) and (b) were analyzed qualitatively and quantitatively. Type (b) appeared to be as frequent as type (a) and systematically favored in pragmatically marked contexts.
R
Rygaev I. P
Referring Expression Generation for Question Answering and Graph Visualization
This paper describes a practical solution for the task of referring expressions generation (REG) in the context of a question-answering system. When an answer to a question is found in the knowledge base the system has to decide how to present the answer to the user, which properties uniquely distinguish the object found from other objects in the knowledge base. Another task where referring expressions would be useful is the semantic graph visualization task. Building on top of the graph-based approach presented by Krahmer et al in 2003 this paper provides some practical improvements to the algorithm, namely: 1) Instead of depth-first graph search we use breadth-first search, which is dramatically faster when a scene graph is big but the description graph to be found is small, 2) Limit on the size (the number of edges) of the resulting description graph to increase performance and avoid useless long descriptions. Also a sketch on linguistic realization of the referring expressions is outlined.
S
Шерстинова Т. Ю.
Структура повседневного диалога как последовательность речевых актов
Исследование структуры повседневного диалога проведено на матери- але 73 микродиалогов повседневной речевой коммуникации из корпуса устной русской речи «Один речевой день» (ОРД корпус). Задачей ис- следования было выяснение того, какие типы речевых актов чаще всего инициируют и завершают диалог, а также выявление наиболее типичных последовательностей речевых актов в структуре диалога. Была про- анализирована речь 30 человек (6 информантов и 24 коммуникантов) в объеме 2230 речевых актов, относящихся как к профессиональным, так и бытовым разговорам. Для подсчета наиболее частотных после- довательностей речевых актов использовалась техника n-граммного анализа. Полученные результаты показали, что инициируют диалог чаще всего репрезентативы, т.е. речевые акты, связанные с обменом информацией (38% случаев), «этикетное» начало (приветствия, вока- тивы) имеет место в 23% диалогов, а в 19% случаев разговор начина- ется с регулятивной формы. Речевые акты, завершающие диалог, по- казывают большее разнообразие: это репрезентативы (16% случаев), оценочные суждения (валюативы) (14%), регулятивные формы (14%), по 8% — директивы, комиссивы и этикетные формы и 7% — экспрес- сивы. Наиболее типичными бинарными последовательностями речевых актов оказались: два репрезентатива подряд (22,35%), регулятивная форма и следующий за ней репрезентатив (6,93%), репрезентатив и ре- гулятивная форма (6,0%), валюатив и следующий за ним репрезентатив (5,21%), репрезентатив и оценочное суждение (4,77%), а также двусто- ронняя комбинация директива с репрезентативом (по 2,77%).
Skachkov N. A., Vorontsov K. V
Improving topic models with segmental structure of texts
Probabilistic topic modeling is a powerful tool of text analysis, that reveals topics as distributions over words and then softly assigns documents to the topics. Even though the aggregated distributions can be good with basic models, a sequential topic representation of each document is often unsatisfactory. This work introduces a method that allows to increase the quality of topical representation of each single text using its segmental structure. Our approach is based on Additive Regularization of Topic Models (ARTM), which is a technique for imposing additional criteria into the model. The proposed method efficiently avoids a bag-of-words assumption by considering the topical connections of words that co-occur in a local segment. We assume, that sequential sentences are topically and semantically coherent, while the number of topics in each particular text fragment is low. We apply our model to topic segmentation task and achieve a better quality than the current state-of-the-art TopicTiling algorithm. In further experiments we demonstrate that the proposed technique reveals an interpretable sequential structure of documents, while keeping a number of topics low, i.e. the sparsity of the model increases. Apart from topic segmentation, the constructed topical text embeddings can be used in any other applications, where the analysis of the document structure is desirable.
Skorinkin D., Fischer F., Palchikov G.
Building a Corpus for the Quantitative Research of Russian Drama: Composition, Structure, Case Studies
In this paper we introduce RusDraCor—an open corpus of Russian drama for digital literary & linguistic research. The corpus (rus.dracor.org) contains plays from the middle of XVIII to the first third of XX century provided with structural (plus some semantic) markup and metadata. Texts are encoded in the XML-based standard TEI, widely used in building corpora for the humanities. We describe the contents and annotation layers of our corpus, provide some details on its development and enrichment, and finally describe three research cases. Each case demonstrates the use of RusDraCor to answer specific questions about composition, structural features and historical evolution of Russian drama.
Слабодкина Т. А., Федорова О. В.
Анализ речевых сбоев в дискурсе русскоязычных детей 10–12 лет
Данная работа продолжает уже ставшую традиционной для конферен- ций «Диалог» проблематику исследования речевых сбоев (см., в част- ности, работы Подлесская, Комарова 2010; Лауринавичюте, Федорова 2010; Подлесская 2013; Богданова-Бегларян 2013; Подлесская 2014; Потанина и др. 2016). В настоящей статье этот вопрос будет рас- смотрен при сравнении языкового поведения русскоязычных детей 10–12 лет (раздел 1) со взрослыми носителями языка на материале корпуса танграмм (раздел 2). В разделе 3 будет приведена класси- фикация речевых сбоев, в разделе 4 приведены результаты исследо- вания. Наконец, раздел 5 будет посвящен обсуждению результатов и перспективам дальнейшей работы. Мы покажем, что дискурсивное поведение ребенка 10–12 лет с точки зрения речевых сбоев отличается от аналогичного поведения взрослых носителей, что подтверждает нашу гипотезу о позднем дискурсивном развитии ребенка.
Slioussar N. A.
Gender, Declension and Stem-final Consonants: an experimental Study of Gender Agreement in Russian
Every adult native speaker of Russian knows that kon’ is masculine and lan’ is feminine, although 3rd declension nouns present some difficulties in the first and second language acquisition. However, will the fact that these nouns are less frequent than masculine nouns ending in a consonant or feminine nouns ending in -a/ja play a role for online subject-predicate agreement processing? Or will subject-predicate agreement processing be more problematic with subjects of a certain gender? Finally, some final consonants are more characteristic for feminine gender, while the others for masculine gender. Are speakers sensitive to this? We present two experiments addressing these questions. We found that all three factors play a role, but for different tasks (online agreement processing or determining the gender of a novel word) and at different processing stages.
Sorokin A. A.
Improving neural morphological Tagging using Language Models
We offer a new neural architecture for character-level morphological tagging, combining character-level networks with the output of neural language model on morhological tags. Our proposal reduces tagging error up to 10% in comparison with baseline model and achieves state-of-the-art performance both on ru_syntagrus and MorphoRuEval datasets.
Stoynova N. M.
Differential object marking in contact-influenced Russian Speech: evidence from the Corpus of Contact-influenced Russian Speech of Russian Far East and Northern Siberia
The paper deals with differential object marking in the Russian Speech of Nanai-Russian bilingual speakers, namely the variation such as принес рыбу ~ принес рыба (‘{he} brought fish-acc ~ fish-nom’). The puzzle is that this peculiarity can result from a number of different processes: morphosyntactic borrowing from Nanai, penetration of dialectal features into the speech of bilinguals, under-acquisition or reinterpretation of the Standard Russian system. The data of a small corpus of contact-influenced Russian Speech is used to test all these hypotheses. The results are following. Nominative forms are used in DO-position in quite a systematic way and such uses cannot be estimated as occasional “errors”. The main factors that influence the NOM~ACC distribution are a) information structure and b) the accentual type of noun stem. The latter fact supports the hypothesis of a systematic reinterpetation of the Standard Russian system in the situation of incomplete acquisition. No significant correlations with animacy, definiteness, verb form and word order were attested. DOM pattern of Nanai Russian differs from those of Russian dialects and reveals some similarity to those of Nanai. However it cannot be considered as a full morphosyntactic calque.
T
Тискин Д. Б.
Интерпретация русских местоимений в контекстах контрфактического тождества: опыт корпусного исследования
В статье предпринимается попытка корпусного анализа семантики русских личных и притяжательных местоимений в интенсиональных контекстах (на примере контекстов контрфактического тождества). Задача исследования состояла в том, чтобы определить, способны ли местоимения различных типов интерпретироваться de se или de re в та- ких контекстах и какая из интерпретаций предпочтительна. Контекстами контрфактического тождества называются синтак- сические позиции, находящиеся в сфере действия модификатора или клаузы, вводящей ирреальное условие, касающееся тождества тех или иных нетождественных в действительности индивидов (ср. на твоём месте, англ. if I were you). В таких контекстах местоимение может обо- значать реальную личность (как в Я бы на их месте таких должников, как я, в хвост и гриву гоняла; de re) или же ирреальную (Я бы на их ме- сте поставил парочку шалашиков в любом приглянувшемся мне месте; de se — тот, с чьей точки зрения рассматривается ирреальная ситуация). На материале ГИКРЯ (около 20 млрд словоупотреблений) мы пока- зываем, что местоимения я и мой допускают как интерпретацию de re, так и интерпретацию de se, но первая предпочтительна; что возврат- ное местоимение себя также допускает обе интерпретации, но пред- почтительнее de se; что возвратное притяжательное местоимение свой безысключительно интерпретируется de se. Кроме того, сде- ланы некоторые квалитативные наблюдения, касающиеся идентифи- кации атомарного индивида с множественным, как в я бы на вашем месте не стала морочить себе голову, вы молодые люди у вас ещё всё впереди.
Toldova S., Pisarevskaya D., Kobozeva M., Vasilyeva M.
The cues for rhetorical relations in Russian: “Cause—Effect” relation in Russian Rhetorical Structure Treebank
The purpose of the paper is to investigate cues signalling the relations between discourse units in Russian. Building a lexicon of discourse connectives is an indispensable subtask in many discourse parsing applications as well as an essential issue in theoretical researches of text coherence. In order to develop such a resource for Russian, we have conducted a corpus-based study of discourse connectives that were manually extracted from the Russian Rhetorical Structure Treebank (Ru-RSTreebank). The Treebank includes 79 texts annotated within the RST framework [Mann, Thompson 1988]. In order to provide a deeper analysis of connectives in Russian, we focus on causal relations only, namely, the ‘Cause-Effect’ relation. Some of the connectives (primary connectives) are enumerated in grammars and dictionaries. They primarily mark the intra-sentential relations. However, there is an expansive class of less grammaticalized items (secondary connectives) that have received less attention till now. Some of them are based on content words (e.g. по причине ‘for the cause’). Secondary connectives often serve as linking devices for inter-sentential relations. We suggest a scheme for connectives annotation for Russian. We specify the basic patterns that can be used for less-grammaticalized connectives mining in an unannotated corpus. Besides, we provide the comparison of two classes of connectives (primary vs. secondary ones). Our research has shown that these two classes differ in their properties. There is a statistically significant difference between them with respect to the nucleus/ satellite position, intra- vs. inter-sentential relations and some others.
U
Урысон Е. В.
Синтаксис предлогообразных наречий: некоторые сложные случаи
The subject of this paper are Russian so called adverbial prepositions; cf. vokrug (kostra) ‘around smth.’, daleko ot (doma) ‘far from smth.’, etc. By definition, an adverbial preposition either coincides with an adverb (cf. vokrug) or contains an adverb and a preposition (cf. daleko ot). As I have demonstrated in my previous works, an adverbial preposition and the underlying adverb have the same meaning, the only difference between them being in the mode of expression of the main semantic actant; cf. Gorel koster, vokrug (preposition) kostra stojali liudi ‘A fire was burning, people were standing around it’ vs. Gorel koster, vokrug (adverb) stojali liudi ‘A fire was burning, people were standing around’. From the modern point of view, syntactic distinction is insufficient for interpreting such cases as different words (or different meanings of a word). So, an adverbial preposition and the underlying adverb should be interpreted as the same meaning of a given word. I argue that this word is an adverb (or a prepositional adverb). This paper deals with syntax of these adverbs. Such adverbs have one or more semantic actants, at least one of them being expressed by a noun or a prepositional group. The problem is that in some cases it is not clear whether the prepositional group is governed by the adverb or by the verb governing this adverb (thus the adverb and the prepositional group are co-governed by the verb). A criterion of adverb vs. verb governing of such groups is discussed. Two Russian adverbs zadolgo ‘for a long time before smth.’ and nezadolgo ‘for a long time before smth.’ are described from this point of view.
V
Вилинбахова Е. Л.
Что будет, то (и) будет: об одном классе тавтологических конструкций в русском языке
В статье рассматриваются коррелятивные тавтологические конструк- ции вида что будет, то (и) будет, где придаточное предложение пред- шествует главному, а содержание обеих частей материально совпа- дает. При анализе материала из Национального корпуса русского языка и интернет-источников обнаруживается ряд нетривиальных особенностей, присущих данным конструкциям. Так, некоторые тавто- логии в разных контекстах передают противоположные значения: что было, то было может интерпретироваться и как ‘то, что это действи- тельно было, нельзя отрицать’ [Булыгина, Шмелев 1997], и как готов- ность забыть о прошлом в интересах будущего [Активный словарь русского языка]. Далее, частица и в главном предложении допустима в одних тавтологиях, но неприемлема в других. В работе предлага- ется объяснение указанным фактам путем выделения четырех воз- можных значений на основании двух оппозиций: (а) находится ли опи- сываемая ситуация в фокусе внимания говорящего или выводится из него; (б) является ли прочтение конструкции генерическим или конкретно-референтным.
Y
Янко Т. Е.
Речевые акты в структуре связного дискурса: показатели незавершенности по данным корпусов звучащей речи
One of the means of designating the coherence in the spoken discourse is demonstrating that the current utterance of the discourse is not terminal. Every step of narrative consisting of the chain of statements can be marked as non-final. The prosodic cues for incompleteness applied to the speech act of a statement have been studied in details in linguistic literature. In this paper, the discourse incompleteness is analyzed as composed not only with statements but with questions, imperatives, and vocatives as well. The results of the investigation are as follows. The wh-questions, imperatives, and vocatives can be freely composed with the meaning of discourse continuity, and they have specific prosodic cues for marking this combination of meanings. Whereas the yes-no-questions do not accept the prosodic incompleteness marking. The prosodic patterns of incompleteness and the accent placement in questions, vocatives, and imperatives are exemplified here by the dialogues taken from the Multimodal corpus of the Russian National corpus, the Prosodically Annotated Corpus of Spoken Russian (spokencorpora.ru), and the minor working collection of the Russian speech recordings specifically set up for this investigation. The software program Praat was used in the process of analyzing the sounding data.
Z
Зализняк Анна А., Денисова Г. В., Микаэлян И. Л.
Русское как-нибудь по данным параллельных корпусов
В докладе предлагается семантический анализ русского неопределен- ного наречия как-нибудь, проведенный на основе анализа данных фран- цузского, итальянского и английского параллельных подкорпусов НКРЯ, а также базы данных русских дискурсивных слов и их французских эк- вивалентов. В исследовании применяется унидирекциональный метод контрастивного анализа, при котором использованный профессиональ- ным переводчиком способ передачи смысла анализируемой единицы текста оригинала рассматривается как ее квазитолкование, обнаружи- вающее возможные имплицитные компоненты ее значения. Проведен- ное исследование позволило подтвердить высокую степень лингвоспе- цифичности данного слова (обнаруживающую себя, с одной стороны, в значительной доле нулевых эквивалентов — как среди «моделей», так и среди «стимулов» перевода — а также в наличии широкого спектра различных «моделей» и «стимулов» перевода). При этом у слова как- нибудь было выявлено значение «маркера неконтролируемости», в ряде контекстов функционально сходное с конъюнктивом в романских языках, которое не зафиксировано толковыми и двуязычными словарями; с дру- гой стороны, было обнаружено, что чисто оценочное значение ‘кое-как, плохо’ в современном языке значительно сузило свою сферу употребле- ния по сравнению с 19-м веком и реализуется преимущественно одно- временно с основным значением неопределенности образа действия.
Циммерлинг А. В.
Два диалекта русской грамматики: корпусные данные и модель
This paper is addressed the problem of parametric variation in Russian grammar, with focus on copular constructions with agreeing and nonagreeing adjectival predicates. Basing on Russian National Corpus, I reconstruct two dialects of Russian morphosyntax. They differ regarding the assignment of the predicative instrumental case, raising conditions and the distribution of agreeing vs non-agreeing predicates after быть ‘be’, стать ‘become’ and казаться ‘seem’. Russian-A only licenses predicative instrumental on adjectives after SEEM (казалось странным, что P) and non-agreeing predicatives after non-zero forms of BE or BECOME (было странно, что P). Russian-B allows non-agreeing forms after SEEM (казалось странно, что P) and forms of the predicative instrumental case after non-zero forms of BE and BECOME (было странным, что P). I argue that the differences between Russian-A and Russian-B must explained in terms of parametric settings and claim that Russian predicatives lack forms of the predicative instrumental. The assignment of the predicative instrumental to adjectival heads can be explained as subject control in all dialects, but only Russian-B allows raising of sententional arguments to the position of the matrix subject.
Зинина A. А., Аринкин Н. А., Зайдельман Л. Я., Котов А. А.
Разработка модели коммуникативного поведения робота Ф-2 на основе мультимодального корпуса «REC»
В статье описывается разрабатываемая архитектура для моделиро- вания естественного коммуникативного поведения на роботе Ф-2. Важной частью нашей работы является корпусное исследование ком- муникативного поведения человека и последующий перенос такого поведения на робота. Основываясь на мультимодальном корпусе REC, мы описываем особенности естественной коммуникации, а также разрабатываем архитектуру, которая учитывает такие особенности. В данной архитектуре робот может по-разному выражать какую-либо коммуникативную функцию, используя один или несколько исполни- тельных органов: например, демонстрировать апелляцию с помощью мимики, движений головы или жестов рук. Разработанная архитектура также позволяет гибко комбинировать жесты с разными коммуника- тивными функциями. Архитектура позволяет с помощью режимов split, join и single комбинировать теги из разных BML-пакетов, а также син- хронизировать теги внутри одного пакета BML. Перечисленные осо- бенности являются ключевыми для формирования правдоподобного поведения робота Ф-2 и необходимы для повышения эффективности коммуникации между роботом и пользователем.
Формат PDF
Дополнительно
B
Bakarov A, Kutuzov A., Nikishina I.
Russian computational linguistics: topical structure in 2007-2017 conference papers
Blinova O.V.
Позиционные свойства русских апеллятивов (по данным речевого корпуса)
D
Dikonov V.G.
Simulation of background knowledge and bridging in Russian
Dmitrin Y., Botov D., Klenin J., Nikolaev I.
Comparison of deep neural network architectures for authorship attribution of russian social media texts
G
Grashchenkov Pavel, Smirnova Olga, Kirillova Anastasia
Синтаксические факторы, влияющие на просодию
K
Kolmogorova A.V., Kalinin A.
Frequency and combinatorics of somatisms in texts of different sentiment tonality
Kuchina S.A.
Лингво-семиотическая специфика электронных художественных текстов
Kouznetsov V.B.
Phonetic component of integral language model: modeling coarticulation processes in Russian by means of locus equations
M
Mihailov S.A., Shersheneva D.M.
Словарный онлайн ресурс-агрегатор Вышка. Словари
Mittal S., Rani P.
Russe’2018: word sense induction method based on context-based lists
R
Ryzhova D.A., Меlnik A.A., Yershov I.A., Panteleeva I. M., Paperno D.A., Singh Ya., Sobolev M.
Automatic data collection in lexical typology
S
Sadov M., Kutuzov A.
Use of morphology in distributional word embedding models: russian language case
Shavrina T.O.
Differential approach to web-corpus construction
Y
Yakovenko O., Bondarenko I., Borovikova M., Vodolazsky D.
Algorithms for accentuation and phonemic transcription of russian texts in speech recognition systems
Z
Zhornik D.O., Sizov F.O.
Linguistic variation and minor languages corpora: a case study of mansi dialects
Cловарь оценочной лексики потребительских отзывов на основе интернет-ресурса Яндекс.Маркет
E
Erdyneeva Saryun
Русские полуфразеологические предложно-именные конструкции со словом сила с точки зрения автоматического семантического анализа
F
Feldman Daniil
Using subject-predicate-object triplets for opinion mining
G
Gomzin Andrey, Turdakov Denis
Detection of Author’s Educational Level and Age based on Comments Analysis
K
Kotliarova Ekaterina
Text classifier for Hierarchical Attention Networks for Document Classification
Kudryavtseva Angelina
Referential conflict filter in a model of referential choice
L
Linnik Iuliia
His Fault, but Ihre Wunschen [Her Desires]: A Corpus-based Study on Language and Gender
T
Teslenko Denis, Ustalov Dmitry
Разрешение лексической многозначности при помощи частичного обучения
V
Vodolazsky Daniil
Algorithm of an automatic buidling of a derivational morphology resourse for Russian
Z
Земичева Светлана
Томский диалектный корпус: сбалансированность и репрезентативность