ARTICLE | doi:10.20944/preprints202111.0378.v1
Subject: Engineering, Other Keywords: NCM classification; natural language processing; transformers; multilingual BERT; portuguese BERT; NLP; BERT
Online: 22 November 2021 (10:59:43 CET)
The classification of goods involved in international trade in Brazil is based on the Mercosur Common Nomenclature (NCM). The classification of these goods represents a real challenge due to the complexity involved in assigning the correct category codes especially considering the legal and fiscal implications of misclassification. This work focuses on the training of a classifier based on Bidirectional En-coder Representations from Transformers (BERT) for the tax classification of goods with NCM codes. In particular, this article presents results from using a specific Portuguese Language tuned BERT model as well results from using a Multilingual BERT. Experimental results justify the use of these models in the classification process and also that the language specific model has a slightly better performance.
ARTICLE | doi:10.20944/preprints201905.0029.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: multilingual; open information extraction; parallel corpus
Online: 6 May 2019 (06:14:07 CEST)
The number of documents published on the Web other languages than English grows every year. As a consequence, it increases the necessity of extracting useful information from different languages, pointing out the importance of researching Open Information Extraction (OIE) techniques. Different OIE methods have been dealing with features from a unique language. On the other hand, few approaches tackle multilingual aspects. In such approaches, multilingual is only treated as an extraction method, which results in low precision due to the use of general rules. Multilingual methods have been applied to a vast amount of problems in Natural Language Processing achieving satisfactory results and demonstrating that knowledge acquisition for a language can be transferred to other languages to improve the quality of the facts extracted. We state that a multilingual approach can enhance OIE methods, being ideal to evaluate and compare OIE systems, and as a consequence, to applying it to the collected facts. In this work, we discuss how the transfer knowledge between languages can increase the acquisition from multilingual approaches. We provide a roadmap of the Multilingual Open IE area concerning the state of the art studies. Additionally, we evaluate the transfer of knowledge to improve the quality of the facts extracted in each language. Moreover, we discuss the importance of a parallel corpus to evaluate and compare multilingual systems.
BRIEF REPORT | doi:10.20944/preprints202105.0046.v1
Subject: Behavioral Sciences, Applied Psychology Keywords: language mixing; code-switching; multilingual learning; bilingual schooling
Online: 5 May 2021 (12:26:32 CEST)
In bilingual communities, social interactions take place in both single- and mixed-language con-texts. Some of the information shared in multilingual conversations is often required in conse-quent social encounters, like interlocutors’ personal information. In this study we explored whether the autobiographical information provided in a single-language context is better re-membered than in an equivalent mixed-language situation. More than 400 Basque-Spanish bilin-gual (pre-)teenagers were presented with new persons who introduced themselves by either using only Spanish or only Basque, or by inter-sententially mixing both languages. Different memory measures were collected immediately after the initial exposure to the new pieces of information (immediate recall and recognition) and on the day after (delayed recall and recognition). In none of the time points was the information provided in a mixed-language fashion worse remembered than that provided in a strict one-language context. Interestingly, the variability across partici-pants in their sociodemographic and linguistic variables had a negligible impact on the effects. These results are discussed considering their social and educational implications for bilingual communities.
ARTICLE | doi:10.20944/preprints202210.0480.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: Speech Recognition; Automatic Speech Recognition; Language Identification; Wav2Vec2; Multilingual
Online: 31 October 2022 (10:06:34 CET)
This paper documents the development of a special case of multilingual Automatic Speech Recognition model, specifically tailored to attend two languages spoken by the majority of Latin America, Portuguese and Spanish. The bilingual model combines Language Identification and Speech Recognition developed with the Wav2Vec2.0 architecture and trained on several open and private speech datasets. In this model, the feature encoder is trained jointly for all tasks and different context encoders are trained for each task. The model is evaluated separately on two tasks: language identification and speech recognition. The results indicate that this model achieves good performance on speech recognition and average performance on language identification, training on a low quantity of speech material. The average accuracy of the language identification module on the MLS dataset is 66.75%. The average Word Error Rate in the same scenario is 13.89%, which is better than average 22.58% achieved by the commercial speech recognizer developed by Google.
ARTICLE | doi:10.20944/preprints201907.0336.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: natural language processing; semantics; word embeddings; multilingual embeddings; translation; artificial neural networks
Online: 29 July 2019 (11:05:16 CEST)
A novel method for finding linear mappings among word embeddings for several languages, taking as pivot a shared, universal embedding space, is proposed in this paper. Previous approaches learn translation matrices between two specific languages, but this method learn translation matrices between a given language and a shared, universal space. The system was first trained on bilingual, and later on multilingual corpora as well. In the first case two different training data were applied; Dinu’s English-Italian benchmark data, and English-Italian translation pairs extracted from the PanLex database. In the second case only the PanLex database was used. The system performs on English-Italian languages with the best setting significantly better than the baseline system of Mikolov et al. , and it provides a comparable performance with the more sophisticated systems of Faruqui and Dyer  and Dinu et al. . Exploiting the richness of the PanLex database, the proposed method makes it possible to learn linear mappings among an arbitrary number of languages.