Building Carolina: Metadata for Provenance and Typology in a Corpus of Contemporary Brazilian Portuguese

Marcelo Finger; Maria Clara Paixão de Sousa; Cristiane Namiuti; Vanessa Martins do Monte; Aline Silva Costa; Felipe Ribas Serras; Mariana Lourenço Sturzeneker; Miguel de Mello Carpi; Mayara Feliciano Palma; Gabriela Lachi

doi:10.20944/preprints202501.1911.v1

Submitted:

26 January 2025

Posted:

26 January 2025

You are already at the latest version

Abstract

This paper presents the challenges of building Carolina, a large open corpus of Brazilian Portuguese texts developed since 2020 using the Web-as-Corpus methodology enhanced with provenance and typology concerns (WaC-wiPT). The corpus aims at being used both as a reliable source for research in Linguistics and as an important resource for Computer Science research on language models. Above all, this endeavor aims at removing Portuguese from the set of “low-resource languages”. This paper details Carolina's construction methodology, with special attention to the issue of describing provenance and typology according to international standards, while briefly describing its relationship with other existing corpora, its current state of development and its future directions.

Keywords:

Brazilian Portuguese

;

Corpus

;

WaC

;

Typology

;

Provenance

;

WaC-wiPT

Subject:

Arts and Humanities - Humanities

Introduction

Carolina is an open corpus for Linguistics and Artificial Intelligence, with a robust and unprecedented volume of texts of variated typology in Brazilian Portuguese. The current version, Carolina 1.3 Ada, is comprised of 802 million tokens, 2 million texts and more than 11 GB. All the texts were originally written in Brazilian Portuguese between 1970 and 2024, and are available for free, open access download at https://sites.usp.br/corpuscarolina. Carolina was conceived and is currently being developed by a multidisciplinary research team at the Digital Humanities Virtual Lab (‘Laboratório Virtual de Humanidades Digitais’, LaViHD) as part of the Natural Language Processing of Portuguese (NLP2) Project of the Center for Artificial Intelligence (C4AI) of the University of São Paulo (USP).

C4AI-USP endeavors to produce advanced research in Artificial Intelligence in Brazil, disseminate and debate the main results, train students and professionals, and transfer technology to society. The NLP2 Project, one of C4AI’s challenges, seeks to develop systems that advance the state of the art of Natural Language Processing (hf. NLP) to Brazilian Portuguese, targeting a new level of quality and performance compared to existing solutions. In this process, the Center aims to create opportunities for developing state-of-the-art language models and to distance Portuguese from the group of languages with “low NLP resources”. With this aim, C4AI-USP, via the NLP2 Project, is currently building several Brazilian Portuguese corpora, including CORAA, Corpus of Annotated Audios of spoken Portuguese, and the Portinari annotated corpus of Portuguese. Carolina is C4AI’s “mother ship” corpus and will incorporate CORAA’s audio transcripts and Portinari’s raw unlabelled texts and other corpora in the future.

The corpus team, composed by computer science, linguistics and philology researchers, has worked to develop a methodology for building corpora that can be used in a variety of ways, complying with rigorous data control criteria in terms of its origin/provenance and typology, a fundamental requirement for not only computer science research but also linguistic research, among others. We aim at building a robust resource with state-of-the-art features both for research in the field of Artificial Intelligence and in the field of Linguistics, focusing on the importance of provenance and a rich typology of information as fundamental assets in modern data availability.

Carolina was named in honor of Carolina Michaëlis de Vasconcelos (1851-1925), a German philologist and linguist based in Portugal, and the first woman assigned as a professor at the Faculty of Letters of the University of Lisbon, in 1911¹. This tribute symbolizes the aims of our team: to advance knowledge on the Portuguese language and its history, and encourage scientific research made by women².

In this paper we present the challenges of building Carolina. Section 1 presents the foundations on which this construction was based. Section 2 shows how and why to develop a new methodology to build a giant corpus, highlighting the problems involved in the “Web as Corpus” idea. Section 3 presents the current stage of the project, and section 4 concludes the paper with some final considerations and the indication of future steps.

1. Fundamentals and Related Works

1.1. Fundamentals

Since long before the emergence of the digital world, humanity has developed means and techniques to meet the need to organize, locate and retrieve documents and information. Knowing the source of the text is directly related to the trust placed on its content. Thus, the provenance and typology of the documents figure in the range of essential information for the research in the Humanities and data reliability in Computer Science, especially for the construction of large collections of documents that serve to store knowledge in a recoverable, searchable and accessible way.

Linguistics research has largely benefited from digital technology, given that automation in the processing of large volumes of data strongly supports formulating hypotheses about grammars. In addition, the reliability of linguistic studies has been enhanced due to the development of scientific techniques and methods for annotation and control of data from the sources of a natural language corpus. Paixão de Sousa (2014) underlies linguistic studies based on electronic corpora in the global approach of the text, in conceptual and technological terms, which is reflected in an interaction of different levels of analysis. Based on this global approach, Carolina has the potential to contribute to the development of research on Brazilian Portuguese, since it is being built aiming at reliability guarantees, assured by the provenance control provided by structured metadata.

According to Santos and Namiuti (2019), a scientific metadata control such as the information about provenance and typology of the documents figure in the range of essential information for the research in the Humanities and data reliability in Computer Science; in addition, it serves other areas, such as History and Social Memory. To this end, the authors advocate the need for a structured metadata apparatus (AME - ‘ Aparato de metadados estruturados ’) as a solution for reliability.

In Computer Science, NLP research has been dominated in the past years by a succession of language models, that is, a set of machine-learning architectures, mostly based on neural networks. Starting from sequence-to-sequence encoder-decoder (Kalchbrenner; Blunsom, 2013) configurations, it incorporated neural attention (Bahdanau et al., 2014), leading to the attention-only Transformer architecture (Vaswani et al., 2017). Different ways of assembling and training transformers have led to a multitude of very successful language models, such as BERT (Devlin et al., 2019) and its derivatives (Liu et al., 2019; Sanh et al., 2019; Lan et al., 2019) for text classification, GPT (Radford et al., 2018) for text generation, and T5 (Raffel et al., 2019) for machine translation. In such a rapidly changing environment, in which today’s best model is condemned to short-term obsolescence, one must put forward a language model training pipeline, to be ready to generate “the next” proposed model. And the fundamental raw material for this production pipeline is a large, open and reliable general corpus such as ours.

Carolina was conceived within the web-as-corpus view (Baroni et al., 2009), extended with provenance and typology information, which we call the WaC-wiPT view³. The web-as-corpus (WaC) view of corpus building (Fletcher, 2007) has been dominant in recent developments in linguistic resource building, but future applications may require a more cautious approach to data collecting. Data provenance refers to the process of tracing and recording the origins of data and, thus, its movement. It allows one to answer questions such as “where did this piece of text come from?”, and “is it a part or the whole of one document?”. Therefore, if a future application reveals that a corpus may carry some open or hidden biases, provenance is the mechanism that allows us to trace back the origins of the bias. It is also important to know if the data was transferred in total, or in part, and the size of the part. Data provenance provides the audit trail of the data and thus it is a source of reliability on data and on applications derived from it. Additionally, this work understands typology in a broad sense, as free from theoretical commitment as possible, and as a crucial methodological tool in the development of such a large collection of texts, organizing the search, selection and balancing of texts, as will be shown in the Methodology section.

Providing an open, large and diverse corpus for Brazilian Portuguese, with provenance and typology information has the potential of directly impacting research both on Linguistics and Computer Science. This is the intended goal of this work. We hope that provenance and typology information will be helpful to researchers. Control of information regarding digital documents produced or posted on the Web is necessary to meet the potential uses of a large collection of documents. This control also makes it possible to cater for a very wide range of research areas of interest, such as Social Memory and History, as well as Linguistics and Computing.

1.2. Related Works

Given the widespread availability of online content in the last decades, many researchers turned to the Web as their main source for corpus building. Examples of corpora that were built using the Web as a source are the Terabyte corpus (Clarke et al., 2002) (53 billion tokens) and the Web Text Corpus (Liu; Curran, 2006) (10 billion tokens), both built using web-crawlers. The Terabyte Corpus targets the English language and is formed by HTML content obtained in mid-2001 from a set of URLs of the main sites of 2,392 universities and other educational organizations. The Web Text Corpus is also an English-language corpus composed of a collection of texts on various subjects. Unlike most corpora created for NLP use, this corpus employs a linguistic search process instead of the traditional use of web search engines, which are based on scores. The general objective of the corpus was to measure the accuracy of NLP-learning software using it in comparison to training using other corpora (Liu; Curran, 2006).

The TenTen Corpus Family is an initiative by Sketch Engine⁴ for building Web corpora for all major languages in the world (Jakubíček et al., 2013). The Corpora family was also created using web-crawling techniques and presenting texts validated under exclusively linguistic criteria, using specialized technology for this. The very name of the corpora family (TenTen, 10¹⁰) indicates the large size of each corpus that composes it, starting from the minimum size of 10 billion words for each language⁵.

In this context, several large corpora were built adopting the WaCky (Web-As-Corpus Kool Yinitiative) methodology, following the emergence of the first WaCky corpora: the ukWaC, deWaC, itWaC (Bernardini et al., 2006; Baroni et al., 2009), and the frWaC (Ferraresi et al., 2010), which targeted English, German, Italian, and French respectively and include more than 1 billion words each. This methodology comprises four steps which are: identification of different sets of seed URLs, post-crawl cleaning, removal of duplicate content, and annotation. One of the corpora built following this framework is the Brazilian Portuguese Web as Corpus (brWaC), of great relevance as it was already considered the “biggest Brazilian Portuguese corpus available” during its construction (Boos et al., 2014)⁶.

Regarding other existing Portuguese-language corpora, the virtual organization Linguateca (Santos, 2000) stands out as a center for resources focused on the computational processing of this language. Its objective was to contribute to the development of new computational and linguistic resources, facilitating the access of new researchers to existing tools. Of the corpora available at Linguateca that specifically target Brazilian Portuguese, the ones that stand out as the most significant in size are: Brazilian Corpus (Sardinha et al., 2010), Lácio-Web (Aluísio et al., 2003), and Corpus do Português, subcorpora NOW and Web/Dialects (Davies; Ferreira, 2016; Davies; Ferreira, 2018).

The Brazilian Corpus has approximately one billion Brazilian Portuguese words, syntactically annotated with the parser PALAVRAS (Bick, 2000).⁷ Lácio-Web was developed by USP as a project whose objective is to make fully and freely available its linguistic and computational tools as well as several corpora of contemporary Brazilian Portuguese. This set of corpora prioritizes whole-content texts and a variety of genres, text typologies, domains of knowledge, and means of distribution (Pinheiro; Aluísio, 2003; Aluísio et al., 2004).⁸ Corpus do Português was developed by Brigham Young University and Georgetown University. The NOW (News on the Web) subcorpus has approximately 1.1 billion words of four different Portuguese varieties (Angola, Brazil, Mozambique and Portugal), gathered from daily searches for articles in magazines and newspapers through Google News between 2012 and 2019. It is not possible to easily retrieve the source and copyright information of the texts, nor to know how much of the data refers specifically to Brazilian Portuguese. The subcorpus Web/Dialects, in turn, has approximately one billion words of the same four Portuguese varieties, of which 656 million words are in Brazilian Portuguese, mainly extracted from Blog-type sites (Davies; Ferreira, 2016)⁹.

2. Building a Methodology

The construction of a billion-token corpus requires a considerable amount of preparation and coordination. First of all we had to define the metadata scheme, which was adjusted after the initial surveys and tests. The goals of the corpus must remain clear at all times, and a mechanism for tracing sources, completion levels and data balancing must be followed diligently. Such an endeavor required three important stages: a detailed analysis of existing resources, the development of a methodological framework to adhere to our goals, and the developing of techniques for post-processing. The main methodological decisions in this process are described in 2.1 and 2.2 below, with special attention to aspects related to Provenance and Typology, and the processing stages are presented briefly in 2.3 and 2.4 further on.

2.1. The Issue of Provenance

Initially, significant effort was made to analyze the pre-existing resources for natural language processing in Brazilian Portuguese, with the aim of supporting the development of our methodology and exploring the possibility of incorporating some of these resources into Carolina. That enabled us to assess the benefits and drawbacks of their methodologies, as well as which niches of contemporary text were already corpus-indexed, and which were still fertile sources for us. In doing so, we decided on a web-based corpus, but against the adoption of the WaCky framework.

Despite the WaCky method claiming the facilitation of an automatic balancing of content without bias, and the brWac achieving the reduction of limited-relevance and duplicate content in comparison to other WaCKy corpora (Wagner et al., 2018), the methodology presents some drawbacks. As its own creators acknowledge, the automated methods allow for limited control over the contents that end up in the final corpus, and therefore they need to be post-hoc investigated (Baroni et al., 2009). For example, the brWac researchers only provide the annotated categories of the 100 most frequent websites (Wagner et al., 2018), and unlike the other WaCky corpus mentioned in the previous section, the list of bigrams, seeds and total URLs used for the brWaC construction are not easily accessible.

These challenges for content quality and provenance tracking, as well as on rights of use, are central issues in Carolina’s goals, and our methodology was developed around avoiding such problems. In line with Paixão de Sousa (2014) and Santos and Namiuti (2019), a scientific control of the memory processes of building a corpora, the memory of texts, and the definition of the set of essential metadata to control provenance and guarantee the documents’ reliability figure in the range of essential information for the research in the Humanities and in Computer Science. Furthermore, as we intend to openly distribute the contents of the Corpus online, under terms akin to Apache license and similarly permissive ones, assuring data provenance beforehand is crucial to determine the original terms of use of content.

For instance, Davies and Ferreira (2018) recognize that the texts used in the Brazilian Corpus may be copyrighted and, for this reason, they rely on the American Fair Use Law, which states that texts under copyrights can be used freely as long as their format is remodeled and that there is no foreseen impact on their potential market by their legal holders (Stim, 2016a; Stim, 2016b). Thus, to avoid copyright-related problems in the Brazilian Corpus for every 200 words of text, 10 words were replaced by “@”, totaling the exclusion of 5% of the original text. The authors claim that, as the words were removed regardless of context, all words would be affected equally, so the frequency and usage counts would not be affected and the corpus would still be suitable for linguistic studies¹⁰.

However, depending on local legislation and on the purpose for the collected material, corpora might be violating the law even if publishing only fragmented or highly processed versions of the texts when copyrighted content is included. In addition, when crawling based on seed URLs and search engines, there is no control over the copyright nature of texts. According to Cardoso (2007), Brazilian law acknowledges copyright establishment at the exact moment an intellectual work is created, without the need for further legal requests or paperwork. Therefore, being published and openly accessible online is no waiver of copyright limitations. For this reason, while building Carolina we avoided collecting random samples from the web to ensure both the openness of the information crawled and compliance with Brazil’s recently enforced personal data protection law, LGPD: “Lei Geral de Proteção de Dados”¹¹.

As for the possibility of incorporating pre-existing corpora to Carolina, there are some obstacles to be considered. Firstly, many corpora listed at Linguateca were discarded from our list for not fitting into our date range or for having content that may have reproduction and distribution restrictions due to possible copyright attributed to the texts or to the corpus itself. Many corpora that took into consideration the copyright limitations chose to work solely with fragments or excerpts of the original texts, choosing greater ease of access to the text at the expense of the completeness of its content, as is the case of Corpus do Português. In the construction of the Corpus, we prioritize the use of integral or minimally modified texts, as we understand that a fragmentary nature of the content can be detrimental to inter-phrase or inter-text associations in both linguistic studies and software development for Natural Language Processing, amongst which the most recent alternatives, such as Attention-based Algorithms, require the processing of the text sentences in its completeness (Devlin et al., 2019). Another obstacle for corpus incorporation is that a large part of the datasets and smaller corpora gathered at Linguateca uses the European variety of the Portuguese language, or more than one variety. This constitutes a limitation to their incorporation, considering that Carolina intends to be an open corpus of contemporary Brazilian Portuguese.

Thus, we concluded that it would be more productive not to include the large existing corpora of Brazilian Portuguese in Carolina, but rather use them as theoretical guidance and control parameters for the development of a new methodology based on provenance, typology and free distribution of the texts: the WaC-wiPT.

However, since Carolina is C4AI’s “mother-ship” corpus, we intend to incorporate some of the smaller corpora in the future. We are interested mainly in those corpora whose content is unique or not easily independently recoverable, such as corpora of transcribed spontaneous speech, like the corpora developed by Project TaRSila, at C4AI, already described in Santos et. al. (2022). We believe that these unique-content corpora will be important sources to guarantee a greater representation of dialectal and typological varieties to the Corpus.

2.2. Defining Typologies

Having determined the objectives and philosophy for the construction of Carolina, since we want to build it aiming for reliability guarantees, ensured by the provenance control provided by structured metadata, we focused on conducting surveys by broad typologies, defined by us as a way to gather several related web domains that had similar content. After defining a typology methodology, we started the downloading step, followed by a pre-processing phase and, finally, we proposed the categorization of metadata headers and the metadata scheme.

The surveys started by a broad typology (Figure 1), divided in seven types, as detailed below. The seven broad types first defined were categories that allowed us to group all the domains researched up to that point: judicial branch of government; legislative branch of government; datasets and other corpora; public domain works; wikis; university domains; and journalistic texts. As we chose the sources to be surveyed for the broad typology, we gave priority to those with open data and a large volume of files, since the process of requisition of rights of use would only take place in later stages of the project. Therefore, the sources that had copyright-protected data (for instance, the journalistic texts) were not prioritized in the first moment (Crespo et al. 2022).

The surveys consist of an in-depth research of each broad type chosen for the construction of the corpus and investigation of the web domains that comprise them. Thus, we surveyed information about the license of the texts and the basic directory structure of the investigated sites, as well as authorship, date, and other information that we deemed relevant for each broad type. All this collected data involved a great importance to the download process and it has been essential for the insertion of the predefined metadata and their revision. Therefore, the surveys are continuously ongoing, as research must be conducted or supplemented for each new web domain we wish to incorporate into Carolina.

In addition, given that throughout the surveying process we came across various types – often within a single web domain –, we defined a narrow typology, formed by subdivisions of the broad typology that take into account the structural similarity of the extracted files. Therefore, we created the distinction between broad and narrow typology: the former is an initial web-domain grouping by similar content, and the latter, a more detailed label for the types of texts found in each section or file of a surveyed web domain. Narrow typology was also included in most surveys and is a metadata category.

2.3. The Downloading and Preprocessing Stages

After the initial survey by broad typology, the downloading and preprocessing stage begins to take place. It is important to note that those procedures, which could be called the ‘final’ stage of the corpus construction, rely heavily on our principles of Provenance and Typology. As discussed in section 2.1, text provenance is the baseline criterion for a text to be selected for the corpus; and as shown in section 2.2, the broad typologies of the texts are the guidelines over which the building process begins. The downloading and preprocessing stages described here were the basis for the production of versions Carolina 1.0 Ada (2021), Carolina 1.1 Ada (2022), Carolina 1.2 Ada (2023) and Carolina 1.3 Ada (2024). Carolina 2.0 Bea is being prepared for publication in March 2025.

The files were mostly obtained through Wget, the chosen software for this process. As the Raw corpus¹² (which precedes text preprocessing) aims at safekeeping the entirety of the selected web domains, thus avoiding any future problems in case they are partially or completely removed from the Internet, the mirror command was used. This command crawls entire web pages, with infinite recursivity inside a web domain, creating by default a mirror of its directory structure, complying with our intention of archiving a copy of most of the sources used in the corpus.

With the detailed inspection of each type in the broad typology, the process of downloading the files was facilitated. Accordingly, in some cases, pages whose content was irrelevant or out of the proposed frame were ignored, such as the public domain works¹³ published prior to 1970. In those cases, the files were assessed one by one and downloaded with Wget, by means of feeding it an URL list in a TXT file.

The filtering of texts was included in the process of building the corpus version and is based on surveys of each type within the broad typology. Care was taken to exclude anything outside the proposed time span (1970–present) and pages with little or no textual content. Therefore, as these previous inspections enable a closer understanding of the structure of the surveyed websites, the desired sections will be easily tracked and selected for preprocessing among the downloaded files. In the preprocessing stage, we extract the text from the Raw corpus and, after that, the Metadata insertion process takes place.

That methodology was also relevant when the mirror command did not retrieve all the targeted files of a website. As the initial survey allowed for the learning of which pages or directories were desired for the corpus, other download methods had to be employed in the cases where they were not automatically crawled by the mirror command. This difficulty was present especially in the Brazilian Federal Government public websites, which required alternatives to obtain their content, and many resources were used for that. For different sections of the Brazilian Supreme Court (STF) website¹⁴, for example, we built tools to generate URLs based on file naming patterns, to extract URLs and save pages with an HTML parser, and to access and click links recursively, using the Python¹⁵ library BeautifulSoup and the Selenium WebDriver. In addition to that, a large volume of judiciary documents was kindly provided by Jackson José de Souza, crawled with a tool he developed using Scrapy.¹⁶

2.4. Defining Metadata

The following stage after preprocessing is metadata insertion. The conception and development of appropriate Metadata categories has been a core task in building Carolina. The identification of basic metadata for the objectives of the Corpus was guided by the classification of information into two broad categories. The first category groups objective information contained in the source document of the text, not having been generated from any type of analysis. Following Santos and Namiuti (2019), we name this category “Dossier”. The second category, which we name “Carolina”, includes processing information and information generated from the analysis of the text contained in the source document. From these two categories, eight information groups were identified: Primary Identification, Authorship, Dating, Location, Size, Acquisition, Licenses and Typology.

Table 1 lists each piece of metadata identified as necessary for the Corpus text header. The first column shows the information category (“Dossier” or “Carolina”); the second column identifies the information group within each category; the third column specifies the item of metadata within each group. The last column specifies the cardinality of each metadata, determining the minimum and maximum or the exact number of occurrences of that metadata for each file: the minimum cardinality of “zero” indicates that the item of metadata is optional, while the maximum cardinality of “one”, or “one or more” indicates that it is mandatory.

The texts in Carolina are represented as TEI Documents, encoded in XML in accordance with the specifications “TEI P5: TEI Guidelines for Electronic Text Encoding and Interchange”, developed and maintained by the Text Encoding Initiative Consortium (TEI Consortium et al., 2021). A single XML file encodes several texts of the Corpus, with a hierarchy of elements that can be validated against a schema derived from the TEI Guidelines, aiming to ensure greater interoperability (TEI Consortium, 2024).

Each text included in the Corpus contains a <TEI> element, which includes the descendant node <teiHeader>, mandatory in a TEI-conformant document. The metadata items listed in Table 2 are encoded in <teiHeader> element for each text. Figure 2 presents the general hierarchical structure of the XML header structure of each individual text. Carolina’s header of text was structured based on the AnnoTEI Schema, proposed by Costa (2024), which recommends encoding each item of metadata classified in the “Dossier” category within the <sourceDesc> (Source Description) element. Given the importance of the origin of the texts for the Carolina project, even though they are native digital documents, the data referring to the source document or file are encoded in sourceDesc. The <fileDesc> element (File Description) constitutes a mandatory element into the <teiHeader> and is designed for encoding file description. Since the <sourceDesc> element can contain <fileDesc>, the AnnoTEI Schema defines that this element includes a complete bibliographic description of the source document, while the remaining metadata items are inserted into other child elements of <fileDesc> that are not nested in <sourceDesc>, which includes information about the distribution and working with the corpus XML file.

Building on this, the final schema for the texts in the Corpus was defined observing the specificity of the project. Carolina’s <teiHeader> also contains two elements defined as optional by TEI Guidelines: <encodingDesc> (Encoding Description) and <profileDesc> (text-profile Description). The <encodingDesc> element includes information about the encoding of the text. Finally, profileDesc contains the text classification according to the typology established by the project team. The decisions of which elements to use and their location were based on the objectives of the Corpus, creating a schema in accordance with the “Corpus” customization provided by the TEI.

Since TEI P5 is a modular and flexible system, whose infrastructure enables users to create a specified encoding schema appropriate to their needs without compromising data interoperability, a customized schema was defined for the Carolina Corpus. The final schema meets the conformance requirements outlined by the TEI standard, ensuring that documents validated by it are “TEI-Conformant”. Therefore, the customized schema follows the TEI Abstract Model and is generated from an ODD (One Document Does It All) file, as recommended by the guidelines. To achieve greater interoperability, the customization is a subset of the “TEI-All” schema, which makes it “clean modification”, according to the guidelines (TEI Consortium, 2024).

3. Current State

Since 2022, four versions of Carolina Ada have been published, each one with few updates or corrections in relation to the previous version. Table 2 below shows the schedule and size of each version, more information about all of them you can find on the Corpus’ webpage (https://sites.usp.br/corpuscarolina).

The current version of the Corpus (Carolina 1.3 Ada), published in October 2024, is organized by the types in the broad typology established up to the present, plus an additional Social Media typology, and it shows the following numbers (Table 3).

The information presented in Table 3 concerns the XML corpus, which represents the final stage of Carolina 1.3 Ada version, containing texts extracted from open source web domains, balanced data, and their respective metadata encoded. The texts in the XML version have already been pre-processed, filtered, and deduplicated. Besides, it is valuable to mention that some websites or even entire broad types (such as the journalistic texts), which require the explicit authorization of their copyright owners to be made available (and therefore, are still in the course of requisition of rights of use), are not being accounted for in the numbers.

The work on building Carolina is constantly ongoing and a new version with its own search interface is being prepared; Carolina 2.0 Bea will be released in March 2025.

4. Conclusion and Future Steps

Carolina has an important distinguishing feature: it is conceived with an original methodology developed by the LaViHD-C4AI team, which we call WaC-wiPT (Web as Corpus with Provenance and Typology information). We consider provenance to be a crucial aspect to strive for in web-based corpora, alongside typology and balance management. Apart from facilitating copyright compliance and typology labeling, it allows one to answer questions about the origin of texts and increases the scope of uses for the corpus.

As shown in our state-of-the-art revisit (a non-exhaustive list of openly available Brazilian Portuguese corpora and other relevant web-based corpora), many recent corpora were built adopting the WaCky methodology. Because this methodology does not envision provenance as we defend here for Carolina, and because most guidelines from other corpora emphasize only “corpus balance”, for which typology serves just as one criterion, most of these corpora were not incorporated into the Corpus; instead, they had an important role in the conception of our own methodology.

Therefore, in line with the provenance proposition, the LaViHD team at C4AI, as part of its NLP2 challenge, has built a large corpus with a robust and unprecedented volume of texts of various typologies in Brazilian Portuguese. In this paper we presented the current state and the next steps of the Corpus construction, defending the importance of provenance and of a detailed typology scheme as fundamental assets in modern data availability. As related products developed during the construction of Carolina, we presented the WaC-wiPT methodology, based on provenance and typology, aiming to make as much data as possible openly available online (in its “beta version”). This also includes the building of metadata to describe provenance and typology, forming the first version of Carolina’s header scheme, using “TEI P5” in accordance with the reuse principle. Additionally, the Raw Corpus was created, currently totaling 1,779 GB and 124,084,164,722 tokens.

As Carolina approaches it's fifth anniversary, we hope to be very aware of its limitations as well as its progress. In this regard, one of the main challenges at the current phase is balancing the Corpus in terms of typologies; in particular, as we mentioned before, the sources of texts that had copyright-protected data (for instance, the journalistic texts) were not prioritized in this first moment. We are aware that this limitation must be overcome, and this is part of our goals for the next versions. Interestingly, we observe that this problem stems from our principle of guaranteed Provenance; but rather than compromise on this fundamental aspect, we opted to wait some time until we can obtain the correct licences that would allow us to offer quality, whole texts independent of copyright liability.

Finally, another important challenge in the current phase of the development is the availability of a more user-friendly interface, in particular bearing in mind the users outside the realm of Computer Science. In its current state, Carolina is fully available through the main website, leading to dedicated platforms which allow bulk download¹⁸; in the near future, we will make it available on a searchable interface which will complement the possibility of downloading the whole Corpus. A prototype of this new visual interface will accompany Carolina 2.0 Bea, to be released in March 8th, 2025.

Funding

IBM Corporation. São Paulo Research Foundation (FAPESP), grants 2019/07665-4, 2020/06443-5 (SPIRA), 2014/12236-1 (Animals). Coordination for the Improvement of Higher Education Personnel (CAPES), Finance Code 001. National Council for Scientific and Technological Development (CNPq), 303609/2018-4 (PQ). Bahia Research Foundation (FAPESB), grants 0007/2016, 0014/2016.

Acknowledgments

This work was carried out at the Center for Artificial Intelligence (C4AI-USP), with support by the São Paulo Research Foundation (FAPESP grant #2019/07665-4) and by the IBM Corporation. This work was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior -- Brasil (CAPES) -- Finance Code 001. Marcelo Finger received partial support from FAPESP 2020/06443-5 (SPIRA), 2014/12236-1 (Animals) and CNPq 303609/2018-4 (PQ), Cristiane Namiuti received partial support from Bahia Research Foundation (FAPESB 0007/2016, 0014/2016), and Maria Clara Paixão de Sousa and Vanessa Martins do Monte received partial support from the São Paulo Research Foundation (FAPESP grant #2021/15133-2). We would like to thank the researchers who were involved in the earlier phases of Carolina but are no longer part of the project today: Maria Clara Ramos Morales Crespo, Maria Lina de Souza Jeannine Rocha, Guilherme Lamartine de Mello, Raquel de Paula Guets, and Renata Morais Mesquita. Their contribution was essential to getting us to where we are now.

Conflicts of Interest

Not applicable.

Pre-Registration and Equator Network Research Standards

Not applicable.

Availability of Data and Material

This research is conducted as an Open Access Project.

Code Availability

This research is conducted as an Open Source Project; codes will be openly available.

Notes

¹: Carolina Michaëlis de Vasconcelos holds the distinction of being the first woman appointed as a university professor in June 1911, at the Faculty of Letters of Lisbon, although she never taught there. Preferring to remain in Porto, she requested and obtained a transfer to the Faculty of Letters of Coimbra, where she engaged in intensive teaching activity, leading the courses in Romance and Portuguese Philology (Sales, 2025).
²: After naming the corpus, it came to our knowledge that Carolina's father, Dr. Gustav Michaëlis, was a mathematician, which brings an unexpected and felicitous relation to our work.
³: A summarized version of the methodology developed can be found in Sturzeneker et al. (2022).
⁴: https://www.sketchengine.eu/
⁵: The TenTen Corpus Family is available for consultation in more than 40 languages, including a corpus of 4 billion tokens for Portuguese (ptTenTen), which includes the European and Brazilian variants. However, the corpora of all these languages are not openly available, being accessible only from the Sketch Engine platform (Jakubíček et al.; 2013) (Wagner et al., 2018).
⁶: Published in 2017, it contains 2.68 billion tokens that were crawled from the Web in 24 hours, by initially employing queries to a search engine with random pairs of content words, according to the description of its development in Wagner et al (2018). Its importance for advances in Brazilian Portuguese research in multiple areas is illustrated by its employment in NLP model training, as substantiated in Souza et al. (2020).
⁷: It was developed by the Applied Linguistics and Language Studies Program of the Pontifical Catholic University of São Paulo (LAEL/PUC-SP) and its version 6.0 of February 2, 2020 is available for online searches at Linguateca. The full corpus can be downloaded upon approval of a requisition form, as long as the user agrees not to distribute or use it for profit purposes. Although the primary sources of the texts are not explicit, there are several works on the construction of the Brazilian Corpus that describe the set of typologies and textual genres that compose it (Sardinha et al., 2010; Vianna; de Oliveira, 2010; de Oliveira; Dias, 2009; de Brito et al., 2007).
⁸: The Lácio-Web project comprises six corpora, four of which are currently available online (Lácio-Ref, Mac-Morpho, Par-C, and Comp-C) and whose content is described on its website: http://143.107.183.175:22180/lacioweb/descricao.html.
⁹: In both corpora, the texts were processed after the download for boilerplate removal, duplicates detection, lemmatization and tagging. The complete Web/Dialects corpus is available for purchase, with different selling prices depending on the final license chosen by the user (academic or commercial) at https://www.corpusdata.org/purchase.asp. The NOW Corpus can be accessed free of charge for online searches on its website (https://www.corpusdoportugues.org/now/), but it cannot be downloaded in full.
¹⁰: https://www.corpusdata.org/limitations.asp
¹¹: Available in Portuguese at http://www.planalto.gov.br/ccivil_03/_ato2015-2018/2018/lei/l13709.htm.
¹²: Raw corpus (‘Corpus Cru’) is a concept created to describe a product derived from the Lapelinc method for building corpora (Namiuti; Santos, 2021), which consists of an unpublicized version of the corpus that holds more information about itself, serving as a mirror of the original sources of the corpus. This notion has been adapted to the methodology used to build Carolina.
¹³: For the time being, all of the files of this broad typology were extracted from http://www.dominiopublico.gov.br/.
¹⁴: http://portal.stf.jus.br/
¹⁵: All the tools were developed using Python 3.
¹⁶: The tool he developed for his Master's degree at the University of São Paulo is available at https://github.com/mayara-melo/analise-juridica.
¹⁷: The token count was achieved with the wc -c linux command, which counts “whitespace-separated tokens”: https://www.gnu.org/software/coreutils/manual/html_node/wc-invocation.html#wc-invocation. It means that these sums will significantly decrease after the preprocessing phase. As well as the number of files, yet we expect it to be in a lower range than the 95% of discarded documents crawled by the brWaC (Wagner et al., 2018), for instance. This same observation is valid for Table 3.
¹⁸: Portulan Clarin (https://portulanclarin.net/repository/browse/carolina-general-corpus-of-contemporary-brazilian-portuguese-with-provenance-and-typology-information/f3751b34e36611ecaa5802420a870112f00a37650c304dbda703d85e14a2e945/) and Hugging Face
¹⁹: (https://huggingface.co/datasets/carolina-c4ai/corpus-carolina)- cf. full list of all versions available for download at https://sites.usp.br/corpuscarolina/corpus.

References

Aluísio, S. M. , G. M., Finger, M., Nunes, M. das G. V.; Tagnin, S. E. O. (2003). The Lacio-Web Project: overview and issues in Brazilian Portuguese corpora creation. In D. Archer, P. Rayson, A. Wilson; T. McEnery (Eds.) Proceedings of the Corpus Linguistics 2003 conference (CL2003) (Lancaster, UK, 28-31 March 2003) (pp.
Aluísio, S. M. , Pinheiro, G. M., Manfrin, A. M., de Oliveira, L. H., Genoves Jr, L. C., & Tagnin, S. E. (2004). The Lácio-Web: Corpora and Tools to Advance Brazilian Portuguese Language Investigations and Computational Linguistic Tools. In M. T. Lino, M. F. Xavier, F. Ferreira, R. Costa; R. Silva (Eds.), LREC 2004 Fourth International Conference On Language Resources And Evaluation. (pp. 1779-1782). European Language Resources Association.
Bahdanau, D. , Cho, K.; Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409. 0473. [Google Scholar]
Baroni, M.; Bernardini, S.; Ferraresi, A.; Zanchetta, E. The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Lang. Resour. Evaluation 2009, 43, 209–226. [Google Scholar] [CrossRef]
Bernardini, S. , Baroni M.; Evert, E. (2006). A WaCky introduction. In: M. Baroni; S. Bernardini (eds.). WaCky! working papers on the web as corpus. Bologna: GEDIT, 2006. [Google Scholar]
Bick, E. (2000). The parsing system palavras: Automatic grammatical analysis of Portuguese in a constraint grammar framework. Aarhus Universitetsforlag.
Boos, R., Prestes, K., Villavicencio, A., Padró, M. (2014). brWaC: A WaCky Corpus for Brazilian Portuguese. In: Baptista J., Mamede N., Candeias S., Paraboni I., Pardo T.A.S., Volpe Nunes, M.G. (eds). Computational Processing of the Portuguese Language. PROPOR 2014, Lecture Notes in Computer Science, vol 8775. Springer, Cham. (pp. 201-206). [CrossRef]
Cardoso, J. A. (2007). Direitos Autorais no Trabalho Acadêmico. REVISTA JURÍDICA DA PRESIDÊNCIA, 9(86), 58-86.
Clarke, C. L. , G. V., Laszlo, M., Lynam, T. R.; Terra, E. L. (2002). The impact of corpus size on question answering performance. In K. Järvelin, M. Beaulieu, R. Baeza-Yates, S. H. Myaeng (Eds.), Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval (pp.
Costa, A. S. (2024). Um sistema de anotação de múltiplas camadas para corpora históricos da língua portuguesa baseados em manuscritos (Doctoral dissertation). Universidade Estadual do Sudoeste da Bahia.
Crespo, M. C. R. M.; Rocha, M. L. S. J.; Sturzeneker, M. L.; Serras, F. R.; Mello, G. L.; Costa, A. S.; Palma, M. F.; Mesquita, R. M.; Guets, R. P.; Silva, M. M.; Finger, M.; Paixão de Sousa, M. C.; Namiuti, C.; Martins do Monte, V. . Carolina: a General Corpus of Contemporary Brazilian Portuguese with Provenance, Typology and Versioning Information. Manuscript. September, 2022. C: Preprint, 2023; arXiv:2303.16098v1 [cs. [Google Scholar]
Davies, M. , Ferreira, M. (2016). Corpus do Português: 1,1 billion words, Web/Dialetics. Brigham Young University: Provo-UT, 2016. Retrieved , 2021, from https://www.corpusdoportugues. 26 May.
Davies, M. , Ferreira, M. (2018). Corpus do Português: 1,1 billion words, NOW. Brigham Young University: Provo-UT, 2018. Retrieved , 2021, from https://www.corpusdoportugues. 26 May.
de Brito, M. G. , Valério, R. G., de Almeida, G. P.; de Oliveira, L. P. (2007). CORPOBRAS PUC-RIO: Desenvolvimento e análise de um corpus representativo do português. PUC-Rio. Retrieved May 26, 2021. [Google Scholar]
Oliveira, L.; Dias, M.C. Compilação de corpus: representatividade e o CORPOBRAS. 7. [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, arXiv 2019, arXiv:1810. 0 2019, 4805. [Google Scholar] [CrossRef]
Ferraresi, A.; Bernardini, S.; Picci, G.; Baroni, M. (2010). Web corpora for bilingual lexicography: A pilot study of English/French collocation extraction and translation. In: Using Corpora in Contrastive and Translation Studies. Newcastle: Cambridge Scholars Publishing. (pp. 337-362).
Fletcher, W. H. (2007). Concordancing the Web: promise and problems, tools and techniques. In M. Hundt, N. Nesselhaulf; C. Biewer (Eds.), Corpus Linguistics and the Web (pp. 25-45). Rodopi.
Jakubíček, M.; Kilgarriff, A.; Vojtěch, K.; Pavel, R. ; Vít Suchomel. (2013). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).
Kalchbrenner, N.; Blunsom, P. (2013). In Two recurrent continuous translation models. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), (pp. 1700–1709). Association for Computational Linguistics. [Google Scholar]
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv:1909.11942.
Liu, V.; Curran, J. R. (2006). Web text corpus for natural language processing. In D. McCarthy; S. Wintner (Eds.), 11th Conference of the European Chapter of the Association for Computational Linguistics.
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692.
Namiuti, C.; Santos, J.V. (2021). "Novos desafios para antigas fontes: a experiência DOViC na nova linguística histórica". In: Humanidades digitais e o mundo lusófono. Organização Ricardo M. Pimenta, Daniel Alves. – Rio de Janeiro: Editora FGV, 2021, págs.
de Sousa, M.C.P. O Corpus Tycho Brahe: contribuições para as humanidades digitais no Brasil. 16, 53. [CrossRef]
Pinheiro, G. M.; Aluísio, S. M. (2003). Córpus Nilc: descrição e análise crítica com vistas ao projeto Lacio-Web. Núcleo Interinstitucional de Lingüística Computacional. Retrieved , 2021, from http://143.107.183.175:22180/lacioweb/publicacoes. 27 May.
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. (2018). Improving language understanding by generative pre-training. OpenAI. Retrieved , 2021, from https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper. 10 June.
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P. J. (2019). Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv:1910.10683. from https://www.jmlr.
Sales, Joana; Sales, Teresa. Carolina Michaëlis de Vasconcelos (1851-1925). Centro de Documentação e Arquivo Feminista Elina Guimarães, 2025. Disponível em: https://www.cdocfeminista.org/carolina-michaelis-de-vasconcelos-1851-1925/. Acesso em: 25 jan. 2025.
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108.
Santos, V.G.; Alves, C.A.; Carlotto, B.B.; Dias, B.A.P.; Gris, L.R.S.; Izaias, R.d.L.; de Morais, M.L.A.; de Oliveira, P.M.; Sicoli, R.; Svartman, F.R.F.; et al. CORAA NURC-SP Minimal Corpus: a manually annotated corpus of Brazilian Portuguese spontaneous speech. IberSPEECH 2022. LOCATION OF CONFERENCE, COUNTRYDATE OF CONFERENCE;
Santos, D. (2000). O projecto Processamento Computacional do Português: Balanço e perspectivas. In M. das Graças (Ed.), V Encontro para o processamento computacional da língua portuguesa escrita e falada (PROPOR 2000) (pp. 105-113).
Santos, J. V.; Namiuti, C. O futuro das humanidades digitais é o passado. In: CARRILHO, E. et al. Estudos Linguísticos e Filológicos oferecidos a Ivo Castro. Lisboa: Centro de Linguística da Universidade de Lisboa, 2019 (pp. 1381-1404). ISBN 978-989-98666-3-8.
Sardinha, T. B.; Filho, J. L. M.; Alambert, E. (2010) Manual Córpus Brasileiro. PUCSP, FAPESP. Retrieved , 2021, from https://www.linguateca.pt/Repositorio/manual_cb. 26 May.
Simmhan, Y.L.; Plale, B.; Gannon, D. A survey of data provenance in e-science. ACM SIGMOD Rec. 2005, 34, 31–36. [Google Scholar] [CrossRef]
Souza, F.; Nogueira, R.; Lotufo, R. (2020). BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In: Brazilian Conference on Intelligent Systems. Spinger, Cham. (pp. 403-417), from https://link.springer.com/chapter/10.1007%2F978-3-030-61377-8_28.
Stim, R. (2016). Fair Use. Stanford Libraries; NOLO. Retrieved 27 May, 2021, from https://fairuse.stanford.
Stim, R. (2016). Measuring Fair Use: The Four Factors. Stanford Libraries; NOLO. Retrieved 27 May, 2021, from https://fairuse.stanford.
Sturzeneker, M. L.; Crespo, M. C. R. M.; Rocha, M. L. S. J.; Finger, M.; Paixão de Sousa, M. C.; Martins do Monte, V.; Namiuti, C. . ‘Carolina’s Methodology: building a large corpus with provenance and typology information’. Proceedings of the Second Workshop on Digital Humanities and Natural Language Processing (2nd DHandNLP 2022). CEUR-WS, Vol. 3128, 2022. Available at http://ceur-ws.org/Vol-3128.
TEI Consortium, Burnard, L.; Sperberg-McQueen, C. M. (2021). TEI P5: Guidelines for electronic text encoding and interchange. Version 4.2.2. Last updated on 9th 21, revision 609a109b1. Retrieved May 20, 2021 from https://tei-c.org/release/doc/tei-p5-doc/en/Guidelines.pdf. 20 April.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems. arXiv:1706.03762.
Vianna, A. E. P. B.; de Oliveira, L. P. (2010). CORPOBRAS PUC-Rio: Análise de corpus e a metáfora gramatical. PUC-Rio. Retrieved , 2021, from http://www.puc-rio.br/ensinopesq/ccpg/Pibic/relatorio_resumo2010/relatorios/ctch/let/LET-%20Ana%20Elisa%20Piani%20Besserman%20Vianna. 26 May.
Wagner Filho, J. A.; Wilkens, R.; Idiart, M.; Villavicencio, A. (2018). The BrWac corpus: A new open resource for Brazilian Portuguese. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). (pp. 4339-4344), from https://www.aclweb.org/anthology/L18-1686.

Figure 1. Surveys by broad typology.

Figure 2. General hierarchical structure of the Carolina XML file.

Table 1. Identification of metadata for Carolina (1.3 - Ada Version).

Category	Group	Item of Metadata	Cardinality
Carolina	Primary Identification	Name of the file created in the corpus Corpus description	1 1
	Authorship	Creator of the file in the corpus Responsibilities for the file in the corpus	1+
	Dating	Date of creation of the file in the corpus Source file download date	1
	Licenses	License type of the file created in the corpus Access conditions for the file created in the corpus	1 1
	Extent	Number of words in the text from the source document	1
	Typology	Carolina typology	1
Dossier	Primary Identification	Title of the source document	1
	Authorship	Source-document author Source-document translator Sponsor (Institution responsible for the source document)	0+ 0+ 1+
		Authority	1
		Publisher	0/1
	Date	Year of publication of the source document Full date or period (start and end) of the source document	1 0/1
	Licenses	License of source document (Public domain, Commons, etc.)	0/1
		License URL	0/1
		Usage permissions Access conditions of source document (public, restricted, etc.)	0/1 0/1
	Localization	Domain Subdomain Source document access URL Regional origin of source document	1 0/1 1 0/1
	Acquisition	File format of the source document (pdf, html, etc.) Constitution (integral, fragmented, etc.) Nature of acquisition (native digital, scanned printout, OCR) Part (part of the collection the document represents, if app.) Collection (of which the document is part of, if app.)	1 1 1 0/1 0/1
	Size	File size of the source document (in bytes) Number of pages of the source document	1 0/1
	Typology	Document type declared in the source document Linguistic variety (regional) indicated in the source document Written or oral text (transcribed) Domain of use	0/1 0/1 1 1
		Degree of preparation	0/1
		User-generated	0/1

Table 2. Published Carolina Versions.

Date	Carolina Version	Size (GB)	Number of extracted texts	Number of tokens¹⁷
2022, Mar	1.0 - Ada	39.23	1,745,234	653,346,569
2022, Apr	1.1 - Ada	7.2	1,745,187	653,322,577
2023, Mar	1.2 - Ada	11.36	2,107,045	823,198,893
2024, Oct	1.3 - Ada	11.16	2,076,205	802,146,069

Table 3. XML Carolina Corpus (Version 1.3 Ada) in numbers.

Broad typology	Size (GB)	Number of extracted texts	Number of words
Datasets and other corpora	4.3	1,074,032	196,524,339
Judicial branch	1.5	40,398	196,228,167
Legislative branch	0.025	13	3,162,474
Public domain works	0.005	26	601,465
Social Media	0.017	3,294	1,231,881
University domains	0.011	941	1,078,967
Wikis	5.3	957,501	403,318,776
Journalistic texts	0	0	0
Current total	11.16	2,076,205	802,146,069

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.