Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: geographic information fusion; data quality; data consistency checking; historic GIS; railway network; patrimonial data; crowdsourcing open data; volunteer geographic information VGI; wikipedia geo-spatial information extraction.
Online: 17 August 2020 (14:51:04 CEST)
Transportation of goods is as old as human civilizations : past networks and their evolution shed light on long term trends. Transportation impact on climate change is measured as major, as well as the impact on spreading a pandemic. These two reasons motivate the importance of providing relevant and reliable historical geographic datasets of these networks. This paper focuses on reconstructing the railway network in France at its maximal extent, a century ago. The active stations and lines are well documented by the French SNCF, in open public data. However, that information ignores past stations (ante 1980), which represent probably more than what is recorded in public data. Additional open data, individual or collaborative (eg. Wikipedia) are particularly valuable, but they are not always geo-coded, and two more sources are necessary to completing that geo-coding: ancient maps and aerial photography. Therefore, remote sensing and volunteer geographic information are the two pillars of past railway reconstruction. The methods developed are adapted to the extraction of information from these sources: automated parsing of Wikipedia Infoboxes, data extraction from simple tables, even from simple text. That series of sparse procedures can be merged into a comprehensive computer-assisted process. Beyond this, a huge effort in quality control is necessary when merging these data: automated wherever possible, or finally visually controlled by observation of remote sensing information. The main output is a reliable dataset, under ODbl, of more than 9100 stations, which can be combined with the information about the 35000 communes of France, for a large variety of studies. This work demonstrates two thesis: (a) it is possible to reconstruct transport network data from the past, and generic computer assisted methods can be developed; (b) the value of remote sensing and volunteered geo info is considerable (what archeologists already know).
ARTICLE | doi:10.20944/preprints201709.0130.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Wikipedia; information quality; WikiRank; DBpedia
Online: 26 September 2017 (15:34:18 CEST)
Despite the fact that Wikipedia is often criticized for its poor quality, it continues to be one of the most popular knowledge base in the world. Articles in this free encyclopedia on various topics can be created and edited in about 300 different language versions independently. Our research showed that in language sensitive topics quality of information can be relatively better in the relevant language versions. However, in most cases it is difficult for the Wikipedia readers to determine the language affiliation of the described subject. Additionally, each language edition of Wikipedia can have own rules in manual assessing of the content quality. This makes automatic quality comparison of articles between various languages a challenging task. The paper presents results of relative quality and popularity assessment of over 28 million articles in 44 selected language versions. In addition, a comparative analysis of the quality and popularity of articles in some topics was conducted. The proposed method allows to find articles with information of better quality that can be used to automatically enrich other language editions of Wikipedia.
ARTICLE | doi:10.20944/preprints202003.0460.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Wikipedia; reference; source; reliability; popularity; Wikidata, DBpedia
Online: 31 March 2020 (22:18:51 CEST)
One of the most important factors impacting quality of content in Wikipedia is presence of credible sources. By following references readers can verify facts or find more details about described topic. A Wikipedia article can be edited independently in any of over 300 languages, even by anonymous users, therefore information about the same topic may be inconsistent. This also applies to use of references in different language versions of a particular article, so the same statement can have different sources. In this paper we analyzed over 40 million articles from the 55 most developed language versions of Wikipedia to extract information about nearly 200 million references and find the most popular and reliable sources. We presented 10 models for the assessment of the popularity and reliability of the sources based on analysis of meta information about the references in Wikipedia articles, page views and authors of the articles. Using DBpedia and Wikidata we automatically identified the alignment of the sources to a specific domain. Additionally, we analyzed the changes of popularity and reliability in time and identified growth leaders in each considered months. The results can be used for quality improvements of the content in different languages versions of Wikipedia.
ARTICLE | doi:10.20944/preprints201807.0560.v1
Subject: Social Sciences, Library & Information Science Keywords: semantic web; wikipedia; conceptual evolution; negotiated meanings
Online: 30 July 2018 (05:55:08 CEST)
Wikipedia, as "social machine", is a privileged place to observe the collective construction of concepts without central control. Based on Dahlberg's theory of concept, and anchored in the pragmatism of Hjørland - in which the concepts are socially negotiated meanings - the evolution of the concept of Semantic Web (SW) was analyzed in the English version of Wikipedia. An exploratory, descriptive and qualitative study was designed and we identified 26 different definitions (between 7-12-2001 and 12/31/2017) of which 8 are of particular relevance for their duration, with the latter being the two recorded at the end of the analyzed period. According to them, SW: "is a extension of the web"and "is a Web of Data"; the latter, used as a complementary definition, links to Berners-Lee's publications. In Wikipedia, the evolution of the SW concept appears to be based on the search for the use of non-technical vocabulary and the control of authority carried out by the debate. As a space for collective bargaining of meanings, the Wikipedia study may bring relevant contributions to a community's understanding of a particular concept and how it evolves over time.
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Text summarization; Fine-tuning; Transformers; SMS; Gateway; French Wikipedia.
Online: 14 September 2021 (10:48:55 CEST)
Text summarization remains a challenging task in the Natural Language Processing field despite the plethora of applications in enterprises and daily life. One of the common use cases is the summarization of web pages which has the potential to provide an overview of web pages to devices with limited features. In fact, despite the increasing penetration rate of mobile devices in rural areas, the bulk of those devices offer limited features in addition to the fact that these areas are covered with limited connectivity such as the GSM network. Summarizing web pages into SMS becomes, therefore, an important task to provide information to limited devices. This work introduces WATS-SMS, a T5-based French Wikipedia Abstractive Text Summarizer for SMS. It is built through a transfer learning approach. The T5 English pre-trained model is used to generate a French text summarization model by retraining the model on 25,000 Wikipedia pages then compared with different approaches in the literature. The objective is twofold: (1) to check the assumption made in the literature that abstractive models provide better results compared to extractive ones; and (2) to evaluate the performance of our model compared to other existing abstractive models. A score based on ROUGE metrics gave us a value of 52% for articles with length up to 500 characters against 34.2% for transformer-ED and 12.7% for seq-2seq-attention; and a value of 77% for articles with larger size against 37% for transformers-DMCA. Moreover, an architecture including a software SMS-gateway has been developed to allow owners of mobile devices with limited features to send requests and to receive summaries through the GSM network.
ARTICLE | doi:10.20944/preprints201905.0144.v2
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Wikipedia; Information quality; Popularity; Topics identification; Wikidata; DBpedia; WikiRank
Online: 5 August 2019 (12:26:34 CEST)
In Wikipedia, articles about various topics can be created and edited independently in each language version. Therefore, quality of information about the same topic depends on language. Any interested user can improve an article and that improvement may depend on popularity of the article. The goal of this study is to show what topics are best represented in different language versions of Wikipedia using results of quality assessment for over 39 million articles in 55 languages. In this paper, we also analyze how popular are selected topics among readers and authors in various languages. We used two approaches to assign articles to various topics. First, we selected 27 main multilingual categories and analyzed all their connections with sub-categories based on information extracted from over 10 million categories in 55 language versions. To classify the articles to one of the 27 main categories we took into account over 400 million links from articles to over 10 million categories and over 26 million links between categories. In the second approach we used data from DBpedia and Wikidata. We also showed how the results of the study can be used to build local and global rankings of the Wikipedia content.
ARTICLE | doi:10.20944/preprints202104.0770.v1
Subject: Social Sciences, Library & Information Science Keywords: Wikipedia, knowledge equity, Wikimedia, open culture, visual arts, cultural bias
Online: 29 April 2021 (09:16:07 CEST)
We explore gaps in Wikipedia's coverage of the visual arts by comparing the representation of 100 artists and 100 artworks from the Western canon against corresponding sets of notable artists and artworks from non-Western cultures. We measure the coverage of these two sets of topics across Wikipedia as a whole and for its individual language versions. We also compare the coverage for Wikimedia Commons and Wikidata, sister-projects of Wikipedia that host digital media and structured data. We show that all these platforms strongly favour the Western canon, giving many times more coverage to Western art. We highlight specific examples of differing coverage of visual art inside and outside the Western canon. We find that European language versions of Wikipedia are generally more "Western" in their coverage and Asian languages more "global", with interesting exceptions. We suggest how both Wikipedia and the wider cultural sector can address this gap in content and thus give Wikipedia a truly global perspective on the visual arts.
ARTICLE | doi:10.20944/preprints201801.0017.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Wikipedia; Polish; information quality; linguistic features; linguistics; data mining; NLP
Online: 3 January 2018 (02:03:51 CET)
Wikipedia is the most popular and the largest user-generated source of knowledge on the Web. Quality of the information in this encyclopedia is often questioned. Therefore, Wikipedians have developed an award system for high quality articles, which follows the specific style guidelines. Nevertheless, more than 1.2 million articles in Polish Wikipedia are unassessed. This paper considers over 100 linguistic features to determine the quality of Wikipedia articles in Polish language. We evaluate our models on 500,000 articles of Polish Wikipedia. Additionally, we discuss the importance of linguistic features for quality prediction.