World Health Organization ( WHO ) COVID-19 Database : Who needs it ?

Introduction: A large number of COVID-19 publications has created a need to collect all research-related material in practical and reliable centralized databases. The aim of this study was to evaluate the functionality and quality of the compiled World Health Organisation COVID-19 database and compare it to Pubmed and Scopus. Methods: Article metadata for COVID-19 articles and articles on 8 specific topics related to COVID-19 was exported from the WHO global research database, Scopus and Pubmed. The analysis was conducted in R to investigate the number and overlapping of the articles between the databases and the missingness of values in the metadata. Results: The WHO database contains the largest number of COVID-19 related articles overall but retrieved the same number of articles on 8 specific topics as Scopus and Pubmed. Despite having the smallest number of exclusive articles overall, the highest number of exclusive articles on specific COVID-19 related topics was retrieved from the Scopus database. Further investigation revealed that PubMed and Scopus have more comprehensive structure than the WHO database, and less missing values in the categories searched by the information retrieval systems. Discussion: This study suggests that the WHO COVID-19 database, even though it is compiled from multiple databases, has a very simple and limited structure, and significant problems with data quality. As a consequence, relying on this database as a source of articles for systematic reviews or bibliometric analyses is undesirable.


INTRODUCTION
In response to the Coronavirus disease (COVID-19) an unprecedented number of articles were published 1,2 . Publishers adapted to the situation not to hinder the progress and to endow themselves with a large number of articles that would be well-cited. Thus, many started publishing open-access, submission-to-publication time reduced and a lot of articles were published ahead-of-print to make them available sooner 3 . The surge in publications resulted in the need to generate a systematic database of all COVID-19 related articles to full advantage of the research. The leader in world health, the World Health Organisation (WHO), created one of the largest databases of COVID-19 researchrelated databases described as "comprehensive multilingual source of current literature on the topic" 4 . Several similar noteworthy attempts exist including the database maintained by the Center for Disease Control and Prevention (CDC) 5 and the LitCovid database created by the National Library of Medicine 6 . Although the incentive to gather COVID-19-related research data in one place and provide an information platform to accelerate and foster research is praiseworthy, a systematic and thoughtful approach should be imperative. This is especially important in the context of databases organized and maintained by distinguished and respectable healthcare organizations such as WHO and CDC, as it is expected that at least some of the researchers will use their platforms without thorough questioning of the content quality. For this reason, we explored the functionality and quality of the WHO database (WHOdb) by comparing it to the widely used PubMed and Scopus databases.

Data acquisition
The whole global research database on COVID-19 maintained by WHO 4 was downloaded on May 19 th and on June 26 th 2020. PubMed (using pubmedR 7 , search term "COVID-19") and Scopus (search phrase "COVID-19") databases were accessed on June 26 th and June 27 th , and accompanying article metadata was stored. The procedure was repeated for the retrieved results on 8 specific unrelated topics (terms azithromycin, chloroquine, depression, diabetes, hypertension, quarantine, shock, tocilizumab were used as an individual search terms in the WHOdb; each term in conjunction with "AND COVID-19" as a search term in PubMed and Scopus).

Data analysis
In the first step, URLs were converted to DOIs and harmonized in the WHdb. DOIs in all databases were deduplicated and used to compare the databases in respect to a) the total number of retrieved articles; b) number of overlapping and c) the number of unique articles (not contained in any other database), regarding the "umbrella" term (COVID-19) and in respect to each of the 8 specific topics. In the next step, quality of the databases was assessed by evaluation of the retrieved metadata using two indicatorsmissing information (considered were all categories of metadata) and duplicate data in the categories in which duplicates are not expected (DOI, ID, URL, abstract, and the combination of title, authors and journal categories). Missing values in abstract, title and keywords categories were hierarchically clustered on complete data. Search terms of specific topics were identified in the metadata retrieved by searching the databases with the same search terms to identify the contribution of categories in the identification of articles, followed by clustering to investigate the overlap between search term matches in categories. The analysis was conducted in R 8 , and the entire code and data are available on GitHub 9 .

Terminology
Term "entry" refers to any article in the databases we analysed. A single data point in any column of the databases is considered a value and columns are referred to as categories. Exclusive articles are articles contained in only one database, while shared are present in more than one. The terms keywords and descriptors are used interchangeably in the context of the WHOdb.

RESULTS
The WHOdb contained the largest number of COVID-19-related articles (36838), followed by PubMed (25700) and Scopus (19451). Following the exclusion of duplicate entries, the total number of articles with a full overlap across the databases was 15302 ( Fig 1A). Each database contained a number of exclusive articles not included in other databases (the largest number in the WHOdb -6865) (Fig 1A). The total number of such exclusive articles across all databases was 9146. However, when we searched each database on specific topics, the number of retrieved results was similar in each database ( Fig 1B). Paradoxically, it appeared that the WHOdb provided a modest number of articles not available in other databases (Fig 1C), despite having the highest number of total and exclusive articles ( Fig 1A). In contrast, Scopus had the smallest number of total and exclusive articles (Fig 1A), but in 7 out of 8 specific queries it provided the highest number of exclusive articles, not available in other databases ( Fig 1C).
Further investigation of this phenomenon revealed that all databases suffer from a significant number of missing values (Fig 2A). Distribution of the missing values across categories of data in the databases is displayed in figure 2B. Special attention was directed at abstracts, keywords (called descriptors in the WHOdb) and title categories, as information retrieval systems (IRS) often depend on them. A substantial number of data entries were missing in these categories in all the examined databases ( Fig 2B). Next, hierarchical clustering of the missing values was performed (Fig 3A) revealing that a substantial proportion of articles in the WHOdb is missing both abstract and keywords (descriptors). On the other hand, the proportion of articles with missing abstract and keywords (authors and indexed) is much smaller in PubMed and Scopus databases.
Additionally, the clustering of matching search terms in abstract, title and keywords categories of search results on the specific topic showed that a small proportion of articles were discovered as a result of matching search terms in more than one category ( Fig 3B).
Apparently, in the WHO and PubMed databases, the search terms were dominantly matched in the abstract category, whereas in Scopus it was matched in the ID category (indexed keywords).
Since the missing values were expected in certain categories in some of the publications (letters, comments, opinions, etc.), we filtered the entries based on the document type, excluding all non-original research article types. This was performed only for Scopus and PubMed because the WHOdb does not have a category that specifies the type of entry. This analysis indicated that the abstracts were still missing for a significant proportion of entries in PubMed and Scopus (37.998% and 19.62% respectively).
However, a closer look at the identified filtered articles revealed that a significant proportion of them were in fact wrongly classified as journal articles while they were actually letters, opinions, etc.
Finally, it appeared that a considerable number of entries in the WHOdb were duplicates. Figure       Additionally, unique identifiers must be provided for all articles to allow for the identification of duplicates and deriving the full-text URL. Latter is not only important for the users of the database but also for retrieval of full-text for text mining. This is especially important as it has been hypothesised that full-text mining might facilitate and simplify the identification of topic-relevant articles in bibliographic databases when used as an alternative or in conjunction with classic Boolean search strategies 14,15 . Several such noteworthy attempts exist 16,17 , and some on the WHOdb. Thus changing the structure of the database might affect the other databases that depend on it.
Finally, we want to draw attention to the inconsistencies and duplicate entries in the database and emphasize the need for caution when using this database as a source of articles for bibliometric analysis. This is especially prominent in older versions of the WHOdb database, where a lot of inconsistencies were noticed in the use of delimiters and the way of writing of the authors' and journals' names, and DOIs.

Limitations
The present work suffers from several limitations: a) since DOIs in the DOI category are missing, DOIs were extracted from the "FullText URL" in the WHOdb to identify the duplicates and investigate overlapping (this is not ideal as approximately 6.95% of full-text URLs are missing and 0.7% are not derived from DOIs); b) analysis displayed in Figure 3B shows that some articles do not have search term mentioned in any categories, suggesting that (i) the exported article metadata is partial, or (ii) IRS is searching other fields (despite specifically selecting abstract, title and keywords) that are not exported from the WHOdb; c) limitations imposed by the Scopus website limited the dataset used for assessing the quality to 4000 articles (2000 topmost articles sorted by source title alphabetically and in reverse order were merged).

Conclusion
In conclusion, under the circumstances of the COVID-19 pandemic, centralization of all pertinent research-related material would be beneficial as it would facilitate the dispersion of information and simplify its access. The attempt to accomplish such a task is first and foremost brave and admirable. Still, it stands to reason that the real challenge is not to merge the data from different sources, but to design a good structure of the database and keep it clean as this affects the functionality. From the standpoint of a researcher interested in using the WHOdb as a bibliographic database, it is worrisome that more results were retrieved with queries on PubMed and Scopus than from the WHO global research database. Thus, we conclude that the WHOdb alone is not sufficient as a source of information, even though it is compiled from multiple sources by a very respected and trustworthy organization.