Version 1
: Received: 14 April 2023 / Approved: 14 April 2023 / Online: 14 April 2023 (04:41:04 CEST)
Version 2
: Received: 16 April 2023 / Approved: 17 April 2023 / Online: 17 April 2023 (08:12:18 CEST)
Version 3
: Received: 17 August 2023 / Approved: 18 August 2023 / Online: 18 August 2023 (11:19:23 CEST)
How to cite:
Hamed, A.A. Improving Detection of ChatGPT-Generated Fake Science Using Real Publication Text: Introducing xFakeBibs a Supervised Learning Network Algorithm. Preprints2023, 2023040350. https://doi.org/10.20944/preprints202304.0350.v1
Hamed, A.A. Improving Detection of ChatGPT-Generated Fake Science Using Real Publication Text: Introducing xFakeBibs a Supervised Learning Network Algorithm. Preprints 2023, 2023040350. https://doi.org/10.20944/preprints202304.0350.v1
Hamed, A.A. Improving Detection of ChatGPT-Generated Fake Science Using Real Publication Text: Introducing xFakeBibs a Supervised Learning Network Algorithm. Preprints2023, 2023040350. https://doi.org/10.20944/preprints202304.0350.v1
APA Style
Hamed, A.A. (2023). Improving Detection of ChatGPT-Generated Fake Science Using Real Publication Text: Introducing xFakeBibs a Supervised Learning Network Algorithm. Preprints. https://doi.org/10.20944/preprints202304.0350.v1
Chicago/Turabian Style
Hamed, A.A. 2023 "Improving Detection of ChatGPT-Generated Fake Science Using Real Publication Text: Introducing xFakeBibs a Supervised Learning Network Algorithm" Preprints. https://doi.org/10.20944/preprints202304.0350.v1
Abstract
Background: ChatGPT is becoming a new reality. Where do we go from here? Objective: to show how we can distinguish ChatGPT-generated publications from counterparts produced by scientist. Methods:By means of a newly devised algorithm, called xFakeBibs, we show the difference in contents and structure of bigram networks generated from ChatGPT fake publications are significantly different from real publications. Specifically, we triggered ChatGPT to generate 100 publications related to Alzheimer’s and comorbidity. Using TF-IDF, we constructed a network of bigrams and compared with 10 other networks constructed from real PubMed publications. Each of those networks were constructed from exactly one of 10 folds. Each fold was equally comprised of 100 publications to ensure fairness. We trained the xFakeBibs algorithm using the 10-folds which were used to test the ChatGPT fake publications. The algorithm successfully assigned the POSITIVE label for real and NEGATIVE for fake ones. Results: When comparing the bigrams of the training set against all the other 10 folds, we found that the similarities fluctuated between (19%-21%). On the other hand, the bigram similarity from the ChatGPT was only (8%). Additionally, when testing how the bigrams alter the structure of the training model, we found that all 10 folds contributed (51%-70%) new, while ChatGPT contributed only 23% which is less than half of any of the other 10 folds. Upon calibrating the xFakeBibs algorithm using the 10-fold real publications, we learned that they contribute 21.96-24.93 number of edges on average. When xFakeBibs classified the individual articles, we found that 98 of the 100 publications were detected as fake, while 2 articles failed the test and were classified as real publications. Conclusions:While it is indeed possible to distinguish, in bulk, a dataset of ChatGPT-generated publications from counterparts of real publications (as was the case for the Alzheimer’s dataset). Classifying individual articles as fake though can be done to a high degree of accuracy, it may not be possible to detect all fake and ChatGPT automatically generated articles. ChatGPT may seem to be a useful tool, but it certainly presents a threat to our knowledge of authentic and real science. This work is indeed a step in the right direction to counter fake science and misinformation.
Computer Science and Mathematics, Artificial Intelligence and Machine Learning
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.