Submitted:
05 August 2024
Posted:
06 August 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Preprocessing:
1.2. Parsing:
1.3. Stop Word Removal:
1.4. Stemming:
1.5. Cascade Filter:
- The current study suggests a novel approach based upon sentence-level hash-code extraction and a thresholding mechanism to distinguish between identical web pages.
- An adaptive threshold is employed, allowing the proposed model to be effective in both large- and small-scale settings.
- Benchmark datasets, including collections of Shakespeare’s works, free text, job descriptions, and Reuters-21578, are used to test the proposed approach.
- The proposed technique demonstrates impressive performance, with an accuracy score of 0.99 and an F1-score of 0.97 thus outperforming existing methods.
2. Literature Review
| Paper | Title | Algorithm / Model | Segmentation | Results |
|---|---|---|---|---|
| 1 | Near-Duplicate Detection in Web App Model Inference | Simhash | Threshold-based | F1 = 0.45 |
| 2 | Allign: Aligning All-Pair Near-Duplicate Passages in Long Texts | Allign /min-hash | Window pairs of size O(n) | F1-score from 0.595 to 0.672 |
| 3 | CopyCat: Near-Duplicates Within and Between the ClueWeb and the Common Crawl | Simhash/ChatNoir-CopyCat-21 | Page based | F1 = 0.94 |
| 4 | An improved Simhash algorithm based malicious mirror website detection method | Simhash | 128 bit strings | Indicates degree of similarity |
| 5 | Online Near-Duplicate Detection of News Articles | Shingling/min-hash | N-grams N= 3,4 | F1 = 0.955 for 3-gram |
| 6 | A Duplication Reduction Approach for Unstructured Data using Machine Learning Method | SIFT features for images | 8x8 grids | Effective comparison |
| 7 | Research on Information Retrieval Algorithm Based on TextRank | Semi-supervised learning/TextRank | 5 and 15 GB blocks | Up to 56% hit rate |
| 8 | Near-duplicate handwritten document detection without text recognition | Series analysis/DTW | Fast DTW | Recall 87-96% for DTW |
| 9 | Normalization of duplication records from multiple sources | Weighted Borda | Record, field and value level | N/A |
| 10 | Data De-Duplication Engine for Efficient Storage Management | De-dupe engine | 128 KB chunks | N/A |
| 11 | Web Information Retrieval Using Island Genetic Algorithm | Island genetic algorithm | Multiple | Similarity measure greater than 0.8 |
| 12 | A near-duplicate detection algorithm to facilitate document clustering | Simhash | Words segmentation | Similarity less than 60% allowed |
| 13 | Near-duplicate web page detection: an efficient approach using clustering, sentence feature and fingerprinting | K=mode clustering, fingerprint extraction | Sentence level | F1 = 0.80 |
| 14 | Efficient near duplicate document detection for specialized corpora | Simhash | Page level | N/A |
| 15 | Informational Retrieval on the Web | Boolean models | Words segmentation | N/A |
| 16 | Challenges in Web Information Retrieval | SVM | Whole Page | An Overview |
| 17 | Information Retrieval on the Web and its Evaluation | Shingling | Canonical sequence of tokens | Precision = 0.8 Recall = 0.05 |
| 18 | Fuzzy logic based similarity measure for information retrieval system performance improvement | FLBSM, Cosine, Euclidean and Okapi | Page level | Best by FLBSM = 0.13 |
| 19 | Web searching and information retrieval | Vector space model | Page level | Web is not a Digital Library |
| 20 | Webpage relationships for information retrieval within a structured domain | Hyperlink Structure | Page level | Prec@1 for retrieval= 0.668 |
| 21 | Preface to special issue on user modeling for web information retrieval | WIFS | Keywords | N/A |
| 22 | Link analysis in web information retrieval | Hyperlink | Words segmentation | ranking query |
| 23 | A PSO Algorithm Based Web Page Retrieval System | PSO algorithm | Avg Accuracy 91% | |
| 24 | State of the art in Web Information Retrieval | Boolean models,Fuzzy Model,Vector Space Models and Probabilistic Models | Words segmentation | Temporal analysis is supported |
| 25 | Research on information retrieval model based on ontology | Domain ontology model | Words segmentation | Threshold of 0.55 gives Precision and Recall = 95% |
| 26 | After the Dot-Bomb | Classification | Index segmentation | Theoretical% |
| 27 | What is this page known for? Computing Web page reputations | Random Walk | Term on page | Needed improvement |
| 28 | Learning to Understand the Web | Hidden Markov model | Words segmentation | N/A |
| 29 | Context in Web Search | Context model | Keywords | Context is better than One size fit all |
| 30 | Next Generation Web Search: Setting Our Sites | Hyperlinks | Word Segmentation | N/A |
3. Proposed Methodology
3.1. Datasets
- Free text: During the development process, the model was tested on free-form text such as clean paragraphs taken from Wikipedia or any piece of writing that is in a clean normalized form.
- Shakespear: For model evaluation, scenes are taken from Shakespeare’s plays, available online. All the three categories of Shakespeare’s plays i.e., comedies, histories, and tragedies, are included in this dataset. In total there are 34,895 sentences consisting of 884,421 words out of which 28,829 words are unique.
- Job Description: A new dataset [36] is also considered for the experiment and utilised in testing before and after the normalization of the dataset. The dataset contains HTML tags, dashes, and white spaces etc. hence required a thorough preprocessing.
- Reuters-21578: The Reuters-21578 dataset [37] is well-known and is considered a benchmark for the detection of duplicates as it is available with the ground truth. It is a document collection, consisting of news articles. The original collection contains 10,369 articles and 29,930 unique words. There are three splits of the dataset, each is made by one of the authors. The ground truth is available for the Levis-split of this dataset which is originally available in SGM format. Hence, the split with ground truth is used for the evaluation of the proposed technique.

3.2. Preprocessing:

3.3. Hash Value Comparison:


3.4. Duplicate Detection:
| Algorithm 1:Text to Hash |
|
3.5. Evaluation Metrics
4. Results

| Predicted | |||
|---|---|---|---|
| Positive | Negative | ||
| Positive | 388 | 12 | |
| Actual | Negative | 7 | 10 |
| Source ID | Duplicate IDs | TP | FN |
|---|---|---|---|
| 519 | 11422,1120 | 2 | 0 |
| 522 | 3164,7769,3735 | 3 | 0 |
| 3729 | 6044,10859,9972 | 2 | 1 |
| 5344 | 9857 | 1 | 0 |
| 7025 | 1969 | 0 | 1 |
| 7204 | 8343,7764 | 2 | 0 |
| 10459 | 2678 | 1 | 0 |
| 12456 | 1971,12471 | 2 | 0 |
| 5123 | 5281 | 1 | 0 |
| 16090 | 16199 | 0 | 1 |
| 16094 | 16357 | 1 | 0 |
| 16624 | 6236 | 1 | 0 |
5. Discussion
| Dataset | No. of Docs | Accuracy | Precision | Recall | F-measure |
|---|---|---|---|---|---|
| Shakespeare | 34 | 0.98 | 0.93 | 0.99 | 0.96 |
| Job Desc. | 50 | 0.97 | 0.98 | 0.99 | 0.98 |
| Reuters | 316 | 0.97 | 0.99 | 0.99 | 0.99 |
| Average | 400 | 0.97 | 0.98 | 0.99 | 0.99 |
| Title | Dataset | Algorithm /Model | Segmentation | F1-Score |
|---|---|---|---|---|
| Near-Duplicate Detection inWeb App Model Inference [31] | Randomly crawled websites | Simhash | Threshold-based | 0.45 |
| Allign: Aligning All-PairNear-Duplicate Passagesin Long Texts [38] | Pan 11 and News | Allign /min-hash | Window pairs of size O(n) | 0.59 - 0.67 |
| CopyCat: Near-DuplicatesWithin and Between theClueWeb and theCommon Crawl [32] | ClueWeb09 ClueWeb12 | Simhash/ChatNoir-CopyCat-21 | Page based | 0.94 |
| Online Near-DuplicateDetection of NewsArticles [35] | SpotSigs dataset | Shingling/min-hash | N-grams N= 3,4 | 0.95 |
| Proposed Technique | Shakespeare acts, Job description(SpotSigs), Reuters- 21578 | Secure Hash Algorithm | Sentence level | 0.97 |
6. Conclusion
7. Future Work and Direction
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Pamulaparty, L.; Rao, C.G.; Rao, M.S. A near-duplicate detection algorithm to facilitate document clustering. International Journal of Data Mining & Knowledge Management Process 2014, 4, 39. [Google Scholar]
- Kumar, J.P.; Govindarajulu, P. Near-duplicate web page detection: an efficient approach using clustering, sentence feature and fingerprinting. International Journal of Computational Intelligence Systems 2013, 6, 1–13. [Google Scholar] [CrossRef]
- Naseer, A.; Tamoor, M.; Azhar, A. Computer-aided COVID-19 diagnosis and a comparison of deep learners using augmented CXRs. Journal of X-ray Science and Technology 2022, 30, 89–109. [Google Scholar] [CrossRef] [PubMed]
- Tamoor, M.; Younas, I. Automatic segmentation of medical images using a novel Harris Hawk optimization method and an active contour model. Journal of X-ray Science and Technology 2021, 29, 721–739. [Google Scholar] [CrossRef]
- Pokorny, J. Web searching and information retrieval. Computing in Science & Engineering 2004, 6, 43–48. [Google Scholar]
- Chughtai, I.T.; Naseer, A.; Tamoor, M.; Asif, S.; Jabbar, M.; Shahid, R. Content based image retrieval via transfer learning. Journal of Intelligent & Fuzzy Systems 2023, 44, 8193–8218. [Google Scholar]
- Naseer, A.; Zafar, K. Comparative analysis of raw images and meta feature based Urdu OCR using CNN and LSTM. International Journal of Advanced Computer Science and Applications 2018, 9. [Google Scholar] [CrossRef]
- Naseer, A.; Zafar, K. Meta features-based scale invariant OCR decision making using LSTM-RNN. Computational and Mathematical Organization Theory 2019, 25, 165–183. [Google Scholar] [CrossRef]
- Naseer, A.; Hussain, S.; Zafar, K.; Khan, A. A novel normal to tangent line (NTL) algorithm for scale invariant feature extraction for Urdu OCR. International Journal on Document Analysis and Recognition (IJDAR) 2022, 25, 51–66. [Google Scholar] [CrossRef]
- Nasreen, G.; Haneef, K.; Tamoor, M.; Irshad, A. A comparative study of state-of-the-art skin image segmentation techniques with CNN. Multimedia Tools and Applications 2023, 82, 10921–10942. [Google Scholar] [CrossRef]
- Wali, A.; Ahmad, M.; Naseer, A.; Tamoor, M.; Gilani, S. Stynmedgan: medical images augmentation using a new GAN model for improved diagnosis of diseases. Journal of Intelligent & Fuzzy Systems 2023, 44, 10027–10044. [Google Scholar]
- Rafiei, D.; Mendelzon, A.O. What is this page known for? Computing web page reputations. Computer Networks 2000, 33, 823–835. [Google Scholar] [CrossRef]
- Cohen, W.W.; McCallum, A.; Quass, D. Learning to understand the web. IEEE Data Eng. Bull. 2000, 23, 17–24. [Google Scholar]
- Naseer, A.; Zafar, K. Meta-feature based few-shot Siamese learning for Urdu optical character recognition. Computational Intelligence 2022, 38, 1707–1727. [Google Scholar] [CrossRef]
- Cabanac, G.; Chevalier, M.; Chrisment, C.; Julien, C.; Soulé-Dupuy, C.; Tchienehom, P.L. Web information retrieval: Towards social information search assistants. In Social information technology: Connecting society and cultural issues; IGI Global, 2008; pp. 218–252.
- Chuklin, A.; Markov, I.; De Rijke, M. Click models for web search; Springer Nature, 2022.
- Cambazoglu, B.B.; Baeza-Yates, R. Scalability challenges in web search engines; Springer Nature, 2022.
- Yu, B. Research on information retrieval model based on ontology. EURASIP Journal on Wireless Communications and Networking 2019, 2019, 1–8. [Google Scholar] [CrossRef]
- Lawrence, S. Context in web search. IEEE Data Eng. Bull. 2000, 23, 25–32. [Google Scholar]
- Deo, A.; Gangrade, J.; Gangrade, S. A PSO Algorithm Based Web Page Retrieval System. Proceedings of Recent Advances in Interdisciplinary Trends in Engineering & Applications (RAITEA), 2019. [Google Scholar]
- Gupta, Y.; Saini, A.; Saxena, A.; Sharan, A. Fuzzy logic based similarity measure for information retrieval system performance improvement. International Conference on Distributed Computing and Internet Technology. Springer, 2014, pp. 224–232.
- Kobayashi, M.; Takeda, K. Informational Retrieval on the Web. IBM Japan 2000, 47. [Google Scholar]
- Arora, M.; Kanjilal, U.; Varshney, D. Challenges in Web Information Retrieval. In Innovations in Computing Sciences and Software Engineering; Springer, 2010; pp. 141–146.
- Garg, D.; Sharma, D. Information Retrieval on the Web and its Evaluation. International Journal of Computer Applications 2012, 40, 26–31. [Google Scholar] [CrossRef]
- Tam, V.W.; Shepherd, J. Webpage relationships for information retrieval within a structured domain. Proceedings of the 21st ACM conference on Hypertext and hypermedia, 2010, pp. 307–308.
- Henzinger, M.R.; others. Link analysis in web information retrieval. IEEE Data Eng. Bull. 2000, 23, 3–8. [Google Scholar]
- Hearst, M.A. Next generation web search: Setting our sites. IEEE Data Eng. Bull. 2000, 23, 38–48. [Google Scholar]
- Brusilovsky, P.; Tasso, C. Preface to special issue on user modeling for web information retrieval. User Modeling and User-Adapted Interaction 2004, 14, 147–157. [Google Scholar] [CrossRef]
- Xu, C. Research on Information Retrieval Algorithm Based on TextRank. 2019 34rd Youth Academic Annual Conference of Chinese Association of Automation (YAC). IEEE, 2019, pp. 180–183.
- Qian, L.; Yu, J.; Zhu, G.; Mei, F.; Lu, W.; Ge, B.; Wang, L.; Mei, Z.; Pang, H.; Xu, M. ; others. A Duplication Reduction Approach for Unstructured Data Using Machine Learning Method. 2019 International Conference on Intelligent Computing, Automation and Systems (ICICAS). IEEE, 2019, pp. 515–519.
- Yandrapally, R.; Stocco, A.; Mesbah, A. Near-duplicate detection in web app model inference. Proceedings of the ACM/IEEE 42nd international conference on software engineering, 2020, pp. 186–197.
- Fröbe, M.; Bevendorff, J.; Gienapp, L.; Völske, M.; Stein, B.; Potthast, M.; Hagen, M. CopyCat: Near-Duplicates within and between the ClueWeb and the Common Crawl. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 2398–2404.
- Chen, G.; Chen, G.; Wu, D.; Liu, Q.; Zhang, L.; Fan, X. An improved Simhash algorithm based malicious mirror website detection method. Journal of Physics: Conference Series. IOP Publishing, 2021, Vol. 1971, p. 012067.
- Seshasai, S. Efficient near duplicate document detection for specialized corpora. PhD thesis, Massachusetts Institute of Technology, 2009.
- Rodier, S.; Carter, D. Online near-duplicate detection of news articles. Proceedings of The 12th Language Resources and Evaluation Conference, 2020, pp. 1242–1249.
- Burk, H.; Javed, F.; Balaji, J. Apollo: Near-duplicate detection for job ads in the online recruitment domain. 2017 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 2017, pp. 177–182.
- empty. Reuters-21578 Dataset, empty.
- Feng, W.; Deng, D. Allign: Aligning all-pair near-duplicate passages in long texts. Proceedings of the 2021 International Conference on Management of Data, 2021, pp. 541–553.



Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).