An Empirical Study of Deep Web based on Graph Analysis

The internet can broadly be divided into three parts: surface, deep and dark among which the latter offers anonymity to its users and hosts. Deep Web refers to an encrypted network that is not detected on search engine like Google etc. Users must use Tor to visit sites on the dark web. Ninety six percent of the web is considered as deep web because it is hidden. It is like an iceberg, in that, people can just see a small portion above the surface, while the largest part is hidden under the sea. Basic methods of graph theory and data mining, that deals with social networks analysis can be comprehensively used to understand and learn Deep Web and detect cyber threats. Since the internet is rapidly evolving and it is nearly impossible to censor the deep web, there is a need to develop standard mechanism and tools to monitor it. In this proposed study, our focus will be to develop standard research mechanism to understand the Deep Web which will support the researchers, academicians and law enforcement agencies to strengthen the social stability and ensure peace locally &amp; globally.


Introduction
The Dark Web, a conglomerate of services hidden from search engines and regular users, is used by cyber criminals to offer all kinds of illegal services and goods [35]. Cybercriminal activities in the dark web can be considered one of the critical problems for societies around the world [5]. Web mining techniques such as content analysis and structure analysis can be useful for detecting and avoiding terrorist's threats all over the world [7]. Nowadays social network analysis (SNA) is used to study a variety of economic and organizational phenomena and processes [6, 8, 9, and 10]. Social network analysis (SNA) is used effectively to counter money laundering, identity theft, online fraud, cyber-attacks, and others. In particular, the SNA methods are used in the investigation of many illegal operations with securities and investments, for the prevention of riots and others [6, 11, and 12]. Graph theory has long been a favored tool for analyzing social relationships [13,14] as well as quantifying engineering properties such as search ability [13,15]. For both reasons, there has been numerous graph-theoretic analysis of the World Wide Web (www) from the seminal [13, 16 -20] to the modern [13,21]. Graph theory as a tool can be used for analyzing social relationships for the dark web [13].
SNA [34] is a graph-based method for analyzing social relationships and their impact on individual behavior and organizational structure. It was developed by sociologists and has been applied in many academic fields such as epidemiology and Computer-mediated communication (CMC). After classifying and clustering the captured data, the characteristics of the special participants can be extracted. Through the social network analysis method, the social interaction mode with other cybercrime, the type of content published, and the frequency of discussion of the participating topics can be obtained [27]. The dark web can provide anonymity by implementing of onion routers, which encrypt and bounce communication through a network of relays run by volunteers around the world [5]. The United States Naval Research Laboratory has developed the onion router (Tor) for anonymity to protect sensitive information and network. Tor program was released to the Internet users in 2004 [5]. It can provide privacy, and encryption, direct Internet traffic by using a series of virtual tunnels. It can help users to reach blocked contents and destinations. The Tor website ends with .onion, while other web domains end with .net, .com, .edu, .org, …etc, and can be opened by using the Tor software [5,23]. Other programs can provide anonymity with encryption such as: ZeroNet, GNUNet, FAI (Free Anonymous Internet), and Freenet [5,23].

Research Problem
One area that has not received adequate attention in the vast academic literature surrounding extremist movements and their use of the Internet is the Dark Web, whose websites are vaguely assumed to work as hubs for terrorists, drug-traffickers, and gangs [49]. With the rise of technology, cyber criminals are becoming more and more empowered. On the other hand, law enforcement agencies do not have adequate resources and technologies to fight cyber-crimes and monitoring activities on dark web. One of the primary challenges posed by the Dark Web to national security professionals is segregating out the "noise" from issues of legitimate national security concern. With annual cybercrime revenue estimated at approximately $1.5 trillion and considering the existence of 7,000-30,000 TOR sites, knowing where to look requires us to bound our focus to specific subject areas [50]. To find latest researches on dark web, an online query was executed on IEEE Xplore. The query result showed that only 250 resources available on dark web. Among those, 111 conference papers, 96 journals articles, 23 magazines, 13 books and 7 other resources. Therefore, we can say that deep web needs more academic attention to fight cyber-crime in this information age. By conducting research on "Deep Web", our primary focus is to deliver a comprehensive road map to fight cyber-crimes and devise new strategies to monitor deep & dark web while developing standard software systems.

Related Research Study
Latent Dirichlet Analysis (LDA) technique has been applied by [25] to discover latent topics in dark Web page's contents. LDA is a generative model to detect topics in a text corpus by determining likelihoods of each document, and then capture word and documents that being capable of exchange. Finding the threaten topics can assets detecting community key-members. A work done by [26] to extracting group key members using LDA to find the terrorize topics by integrating the LDA in dark Web portals to enhance the Social Network Analysis (SNA). Using the method can help to measure the radical of the member and assort the kind of member to expert or key-based on the selected topic. This work limited to dark websites use English language as communication language and it also done based in only one forum.
Zhang Xuan, a member of the Shandong Police College, and the Secretary of the Department of Information Security and Cryptography (CISC) of the University of Hong Kong, Professor Jinpei ou, co-published the Dark Net Threat Intelligence Analysis Framework, which proposes a concept of a hidden threat intelligence analysis framework. To help analyze crime traces in the dark network [27,28].
In a recent work [29,32], Qin et al. performed an empirical study of different global extremist organizations on the Web and presented how sophisticatedly they propagate their ideologies. Several studies have focused on sentiment analysis, opinion mining and affect analysis of user posts in Web forums [30], and the discovery of user roles and their ties have been appraised [31].
In a research study, Yang et al. [33] came up with a spectral coherence based clustering approach to identify dark Web clusters, which considers the temporal coherence of user activeness rather than contents or links as the primary information. They represented a group of users as a mdimensional multivariate process which is used to derive the spectral density matrix and finally spectral coherence score is computed to identify the clusters [32].
Pastrana et al. [36] recently built a system that looks at cyber-crime outside the Dark Web. The authors discuss challenges in crawling underground forums and analyze four English-speaking communities on the Surface Web. In contrast, Nunes et al. [37] mine Dark Web and Deep Web forums and marketplaces for cyber threat intelligence. They show that it is possible to detect zero-day exploits, map user/vendor relationships and conduct topic classification on Englishlanguage forums, results that we have been able to reproduce with BlackWidow [35].
Al-Nabki et al. [40] presented a web-text-content-based classification pipeline containing TOR dark net illegal activities. They have used two well-known text representation techniques (Frequency Inverse Document Frequency and Bag-of-Words) together with three different supervised classifiers (Logistic Regression, SVM, and Naive Bayes). With the help of Uniform Resource Locators (URL), Kan et al. [41] classified the web pages by extracting features where a URL is segmented into tokens using information-theoretic measures. Noor et al. [42] proposed an automatic deep web classification technique, named "Query Probing", where they extracted the content from deep web data sources. Besides, it is commonly used for supervised learning algorithms and "Visible Form Features" [39].
Nunes et al. [43] discovered 16 zero-day exploits by monitoring forum posts in Darknet marketplaces. To reduce training data labeling requirements, their binomial classification method combined supervised with semi-supervised classifiers (eg: Label Propagation and Co-Training). Unsupervised k-means clustering was applied to character level n-gram features in [44] and partitioned Dark Web marketplace products into 34 clusters.
Thomas et al. analyzed the way of cybercriminals' communications and what they exchange in forums [45]. Pastrana et al. focused on finding cybercrime actors in a large underground forum [46]. For evaluating private interactions, Overdorf et al. developed a method for automatically labelling threads that are likely to trigger private messages [47]. These studies were used to explore the market of underground forums and the social relationships of members.
Masashi et al. [48] conducted a study to efficiently extract threat intelligence from the dark web by using machine learning as an "active defense" against cyber-attacks. Furthermore, focusing on the current situation that myriad forums are rampant on the dark web, they proposed a method to identify the characteristics of these forums. The experiment showed that "doc2vec", a neural network based tool, has high performance as a method of natural language processing and feature extraction in machine learning. MLP indicated high classification performance of 90% or more based on the number of datasets used in the experiment. This proved that the vectorization of doc2vec accurately represents the features of the posts. Furthermore, their experiment has shown that it is effective to use machine learning for posts on the dark web [48].

Future Research Scope
The future research scope on Deep Web has enormous potential. With the rise of "Artificial Intelligence" and Web Technology, research on Deep Web would be the next game changer. Furthermore, this area of research involves emerging fields such as big data and advance intelligent computing, etc. More precisely, this will play a significant role in global eGovernance.