A Multi-Object Simulated Annealing Approach to Identify Likely Functional Genomic pairs

Metaheuristic algorithms have been frequently using to tackle optimization problems, however such algorithms in the analysis of health-related data is not commonly used as developing metaheuristic algorithms that work well on health-related data is a difficult task due to complexity of the health data in particular genomics and epigenetics data. One of the important tasks in genomics is to predict genomic elements that are incorporating together to regulate a diseaserelated genes. Predicting such elements are important as they can be used to develop a personalized cure. In this study, we present for the first time, a multiobject simulated annealing algorithm to identify enhancer-promoter like interactions from Hi-C (chromosome conformation capture) data. These regulatory elements can potentially play vital roles as promoters and/or enhancers in appearance and exacerbation of the regulation of gene.s To evaluate the efficiency of the proposed method, we applied our proposed method and traditional methods on the Hi-C data from mice and compared together. Our results show that the interacting elements identified by our new method are more likely to be functional. The source code of the method is publicly available.


Introduction
Medical scientists have always been confronted with issues such as prediction, diagnosis, survival, and relapse of the diseases [1]. However, in the field of genetic diagnosis, gene regulation may be the main factor in the incidence of the disease [2] Gene regulation is an activity which could lead to gene expression or bring it to a halt. Factors such as silencers, enhancers, and promoters influence on gene regulation which are referred to as regulatory elements.
Tools and technologies known to date can identify the regulatory elements along genome. Also, they are capable of understanding the various interactions among the regulatory elements. For instance, the Hi-C technology was introduced by Aidin et al [3], to capture all the chromosome interactions [4,5,6]. The Hi-C method altogether with its tools represents three-dimensional interactions between different points of the genome in the regulatory elements. In other words, it determines what region of the genome interacts with the others. The extensive information obtained in this way provides a lot of knowledge concerning genome interactions.
Using Hi-C, many studies have engaged in detecting promoters-enhancers pairs and how they regulate the genes and how they interact with each other [7,8,9]. During the past decade, a wide range of tools have been proposed to effectively analysis Hi-C data and detect informative interacting pairs in the genome. One of these tools is Homer which was developed by [10]. This tool is capable of detecting valid interactions among the Hi-C data set [10]. Later, HiCUP was introduced in [11]. Similar to Homer, it can perform the detection of valid interactions. It is also able to execute mapping and Allele specific analysis [11]. The three tools of HiCdat [12], HiClib [13], and Hic-Box [14] that were introduced later were able to carry out the operations of mapping, detection of valid interactions, binning, and correction of systematic noise. Different to previously methods, HiC-Pro that was introduced by [15] is able to do parallel implementation of mapping, quality checking and interaction calling. Note that all the tools mentioned above could enumerate the interactions between different points in the regulatory elements.
The methods mentioned above are only able to detect significant interacting pairs through dividing the genome to fix-bin size fragments, which is a timeconsuming task and also identify many non-functional interactions. Here, we used a heuristic based technique based on our previous frameworks [16,17,18,19,20] to extracts the Hi-C interacting pairs and counts the interactions among them through a bi-object algorithm depending on the simulated annealing. Such interacting pairs have the potential to be functional interactions that play a key role in regulating the gene expression. The rest of this article is organized as follows. In the second section we will introduce the data set used in this study as well as our proposed method. In Section 3 we present our results and discussion. Finally, we present our conclusion and future work in the final section.

Dataset
The Hi-C data set used here is the set of mouse embryonic stem cells (ESCs) [1,21]. Specifications of the dataset show in table 1. The length of each fragment used in the dataset is 5kb.

Pre-Processing
We used a recently developed tool MHiC [22] for mapping, filtering and interaction calling of Hi-C data. FASTQ paired-end reads were aligned to the mm9 genome and filtered through the HiC-Pro [15] pipeline developed. To remove duplicate reads, filter for valid interactions, and generate Hi-C interaction matrices (5kb and 10kb bin-sizes), we set the following parameters: MIN_FRAG_SIZE = 230; MAX_FRAG_SIZE = 1100; MIN_INSERT_SIZE = 120; and MAX_INSERT_SIZE = 990. All experimental artifacts, such as circularized reads and re-ligations, singletons have been filtered out using HiC-Pro. We also excluded self-interactions from the downstream analyses. For getting real Hi-C interactions, we used a recently developed background correction model, MaxHiC [8] to identify significant Hi-C interactions. Here, we included those significant interactions with P-value < 0.01, read-count >=5.

The Technique
The simulated annealing (SA) algorithm is a simple and efficient heuristic algorithm which is used in the optimization problems [23,24,25]. The method is employed in metallurgy to reach a state in which the energy of solid matter is minimized [23,25]. This technique includes putting the material at high temperature and then bringing down the temperature, slowly.
The SA algorithm starts at an initial solution and then moves towards the neighboring solutions through a repeating loop. If the neighboring solution would be better than the present solution, the algorithm would set it as the solution (that is, tends to it), otherwise, the algorithm recognizes that as the solution with probability exp(-ΔF/T), where ΔF is the difference between the objective function of the present solution and the neighboring solution, and T is the parameter dubbed as temperature [23]. In each temperature, several repetitions are performed and then the temperature is slowly reduced [24].
The SA is works in the following manner. In the early steps, the temperature is taken very high to be more likely to accept worse answers. With gradual decrease of temperature, there would remain fewer probabilities for acceptance of worse solutions, and hence the algorithm would converge to a favorite solution [26][27][28]. In each step, the SA algorithm regards several cases at the neighborhood of the current mode S, and decides probably whether transforms the system from the S mode or stays the same. These probabilities lead the system eventually to the mode of lower energy level. Approaching to this solution is happens in the following conditions: 1. The new solution is better than the previous one.  [26]. The value of the movement probability function is computed in every step from the relation exp(-ΔF/T). The difference between this equation and the object function is the value for the current and the new solution [26,29,30]. Theoretically, overcoming local optimization, the SA technique is able to find the absolute optimized solution. The main idea in the SA method is to generate a solution at the neighborhood of the current solution and to effectively calculate the variation in the object function ΔF. Then, after saving the best solution, the new solution will be accepted if Δf <=0. Otherwise, if Δf >0 the new solution would be accepted with a probability [24].

Multi-objective detection of interacting pairs problem
Take the Hi-C dataset with n items of data as follows: where is a region of genome in a chromosome. The aim of the problem is to find repetitive items RE1~ RE2, where RE1 and RE2 are as follows: Here, RE1 is an interacting pair of the genome in a chromosome defining a potential promoter, and RE2 is another interacting pair of the genome in a chromosome which defines a potential enhancer. The dataset used in this research is Hi-C which is used to show the interactions among different points of the genome in the cancer patients. Our aim here is to find the interaction RE1~ RE2, where RE1 and RE2 are potential promoters and enhancers elements.

Detection of multi-objective interacting pairs
In this section, our proposed method in detecting and extracting interacting pairs from the Hi-C data. The approach is based on the bi-objective simulated annealing algorithm. To explain the objectives of the algorithm, let RE1~ RE2 be a solution to the problem as it was specified in the previous section.

Criterion
The criterion is defined as the proportion of the sum of all interactions of any item in RE1 with any item in RE2 to the sum of all the interactions of the items in RE1 with any item in Hi-C Data. The value falls in the interval [0, 1]. The higher this value, the better criteria would be attained.

2.4.1.2.Ψ Criterion
The Ψ criterion is defined as the size of the interacting pair detected in the genome. As in the following formula, the Ψ of RE1 is the summation of the size of all ξ belonging to RE1.
The fitness function for two potential promoter and enhancer (RE1, RE2) in a chromosome is, in accordance with equation (3), the proportion of ratio criterion to the sum of all the sizes of RE1 and RE2.
The value of the fitness function belongs to the interval [0,1]. The higher interaction between these interacting pairs, the closer this value would be to 1. In this formula, for the value of fitness function to be maximum, the ratio criterion (the first objective) should be maximum and the length criterion (the second objective) should be minimum. These conditions tell us that for the minimum length sum of the detected interacting pairs in the genome we will have the most interactions. Also, the fitness function being maximum means that the detected interacting pairs must have strong interaction with each other.

The proposed method
In this section, we present our suggested algorithm. The algorithm is a multiobjective model based on SA to detect and extract likely functional interacting pairs in the Hi-C dataset. Besides, it is able to enumerate all the interactions among the interacting pairs. Overcoming the local optimization, the proposed method could find the absolute optimal solution as well. The main idea in this method is to generate a solution at the neighborhood of the current solution and to effectively calculate the variation in the object function Δf. Afterwards, saving the best solution, the new solution will be certainly accepted if Δf<=0. Otherwise, if Δf >0, the new solution would be accepted. In Figure 1, the input of the algorithm would be the Hi-C dataset including the information about the inter-genome interactions in each chromosome, the maximum iteration (MaxIt), and the maximum sub-iteration (SubIt). At the pre-processing stage, the operations of sequencing, mapping, and trimming are done on the input dataset. Then, the initial population is created randomly the same extent as MaxIt to detect and extract the interacting pairs. At the next stage, each individual from the input population would be assessed via the fitness function. In the next step, 10 neighborhoods (the same number as SubIt) around the best individual are generated from the population of the previous stage before each assessed using the function in formulae 3. After that, the best solution found in each repetition with the most favorite assessment function will be saved. Consequently, the best solutions in each iteration represent the optimal detected regulatory elements in the dataset. Finally, all the detected interacting pairs in the RE set will be displayed as the output (Figure 1).  Table 2 shows the parameters set in the proposed method. According to Table  2, we investigated different values for these parameters and choose those that obtained the best results. The fitness for each individual is calculated via the fitness function. After the computations of the fitness function, the individual from among the input population which has the interacting pairs with the best fitness function is considered as the best actual answer. We repeat the following steps the same number of times as Maxit. First, we create the new population of 20 neighbors close to the resulted present optimal solution. Then, the fitness of each is compared to the optimal solution using the assessment function. If the new solution is better than the best one obtained so far, then we take it as the best new solution [25]. Afterwards, the best solution found will be updated followed by the temperature drop coefficient as the final action. This process would be repeated for the same number of the iterations and at each iteration, the optimal solution (including the best detected interacting pairs) will be recorded. The structure of each individual of the input population in the problem of discovering and extracting the promoters/enhancers in the Hi-C dataset is as follows. We put in plain words each part of this structure.

Rand-Seq Right Side
Right Side

Rand-Seq Left Side
Left Side Chr name ChrName: is the name of the chromosome. Left Side: determines the beginning part of the genome containing the interacting pair at the left side of the interaction. Rand-Seq Left Side: is the length of the interacting pair at the left side of the interaction. Right Side: determines the beginning part of the genome containing the interacting pair at the right side of the interaction. Rand-Seq Right Side: is the length of the interacting pair at the right side of the interaction.

Measuring the Performance
The efficiency of the proposed algorithm is evaluated by two decisive factors. The first is the number of interactions between detected interacting pairs, and the second is the value of the fitness function evaluating the rate of inter-genome interactions with regard to the size of genome regions. Figure 3 shows the plot box related to the interactions detected in the suggested method compared to traditional methods (HiC-Pro [15], HiCUP [11]). As shown in this Figure, the number of detected interactions in our proposed method is more than that in traditional methods. The same holds for the number of average interactions detected in the two methods. The following figure confirms that our suggested method behaves more efficiently than the traditional methods strategy in the detection of likely functional Hi-C interactions.     , demonstrates the relation between the number of detected interacting pairs and the distances between them. According to this figure, 50% of the detected interactions occur among those interacting pairs being 0kb apart. Also, 30% of the interactions belong to those which are 5kb away. The interacting pairs being 10kb, 15kb, and 20kb away from each other constitute 15% of the interactions, and the rest 5% comprise those interacting pairs positioned more than 25kb apart. As shown in Figure 5a, the number of interactions between two interacting pairs has a reverse relationship with the distance between them. Figure 5b also illustrates the number of interacting pairs per chromosome in the input Hi-C dataset detected by our proposed method of which the traditional methods have not been able to detect them. Among them, the highest number of the detected interacting pairs is related to chromosome 2, and the least belongs to the two chromosomes 18 and 19. Our proposed method does not detect any new interacting pairs in chromosomes 20, 21, 22, and Y. Table 3 demonstrates three new interacting pairs detected by our proposed method. The interacting pairs in row 1 is the outcome of merging the regions 155,580,000-155,585,000 and 155,585,000-155,590,000 in chromosome 1. The corresponding region interacts with the region 155,565,000-155,570,000 from chromosome 1. The fitness function gives the value 1.10E-05 for the resulted interacting pair. As shown in the second and third row of this table, this value of fitness function is more favorable than the values of those interacting pairs identified by the traditional methods. The row 4 shows the interacting pairs resulted from the merge of the three regions (174,285,000-17,4290,000, 174,290,000-174,295,000, and 174,295,000-174,300,000) of chromosome 2. The merged region interacts with the region 174,235,000-174,240,000 of the same chromosome. The fitness value of this new interacting pairs is also better than the values of those detected by the traditional methods (the rows 5, 6, and 7).

Fitness Function Value
The row 8 of the table is dedicated to the result of merging the four regions of 42,240,000-42,245,000, 42,245,000-42,250,000, 42,250,000-42,255,000, and 42,255,000-42,260,000 from chromosome 10. The resulting region interacts with the region 42,220,000-42,225,000 of chromosome 10. The value attributed to the fitness function at the detected interacting pairs is much better than all the values obtained from the traditional methods (the rows 9, 10, 11, and 12).
The diagram in Figure 6, outlines the average value of fitness function in the proposed method along with enumeration of their interactions per chromosome compared to the traditional methods. As shown in this figure, the average fitness per chromosome is much better in our proposed compared to the traditional methods for determining the interacting pairs. As it was explained earlier in Section 2, the fitness function being better means that the number of interactions among the detected interacting pairs is more than that in the traditional approaches.

Validating the Method and Results
This section compares all detected interacting pairs with the regions reported as promoters by Stefan et al. along with their interactions with other regions [1,21]. Our results illustrated in Figure 7 demonstrates that 70% of the interacting pairs recognized by our proposed method overlap the enhancer-promoter pairs reported in [21], and the rest 30% does not. Figure 7 shows the percentage of overlapping and non-overlapping of the new interacting pairs detected by the suggested method with significant enhancer-promoter regions identified in Stefan et al. This proves the ability of our proposed method in detection and extraction of the interacting pairs.  Figure 7, the discovered areas of the proposed method was able to match 70% with significant promoters. This indicates that the proposed method correctly detects promoter regions.

Conclusion
Here, we proposed a multi-objective approach to identify likely functional genomic interacting pairs. The average value of the fitness function related to the suggested algorithm per chromosome is much better than the traditional method in identifying genomic interacting pairs. Indeed, the suggested algorithm derives those interacting pairs that were not identified by the traditional methods. This "being better" of fitness values entails that the number of interactions among the extracted interacting pairs by the proposed method are more than of the interactions identified by the traditional methods.
As the final point, the detected interacting pairs in our method with the ESCs data went under comparison with the significant enhancer-promoter regions reported by Stefan et al [1,21]. The comparison proved that 70% of the interacting pairs recognized via our suggested method have overlapped the Stephan's enhancer-promoter pairs. This percentage represents the appropriate successfulness of the suggested approach. This means that our technique has strongly enumerated the detected and extracted regulatory elements.
Authors' contributions: MH developed the concept. MH wrote the manuscript; MH and HP revised the manuscript. HP and MH provided the data. MH carried out tool implementation. MH, HP generated all figures and tables. MH, HP performed statistical analyses. All authors have read and approved the final version of the paper.