Parallel simulation of a multi-span DWDM system limited by FWM

: One of the non-linear phenomena affecting high bandwidth and long reach communica-1 tion systems is Four-Wave Mixing (FWM). Unfortunately, the simulation of systems for parameter 2 optimization requires more time as the number of channels increases. In this paper, we propose a 3 new high-performance computational model to obtain the optimal design parameters in a multi-4 span Dense Wavelength Division Multiplexing (DWDM) system, limited by FWM and the intrinsic 5 Ampliﬁed Spontaneous Emission (ASE) noise of optical ampliﬁers employed in each segment. The 6 simulation in this work provides a complete optical design characterization and compares the efﬁ-7 ciency and speed improvement of the proposed parallelization model versus an earlier sequential 8 model. Additionally, an analysis of the computational complexity of parallel models is presented, 9 in which two parallel implementations are used; ﬁrstly, Open MultiProcessing (OpenMP) which is 10 based on the use of a central multi-core processing unit and secondly, the Compute Uniﬁed Device 11 Architecture (CUDA), which is based on the use of a Graphics Processing Unit (GPU). Results 12 show that parallelism improves by up to 40 times the performance of a simulation when nested 13 parallelization with CUDA is used over a sequential method and up to 6 times compared with 14 the implementation with OpenMP using 12 processors. Within our parallel implementation, it 15 is possible to simulate with an increased number of channels that which was impractical in the 16 sequential simulation. 17


Introduction
Non-linear phenomena occur in optical fibre systems especially when they are 21 used at their maximum . It is a limitation for multi-segment, high-bandwidth and long- Brillouin Scattering (SBS), Self-Phase Modulation (SPM) and Cross-Phase Modulation 28 (XPM); each appears under certain design circumstances. In any design analysis, basic 29 linear effects will also be present such as fibre attenuation and dispersion, defined by 30 the parameter a and D, respectively. Erbium Doped Fiber Amplifiers (EDFAs) have 31 been used to compensate for fibre losses and to increase the transmission distance, 32 causing at the same time an increase on amplified spontaneous emission (ASE) noise 33 and nonlinearities [1,2].Different kinds of optical fibres have been designed to reduce the 34 effect of dispersion and nonlinear effects [3][4][5][6]. There are numerous studies on non-linear 35 impairments showing the importance and complexity of such phenomena for DWDM 36 [7][8][9][10][11][12]. 37 The sequential model presented in [13] can be used on a multi-span DWDM system 38 limited by FWM/ASE to obtain the optimal power (P opt ) and maximum link length (L max ) 39 with respect to the dispersion D. The model is highly parallelizable so that different 40 parallelization models can be used to reduce the computational complexity. Paralleliza- 41 tion can significantly reduce the processing or execution time of a sequential algorithm 42 [14],including the simulation of Non-Linear phenomena [15]. To implement paralleliza-43 tion, several tools exist, such as MPI (Message Passing Interface) [16], OpenMP [17], and 44 recently GPU cards [18]. The latter are programmable using the libraries provided by 45 NVIDIA, in particular the CUDA-C language. Other examples using GPU cards are 46 mentioned in [19,20], where the use of this tool significantly improves the execution of 47 algorithms.

49
In this work we use two schemes to significantly reduce the time complexity of 50 DWDM and simulate more channels than obtained in [13].First, we show a multipro-51 cessing parallel paradigm using OpenMP since it only uses the processors located in the 52 Central Processing Unit (CPU) and does not require a distributed system like MPI. The 53 efficiency has been proven [21], but it is limited by the number the cores in the processor.

54
Secondly,we use of Graphic Processing Units for the cost-benefit ratio [22][23][24].For CUDA, 55 we use Dynamic Parallelism (DM) [22]. Dynamic parallelism consists of hierarchical 56 kernels in a tree structure, where a thread is defined as a father kernel. This father kernel 57 can generate new child kernels to execute tasks in the GPU context. The height of the 58 tree depends on the characteristics of the GPU card. Finally, an analysis of both imple-59 mentations using metrics such as speedup, efficiency and performance ratio is presented. The paper is divided into the following sections: Section 2 describes the theory of a

67
The power dependence of the refractive index is denoted by χ 3 and FWM, is de-68 scribed using χ 3 . If three optical fields with carrier frequencies f 1 , f 2 , and f 3 copropagate 69 inside the fiber simultaneously, FWM generates a fourth field whose frequency f 4 is 70 related to other frequencies by a relation f 4 = f 1 ± f 2 ± f 3 . All frequencies correspond- transmitted. The C band is the most commonly used for its low lossesand the EDFA fits 82 this system [2].   The total spectral width in band C is W = f N − f 1 = 3.75 THz (30nm). The multi-88 channel density ∆ f is a function of the total number of N channels and can be expressed 89 as: The central frequency can be expressed as: An FWM signal appears by mixing three frequency channels f i , f j and f k , so that is, The power at f n is given by [11,12,25]: for k = 32π 3 χ 3 η 2 cλ , where λ is wavelength, c is the vacuum light speed, η is the core 94 refractive index, a is the linear loss coefficient,P i , P j , P k are launched input power levels FWM efficiency, which is given by: where ∆β i,j,k is the phase mismatch that is expressed as: D is the fibre chromatic dispersion coefficient. In a DWDM system with N number This include the sum of all cases that satisfy For 105 simplicity, we consider equal transmission power in each channel (P S = P i = P j = P k ), 106 therefore, the most affected channel with the highest FWM, (n = w) power can be 107 expressed as: where Y indicates the maximum value of the summation ∑ η i,j,k d 2 i,j,k for average 109 channel as follows: FWM can be considered as an interfering signal, so if we consider a typical maxi-111 mum value of 20 dB of optical signal to interference ratio in the receiver [1], the maximum 112 power for the most interfered channel, is: This should be compared to the total ASE noise power in the receiver can be 114 estimated, which increases with the number of amplifiers and is as [26] : where h is Plank's constant (6.634 × 10 −34 J.s), f is the centre frequency, G is the gain, B 0 116 is the bandwidth of the optical filter that can be approximated by 2B, 2η sp is the amplifier 117 noise factor whose minimum value is 2 for an ideal amplifier and η sp is the population 118 inversion parameter [1]. For equal amplifier spacing (M = L/L a ) and an amplifier gain 119 G that compensates for fibre loss as: The minimum power per channel to ensure the required SNR (20dB) can be ob-121 tained as: The FWM effect imposes an upper limit on the power per channel and in the ASE 123 case, it imposes a lower limit when the transmission distance is increased. Therefore, by 124 intersecting P and P m the maximum value of the transmission distance L = L max_ f wm 125 is obtained and the optimal transmission power per channel P o_ f wm = P = P m can be 126 expressed in the following ways L max_ f wm : and P o_ f wm Equations (13) and (14) only represent a system limited by FWM-ASE noise. The 129 values of equation (14) must be lower than the power thresholds of the non-linear phe-130 nomena of Raman [27] and Brillouin [28][29][30][31] stimulated scattering.

132
By analyzing equations (13), (15) and (16), where equation (15) represents the 133 maximum transmission distance due to Dispersion and equation (16) is the maximum 134 transmission distance due to the SRS-ASE effect [32,33], we can obtain the equation (17) 135 for D as a function of N, to limit the system for only FWM-ASE effect, without the need 136 to analyze both non-linearities [34] so: Obtaining the graphs for L max_ f wm (D) and 138 P o_ f wm (D), by using equations (13), (14) and (17) is time consuming. In Section 3 we 139 describe the main difficulties in obtaining these values using the sequential model. We   Let Eq(24) to obtain D upper_limit define the set of intervals of dispersion as

170
OpenMP uses a master-slave architecture [16]. In a master-slave architecture, the type of parallelism is geared toward coarse-grained parallelism [37].   grids, and kernels [38].In CUDA, the GPU is named the device, and the CPU is referred 189 to as the host. To execute a parallel process in the device, first the process must be

213
The experiments consisted of applying the parallel methods and the sequential 214 method to parameter above shown (N = 240 and µ = 144). 215 We use the Speedup to measure the outperformance of the parallel algorithm by 216 comparing its execution time versus the execution time of the best sequential algorithm 217 [39]. The Speedup is defined as: where T * (n) is the time taken for the best sequential algorithm to resolve a problem 219 of size n and T p (n) is the time taken for the parallel algorithm with p processors to resolve 220 the same size n problem. Optimal Speedup is defined by S p (n) ≈ p and Efficiency is Efficiency is a measure of the effective utilization of the p processors [39] in the 223 parallel algorithm.  parallel method was up to six times faster than the sequential method using ≥ 8 pro-  The performance using GPUs with nested CUDA-C is 20.58 seconds, which is 6.46 231 longer than its counterpart using the best time executed in the OpenMP implementation 232 and more than 40 times better than the sequential equivalent. It is possible to decrease 233 the time by 2.6 times using the CUDA SFU (Special Function Units) for Sin and Sqrt 234 evaluation [40,41], however, the accuracy of the results is will be significantly affected.

235
Note that the number of CPU processors in our implementation is much lower than 236 the number of GPU processors. Here we use the Performance Ratio (PR), which is a 237 measure to quantify CPU speed for a specific quantity of input data (computational task) 238 of size n using a constant number of processors concerning the total time taken by GPU 239 performing the same input [19] . The Performance Ratio (PR) is defined as follows where T(n, ν) CPU is the time taken for an algorithm using ν CPU processors to 241 resolve a problem of size n, T(n, ζ) GPU is the time taken for an algorithm using ζ GPU 242 cores to resolve a problem of size n.  (17). It can be seen that at a dispersion of 9 ps/km-nm, the 251 maximum permitted distance for the system is approximately 430 km. Figure 6 shows 252 the optimum transmission power. Where it is observed that as the dispersion increases, 253 the optimum power value also increases; up to 0.11 mW for 9 ps/km-nm dispersion.

254
Finally, note that due to restrictions of the algorithm, we could not report results

255
for 240 channels in [13]. Consequently, the greater the number of channels the longer 256 execution time.  The following abbreviations are used in this manuscript: