Design and simulation of adaptive PID controller based on fuzzy Q-learning algorithm for a BLDC motor

Reinforcement learning (RL) is an extensively applied control method for the purpose of designing intelligent control systems to achieve high accuracy as well as better performance. In the present article, the PID controller is considered as the main control strategy for brushless DC (BLDC) motor speed control. For better performance, the fuzzy Q-learning (FQL) method as a reinforcement learning approach is proposed to adjust the PID coefficients. A comparison with the adaptive PID (APID) controller is also performed for the superiority of the proposed method, and the findings demonstrate the reduction of the error of the proposed method and elimination of the overshoot for controlling the motor speed. MATLAB/SIMULINK has been used for modeling, simulation, and control design of the BLDC motor.


Introduction
BLDC motors have many advantages, including high efficiency, higher torque-waited ratio, high dynamic response, and long operating life. Hence the BLDC motor was employed in many industrial applications, including electric automotive and robotics [1]. PID controller is a popular used controller in various industries because of no need for an in-depth understanding of the system dynamics and its simple structure. On the other hand, it suffers from the inadequate adjustment of gains. During the use of the PID controller, determining the appropriate PID coefficients is probably challenging in the case of the existence of various nonlinearities and uncertainties, including hysteresis, friction, payload variations, etc. The desired speed tracking and control system performance could be affected by all these factors [2]. This issue can be addressed through applying intelligent control approaches, including fuzzy control, adaptive control, and neural networks. Many tuning approaches and discussions were presented in [3]. Moreover, it is possible to discover countless research articles focusing on auto-tuning PID control [4], APID control [2,5], auto-tuning PID-type fuzzy control [6], etc. in the existing literature. The parameters of the controller have been automatically tuned in auto-tuning PID and APID control according to the alterations of process parameters [4,5].
Approaches based on neural networks (NNs) for motor control system does have a powerful ability to overcome system uncertainty and disturbance, while more computational capacity and data storage space is required. These methods usually do not include transforming the continuous time domain plant into the equivalent of its neural network [2]. Fuzzy logic control is a powerful, intelligent control approach, primarily according to control of rules and the knowledge of professional staff and specialized skills is the key source of the rule. The fuzzy control can therefore attain the effect of simulating the automatic control mechanism [7][8][9][10]. However, the summarization of the rules of fuzzy control is challenging, and it is hard to tune online once the control rules are set, and also there is not enough capacity for the new rules [11]. In summary, it is difficult to understand the characteristics of the motor control by one particular type of intelligent control, such as the single PID modification or individual fuzzy control; however, the combination of the traditional system and modern control may possess a reasonable control impact following complementing [12].
A PID compensator with combining of model reference adaptive control (MRAC) has been proposed by El-Samahy et al. [1]. The findings indicate that the proposed method is robust, as well as the MRAC performance seems to be much higher compared to the self-tuning fuzzy PID controller, particularly for the systems exposed to abrupt disturbances. However, the choice of the appropriate value of adaptation rate gain can affect the speed response. In [2], an adaptive interaction method theory-based self-tuning algorithm for PID control is presented, which is properly applied in artificial neural networks. This modern technique would not need the continuous time domain plant to transform into its corresponding neural network. This approach is implemented on the BLDC motor. In [13], the author built and evaluated the performance of an APID controller on the basis of an additional error of inversed-control signal for BLDC motor drives. It is shown that in the case that there are discrepancies in fault tolerance, process parameters, and learning capabilities, the proposed method provides stronger solutions. The core principle of this approach is, however, focused on a neural network that needs higher computation and time usage, making it impossible to implement in real-time.
This article combines conventional PID control and fuzzy Q-learning to achieve the self-tuning PID-type fuzzy Q-learning control strategy which is utilized to the BLDC motor device and has the same benefits of an ordinary PID structure and are easy to omit the system error and overshoot as well as possesses excellent robustness to parameter interference and alteration. This article has the following structure: the BLDC mathematical model is presented in Section 2. BLDC speed controls were built for control techniques, including conventional adaptive PID and self-tuning PID-type fuzzy Q-learning adaptive control in 3. In Section 4, the numerical results dependent on the simulated procedure are presented and discussed, and the conclusion is shown in 5.

Mathematical model of BLDC motor
BLDC motors are a class of synchronous motors with permanent magnets. They are operated by DC voltage, and without a brush, so electronic commutation is used to rotate the rotor. There is a need for proper commutation of the information about the rotor position for electronic commutation. In addition, Hall-effect sensors are applied for the purpose of detecting the rotor position with a 120-degree location as sensor-less technologies. The BLDC motor commutation is accomplished with 6 phases having this distance [13][14][15].
Based on the voltage equation of the BLDC motor, the voltage can be defined by the equation as given in Ref. [16]. The full-bridge driving in the two-phase conduction mode is as presented in Fig. 1 as well [17]. The BLDC motor parameters are shown in Table 1.

Reinforcement Learning
Reinforcement Learning is an approach aimed at solving problems such as control systems [18], energy management systems [19][20][21] and is one of the methods of machine learning. The essence of learning influenced this approach because only by communicating with the environment can the control policy produce without understanding the underlying system model. Agent-environment

Rating and Symbol
Units Quantity interaction shown in Fig. 2 can be described as the RL problem. The agent monitors S t ∈ S in this interaction and takes measures A t ∈ A. The environment gets actions, updates new states S t+1 ∈ S and produces a scalar reward R t+1 ∈ R ⊂ R. Afterward, the agent gets the award, remains in new states S t+1 ∈ S, and selects new measures A t+1 ∈ A. This interaction persists until the terminal state S T ∈ S is reached. The purpose of the agent is to extract the optimum control strategy to optimize the discounted accumulated rewards, called as expected discounted return G t in the long term, the governing equation of which is given in Ref. [18].

Q-Learning
The Q-learning algorithm [22] is an RL value-based off-policy learning algorithm, in which the learned action-value function Q, rather than following the existing policy, estimates the optimal action-value function, q * , in a direct manner. The updating rule equation is as given in Ref. [22].
In general, the Q-learning agent selects action A t for the current state S t and determines the next step S t . It earns a reward from this transition and updates the pair (S t , A t ) value on the condition that it carries out the optimal policy from the state St. An exploration/exploitation scheme is used by a Q-learning agent to measure a policy which optimizes the value of future discounted rewards (Rew), which is defined as given in Ref. [23]. The general procedures of the Q-learning algorithm are demonstrated in an algorithm as given in Ref. [24].
Algorithm 1 Q-learning algorithm. until S t reaches terminal state S T ; 10: until episode ends;

Fuzzy Q-Learning
The q-values in Q-learning for every state action pair are stored and updated. In cases where there is a big number of state action pairs, it may not be feasible to apply it [25]. Fuzzy inference systems benefit from the Q-function in order to achieve good approaches [26] and, at the same time, provide the applicability of Q-learning in continuous state-space problems (fuzzy Q-learning) [27]. x implies the crisp set of inputs in fuzzy Q-learning, which describes the agent's state. This is converted to varying values as well as every fuzzy rule fits a state. This means that the degree to which the agent is in a state is defined by each rule's strength [28,29]. In other words, no fixed (predefined) state-action pair is possible; however, the results from each rule (state-action pair) are generated via the exploration/exploitation algorithm. The FIS, therefore, has competition acts with respect to any rule; these rules are in the form that is presented in Ref. [30].
Specifically, the fuzzy Q-learning algorithm and its equations are broadly outlined in 7 stages, which are listed in the following [23].
1. Observe the state x 2. Perform the action A t for each rule t based on the exploration/exploitation algorithm. 3. Compute the global output a(x) by its equation (see Ref. [23]). 4. Compute the related value of Q(S t , A t ) according to its equation (see Ref. [23]). 5. Obtain the new state information (x ). 6. Compute reward R t+1 7. Update q-values based on the updating rules equation (see Ref. [22]).

Control Strategy
An automatic self-tuning of the fuzzy inference system is provided by fuzzy Q-Learning only on the basis of reinforcement signals. Fuzzy reasoning handles continuous states and produces continuous actions. Previous information can be integrated into fuzzy rules that can greatly minimize learning time. The need for a professional is virtually unrealistic in requiring a wide spectrum of rules to navigate stochastic environments. This middleware is applied to try to provide an adaptive layer for adjusting both the reference knowledge and rules. Fig. 3 indicates the proposed method structure, which is a combined PID controller as the main control and fuzzy q-learning approach as an intelligent algorithm for controller parameter tuning. The Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 19 November 2020 doi:10.20944/preprints202011.0482.v1 control signal of the conventional PID controller can be defined as given in Ref. [31] (See equation (7) in mentioned reference, considering λ = µ = 1).

Figure 3.
The structure of the self-tuning PID-type fuzzy Q-learning adaptive controller.
In Fig. 3, K p , K i , K d are the gains of PID controller and e is the system error. The gains of PID controller is considered as multi-agent system and each gain including K p , K i , and K d operate as an agent. The set point (Reference speed) is the considered rotor angular velocity. The calculated rotor velocity is subtracted from the set point via the feedback control loop, and the error value (e) emerges. The error is the input value of both the fuzzy Q-learning agent and the PID controller. The motor's speed is the control procedure that possesses the angular velocity of the rotor as output and the control signal of the PID controller (voltage) as input.
The value of only one state variable of the fuzzy Q-learning agent, conditioning in the range of [-1, 1], is defined by the input signal (error). Seven membership functions (MFs) are employed for fuzzy input with triangular-shape which seven states depicted by an equal number of fuzzy rules resulted from the one input with seven membership functions. For fuzzy output, The five MFs are employed with triangular-shape as well.
As previously stated, the interaction of the environment is a factor for agent learning. The exploration/exploitation algorithm should be applied for determining whether the agent is to carry out exploitation or exploration to obtain knowledge through its own experience. Agent scan for 500 rounds/state to carry out a high degree of scan in a new state. The agent then verifies and carries out the activity that was not carried out (if there are any), that provides all probable actions to be carried out for every state at least once. Then for a given state, the agent conducts 1% exploration and 99% exploitation.
The agent reward (R) is described as presented in Ref. [23]. Moreover, the amount of future discounted rewards is described as given in mentioned reference.

The numerical simulation of BLDC motor
MATLAB runs the simulation on a PC with an Intel Core i7 − 6500U processor (6 MB cache, up to 3.18 GHz), 8 GB RAM.
The purpose of this simulation is the design of a controller for a motor speed of 3000 RPM with eliminating the overshoot. Since overshoot can inflict too much current on the motor at the first moment and cause serious damage to the motor, it is, therefore, necessary to consider overshoot as a serious issue. In the first step, an APID controller is implemented due to a comparison with the proposed method and using its PID coefficients to adjust PID gains' initial value for the FQL method. However, selecting the PID gains' initial value by means of try and error can achieve acceptable results. Because the property of the Q-learning method is that over time, it can improve its coefficients by the actions it receives. Nevertheless, an adaptive method has been used to enhance the controller performance. In the second step, the fuzzy Q-learning method was proposed for BLDC motor speed control. The simulation time is set to 0.5s (case 1) and 3s (case 2) with a simulation step time of 0.001s seconds. The FQL agent round is equal to 0.1s seconds. The discount factor and the learning rate values equal to γ = 0.9 and η = 0.1 respectively. In time 0.2 s, a sudden load with torque 10N.m is applied to the motor due to verify the robustness and highlight performance. As can be seen, Fig. 4 and Fig. 8 show a comparison between the self-tuning fuzzy Q-learning PID (FQLPID) controller and APID controller in case 1 and case 2, respectively. The comparison indicates the maximum efficiency of the FQLPID. At the commencement of the simulation, whereby the FQLPID primarily investigates, it can be seen that the adaptive PID has a huge overshoot and then achieve the steady-state, In the event that the fuzzy Q-learning PID does not have overshoot in the first moment (see Fig. 4 and Fig. 8). By adding a sudden load in 0.2s, it is observed that the fuzzy Q-learning PID is more resilient to adaptive PID and has returned to the normal state. However, adaptive PID has oscillation during the simulation time (see Fig. 5 and Fig. 8). Details of the performance response for both controllers at two different time points can be found in Table 2 for two cases, including at 3s (exploitation phase) and 0.5s (exploration phase). The values of the integral time absolute error (ITAE) and the integral absolute error (IAE) also showed the better performance of the fuzzy Q-learning PID after the exploration phase. The IAE and the ITAE are defined as given in Ref. [32].

Conclusions
The numerical simulation of the BLDC motor has been realized to the desired reference speed. The simulation of the system applying the adaptive PID (APID) as well as the self-tuning PID-type fuzzy Q-learning adaptive controller (FQLPID) was performed by MATLAB/SIMULINK software. The resulted figures illustrate the best performance of the self-tuning PID-type fuzzy Q-learning adaptive controller among the conventional adaptive PID with regard to both the amplitude of the oscillations and steady-state error. The proposed controller results in the desired system without the overshoot. It can be inferred from the obtained results that the proposed controller is more effective than others.