Exploration of genetic network programming with two-stage reinforcement learning for mobile robot

This paper observes the exploration of Genetic Network Programming Two-Stage Reinforcement Learning for mobile robot navigation. The proposed method aims to observe its exploration when inexperienced environments used in the implementation. In order to deal with this situation, individuals are trained firstly in the training phase, that is, they learn the environment with ε-greedy policy and learning rate α parameters. Here, two cases are studied, i.e., case A for low exploration and case B for high exploration. In the implementation, the individuals implemented to get experience and learn a new environment on-line. Then, the performance of learning processes are observed due to the environmental changes.


Introduction
A mobile robot using reactive strategies determines its behavior based on the sensory information, where the robot carries out a simple task, such as the wall following, obstacle avoidance or object following behaviors.Futhermore, the main mobile robot navigation problem is to follow the wall are the wall and its enviromenment changes.In order to build navigation systems based on the reactive strategies workable in unknown environments, the robustness to the changes of the environments should be considered.
Reinforcement Learning (RL) [1] is an attractive method to provide adaptation mechanisms in the dynamic environments through trial and error, where the rewards are given by the environments depending on the actions taken by the agent.The objective of the agent is to maximize the rewards, and RL learns a policy to maximize the accumulated rewards.Many researches [2][3][4][5] show that RL is well suited to learn control policies for mobile robot navigations.
State of the art in this research can be seen that the integration of RL to Evolutionary Algorithms (EA), such as Genetic Algorithm (GA) [6][7][8], Genetic Programming (GP) [9][10][11] and Genetic Network Programming (GNP) [12] were studied in many researches [13][14][15][16][17], where the integration can improve the performance as shown in GNP with RL (GNP-RL) which was implemented to navigate the mobile robot [18].EA has the evolving ability for capturing the environment using selection, crossover and mutation, while the integration of RL to EA improves the adaptability to the dynamic environments.
The aim of this research is to observe the robustness to the changes of the environments by using Genetic Network Programing, several effective mechanisms were studied, such as (1) adding noises during the training phase [14,19,20]; (2) introducing the two-stage reinforcement learning structure [21,22]; and (3) controlling parameter learning [23][24][25].The first method improves the exploration ability of the agent in the training phase, then the agent becomes more robust when facing inexperience situations in the implementation with noises.Here, the proposed method is to get the effectiveness of the learning mechanisms of RL.The learning mechanism is applied to the second method, where a large search space is separated in two stages, so that the actions can be determined more appropriately.The third method introduces a mechanism to control the duality of exploitation and exploration, which have the ability of re-learning quickly and flexibly when sudden changes occur in the environments [26].The proposed navigation system of the mobile robot in this paper is based on GNP, where GNP has advantages [12] such as (1) re-usability of the nodes which make the structures more compact and (2) applicability to Partially Observable Markov Decision Problem (POMDP).Compared to the other methods, such as Evolutionary Neural Network (ENN) and GP, GNP has better performance [12,18].Here, GNP with Two-Stage Reinforcement Learning (GNP-TSRL) to face inexperienced changes of the environments was studied.TSRL has two kinds of RLs represented by two Q-tables, that is, a Q-table for sub node selection (SS method) and Q-table for branch connection selection (BS method).The actions selections of SS and BS methods are carried based on ϵ greedy policy.This paper is organized as follows.Section 2 describes a mechanism of GNP-TSRL with ϵ-greedy policy and learning rate α.Section 3 shows the simulation conditions and results.Finally, conclusion and future work are given in section 4.

Two Stage Reinforcement Learning (TSRL) with Changing Mechanism
This section describes a mechanism of changing ϵ and α of GNP-TSRL structures.

Structures of GNP-TSRL
The structures of GNP-TSRL consist of a start node and a fix number of processing nodes and judgment nodes, which are connected to each other as a directed graph as shown in Figure 1.The start node has no function and its only role is to determine the first node to be executed, while the judgment nodes have functions to judge the assigned inputs (sensor values), return the judgment results and determine the next node in the transitions.In the former paper [20], it is found that the integration of fuzzy logic into the judgment nodes (fuzzy judgment nodes) can perform well in the noisy environments to determine the node transitions probabilistically, therefore the fuzzy judgment nodes are still used in the proposed method.On the other hand, the function of the processing nodes is for agent to do the actions, that is, to set the speed of the wheels of a Khepera robot.In order to do the effective learning using GNP-TSRL, the structures of the nodes of the conventional GNP-RL are modified, i.e., while the conventional GNP-RL has sub nodes for the the alternative functions [15], GNP-TSRL has not only sub nodes for the alternative functions, but also several branches for the alternative connections.The structures of the judgment nodes and processing nodes of GNP-TSRL are shown in Figure 2. The gene structure of node  is shown in Figure 3, which is divided into the macro node part, sub node part and branch part.The macro node of node  is defined by   and di.  represents a node type, that is,   =0, 1, 2 encodes the start node, judgment node and processing node, respectively.  represents the time delay spent on executing node , for example in this paper,   =0 on the start node,   =1 on the judgment node and   =5 on the processing node.When the sequence of nodes called node transition uses at least 10 time units, it is defined as one time step of the GNP-based agent behavior.For example, after executing three judgment nodes and one processing node, if another processing node is executed, the total time delay is 13 time units, it means that one time step of GNP is executed.
The node  has  sub nodes as shown in Figure 2 whose functions are described in the sub node part as shown in Figure 3.The node function of sub node  ∈ {1, . . ., } is defined by   ,   and   (, ).  is a code number of the judgment/processing node, which is represented by a unique number shown in the function library.When the node is a judgment node,   represents the sensor number of a Khepera robot, e.g.,   = 0 means that sensor number 0, etc.However, when the node is a processing node,   = 0 means the speed of the right wheel of a Khepera robot, while   = 1 means that of the left wheel.  is a parameter of the judgment/processing nodes.Because the fuzzy judgment nodes are used in the proposed method,   = {  ,   } represents the parameters of fuzzy membership functions.On the other hand, when the node is a processing node,   represents the speed of the wheel of a Khepera robot.  (, ) means the  value of  method, which is assigned to each state-action pair, i.e., the state is node , and the action is sub node  selection.Here, the   value is updated using Sarsa learning in the first stage of RL.

Reinforcement Learning (RL)
RL studies the interaction between agents and the environments to adapt to the dynamic environments based on trial and error.The goal of RL is to learn a policy (, ) by selecting action  at state  to maximize expected cumulative reward   .In the POMDP, the agent observes the state using incomplete information on the state, where the actions are more appropriately determined by -greedy policy to learn the near optimum behavior, where on-line learning by Sarsa algorithm [1] estimates   (, ) as follows, (, ) ← (, ) + (() + ( ′ ,  ′ ) − (, )) where,  is learning rate such that 0 <  ≤ 1.In GNP-RL, the current state is current node  and the action is sub node selection  [15].

Algorithm of GNP-TSRL
In order to improve the adaptability of agents in the dynamic environments, the effective learning is done using TSRL [21].In GNP-TSRL, the following two RLs are combined, that is, (1) RL with Sub Node Selection (SS method) and (2) RL with Branch Connection Selection (BS method) as shown in Figure 4. Thus, GNP-TSRL has two Q-tables, that is,   -table and   -table.SS Method.SS Method is carried out in the first stage of RL using GNP-RL, where the current state is current node  and the action is sub node selection .
BS Method.After executing the sub node selected in SS method, one of the several branches from the sub node is determined.Then, the branch connection is determined in the second stage of RL by using BS method.Here, the state is represented by branch ( () ) ∈ { (1) , . . .,  () }, while the action is represented by branch connection selection  () ∈ { (1) , . . .,  () } .The procedure of updating GNP-TSRL using Sarsa learning is explained in Figure 5.

Simulations Settings
The proposed method is used to navigate a Khepera robot in the dynamic environments.This section describes the simulation settings and the results in the training and implementation phases.

Khepera Robot
The proposed method is simulated to Khepera robot.It has eight infrared distance sensors which are used to perceive objects in front of it, behind of it, to the right and left of it by its reflection.Each sensor returns a value ranging between zero and 1023.Zero means that no object is perceived, while 1023 means that an object is very close to the sensor (almost touching the sensor).Intermediate values may give an approximate idea of the distance between the sensor and object.Two motors turn the right and left wheels of the robot, respectively.The range of   and   is between -10 to +10, where   is the speed of the right wheel and   is that of the left wheel.Negative values rotate the wheel backward, while positive values rotate the wheel forward.

Reward and Fitness in Wall Following Behaviors
GNP-TSRL judges the values of the sensors and determines the speed of the wheels depending on node function   and parameter   , while the robot moves in the environment and gets rewards.A trial ends when the individual uses 1000 time steps, then the fitness is calculated.In this simulation, GNP-TSRL learns the wall following behavior, i.e., the robot must move along the wall as fast as and as straight as possible.The reward () at time step  and fitness are calculated by the following equations [15] where,   () and   () are the speed of the right and left wheels at time step , respectively.The range of   () and   () is between -10 to +10.If all the sensors have values less than 1000 and at least one of them is more than 100, then  is equal to 1, otherwise  is equal to 0.

Simulation Conditions
The node functions of the judgment nodes and processing nodes are shown in Table 1.Each judgment function, 0, . . ., 7, judges the sensor value and determines the next node in the node transitions probabilistically [20].Each processing node determines the speed of the left or right wheel.The simulation conditions of GNP-TSRL in Table 2, where these values are selected appropriately through the simulations.In this paper, Gaussian noises (µ = 0,  = 50) are added to the sensor values in the training phase to improve the generalization ability of GNP-TSRL in noisy environments of the implementation phase [20].
In the training phase, 300 individuals are evolved, where at the end of each generation, 300 individuals are generated to form a new population for the next generation; 179 individuals are generated by mutation, 120 individuals are generated by crossover, and one individual is the elite.Each individual uses 61 nodes including 40 fuzzy judgment nodes (5 for each kind), 20 processing nodes (10 for each kind) and one start node.Each of the fuzzy judgment nodes and processing nodes of GNP with TSRL has 2 sub nodes, and each branch of the sub nodes has 2 branch connections which are determined by the evolution.The best individual in the last generation is selected for the implementation.Figure 6 shows the flowchart of the proposed method of GNP-TSRL.The performance of GNP-TSRL is studied in two aspects, that is, in the training phase and implementation phase.The successful trajectories of the robot in the training and implementation environments are shown in Figure 7.
Evolution phase.The evolution of GNP-TSRL starts from the initialization of individuals.Each individual has one start node and a fix number of judgment nodes and processing nodes.The function of node (  ) is assigned by a unique number which is shown in the function library.The parameter of node (  ) is set at a randomly selected integer.When the node is a judgment node, its parameter is   = {  ,   }, where   is larger than   ; that is   is set at  ISSN: 1693-6930 TELKOMNIKA Vol.17, No. 3, June 2019: 1447-1454 1452 between 0 and 1023, and   is set at between 0 and   , while when the node is a processing node, its parameter is set at between -10 and 10.The initial connection of the node by branch  () is determined randomly.All  values (  and   ) are set at zero initially.The connections between nodes, node functions and parameters of the individuals are changed by crossover and mutation whose rates are   and   , respectively.The reader can refer to [20] for genetic operators in details.The average fitness of () converges faster and higher than that of (), because the actions with higher -values can be selected more frequently in (), while () carries out random action selections more frequently than ().In the other words, () and () carry out higher and lower exploitation, respectively.In this case, when the random action selections are carried out with high probability, the actions cannot be reinforced well, then the -values are small, while when the exploitation of action selections is carried out with high probability, the good act ions are reinforced, but the alternative actions cannot be reinforced well.

Implementation Results
In the implementation phase, the performace of the proposed method is studied when the individuals are implemented with parameters of   and   as shown in Table 3.The simulations are done 3000 times, that is, 10 best individuals from 10 independent runs in the training phase are implemented 300 times using 10 different start positions.The results shows that, the individuals trained by () and () are implemented using constant   and   , i.e.,   = 0.01 and   = 0.10.When an inexperienced environment used in the implementation, the action selections are selected considering situations learned in training phase, while the -values of the current transition, are used to leceted actions due to the changes of the environments.Here, () has the lower average reward as shown in Table 3, because it was trained with higher exploitation, and the values of the alternative actions had small values.Thus, due to the changes of the environments, the actions of () cannot be selected appropriately, while as () was trained with higher exploration, although the performance of () in the training phase is worse than (), the average reward of () is higher than () in the implementation phase, because the -values of the alternative actions had larger values.Thus, due to an inexperienced of environments, the actions of () can select be selected more appropriately.The proposed method, () has the better result than (), which means the two-stage reinforcement learning can reinforce good actions and the alternative actions.Here, () shows more efficient and effective compared to ().

Conclusion
The two stage reinforcement learning of Genetic Network Programming (GNP-TSRL) has been proposed to improve the performance of conventional GNP-RL.In the training phase, the average fitness of () converges faster and higher than that of (), while ()  ISSN: 1693-6930 TELKOMNIKA Vol.17, No. 3, June 2019: 1447-1454 1454 has the better result than () in testing phase, which means the two stage reinforcement learning can reinforce good actions and the alternative actions.It shows that the exploration of learning two stage RL (GNP-TSRL) can improve the performance of GNP-TSRL efficiently and effectively by providing alternative connections.In the future work, we will study the performance of the proposed method by studying the adaptability when severe conditions occur.

Figure 8 .
Figure 8.Average fitness in the training phase

Table 1 .
Node Functions Used in the Function Library

Table 2 .
Simulation Conditions