Traditional Culture Encyclopedia - Traditional customs - Let machines learn like humans?

Let machines learn like humans?

What should robots do if they want to have human learning ability? The AI Institute in Berkeley gives a good answer-Meta-RL. But this time, Berkeley AI Institute not only used meta-reinforcement learning, but also considered POMDP, asynchronous strategy gradient and other knowledge systems, and finally got a new algorithm "Pearl" with high sample efficiency and high exploration efficiency. This achievement not only provides a new thinking angle for solving practical problems of artificial intelligence; It is also the first step to realize the large-scale application of meta-reinforcement learning in real systems. Berkeley AI Research Institute published a blog post introducing this achievement, which is compiled as follows.

background

If an agent wants to run normally in the ever-changing complex environment, it must acquire new skills quickly enough. Just like the extraordinary ability of human beings in this respect, for example, human beings can learn how to distinguish a brand-new object from an example; A few minutes to adapt to different driving modes of different cars; When you hear a slang, you can add it to your dictionary, and so on.

Therefore, meta-learning is a reference method if the agent wants to complete the learning ability like human beings. Using this paradigm, agents can make full use of the rich experience accumulated in performing related tasks and adapt to new tasks according to these limited data. For such agents who need to take action and accumulate past experience, meta-reinforcement learning can help them adapt to the new situation quickly. However, the fly in the ointment is that although the trained strategy can help the agent adapt to the new task quickly, the meta-training process needs a lot of data from a series of training tasks, which aggravates the sample inefficiency of the trapped reinforcement learning algorithm. Therefore, the existing meta-reinforcement learning algorithm can only run normally in the simulation environment to a large extent. This paper briefly introduces the research status of meta-reinforcement learning, and then puts forward a new algorithm-Pearl, which greatly improves the sample efficiency.

Research progress of meta-reinforcement learning

Two years ago, Berkeley Blog published an article called Learning to Learn (the address of the article is https://bair.berkeley.edu/blog/2017/07/18/learning-to-learn/). In addition to proposing new algorithms, this paper also investigates and counts the surge of people's interest in meta-learning at that time. The results show that the key idea of meta-learning at that time and now is to simplify complex problems into problems that we already know how to solve. In traditional machine learning, researchers will give a set of data points to match the model; In meta-learning, these data points are replaced by a set of data sets, and each data set corresponds to a learning problem. As long as the process of learning these problems (so-called "adaptation") is distinguishable, gradient descent can be used to optimize in the outer loop (meta-training) as usual. Once trained, the adaptive program can quickly solve new related tasks from a small amount of data.

The latest development of meta-reinforcement learning (from left to right: through meta-learning, one sample imitates https://arxiv.org/abs/1802.01557, and the broken leg adapts to https://arxiv.org/abs/'s meta-reinforcement learning model. 1803.11347, and infer the situation beyond the distribution of training tasks with evolutionary strategy gradient (https://arxiv.org/abs/1802.04821).

At that time, most meta-learning work focused on the classification of small sample images. In the following years, meta-learning was applied to a wider range of problems, such as visual navigation, machine translation and speech recognition. Therefore, it is a challenging but exciting task to apply reinforcement learning to meta-learning methods, because this combination is expected to make agents learn new tasks faster, which is very important for agents deployed in a complex and ever-changing world.

Because the sample complexity of meta-learning itself is closely related to deep reinforcement learning, meta-learning can be combined with deep reinforcement learning to solve this problem. Two years ago, some papers on meta-learning (RL2 Wang et al. and mamlhttps://arxiv.org/ABS/1703.03400) introduced the preliminary results of applying meta-learning to reinforcement learning in the limited environment of policygradient and denserewards. Since then, many scholars are interested in this method, and more papers show the application of meta-learning concept in a wider environment. For example: demonstration learning from human beings (https://arxiv.org/ABS/1802.01557), imitation learning (https://arxiv.org/abs/1810.03237) and model-based reinforcement learning (https://. In addition to the parameters of the meta-learning model, we also consider the superparameter and loss function. In order to solve the problem of sparse reward setting, there is a method to explore strategies by using meta-learning.

Despite these advances, sample efficiency remains a challenge. When applying meta-RL to more complex tasks in practice, more effective exploration strategies are needed to adapt to these tasks quickly. Therefore, in the actual learning task, we need to consider how to solve the problem of inefficient meta-training samples. Therefore, Berkeley AI Research Institute has conducted in-depth research on these problems and developed algorithms to solve these two problems.

Advantages of asynchronous strategy elements in reinforcement learning

Although strategic gradient reinforcement learning algorithms can achieve high performance in complex high-dimensional control tasks, such as controlling the operation of humanoid robots, their sample efficiency is still very low. For example, the most advanced strategy gradient method (PPO https://arxiv.org/abs/ 1 707.06347) needs1100 million samples to learn a good humanoid strategy. If we run this algorithm on a real robot and let it run continuously with a 20 Hz controller, it will take nearly two months to learn it without calculating the reset time. The main reason for its inefficiency is that the data forming the strategy gradient update must be sampled from the current strategy, rather than reusing the previously collected data during training. The performance of recent non-strategic algorithms (td3 https://arxiv.org/abs/1802.09477, SAC https://arxiv.org/abs/1801.01290) is equivalent to that of strategic gradient algorithm, but the number of samples required is reduced by 100 times. If researchers can use this algorithm for meta-reinforcement learning, the data collection time of several weeks can be reduced to half a day, making meta-learning more efficient. When training from scratch, asynchronous strategy learning not only greatly improves the sample efficiency, but also has further function optimization-not only the static data set collected before, but also the data of other robots in other scenes can be used.

Non-strategic reinforcement learning is more effective than strategic gradient learning.

Search problem

In supervised meta-learning, data used to adapt to new tasks are given. For example, in the small sample image classification, we will provide the meta-learning agent with the images and annotations of the new category that we want to mark. In reinforcement learning, agents are responsible for exploring and collecting their own data, so the adaptation program must include effective exploration strategies. Black-box meta-learning agents (RL2 and https://arxiv.org/abs/1707.03141) can learn these exploration strategies, because in recursive optimization, the whole adaptation process is regarded as a long sequence. Similarly, the gradient-based meta-reinforcement learning method can learn the exploration strategy by allocating credit to the tracks collected by the pre-update strategy and the returns obtained by the update strategy. Although it is feasible in theory, in fact, these methods do not learn the exploration strategy of temporary expansion.

In order to solve this problem, Maesn (https://arxiv.org/ABS/1802.07245) determines the strategy and the probability potential variables to adapt to the new task through gradient descent, which increases the randomness of the structure. After training, the model makes the previous samples encode the exploration trajectory, and the samples from adaptive variables get the optimal adaptive trajectory. Generally speaking, these schemes are suitable for on-policy reinforcement learning algorithm, because they depend on the exploration and adaptation trajectory of sampling from the same current strategy, so synchronous strategy sampling is needed. In order to construct synchronous strategy element reinforcement learning algorithm, we will use different methods to explore.

Exploration of meta-learning posterior sampling method

A very simple way to explore in a brand-new scene is to pretend that it is something you have seen. For example, when you see pitaya for the first time and want to eat it, you will compare it to mango and cut it with a knife like eating mango. This is a good exploration strategy, which allows you to eat delicious pulp inside. When you find that the flesh of pitaya is more like kiwifruit, you may switch to the strategy of eating kiwifruit and dig out the flesh with a spoon.

In the related literature of reinforcement learning, such an exploration method is called posterior sampling (or Thompson sampling). The agent has a data set distribution on MDPs, and then iteratively samples new MDP from this distribution, and decides the best operation mode according to it, and updates the distribution with the collected data. With more and more data collected, the posterior distribution decreases, which makes a smooth transition between exploration and iteration. This strategy seems to be limited, because it excludes the possibility of aimless exploration; However, a previous work, "(More) effective compensation learning through posterior sampling" shows that the worst-case cumulative regret is close to the current optimal exploration strategy through posterior sampling.

Eat a strange new fruit by posterior sampling

In practical problems, how do we express this distribution on MDP? One possibility is to keep the distribution of transfer and reward functions. To operate according to the sampling model, we can use any model-based reinforcement learning algorithm. Bootstrap DQN applies this idea to model-free deep reinforcement learning, and maintains an approximate posterior on the Q function. We think that by studying the distribution of different tasks on Q function, we can extend this idea to multi-task environment, and this distribution is very effective for the exploration of new related tasks.

In order to introduce posterior sampling method into meta-reinforcement learning, we first model the distribution based on Q function on Mdps, instantiate the potential variable Z, and infer from experience (or context) that the model will take Q function as input to adjust its prediction. In the process of meta-training, all variables before Z are learned to represent the distribution of meta-training tasks. Facing the new test task, agent samples from the previous hypothesis, then decides what action to take in the environment according to the selected hypothesis, and then updates the posterior distribution through the new proof. When the agent collects the trajectory, the posterior distribution decreases, and at the same time, the agent will make a better prediction of the current task.

Meta-reinforcement learning is considered as POMDP.

Based on Bayesian posterior view of meta-reinforcement learning, this paper reveals the relationship between meta-reinforcement learning and partial observation of markov decision processes. POMDPs is very useful for modeling environment when what is currently observed can't tell you everything about the current environment (that is, only the current state can be partially observed). For example, when you walk near a building, the lights suddenly go out. At this time, you can't immediately observe where you are from the darkness, but you will still have an estimate of your position, because you can make an estimate by recalling what you saw before the lights went out. Solving POMDPs is a similar principle, which involves the integration of historical observation information, so as to achieve the purpose of accurately estimating the current state.

Image model of POMDP

Meta-reinforcement learning can be regarded as POMDP with special structure, and its task is the only unobserved part in the current state. In our example, the task may be to find an office that you have never been to. In the standard POMDP, the agent must re-estimate the state every time it goes to the next step, so that it can constantly update its position estimation in the building. In the example of meta-reinforcement learning, the task will not change all the time in each exploration trajectory, that is, in the real world, the location of the office will not change during the search process. This also means that this method can keep the estimation of the office location without worrying about the potential system dynamics changing its actual location at each step. The meta-reinforcement learning algorithm is transformed into POMDPs, that is, the agent should keep the belief state of the task-when collecting information on multiple exploration trajectories, the task will be updated.

Pearl in a shell

How to combine the belief state on the task with the existing asynchronous strategy reinforcement learning algorithm? Firstly, we can infer the variational approximation of a posteriori belief by using the encoder network q(z|c) with context (experience) as input. In order to maintain operability, we represent the second half as a Gaussian function. For the agent of reinforcement learning algorithm, we choose the modeling based on soft actor-critic (SAC) because it has the best performance and sample efficiency at present. The samples in the belief state are passed on to actors and critics so that they can make predictions according to the sample tasks. Then the meta-training process includes learning to derive the posterior q(z|c) of a given context, and training actors and critics according to the given z optimization. The encoder is a gradient optimization using critic (so q(z|c) represents the distribution on the q function), and the information bottleneck. This bottleneck is the result of deducing the lower bound of variation, but it can also be intuitively interpreted as minimizing the information between context and Z, so that Z contains the minimum information needed to predict the state-action value.

One thing to note about this scheme is that a batch of data sampled for training actors and critics will be separated from a batch of data in the context. Intuitively, this is useful: by clearly expressing the belief state of tasks, agents separate task reasoning from control, and can learn each task using completely different data sources. This is in sharp contrast with methods such as MAML and RL2, which combine task reasoning and control, so a batch of data must be used at the same time.

Facts have also proved that this separation is very important for asynchronous strategy meta-training. The reason is that the current meta-learning prediction is based on the assumption that the training and testing stages should be consistent. For example, meta-learning agents who perform new animal species classification tasks during testing should be trained in class distribution including animals. Similarly, in reinforcement learning, if the agent makes adjustments by collecting synchronous strategy data during the test, it should also use the strategy data for training. Therefore, the use of asynchronous strategy data in training will bring changes to the distribution, thus destroying this basic assumption. In PEARL, we can reduce this distribution change by context sampling synchronous strategy data and using asynchronous strategy data for actor-critic training, and use asynchronous strategy data on a large scale.

At that time, part of the algorithm was an abstract encoder architecture. Looking back, this encoder works in context (a set of transitions consisting of state, action, return and next state) and generates Gaussian posterior parameters on potential context variables. Although recurrent neural network seems to be a wise choice here, we notice that the nature of Markov means that these transformations can be encoded, regardless of their order in the trajectory. Based on this observation, we adopt an encoder with permutation invariance, which can independently predict the Gaussian factors of each transformation and multiply these factors to form a posterior sample. Compared with RNN, this architecture is faster and more stable in optimization, and can adapt to a larger context.

How do pearls work when they are fully prepared?

We use MuJoCo simulator to test PEARL in six benchmark continuous control domains. Simulators have different rewards or dynamic functions between different tasks. For example, for ant agents, different tasks correspond to navigating for different target positions on the 2D plane; For walking agents, tasks correspond to different parameters of their joints and other physical parameters.

We compare PEARL with three most advanced meta-reinforcement learning algorithms, namely ProMP, MAML and RL2. The results are shown in the figure below, where the blue curve represents PEARL's results. Note the unit (logarithmic ratio) of the X axis. By using asynchronous strategy data in meta-training, this method improves the sample efficiency by 20- 100 times, and its final performance is often better than the baseline.

In the field of sparse returns, effective exploration is particularly important. Imagine a point robot. It must navigate to different target positions on a semicircle, and it will only be rewarded if it is within a small radius of the target position (which can be observed in the blue area). By sampling different hypotheses of the target location and then updating their belief status, agents can effectively explore until they find the target location. By comparing PEARL with MAESN, MAESN is the meta-learning exploration strategy generated by latent variables we discussed earlier. Finally, we find that PEARL not only has higher sampling efficiency, but also can explore more effectively in meta-training.

Point robots use posterior sampling to explore and find targets in sparse reward settings.

Future development direction

Although meta-learning provides a possible solution for agents to quickly adapt to new scenarios, it also creates more problems! For example, where does the meta-training task come from? Do they have to be designed manually or can they be generated automatically? Although meta-learning is accidental in nature, the real world is a continuous and endless changing process-how does the agent handle the tasks that have been changing with time? The design of reward function is very difficult-on the contrary, can binary feedback, preference and demonstration be used simultaneously in meta-reinforcement learning algorithm? We believe that the study of Bayesian inference in The Pearl can bring a new perspective to solve these problems. At the same time, we also believe that PEARL's ability in asynchronous strategy learning is the first step to realize the large-scale application of meta-reinforcement learning in real systems.

Previous article:What are the basic movements of dance?
Next article:Jiangyin decoration company introduces Jiangyin decoration price.