Fundamentals of Reinforcement Learning (FunRL)

Reinforcement learning algorithms have witnessed growing popularity and tremendous success in recent years. They are now used to solve optimization problems that were believed almost impossible a few years ago. Yet, despite this empirical success, they are very power-hungry and there are very few algorithms with performance guarantees. The performance of a reinforcement learning algorithm can be measured by using various metrics: the {sample efficiency} measures how many samples (of a simulator, or of a real environment) an algorithm needs to find a ``good’’ policy, whereas the {computational efficiency} measures the amount of computation (or memory) needed.. Finally its {regret} measures the gap between the sequential rewards obtained by the learning algorithm and the rewards of the optimal policy (that is unknown).

The goal of a reinforcement learning algorithm is to gather information about the unknown system being explored by the learner to better understand its dynamical properties and exploit them to optimize its behavior. Whenever the learner has an {\it a priori} offline information about the system, it can leverage this knowledge to be more efficient in learning its optimal behavior. This approach is coined by the global concept of {\it structured learning } . This leads us to the research question that we want to tackle with the FunRL project:

How to design algorithms with optimal theoretical guarantees that exploit a (known or unknown) structure of the problem to solve?

This question will be developed in three directions. First, we will tackle the online control of queueing networks, which raises the important issue of stability and rarely visited states. The as Markov decision processes (MDPs), which are stochastic dynamical systems that can be controlled. The main originality of this axe with respect to the others is that these dynamical systems are constrained by the structure of the problem, the challenge being to efficiently use our knowledge of such a structure.Third we will study parametric learning, where a learner adapts its policy to a problem with a known structure but whose parameters are unknown. This has applications to auto-scaling problems in cloud computing, resource allocation, and sequential decisions.

Keywords: Reinforcement learning, Online learning, stochastic gradient descent, Bandits.

##The FunRL consortium

The project members come from several sites (Grenoble and Clermont-Ferrand) and several laboratories (LIG in Grenoble, LJK in Grenoble and LIMOS in Clermont).

  • Jean-Yves Daniel (ISIMA, Clermont), is PRAG in ISIMA, Clermont. He is involved in the global overhaul of the courses in IA in Clermont. He has already set up several software environments for students to get a first hands-on experience of reinforcement learning.

  • Pierre Gaillard} (LJK, Grenoble) ({co-PI}), Is an Inria researcher in Grenoble. He is a member of the TOTH team and a member of the LJK laboratory. He has been active in in statistical learning with a particular focus on online learning.

  • Nicolas Gast}(LIG, Grenoble), in an Inria researcher in Grenoble in the Polaris Team. He is also a member of the LIG laboratory. He is working on stochastic approximation techniques and their applications in stochasic gradient algorithms.

  • Bruno Gaujal (LIG, Grenoble), ({co-PI}) an Inria researcher in Grenoble in the Polaris Team and a member of the LIG laboratory. He is currently working on reinforcement learning in Markov decision processes with applications to admission control, resource allocations and auto-scaling.

  • Jean-Philippe Gayon (LIMOS, Clermont), is a professor in ISIMA and is a member of the laboratory LIMOS in Clermont. He is a specialist in Markov decision processses and their applications to queueing and operations research problems.