Skip to main content

Reinforcement Learning, Bit by Bit

 

Other learning paradigms are about minimization; 

Reinforcement learning is about maximization.


The statement quoted above has been attributed to Harry Klopf, though it might only be accurate in sentiment. The statement may sound vacuous, since minimization can be converted to maximization simply via negation of an objective. However, further reflection reveals a deeper observation. Many learning algorithms aim to mimic observed patterns, minimizing differences between model and data. Reinforcement learning is distinguished by its open-ended view. A reinforcement learning agent learns to improve its behavior over time, without a prescription for eventual dynamics or the limits of performance. If the objective takes non-negative values, minimization suggests a well-defined desired outcome while maximization conjures pursuit of the unknown



Video Courtesy:bdtechtalks.com

What happens when AI Plays Hide and Seek 500 Times

Paper By -

Xiuyuan Lu, Benjamin Van Roy, Vikranth Dwaracherla, 

Morteza Ibrahimi, Ian Osband ,Zheng Wen

Paper Link


 Data Efficiency In reinforcement learning, the nature of data depends on the agent’s behavior. This bears important implications on the need for data efficiency. In supervised and unsupervised learning, data is typically viewed as static or evolving slowly. If data is abundant, as is the case in many modern application areas, the performance bottleneck often lies in model capacity and computational infrastructure. This holds also when reinforcement learning is applied to simulated environments; while data generated in the course of learning does evolve, a slow rate can be maintained, in which case model capacity and computation remain bottlenecks, though data efficiency can be helpful in reducing simulation time. On the other hand, in a real environment, data efficiency often becomes the gating factor.


Paper, Developed a framework for studying costs and benefits associated with information. I am highlighting major components of Reinforcement concepts, as the paper has in detail information of each of it. 

  • Agents
    • coin tossing
    • dialogue
  • Agent-Environment Interface
  • Policies and Rewards
  • Agent State

Sources of Uncertainty 

The agent should be designed to operate effectively in the face of uncertainty. It is useful to distinguish three potential sources of uncertainty: 

  • Algorithmic uncertainty may be introduced through computations carried out by the agent. For example, the agent could apply a randomized algorithm to select actions in a manner that depends on internally generated random numbers. 
  • Aleatoric uncertainty is associated with the unpredictability of observations that persists even when the ρ is known. In particular, given a history h and action a, while ρ(·|h, a) assigns probabilities to possible immediate observations, the realization is randomly drawn.


  • Epistemic uncertainty is due to not knowing the environment – this amounts to uncertainty about the observation probability function ρ, since the action and observation sets are inherent to the agent design.


Environment Proxies



 Learning Targets



Cost-Benefit Analysis 

Paper highlighted a number of design decisions. These determine the components of agent state, the environment proxy, the learning target, and how actions are selected to balance between immediate reward and information acquisition. Choices are constrained by memory and per-timestep computation, and they influence expected return in complex ways. In this section, we formalize the design problem and establish a regret bound that can facilitate cost-benefit analysis.


Sample-Based Action Selection
We specialize our discussion to feed-forward variants of DQN with aleatoric state St = Ot and epistemic state Pt = (θt, Bt). Here θt represents parameters of an ENN f and Bt an experience replay buffer. The epistemic state is updated according to (14) for θt and the first-in-first-out (FIFO) rule for Bt. To complete our agent definition we need to define the action selection policy from agent state Xt = (Zt, St, Pt). With this notation we can concisely review three approaches to action selection: 

• -greedy: algorithmic state Zt = ∅; select At ∈ arg maxa fθt (St, Zt)[a] with probability 1 − , and a uniform random action with probability  (Mnih et al., 2013). 

• Thompson sampling (TS): algorithmic state Zt = Ztk , resampled uniformly at random at the start of each episode k; select At ∈ arg maxa fθt (St, Zt)[a] (Osband et al., 2016). 

• Information directed sampling (IDS): algorithmic state Zt = ∅; compute action distribution vt that minimizes a sample-based estimate of the information ratio with nIDS samples; sample action At from νt. 



Conclusion

The concepts and algorithms we have introduced are motivated by an objective to minimize regret. They serve to guide agent design. The resulting agents are unlikely to attain minimal regret, though these concepts may lead to lower regret than otherwise. 

We have taken the learning target to be fixed and treated the target policy as a baseline. An alternative could be to prescribe a class of learning targets, with varying target policy regret. The designer might then balance between the number of bits required, the cost of acquiring those bits, and regret of the resulting target policy. This balance could also be adapted over time to reduce regret further. While the work presents an initial investigation pertaining to very simple bandit environments, leveraging concepts from rate-distortion theory, much remains to be understood about this subject. More broadly, one could consider simultaneous optimization of proxies and learning targets. In particular, for any reward function and distribution over environments, the designer could execute an algorithm that automatically selects a learning target and proxy, possibly from sets she specifies. This topic could be thought of as automated architecture design.


Comments

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based sim...

Ownership at Large

 Open Problems and Challenges in Ownership Management -By John Ahlgren, Maria Eugenia Berezin, Kinga Bojarczuk, Elena Dulskyte, Inna Dvortsova, Johann George, Natalija Gucevska, Mark Harman, Shan He, Ralf Lämmel, Erik Meijer, Silvia Sapora, and Justin Spahr-Summers Facebook Inc.  Software-intensive organizations rely on large numbers of software assets of different types, e.g., source-code files, tables in the data warehouse, and software configurations. Who is the most suitable owner of a given asset changes over time, e.g., due to reorganization and individual function changes. New forms of automation can help suggest more suitable owners for any given asset at a given point in time. By such efforts on ownership health, accountability of ownership is increased. The problem of finding the most suitable owners for an asset is essentially a program comprehension problem: how do we automatically determine who would be best placed to understand, maintain, ev...

Hybrid Approach to Automation, RPA and Machine Learning

- By Wiesław Kopec´, Kinga Skorupska, Piotr Gago, Krzysztof Marasek  Polish-Japanese Academy of Information Technology Paper Link Courtesy DZone   Abstract One of the more prominent trends within Industry 4.0 is the drive to employ Robotic Process Automation (RPA), especially as one of the elements of the Lean approach.     The full implementation of RPA is riddled with challenges relating both to the reality of everyday business operations, from SMEs to SSCs and beyond, and the social effects of the changing job market. To successfully address these points there is a need to develop a solution that would adjust to the existing business operations and at the same time lower the negative social impact of the automation process. To achieve these goals we propose a hybrid, human-centred approach to the development of software robots. This design and  implementation method combines the Living Lab approach with empowerment through part...