Skip to main content

Reinforcement Learning, Bit by Bit

 

Other learning paradigms are about minimization; 

Reinforcement learning is about maximization.


The statement quoted above has been attributed to Harry Klopf, though it might only be accurate in sentiment. The statement may sound vacuous, since minimization can be converted to maximization simply via negation of an objective. However, further reflection reveals a deeper observation. Many learning algorithms aim to mimic observed patterns, minimizing differences between model and data. Reinforcement learning is distinguished by its open-ended view. A reinforcement learning agent learns to improve its behavior over time, without a prescription for eventual dynamics or the limits of performance. If the objective takes non-negative values, minimization suggests a well-defined desired outcome while maximization conjures pursuit of the unknown



Video Courtesy:bdtechtalks.com

What happens when AI Plays Hide and Seek 500 Times

Paper By -

Xiuyuan Lu, Benjamin Van Roy, Vikranth Dwaracherla, 

Morteza Ibrahimi, Ian Osband ,Zheng Wen

Paper Link


 Data Efficiency In reinforcement learning, the nature of data depends on the agent’s behavior. This bears important implications on the need for data efficiency. In supervised and unsupervised learning, data is typically viewed as static or evolving slowly. If data is abundant, as is the case in many modern application areas, the performance bottleneck often lies in model capacity and computational infrastructure. This holds also when reinforcement learning is applied to simulated environments; while data generated in the course of learning does evolve, a slow rate can be maintained, in which case model capacity and computation remain bottlenecks, though data efficiency can be helpful in reducing simulation time. On the other hand, in a real environment, data efficiency often becomes the gating factor.


Paper, Developed a framework for studying costs and benefits associated with information. I am highlighting major components of Reinforcement concepts, as the paper has in detail information of each of it. 

  • Agents
    • coin tossing
    • dialogue
  • Agent-Environment Interface
  • Policies and Rewards
  • Agent State

Sources of Uncertainty 

The agent should be designed to operate effectively in the face of uncertainty. It is useful to distinguish three potential sources of uncertainty: 

  • Algorithmic uncertainty may be introduced through computations carried out by the agent. For example, the agent could apply a randomized algorithm to select actions in a manner that depends on internally generated random numbers. 
  • Aleatoric uncertainty is associated with the unpredictability of observations that persists even when the ρ is known. In particular, given a history h and action a, while ρ(·|h, a) assigns probabilities to possible immediate observations, the realization is randomly drawn.


  • Epistemic uncertainty is due to not knowing the environment – this amounts to uncertainty about the observation probability function ρ, since the action and observation sets are inherent to the agent design.


Environment Proxies



 Learning Targets



Cost-Benefit Analysis 

Paper highlighted a number of design decisions. These determine the components of agent state, the environment proxy, the learning target, and how actions are selected to balance between immediate reward and information acquisition. Choices are constrained by memory and per-timestep computation, and they influence expected return in complex ways. In this section, we formalize the design problem and establish a regret bound that can facilitate cost-benefit analysis.


Sample-Based Action Selection
We specialize our discussion to feed-forward variants of DQN with aleatoric state St = Ot and epistemic state Pt = (θt, Bt). Here θt represents parameters of an ENN f and Bt an experience replay buffer. The epistemic state is updated according to (14) for θt and the first-in-first-out (FIFO) rule for Bt. To complete our agent definition we need to define the action selection policy from agent state Xt = (Zt, St, Pt). With this notation we can concisely review three approaches to action selection: 

• -greedy: algorithmic state Zt = ∅; select At ∈ arg maxa fθt (St, Zt)[a] with probability 1 − , and a uniform random action with probability  (Mnih et al., 2013). 

• Thompson sampling (TS): algorithmic state Zt = Ztk , resampled uniformly at random at the start of each episode k; select At ∈ arg maxa fθt (St, Zt)[a] (Osband et al., 2016). 

• Information directed sampling (IDS): algorithmic state Zt = ∅; compute action distribution vt that minimizes a sample-based estimate of the information ratio with nIDS samples; sample action At from νt. 



Conclusion

The concepts and algorithms we have introduced are motivated by an objective to minimize regret. They serve to guide agent design. The resulting agents are unlikely to attain minimal regret, though these concepts may lead to lower regret than otherwise. 

We have taken the learning target to be fixed and treated the target policy as a baseline. An alternative could be to prescribe a class of learning targets, with varying target policy regret. The designer might then balance between the number of bits required, the cost of acquiring those bits, and regret of the resulting target policy. This balance could also be adapted over time to reduce regret further. While the work presents an initial investigation pertaining to very simple bandit environments, leveraging concepts from rate-distortion theory, much remains to be understood about this subject. More broadly, one could consider simultaneous optimization of proxies and learning targets. In particular, for any reward function and distribution over environments, the designer could execute an algorithm that automatically selects a learning target and proxy, possibly from sets she specifies. This topic could be thought of as automated architecture design.


Comments

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based similarity measures for text data. Object o is an out

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

 - By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang Microsoft Research, Beijing 100080, China. Beihang University, Beijing 100191, China Paper Link Abstract Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for tab

DEEP LEARNING FOR ANOMALY DETECTION: A SURVEY

-By  Raghavendra Chalapathy  University of Sydney,  Capital Markets Co-operative Research Centre (CMCRC)  Sanjay Chawla  Qatar Computing Research Institute (QCRI),  HBKU  Paper Link Anomaly detection also known as outlier detection is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions Hawkins defines an outlier as an observation that deviates so significantly from other observations as to arouse suspicion that it was generated by a different mechanism. Aim of this paper is two-fold, First is a structured and comprehensive overview of research methods in deep learning-based anomaly detection. Furthermore the adoption of these methods