Other learning paradigms are about minimization;
Reinforcement learning is about maximization.
The statement quoted above has been attributed to Harry Klopf, though it might only be accurate in sentiment. The statement may sound vacuous, since minimization can be converted to maximization simply via negation of an objective. However, further reflection reveals a deeper observation. Many learning algorithms aim to mimic observed patterns, minimizing differences between model and data. Reinforcement learning is distinguished by its open-ended view. A reinforcement learning agent learns to improve its behavior over time, without a prescription for eventual dynamics or the limits of performance. If the objective takes non-negative values, minimization suggests a well-defined desired outcome while maximization conjures pursuit of the unknown
What happens when AI Plays Hide and Seek 500 Times
Paper By -
Xiuyuan Lu, Benjamin Van Roy, Vikranth Dwaracherla,
Morteza Ibrahimi, Ian Osband ,Zheng Wen
Data Efficiency In reinforcement learning, the nature of data depends on the agent’s behavior. This bears important implications on the need for data efficiency. In supervised and unsupervised learning, data is typically viewed as static or evolving slowly. If data is abundant, as is the case in many modern application areas, the performance bottleneck often lies in model capacity and computational infrastructure. This holds also when reinforcement learning is applied to simulated environments; while data generated in the course of learning does evolve, a slow rate can be maintained, in which case model capacity and computation remain bottlenecks, though data efficiency can be helpful in reducing simulation time. On the other hand, in a real environment, data efficiency often becomes the gating factor.
Paper, Developed a framework for studying costs and benefits associated with information. I am highlighting major components of Reinforcement concepts, as the paper has in detail information of each of it.
- Agents
- coin tossing
- dialogue
- Agent-Environment Interface
- Policies and Rewards
- Agent State
Sources of Uncertainty
The agent should be designed to operate effectively in the face of uncertainty. It is useful to distinguish three potential sources of uncertainty:
- Algorithmic uncertainty may be introduced through computations carried out by the agent. For example, the agent could apply a randomized algorithm to select actions in a manner that depends on internally generated random numbers.
- Aleatoric uncertainty is associated with the unpredictability of observations that persists even when the ρ is known. In particular, given a history h and action a, while ρ(·|h, a) assigns probabilities to possible immediate observations, the realization is randomly drawn.
- Epistemic uncertainty is due to not knowing the environment – this amounts to uncertainty about the observation probability function ρ, since the action and observation sets are inherent to the agent design.
Environment Proxies
We specialize our discussion to feed-forward variants of DQN with aleatoric state St = Ot and epistemic state Pt = (θt, Bt). Here θt represents parameters of an ENN f and Bt an experience replay buffer. The epistemic state is updated according to (14) for θt and the first-in-first-out (FIFO) rule for Bt. To complete our agent definition we need to define the action selection policy from agent state Xt = (Zt, St, Pt). With this notation we can concisely review three approaches to action selection:
• -greedy: algorithmic state Zt = ∅; select At ∈ arg maxa fθt (St, Zt)[a] with probability 1 − , and a uniform random action with probability (Mnih et al., 2013).
• Thompson sampling (TS): algorithmic state Zt = Ztk , resampled uniformly at random at the start of each episode k; select At ∈ arg maxa fθt (St, Zt)[a] (Osband et al., 2016).
• Information directed sampling (IDS): algorithmic state Zt = ∅; compute action distribution vt that minimizes a sample-based estimate of the information ratio with nIDS samples; sample action At from νt.
Comments