Vincec's Dimension

AI Review Note - Learning

Word count: 657 / Reading time: 4 min
2018/11/22 Share


Two types of Learning

  • Deductive: by already known rules and facts, in medical
  • Inductive: Learn new rules/facts from data sets, machine learning

Learning Agent

Having critic to give self feedback.

Type of feedback

  • Supervised learning: Correct answer given
  • Unsupervised learning: Correct answer not given
  • Reinforcement learning: Given Reward

Inductive Learning

Decision Trees


Entropy = -Σ P(vi) ln[P(vi)] = -P(YES) ln[P(YES)] - -P(NO) ln[P(NO)]


We prefer to find more compact decision trees

Keywords: Expressiveness / Hypothesis spaces / ID3 Algorithm

Information theory

Information gain

Choose Sunny first and then find the largest Entropy is Temp, then chagen Sunny to Overcast, and so on

IG(A) = before - after

Choose the attributes with the largest IG

Neural Networks and Deep Learning

Logical Functions


  • AND
  • OR
  • NOT

Single layer is limited: cannot compute XOR, can only represent Linearly separable functions

  • Feed-Forward
  • Perceptron Learning

Multilayer NN

Overcome limitations of single-layer

  • Output units
  • Hidden units: decided by hand or research: too many -> memorize input, overfitting / too lew -> difficult to represent function
  • Input units
Genetic Algorithm

For evaluating variations of the network, e.g. number of hidden units….








Network Structures

Feed-forward networks & Recurrent networks


Back-Propagation is to minimize the errors
Can not direct compute error on output of hidden units in Multilayer NN.

Hopfield Networks

Hopfield networks is minimizing the energy

Deep Learning

  • Train good feature automatically
  • Same method for different domain

Markov Decision Processes(MDP)

Model the environment

  • A - Action: {forward, backward, turn-around}
  • Z - Percepts (observations): {ros, colcano, nothing}
  • S - States: {north, south, east, west}
  • Appearance: state -> observations
  • Transition: (state, actions) -> states
  • Current State (s)
  • Reward: R(s) or R(s, a)
  • Value/Utility if States: U(s)


  • Probability (Transition) - Pa(s, s’) = Pr(st+1 = s’ | st = s, at = a)
  • Immediate Reward - Ra(s, s’) = R(st = s, st = s’)


MDP has no sensor model

Consists of

  • State S and action A
  • Initial State s0 Probability Distribution Pi
  • Transition mode T(s’ | s, a)
  • A reward function R(s)

Partial Observation MDP (POMDP)

Can only see the percepts, not the states

  • Model
  • State Value(Utitly)
  • Policy (Choose the best action)

Consists of

  • State S and action A
  • Initial State s0 Probability Distribution Pi
  • Transition mode T(s’ | s, a)
  • A reward function R(s)
  • A sensor model P(e|s)
  • A belif of what the current state is b(s)

Compute Utilities

  • Goal (Given)
  • Reward (Given)
  • Value/Utility (Computed), with future discount

Bellman equations

Bellman iteration

Value Iteration




A solution of (PO)MDP - Policy Pi(s) = a

Optimal Policy
  • A policy that yields the highest expected utility.
  • Compute when U*(s’) is known

Policy iteration

Impoving Pi(s) ebery step, start with a randomly chosen initial policy Pi0

Policy evalutation

By a given policy Pi each iteration

Policy Improvement

Update Pi each time by using one-step look-ahead

Reinforcement Learning

Two General Classes

  • Model-based RL
  • Model-free RL


  • Find optimal policy


  • States is not always known and computation time can be extreme large
  • Cannot handle raw data
  • Cannot do Model-free RL when goal is change

Model-based RL

Known the transition, first know the utilities, then learn the policy

  • Policy iteration

Model-free RL

Without know the transition, learning by receiving samples from environment

  • Monte Carlo
  • Temporal Difference
  • Q-Learning
Monte Carlo
Temporal Difference

Using Q(s, a) to represent the value of taking an action a in state s, rather than only

  • The value of state U(s):

  • Get optimal Pi(s)


  • Sample
  • Q(s, a) = (1-a)Q(s, a) + a*Sample



  • No need for a fixed policy, (could off-policy learning)
  • Optimal policy Pi(s)=a to select the action a that has the best Q(s, a)
  • Provably convergent to the optimal policy when t -> infinite


  • slides
  1. 1. Learning
    1. 1.1. Two types of Learning
    2. 1.2. Learning Agent
      1. 1.2.1. Type of feedback
    3. 1.3. Inductive Learning
      1. 1.3.1. Decision Trees
        1. Entropy
        2. Information theory
        3. Information gain
      2. 1.3.2. Neural Networks and Deep Learning
        1. Logical Functions
        2. Multilayer NN
          1. Genetic Algorithm
        3. Network Structures
          1. Back-Propagation
          2. Hopfield Networks
        4. Deep Learning
      3. 1.3.3. Markov Decision Processes(MDP)
        1. Model the environment
        2. MDP
        3. Partial Observation MDP (POMDP)
        4. Compute Utilities
        5. Bellman equations
        6. Bellman iteration
        7. Value Iteration
          1. MDP
          2. POMDP
        8. Policy
          1. Optimal Policy
        9. Policy iteration
          1. Policy evalutation
          2. Policy Improvement
      4. 1.3.4. Reinforcement Learning
        1. Pros
        2. Cons
        3. Model-based RL
        4. Model-free RL
          1. Monte Carlo
          2. Temporal Difference
          3. Q-Learning
  2. 2. Reference