Inverse Reinforcement Learning

In traditional reinforcement learning (RL), an agent learns a policy (behavior strategy) by maximizing a known reward function. Inverse Reinforcement Learning reverses this i.e. instead of learning policy from reward function inverse reinforcement learning learns reward function from policy and hence the :

Input: Expert demonstrations (trajectories of states and actions).
Output: Learned reward function that explains the expert's behavior.
Goal: Discover the hidden objectives the expert was optimizing.

Inverse-Reinforcement-Learning — Inverse Reinforcement Learning

Why IRL > Traditional RL & Imitation Learning

Approach	How It Works	Limitations
Traditional RL	Learns policy from predefined rewards	Reward engineering is complex and error-prone
Imitation Learning	Copies actions directly from experts	Fails when deviating from expert states; no understanding of why
IRL	Infers rewards from expert behavior	Generalizes to new situations; captures intent

Key Advantages of IRL

Small Data Efficiency: Learns complex behaviors from limited demonstrations by inferring underlying intent.
Robustness: Avoids cascading errors (unlike imitation learning) by learning objectives instead of copying actions.
Interpretability: Reveals why experts make decisions (e.g., safety vs. speed trade-offs in driving).

How IRL Works: Step-by-Step

Expert Demonstrations: Collect trajectories (e.g. human driver recordings: states = road conditions, actions = steering/braking).
Reward Hypothesis: Assume expert behavior optimizes an unknown reward function R(s,a).
Optimization Loop:

graph LR
A[Current Reward Estimate] --> B[Compute Optimal Policy]
B --> C[Compare Policy vs. Expert]
C --> D[Update Reward Function]
D --> A

Inverse Reinforcement Learning Algorithms

1. Maximum Entropy IRL (MaxEnt)

Principle: Assume the expert is probabilistically optimal (not perfectly rational).
Math: P(\tau) \propto \exp\left( \sum_t R(s_t, a_t) \right) Maximize likelihood of expert trajectories under this distribution.
Why it works: Handles ambiguous rewards by favoring high-entropy solutions.

2. Adversarial IRL (AIRL)

Architecture:

Generator: Agent policy producing trajectories
Discriminator: Classifies "expert vs. generated" data → outputs reward

Objective: \min_{\pi} \max_{D} \; \mathbb{E}_{\pi} [\log D(s, a)] + \mathbb{E}_{\text{expert}} [\log(1 - D(s, a))]

Advantage: Learns transferable rewards (e.g., sim-to-real robotics).

Autonomous Driving Example

Expert Trajectory: Human driver slows down near schools, stops at yellow lights.
IRL Infers Reward: R(s, a) = w_1 \cdot \text{safety} + w_2 \cdot \text{speed} - w_3 \cdot \text{jerk} (Learns that safety > speed near schools)
Result: Self-driving car generalizes to unseen scenarios (e.g., construction zones) because it understands objectives, not just actions.

Why IRL Excels with Limited Data

Imitation Learning Failure: If an expert never encounters a rare scenario (e.g., deer on road), imitation fails.
IRL Solution: Infers that avoiding collisions is a core reward -> applies to deer scenario even without explicit data.

Key Challenges of Inverse Reinforcement Learning (IRL)

Reward Ambiguity: Multiple reward functions can explain the same expert behavior, making it hard to identify the true intent.
Data Limitations: Limited, noisy, or suboptimal expert demonstrations can mislead the learned reward and policy.
Computational Complexity: IRL often requires repeatedly solving RL problems, making it computationally expensive, especially for large or continuous state spaces.
Generalization Issues: Learned rewards may overfit to demonstration data and fail in unseen scenarios; generalizing from few demonstrations is challenging.
Sensitivity to Prior Knowledge: IRL results can depend heavily on the choice of features and prior assumptions.
Task Alignment: Focusing only on matching demonstration data (data alignment) can lead to rewards that do not capture the true task objectives, resulting in misaligned or exploitable policies.