Stationary Stochastic Processes and Markov Chains

  1. EE 376A: Information Theory. Winter 2018. Lecture 4. - Stanford University
  2. EE 376A: Information Theory. Winter 2017. Lecture 4. - Stanford University
  3. Elements of Information Theory

Stochastic Process

Definition: A stochastic process {Xi} is an indexed sequence of random variables. In general, there can be an arbitrary dependence among the random variables.

The process is characterized by the joint probability mass functions Pr{(X1,X2,,Xn)=(x1,x2,,xn)}=p(x1,x2,,xn). with (x1,x2,,xn)Xn for n=1,2,

Stationary Stochastic Process

Definition: A stochastic process is said to be stationary if the joint distribution of any subset of the sequence of random variables is invariant with respect to shifts in the time index; that is, Pr{X1=x1,X2=x2,,Xn=xn}=Pr{X1+l=x1,X2+l=x2,,Xn+l=xn} for every n and every shift l and for all x1,x2,,xnX.

Markov Chain

A Markov process or a Markov chain is a stochastic process indexed by time, and with the property that the future is independent of the past, given the present.

Markov processes, named for Andrei Markov, are among the most important of all random processes. In a sense, they are the stochastic analogs of differential equations and recurrence relations, which are of course, among the most important deterministic processes.

Source: Introduction to Markov Processes.

本文只讨论时间和状态空间均为离散的情况的markov chain.

Definition: A discrete stochastic process X1,X2, is said to be a Markov chain or a Markov process if for n=1,2,, Pr(Xn+1=xn+1Xn=xn,Xn1=xn1,,X1=x1)=Pr(Xn+1=xn+1Xn=xn) for all x1,x2,,xn,xn+1X.

We can denote them as X1X2Xn.

  • Note: If there're only two r.v. s, say X1 and X2. They must form a Markov chain X1X2, or X2X1 since Markov chain is invertable(->Proof). Because P(X1|X2,there's no moreXi)=P(X1|X2)

PMF of a Markov Chain

The joint probability mass function of the random variables can be written as p(x1,x2,,xn)=p(x1)p(x2x1)p(x3x2)p(xnxn1).

Proof

let's prove this expression with mathematical induction:

  1. Base Case (n = 2): p(x1,x2)=p(x1)p(x2x1)p(x1,x2)=p(x1)p(x2x1).

  2. Inductive Step: Assume the expression holds for n=k: p(x1,x2,,xk)=p(x1)p(x2x1)p(x3x2)p(xkxk1).

  3. Now, consider n=k+1: p(x1,x2,,xk+1)=p(x1,x2,,xk)p(xk+1x1,x2,,xk)=p(x1,x2,,xk)p(xk+1xk)=p(x1)p(x2x1)p(x3x2)p(xkxk1)

    • In the second line we apply p(xk+1x1,x2,,xk)=p(xk+1xk), which is the preperty of a Markov chain.
    • In third line we apply our assumtion in the aductive step.
  4. This expression matches the assumed expression for n=k. Therefore, by mathematical induction, the expression holds for all n.

    Hence, the joint probability mass function of the Markov chain can be written as:

Time Invariant

Definition: The Markov chain is said to be time invariant if the conditional probability p(xn+1xn) does not depend on n; that is, for n= 1,2,, Pr{Xn+1=bXn=a}=Pr{X2=bX1=a} for all a,bX.

We will assume that the Markov chain is time invariant unless otherwise stated.

If {Xi} is a Markov chain, Xn is called the state at time n.

A time invariant Markov chain is characterized by its initial state and a probability transition matrix P=[Pij],i,j{1,2,,m}, where Pij=Pr{Xn+1= jXn=i}.

If it is possible to go with positive probability from any state of the Markov chain to any other state in a finite number of steps, the Markov chain is said to be irreducible. If the largest common factor of the lengths of different paths from a state to itself is 1 , the Markov chain is said to aperiodic.

Properties

If random variables X,Y,Z form a Markov chain (denoted by XYZ ), their PMF satisfies: (1)p(x,y,z)=p(x)p(yx)p(zy).

  • XYZ implies that ZYX. Thus, the condition is sometimes written XYZ.
  • If Z=f(Y), then XYZ.

Conditionally Independence

XYZ <==> X and Z are conditionally independent given Y: p(x,z|y)=p(x|y)p(z|y). Proof:

Proof

==>

Given XYZ, we have p(x,z|y)=p(x|y)p(z|y).

Proof: (2)p(x,zy)=p(x,y,z)p(y)(3)=p(x)p(yx)p(zy)p(y)(4)=p(x,y)p(zy)p(y)(5)=p(xy)p(zy). (2): the definition of conditional probability.

(3): from (1).

(4): the definition of conditional probability: p(x)p(y|x)=p(x,y).

(5): the definition of conditional probability: p(x,y)p(y)=p(xy).


<==

Given p(x,z|y)=p(x|y)p(z|y), we have XYZ, which means (1).

Proof: we just need to inverse the above proof: p(x|y)p(z|y)=p(x,y)p(zy)p(y)=p(x)p(yx)p(zy)p(y) And we know that: p(x,zy)=p(x,y,z)p(y). Since p(x,z|y)=p(x|y)p(z|y) Then: p(x)p(yx)p(zy)p(y)=p(x,y,z)p(y), which means (1).

Data Processing Inequality

Theorem(Data-processing inequality): If XYZ, then I(X;Y)I(X;Z).

Proof:

By the chain rule, we can expand mutual information in two different ways: I(X;Y,Z)=I(X;Z)+I(X;YZ)=I(X;Y)+I(X;ZY).

Since X and Z are conditionally independent given Y, we have I(X;ZY)=0.

Since I(X;YZ)0, we have I(X;Y)I(X;Z).

We have equality if and only if I(X;YZ)=0 (i.e., XZY forms a Markov chain). Similarly, one can prove that I(Y;Z)I(X;Z).

Corollary

Corollary: In particular, if Z=g(Y), we have I(X;Y)I(X;g(Y)). Proof: XYg(Y) forms a Markov chain. Thus functions of the data Y cannot increase the information about X. Corollary If XYZ, then I(X;YZ)I(X;Y).

Proof

Proof: We note in (???) and (???) that I(X;ZY)=0, by Markovity, and I(X;Z)0. Thus, I(X;YZ)I(X;Y).

Thus, the dependence of X and Y is decreased (or remains unchanged) by the observation of a "downstream" random variable Z. Note that it is also possible that I(X;YZ)>I(X;Y) when X,Y, and Z do not form a Markov chain. For example, let X and Y be independent fair binary random variables, and let Z=X+Y. Then I(X;Y)=0, but I(X;YZ)=H(XZ)H(XY,Z)=H(XZ)=P(Z=1)H(XZ=1)=12 bit.

Stationary Distribution

If the probability mass function of the random variable at time n is p(xn), the probability mass function at time n+1 is p(xn+1)=xnp(xn)Pxnxn+1.

A distribution on the states such that the distribution at time n+1 is the same as the distribution at time n is called a stationary distribution.

Stationary distribution is so called because if the initial state of a Markov chain is drawn according to a stationary distribution, the Markov chain forms a stationary process.

If the finite-state Markov chain is irreducible and aperiodic, the stationary distribution is unique, and from any starting distribution, the distribution of Xn tends to the stationary distribution as n. (//TO BE PROVED)

Example: Mickey Mouse Markov Chain

We consider an example of two-state Markov chain, which we call Mickey Mouse Markov Chain [MMMC], in order to understand basic properties of Markov chains.

Mickey Mouse Markov Chain

The probability transition matrix of MMMC is: P=[1ααβ1β] Let the stationary distribution be represented by a vector μ whose components are the stationary probabilities of states 1 and 2 , respectively.

//Can't understand:

Then the stationary probability can be found by solving the equation μP=μ or, more simply, by balancing probabilities.

For the stationary distribution, the net probability flow across any cut set in the state transition graph is zero. Applying this to Figure 4.1, we obtain μ1α=μ2β

Since μ1+μ2=1, the stationary distribution is μ1=βα+β,μ2=αα+β.

If the Markov chain has an initial state drawn according to the stationary distribution, the resulting process will be stationary. The entropy of the

state Xn at time n is H(Xn)=H(βα+β,αα+β).

However, this is not the rate at which entropy grows for H(X1,X2,, Xn). The dependence among the Xi 's will take a steady toll.