adding
[mlfp.git] / writeup / markov.tex
blob8a996108fffc3b65a939a0836679408d5755d437
1 \section{Classification Model}
2 \label{markov}
4 \subsection{$k$-Markov chains model}
5 Consider the standard Markov model defined by
6 \begin{eqnarray*}
7 \small{
8 p(x_1, \ldots, x_m) = t(x_1) \prod_{i=2}^m t(x_i | x_{i-1})
10 \end{eqnarray*}
11 where $t(x_1)$ is the probability that $x_1$ is the initial state and $t(x_i \vert x_{i-1})$ are the transition probabilities.
13 Similar to Gaussian Mixture Models, we can take a mixture of $k$ Markov chains. A simple extension of the basic mixture model to allow variable length sequences gives the final model we used:
14 \begin{eqnarray*}
15 \small{
16 p(x_1, \ldots, x_m) = \sum_{z=1}^k q(z) \left(t_z(x_1) t(END | x_m) \prod_{i=2}^m t_z(x_i | x_{i-1})\right)
18 \end{eqnarray*}
20 \subsection{Model Choice}
21 We chose this model because it fits our goals of clustering different patterns of sequential interaction. We chose to use $k$-Markov chains for the following reasons:
22 \begin{enumerate}
23 \item Mixture models are a natural way to uncover clusters in an unsupervised fashion by examining the component models of the mixture.
24 \item The temporal relationship between method calls are more interesting and informative than, say, the counts of certain method calls. Markov chains capture this in a simple way by assuming that the system only cares about the previous state. While program interaction is not solely determined by the previous function call (in fact, there may be add/remove sequence that have context-free structures), we reasoned that choosing a simpler model would make learning easier while still yielding insight into API interaction\footnote{We also considered using other classification methods ($k$-means clustering, SVMs) and capturing the temporal nature of our data by selecting pairs of function calls as features, but Markov chains capture this in a more straightforward way}.
25 \end{enumerate}
26 We considered using hidden Markov models (HMMs) as well, but it was not clear to us what the hidden states would be. Also, the learning problem would have been considerably more difficult if we chose to use a mixture of HMMs, as we would need to estimate both the clusters the points were chosen from as well as the hidden states in EM.
28 \subsection{Maximum Likelihood Estimation}
29 Similar to our derivation of the EM algorithm for other models, we first take a look at maximum likelihood estimation in the case where we observe exactly the underlying Markov Chain that our observations were drawn from. In this case, simlar to GMMs, the maximum likelihood estimates for the parameters in our model are simple. Suppose that we have $n$ training sequence and define $\delta(z, i)$ to be 1 iff sample i was drawn from chain $z$. The maximum likelihood estimates for our parameters will be:
30 \begin{eqnarray*}
31 &&n(z) = \sum_{i=1}^n \delta(z, i), \ \ q^*(z) = \frac{n(z)}{n}\\
32 t_z^*(x) & = & \frac{\sum_{i=1}^n \delta(z, i) [[ x_{i1} == x]]}{\sum_{i=1}^n \delta(z, i)}, \ \ t_z^*(x_u | x_v) = \frac{\sum_{i=1}^n \delta(z, i) \sum_{j=1}^{|\underline{x}_i| - 1} [[ x_{ij} == x_v \wedge x_{i,j+1} == x_u]]}{\sum_{i=1}^n \delta(z, i) \sum_{j=1}^{|\underline{x}_i|}[[ x_{ij} == x_v]]}\\
33 && t_{z}^*(END|x_v) = \frac{\sum_{i=1}^n \delta(z, i) [[x_{i|\underline{x}_i|} == x_v]]}{\sum_{i=1}^n \delta(z, i) \sum_{j=1}^{|\underline{x}_i|}[[ x_{ij} == x_v]]}
34 \end{eqnarray*}
35 Note that since we have variable length sequences, we sum to $|x_i|$ in the denominator rather than $|x_i| - 1$.
36 Now that we have the estimates for the case where we know which Markov Chain each sequence was generated from, we can extend this to the case where we have a probability distribution over the Markov Chains that could have generated the sequence. We can keep all of our estimates the same, except that we need to change the way the $\delta(z, i)$ are computed. Rather than being strictly 0 or 1, $\delta(z, i)$ now represents a normalized probability that example $i$ was generated from the $z^{th}$ Markov Chain. We compute $\delta(z, i)$ (assume that $\underline{x}_i$ has $m$ observations and the probabilities are drawn from the previous parameters) with
37 \begin{eqnarray*}
38 \delta(z, i) = \frac{q(z) t_z(x_{i1}) t_z(END| x_{im}) \prod_{j=2}^{m} t(x_{ij} | x_{i,j-1})}{\sum_z q(z) t_z(x_{i1}) t_z(END | x_{im}) \prod_{j=2}^{m} t_z(x_{ij} | x_{i,j-1})}.
39 \end{eqnarray*}
40 We implemented EM for this model in MATLAB without using any existing libraries or code.