writeup/results.tex

   1 \subsection{The Number of Underlying Markov Chains}
   2 One topic of interest for us was examining the impact of the number of underlying Markov Chains used in our model.  With mixture models, it is often difficult to know how many underlying mixtures to choose.  To investigate this problem, we split our data into 90\% training and 10\% test.  (Performing $n$-fold cross-validation would have been better if time had permitted).  We then ran our learning algorithm for different values of $k$, noting the training likelihood and test likelihood.  Because the learning algorithm depends on the choice of initial starting point, we chose models with the best training likelihood over 5 random initial starting points.  Here are the results:\\
   3 \begin{center}
   4 \begin{tabular}{|c|c|c|c|c|}
   5 \hline
   6 $k$ & training log likelihood & test log likelihood & number of parameters & BIC score\\
   7 \hline
   8 1 & -15749 & -1578 & 32 & -16010 \\
   9 2 & -12090 & -1295 & 6.450000e+01 & -12616 \\
  10 3 & -10795 & -1182 & 97 & -11588 \\
  11 4 & -10112 & -1078 & 1.295000e+02 & -11169 \\
  12 5 & -9581 & -1026 & 162 & -10904 \\
  13 6 & -9268 & -991 & 1.945000e+02 & -10857 \\
  14 7 & -9228 & -990 & 227 & -11081 \\
  15 \hline
  16 \end{tabular}
  17 \end{center}
  18 (Add results analysis here)
  19
  20 \subsection{Relationship between Programs and Clusters}
  21 Another topic of interest for us was whether the traces from particular programs corresponded tightly to a given cluster.  For the special case of $k$ equal to the number of programs traced, this is equivalent to seeing if the unsupervised learner recovered the $k$ different classes that the traces originated from (a multi-class classification problem).  In order to examine this, we determined the best cluster for each training sample (by taking the cluster with the highest $p(z | x)$), grouped these mappings by program.  If programs were more likely to belong to a specific cluster, then we expect that the percentage of traces in the largest cluster for a program should be higher than the prior probability for the cluster.  The reasoning here is that given no information about the trace, we will map traces to clusters based on the prior, so certain clusters are expected to have a higher fraction of traces.  However, if we see that the number in the largest clusters for a program exceeds this fraction, then this suggests there is some underlying structure that is binding a program to a given cluster.
  22 \begin{center}
  23 \begin{table}
  24 \begin{tabular}{|c|c|c|c|c|}
  25 \hline
  26 program & best cluster & number in best cluster &  \% in best cluster & prior of best cluster\\
  27 \hline
  28 mallet-import-data & 2 & 35 & 0.80 & 0.90 \\
  29 mallet-train-profile & 2 & 28 & 0.97 & 0.90 \\
  30 jweather & 2 & 2564 & 0.94 & 0.90 \\
  31 mallet-evaluate-profile & 2 & 27 & 0.93 & 0.90 \\
  32 profile-demo & 2 & 577 & 0.84 & 0.90 \\
  33 \hline
  34 \end{tabular}
  35 \\
  36 \caption{Best Clusters for k=2}
  37 \end{table}
  38 \begin{table}
  39 \begin{tabular}{|c|c|c|c|c|}
  40 \hline
  41 program & best cluster & number in best cluster &  \% in best cluster & prior of best cluster\\
  42 \hline
  43 mallet-import-data & 2 & 29 & 0.66 & 0.43 \\
  44 mallet-train-profile & 2 & 23 & 0.79 & 0.43 \\
  45 jweather & 2 & 2203 & 0.80 & 0.43 \\
  46 mallet-evaluate-profile & 2 & 23 & 0.79 & 0.43 \\
  47 profile-demo & 2 & 232 & 0.34 & 0.43 \\
  48 \hline
  49 \end{tabular}
  50 \\
  51 \caption{Best Clusters for k=4}
  52 \end{table}
  53 \begin{table}
  54 \begin{tabular}{|c|c|c|c|c|}
  55 \hline
  56 program & best cluster & number in best cluster &  \% in best cluster & prior of best cluster\\
  57 \hline
  58 mallet-import-data & 5 & 28 & 0.64 & 0.47 \\
  59 mallet-train-profile & 5 & 23 & 0.79 & 0.47 \\
  60 jweather & 5 & 2212 & 0.81 & 0.47 \\
  61 mallet-evaluate-profile & 5 & 23 & 0.79 & 0.47 \\
  62 profile-demo & 5 & 342 & 0.50 & 0.47 \\
  63 \hline
  64 \end{tabular}
  65 \\
  66 \caption{Best Clusters for k=6}
  67 \end{table}
  68 \end{center}
  69 Based on these observations, it appears that there was a single dominant cluster which accounted for most samples from every program.  It appears that the identity of the program was not a strong determinant of the type of trace or interaction, and rather, there was a single dominant way of interacting with the API that was consistent across programs.  This held true across different choices of $k$, hinting that adding more mixture components would simply refine our performance on the rarer edge cases.
  70
  71 \subsection{Analysis of the Learned Model}
  72
  73 % Examine the most likely transitions
  74
  75 % Generating 5 samples traces from each cluster (fix k = 5)
  76
  77 % Generate the 0 probability transitions
  78