From dbaf7a97c9bea1d4a7e02d661758921c02836ca3 Mon Sep 17 00:00:00 2001 From: Jean Yang Date: Thu, 4 Dec 2008 16:35:47 -0500 Subject: [PATCH] update --- writeup/results_desc.tex | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/writeup/results_desc.tex b/writeup/results_desc.tex index d2ca884..cdf50bc 100644 --- a/writeup/results_desc.tex +++ b/writeup/results_desc.tex @@ -16,7 +16,8 @@ From these clusters we can infer that \texttt{get}, \texttt{add}, and \texttt{si For $k=6$ we get similar results, with \texttt{iterator} and \texttt{toArray} contuiting to be separated, one cluster of just \texttt{size} and \texttt{get}, and other clusters combining the usual functionalities. We may also be able to separate read-only cases from read-write cases. Though we performed this inference by examining our data, we could potentially automate this process by looking at calls not common across clusters and how they are used, perhaps with more processed data (e.g., with multiple equivalent calls compressed into one). -Another set of observations we made had to do with examining the transition probabilities in the learned Markov chains. These chains consist of the initial state probabilities and transition probabilities, so they are a little unwieldy to analyze. However, we did make some anecdotal observations about the chains. One interesting point has to do with what kinds of sequences chains can capture. For instance, for one learned chain, we see that the initial probabilities are mostly spread between \texttt{get} (0.50 prior) and \texttt{add} (0.41 prior). Examining the transition probabilities, we see that (\texttt{get -> end}: 0.98), (\texttt{size -> toArray}: 1.0), (\texttt{toArray -> add}: 0.14, \texttt{toArray -> size}: 0.8524). What this suggests is that the generated sequence will look very different depending on the initial state chosen. If get is chosen, then we are very likely to terminate immediately. If size is chosen, then we will likely bounce between size and toArray. We bring this observation up because it reveals a possible deficiency in our model. Unlike GMMs which have a single mean, for Markov chains, a single chain can actually contain multiple clusters (in the descriptive sense) if the clusters have disjoint states, as seen in our example. Because the initial state probabilities are more or less a coin flip between get and size, this one cluster actually captures two distinct interactions. Because of this, it is possible that our chains over-cluster, and we need some sort of post-processing to really extract the exact modes of interaction. One possibility is to break the constituent chains further into separate chains with deterministic start states if we see that the initial probabilities are split equally. +In observing the initial and transition probabilities we noted that Markov mixture chains will not cluster sequences separately if they use disjoint states. Unlike GMMs which have a single mean, for Markov chains, a single chain can actually contain multiple (semantic) clusters if the clusters have disjoint states. + For instance, for one chain we learned as part of a mixture, the initial probabilities are mostly spread between \texttt{get} (0.50 prior) and \texttt{add} (0.41 prior). Examining the transition probabilities, we see that (\texttt{get -> end}: 0.98), (\texttt{size -> toArray}: 1.0), (\texttt{toArray -> add}: 0.14, \texttt{toArray -> size}: 0.8524). This suggests that the generated sequence will look very different depending on the initial state: if it is \texttt{get} is chosen, then we are very likely to terminate immediately. If it is \texttt{size}, then we will likely bounce between \texttt{size} and \texttt{toArray}. Because the initial state probabilities are more or less a coin flip between get and size, this one cluster actually captures two distinct interactions. Because of this, it is possible that our chains cluster together things that should be apart, and we need some sort of post-processing to really extract the exact modes of interaction. One possibility is to break the constituent chains further into separate chains with deterministic start states if we see that the initial probabilities are split equally. Though we had hoped it would not matter, the length of the sequence seems to correspond to its classification. For $k = 2, 4, 5$, the largest cluster is associated with the shortest average sequence length and the smallest cluster is associated with the longest average length. (We could have tried to factor this out in our learning algorithm, but we had not thought it would have a significant effect and therefore wouldn't be worth the complexity.) -- 2.11.4.GIT