final?
[mlfp.git] / writeup / fp_writeup.tex
blob4b05d82753917c1d83be109200da0398a00f3ac5
1 \documentclass[11pt]{article}
2 \usepackage[margin=1.5cm]{geometry}
3 \usepackage{amsmath,amsthm,amssymb}
4 \usepackage{graphicx}
5 \usepackage{color}
6 \usepackage{url}
7 \usepackage{fancyhdr}
9 \pagestyle{fancy}
11 \newtheorem{lem}{Lemma}
12 \newtheorem*{claim}{Claim}
13 \newcommand{\ul}{\underline}
14 \newcommand{\dtds}{\frac{\partial \alpha_t}{\partial \alpha_s}}
16 \title{Learning Garden-Variety Library Interactions}
17 \author{\small John Lai and Jean Yang\\ \small 6.867 Final Project}
18 \date{{\small December 4, 2008}}
20 \begin{document}
21 \maketitle
23 \begin{quote}
24 \centering
25 \emph{If you have a garden and a library, you have everything you need.} --Cicero
26 \end{quote}
28 \section{Introduction}
29 Researchers have been trying to learn the ``correct'' way of using API libraries. Those working on program verification have tried to do this to separate correct uses from incorrect uses in order to help programmers create more bug-free programs. Ammons et. al. have shown that it is possible to infer specifications about interaction with API's from looking at program traces~\cite{Mining}. Those working on code synthesis have used the sequences of possible API calls as a search space in which to generate correct code interacting with APIs. Mandelin et. al. have worked on automatically generating call sequences that interacts with APIs to achieve the desired goals~\cite{Jungloid}.
31 These previous approaches have not specifically addressed the property that the correct way of interacting with a particular API may consist of different distinct modes of interaction. To consider a simple example, in C++ \texttt{malloc} is paired with \texttt{free}, \texttt{new} is paired with \texttt{free}, and \texttt{new[]} is paired with \texttt{free[]}. Trying to call \texttt{free[]} on an object allocated with \texttt{new} is incorrect. The ideal program specification would take into account different modes of interacting with the data. Similarly, knowing the likelihood of different sequences of API calls would reduce the code synthesizer's search space and likely make code synthesis more effective.
33 The goal of our project was to determine whether we can apply machine learning techniques to learning modes of interaction with API's and abstract data structures. We wanted to apply machine learning techniques to 1) see how we can map interactions with API libraries to separate clusters and 2) see if we could learn something from what these clusters look like. The main decisions we made in our experimental setup were regarding how to extract meaningful training data from API interactions, how to represent API information as useful feature vectors, and which classification techniques are most appropriate. For this project, we
34 \begin{enumerate}
35 \item generated data from program traces of interactions of open source Java programs with selected Java standard libraries and
36 \item investigated classification using a mixture of $k$ Markov chains.
37 \end{enumerate}
38 We discuss our experimental setup in Section~\ref{setup}, our implementation of $k$-Markov chains in Section~\ref{markov}, and our results in Section~\ref{results}. We describe how our clusters revealed general usage modes for the \texttt{ArrayList} library.
40 \section{Experimental setup}
41 \label{setup}
43 Though Java and C++ both have sufficiently ``interesting'' libraries and a large code base, we favored Java over C++ because 1) Java code is more portable, since it compiles to bytecode rather than machine code and is dynamically linked, 2) there are code copying issues involved with C++ templates, and 3) for hisorical reasons, there is more open source Java code that uses Java standard libraries than there is open source C++ code that uses the STL~\footnote{Many open source software in C++ uses its own version of abstract data structure libraries because 1) the libraries were not standardized across compilers and 2) people have efficiency issues with using the STL.}. In this section we describe how we implemented our profiler and constructed traces.
45 \subsection{Instrumenting Java code to produce traces}
46 We constructed runtime profiles of the code we run in order to construct the method traces (rather than adding logging code to the standard libraries) because 1) this method is more general, giving us the freedom of being able to construct learning data from any Java \texttt{jar} file and 2) this method gave us the freedom to choose our set of libraries to classify after seeing which ones were being used in the code.
48 We did this by writing a Java profiler that inserts logging instructions in the Java bytecode whenever there is a call to a standard library method of interest. Because Java's standard libraries are loaded in the bootstrap classloader rather than in the system loader, however, this causes problems for directly instrumenting the standard library classes. Because of this, we instrument other classes loaded in the system classloader and, from code in those classes, log calls to standard library objects~\footnote{Because of these unforeseen issues involving Java's idiosyncracies, this process required far more work than initially anticipated. We did, however, emerge victorious: we can now construct a profile of any Java JAR executable file.}.
50 Our method for getting traces uses the Java compiler's \texttt{javaagent} support for attaching a profiler to a piece of code. We constructed our traces by modifying the source code from the Java Interactive Profiler~\cite{JIP}, a Java profiler written in Java and built on top of the ASM bytecode reengineering library~\cite{ASM}. Our program inserts bytecode instructions into the profiled code that then call our profiler functions to calls to standard library functions. We initially tracked sequences of calls to given libraries, but we realized that this information is not as useful as logging calls to specific instances of classes. To do this, we added instrumentation code that inspects the runtime stack to get at the object we are profiling. We take the hash value of each object in order to distinguish calls to different instances of the same class.
52 \subsection{Building traces}
53 We initially downloaded an assortment of programs, many of which had interactive interfaces, but we quickly realized that we would not be able to generate enough learning data by interacting with these programs. We chose instead to focus on programs that 1) could take lots of inputs and therefore could provide enough data, 2) used enough common libraries with other programs we found, and 3) did not load libraries that cause our profiler to crash. We were also limited by the fact that the profiler introduced significant overhead.
55 We generated traces on the following programs:
56 \begin{enumerate}
57 \item \textbf{Java Weather}, a library for parsing METAR fomatted weather data~\cite{Jweather}. Our test program polled the San Francisco weather satellite for data at half-second intervals and parsed it.
58 \item \textbf{Lucene}, a library for searching webpages. We generated code by downloading a webpage, pulling a link, and using each page to pull the next link. We ran each resulting page through the Lucene index builder.
59 \item \textbf{Mallet}, a machine learning language toolkit for processing webpages. We generated training data in a similar way to data for Lucene and we ran code to import the data, train with it, and evaluate the results.
60 \end{enumerate}
61 We constructed about 100 profiles that contained object-level information for each of these programs. We focused on the \texttt{ArrayList} library because it was the most extensively used in these programs.
63 \input{markov}
65 \section{Results}
66 \label{results}
68 \input{results}
69 \input{results_desc}
70 \input{conclusion}
72 \bibliographystyle{plain}
73 \bibliography{fpbib}
74 \end{document}