tex files
[mlfp.git] / writeup / fp_writeup.tex
blob4dc4d4a7f3e4f26a6f135c1a2978670a13bcc267
1 \documentclass[11pt]{article}
2 \usepackage[margin=2cm]{geometry}
3 \usepackage{amsmath,amsthm,amssymb}
4 \usepackage{graphicx}
5 \usepackage{color}
6 \usepackage{url}
8 \newtheorem{lem}{Lemma}
9 \newtheorem*{claim}{Claim}
10 \newcommand{\ul}{\underline}
11 \newcommand{\dtds}{\frac{\partial \alpha_t}{\partial \alpha_s}}
13 \title{6.867 Final Project\\
14 Learning Garden-Variety Library Interactions}
15 \author{John Lai and Jean Yang}
16 \date{December 4, 2008}
18 \begin{document}
19 \maketitle
21 \begin{abstract}
22 Knowing the appropriate ways of interacting with an API library is useful for program verification and code synthesis. In this paper we describe our experiences collecting data and learning with $k$-Markov mixture chains for the Java standad libraries. We describe our results in depth for the \texttt{ArrayList} API.
23 \end{abstract}
25 \begin{quote}
26 If you have a garden and a library, you have everything you need. --Cicero
27 \end{quote}
29 \section{Introduction}
30 Researchers have been trying to learn the ``correct'' way of using API libraries for different purposes. Those working on program verification have tried to do this to separate correct uses from incorrect uses in order to help programmers create more bug-free programs. Ammons et. al. have shown that it is possible to infer specifications about interaction with API's from looking at program traces~\cite{Mining}. Those working on code synthesis have used the sequences of possible API calls as a search space in which to generate correct code interacting with APIs. Mandelin et. al. have previously addressed the problem of generating code that interacts with APIs~\cite{Jungloid}. They allow the programmer to specify the type of the desired API interaction and find the appropriate function call sequence to achieve the goal.
32 These previous approaches have not specifically addressed the property that the correct way of interacting with a particular API may consist of different distinct modes of interaction. To consider a simple example, in C++ \texttt{malloc} is paired with \texttt{free}, \texttt{new} is paired with \texttt{free}, and \texttt{new[]} is paired with \texttt{free[]}. Trying to call \texttt{free[]} on an object allocated with \texttt{new} is incorrect.
34 The ideal program specification would take into account different modes of interacting with the data. Similarly, knowing the likelihood of different sequences of API calls would reduce the code synthesizer's search space and likely make code synthesis more effective. If we can separate the common use cases from the edge cases, we may also be able to have more effective program verifiers and code synthesizers.
36 \subsection{Project goals}
37 The goal of our project wasv to determine whether we can apply machine learning techniques to learning modes of interaction with API's and abstract data structures. That is, given a data structure (i.e., a Java \texttt{Vector}), we wanted to apply machine learning techniques to 1) see how we can map interactions with API libraries to separate clusters and 2) see if we could learn something from what these clusters look like.
39 The main decisions we made in our experimental setup were regarding how to extract meaningful training data from API interactions, how to represent API information as useful feature vectors, and which classification techniques are most appropriate. For this project, we
40 \begin{enumerate}
41 \item generated data from program traces of interactions of open source Java programs with selected Java standard libraries and
42 \item investigated classification techniques using $k$-Markov chains.
43 \end{enumerate}
44 We discuss our experimental setup in Section~\ref{setup}, our implementation of $k$-Markov chains in Section~\ref{markov}, and our results in Section~\ref{results}.
46 \subsection{Applications}
47 Given that we know that there are $n$ common use cases of an API, we could generate a very useful set of example use cases for the programmer. We could also use this information in code synthesis: given that the programmer wants to write a program using the API in a given circumstance, can we use a similar example to guide the code synthesis? This information is also useful in test generation: if we are using data structure $D$ and know code tends to use this data structure in $D$ ways, we might be able to predict interactions with the code we write and make sure our code is robust to those cases. The question of separating the common case and edge cases is especially elevant to the problem of code synthesis: if the synthesizer has some mechanism for identifying the common cases, it can handle the edge cases differently (i.e., by isolating them and asking for more programmer input).
49 \section{Experimental setup}
50 \label{setup}
52 In this section we describe why and how we generated traces for code interacting with the Java standard libraries.
54 \subsection{On the choice of Java standard libraries}
55 We chose Java for both utility and convenience. Both the Java standard libraries and the C++ template libraries provide libraries with abstract data structures. We favored Java over C++ because 1) Java code is more portable, since it compiles to bytecode rather than machine code and is dynamically linked, 2) there are code copying issues involved with C++ templates, and 3) for hisorical reasons, there is more open source Java code that uses Java standard libraries than there is open source C++ code that uses the STL~\footnote{Many open source software in C++ uses its own version of abstract data structure libraries because 1) the libraries were not standardized across compilers and 2) people have efficiency issues with using the STL.}.
57 \subsection{Instrumenting Java code to produce traces}
58 We constructed runtime profiles of the code we run in order to construct the method traces (rather than adding logging code to the standard libraries) because 1) this method is more general, giving us the freedom of being able to construct learning data from any Java \texttt{jar} file and 2) this method gave us the freedom to choose our set of libraries to classify after seeing which ones were being used in the code.
60 We did this by writing a Java profiler that inserts logging instructions in the Java bytecode whenever there is a call to a standard library call of interest. Before we describe the method, we must note that the apparent hackishness of the method is due to Java's idiosyncracies. Because Java's standard libraries are loaded in the bootstrap classloader rather than in the system loader, however, this causes problems for directly instrumenting, or transforming the code for, the standard library classes in order to profile them. Because of this, we instrument other classes loaded in the system classloader and, from code in those classes, log calls to standard library objects~\footnote{Because of these unforeseen issues, this process required far more work than initially anticipated. We did, however, emerge victorious: we can now construct a profile of any Java JAR executable file.}.
62 Our method for getting traces uses the Java compiler's \texttt{javaagent} support for attaching a profiler to a piece of code. The agent provides an interceptor in front of the code's main method, executed in the same Java Virtual Machine andby the same classloader. This allows the programmer to easily instrument Java code. We constructed our traces by modifying the source code from the Java Interactive Profiler~\cite{JIP}, a Java profiler written in Java and built on top of the ASM bytecode reengineering library~\cite{ASM}. Our program inserts bytecode instructions into the profiled code that then call our profiler functions to calls to standard library functions. We initially tracked sequences of calls to given libraries, but we realized that this information is not as useful as logging calls to specific instances of classes. To do this, we added instrumentation code that inspects the runtime stack to get at the object we are profiling. We take the hash value of each object in order to distinguish calls to different instances of the same class.
64 \subsection{Building traces}
65 We initially downloaded an assortment of programs, many of which had interactive interfaces, but we quickly realized that we would not be able to generate enough learning data by interacting with these programs. We chose instead to focus on programs that 1) could take lots of inputs and therefore could provide enough data, 2) used enough common libraries with otherprograms we found, and 3) did not load libraries that cause our profiler to crash. We were also somewhat limited by the fact that there was significant overhead in how we were profiling the standard library calls.
67 We generated traces on the following programs:
68 \begin{enumerate}
69 \item \textbf{Java Weather}, a library for parsing METAR fomatted weather data~\cite{Jweather}. Our test program polled the San Francisco weather satellite for data at half-second intervals and parsed it.
70 \item \textbf{Lucene}, a library for searching webpages. We generated code by downloading the New York Times webpage, pulling a link, and using each page to pull the next link. We ran each resulting page through the Lucene index builder.
71 \item \textbf{Mallet}, a machine learning language toolkit for processing webpages. We generated training data in a similar way to data for Lucene and we ran code to import the data, train with it, and evaluate the results.
72 \end{enumerate}
73 We constructed about 100 profiles that contained object-level information for each of these programs.
75 We focused on the \texttt{ArrayList} library because it was the most extensively used. Other common libraries we considered include \texttt{PriorityQueue}, \texttt{HashSet}, and \texttt{HashMap}.
77 \input{markov}
79 \section{Results}
80 \label{results}
82 \input{results_desc}
84 \input{results}
85 \bibliographystyle{plain}
86 \bibliography{fpbib}
87 \end{document}