4 <meta http-equiv=
"Content-Type"
5 content=
"text/html; charset=iso-8859-1">
6 <meta name=
"GENERATOR" content=
"Microsoft FrontPage Express 2.0">
7 <title>Benchmarking Guide
</title>
10 <body bgcolor=
"#FFFFFF">
12 <h2 align=
"left">Benchmarking Guide
</h2>
14 <p>Micro-benchmarks are notoriously inaccurate, in any system.
15 Here are some guidelines you should read carefully before trying
16 to construct an accurate benchmark in the Strongtalk system. This
17 is very important because there is one big 'gotcha' associated
18 with running benchmarks from a
"do it
" in Strongtalk:
</p>
21 <li><strong>Put your benchmark in a real method
</strong>. As
22 mentioned in the tour, to get compiled performance
23 results in Strongtalk, the primary computation (the code
24 where your benchmark is spending most of its time) needs
25 to be in an actual method, not in a
"do it
"
26 from a workspace. This is because the current version of
27 the VM doesn't use the optimized method until the
<em>next
</em>
28 time that it is called after compilation, and a
"do
29 it
" method by definition is never called more than
30 once. (In a real program or normal
"do it
",
31 this effect is never an issue- only micro-benchmarks have
32 loops that iterate zillions of times with the loop itself
33 in the
"do it
"). This is not a fundamental
34 limitation in the technology, but we hadn't implemented
35 "on-stack-replacement
" in the Smalltalk system
36 at the time of release (we did implement it for Java).
<p>Note
37 that this does
<em>not
</em> mean that the code that your
38 "do it
" invokes won't be optimized and used the
39 first time around- it will. But the big performance gains
40 for micro-benchmarks come from inlining
<em>all
</em>the
41 called methods directly into the performance critical
42 benchmark loop, and if that loop is literally in the
43 "do it
", that isn't possible.
</p>
44 <p>A good way to run your benchmark is to create a method
45 in the Test class (which is there for this kind of thing)
46 that runs for at least
100 milliseconds, and then call
47 that method a number of times until it becomes optimized.
48 The Test
>benchmark: method will do this for you, and
49 report the fastest time. To tell if your code is running
50 enough, a good rule of thumb is that if your method
51 doesn't get faster and then stabilize at some speed, then
52 it's not being run
</p>
54 <li><strong>Know how to choose a benchmark.
</strong>Micro-benchmarks
55 are notorious for producing misleading results in all
56 systems, which is why all real benchmarks are bigger
57 programs that as much as possible use the same code on
58 both systems. If you insist on writing a micro-benchmark,
59 keep these issues in mind:
<ul>
60 <li><strong>Your code should spend its time in
61 Smalltalk
</strong>, not down in rarely-used
62 system primitives or C-callouts. For example,
63 'factorial' spends almost all of its time in the
64 LargeInteger multiplication primitive, not
66 <li><strong>Use library methods that are commonly
67 used in real performance-critical code.
</strong>Take
68 factorial as an example: when is the last time
69 your program was performance bound on
70 LargeInteger multiplication?
</li>
71 <li><strong>Use code that is like normal Smalltalk
72 code (use of core data structures, allocation,
73 message sending in a normal pattern, instance
74 variable access, blocks).
</strong> This is the
75 biggest reason most micro-benchmarks aren't
76 accurate. Real code is broken up into many
77 methods, with lots of message sends, instance
78 variable reads, boolean operations, SmallInteger
79 operations, temporary allocations, and Array
80 accesses, all mixed together. These are the
81 things that Strongtalk is designed to optimize.
</li>
82 <li><strong>Use the same code and input data on both
83 systems.
</strong> Running a highly
84 implementation- dependent operation like
85 "compile all methods
" is not a good
86 benchmark because the set of methods is totally
87 different, and the bytecode compilers are
88 implemented completely differently. (Also, the
89 byte-code compiler is not a performance critical
90 routine in applications, so it has not been tuned
91 at all in Strongtalk. When was the last time your
92 users were twiddling their thumbs waiting for the
93 bytecode compiler?)
</li>
98 <h3>How we did Benchmarking
</h3>
100 <p>When we benchmarked the system ourselves, we assembled a large
101 suite of accepted OO benchmarks, such as Richards, DeltaBlue (a
102 constraint solver), the Stanford benchmarks, Slopstones and
103 Smopstones. These benchmarks are already in the image, if you
104 want to run them. Try evaluating
"VMSuite
105 runBenchmarks
" and look at the code it runs. If you want a
106 real performance comparison, run these on other VMs.
</p>
108 <p>As an example, I put a couple of very small microbenchmarks
109 that are run the right way in the system tour (the code is in the
110 Test class). You can try running them on other Smalltalks as a
113 <h3>Other benchmarkling problems people have been having
</h3>
116 <li>Several have people complained about their benchmark that
117 runs
"5000 factorial
" in a loop crashes. If you
118 read the troubleshooting section, you will see that the
119 error message you are getting indicates that you are
120 running out of virtual memory, which explains the crash.
121 This is happening because the full garbage collector does
122 not run automatically in Strongtalk right now (the
123 generation scavenger of course runs fine). Obviously it
124 would be nice if it ran automatically, but if you are
125 allocating vast amounts of memory (which
5000 factorial
126 does), plese run
"VM collectGarbage
"
127 occasionally. And as we have already pointed out,
128 factorial is a very bad (unrepresentative) benchmark on
129 any system.
<p>The moral of the story: if you have a
130 crash, read the troubleshooting section.
</p>
132 <li>"Compile all methods
" crashes. Yes, it is a
133 known problem that is one method in the image that
134 crashes the bytecode compiler when it is run this way,
135 even in interpreted mode. Use some other benchmark (this
136 isn't a good benchmark anyway, as pointed out above).
</li>