Merge pull request #1844 from jrw972/monterey
[ACE_TAO.git] / TAO / docs / performance.html
blob39e569365cfc7eb6a04bd205d7260739fd9d3367
1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2 <html>
3 <head>
4 <!-- -->
5 <title>TAO Performance and Footprint Tuning</title>
6 <LINK href="tao.css" rel="stylesheet" type="text/css">
7 </head>
9 <body>
10 <hr><p>
11 <h3>TAO Compile-time and Run-time Performance and Footprint Tuning</h3>
13 <a name="overview"></a>
14 <h3>Overview</h3>
16 <p>
17 <!-- We talk of real-time here and throughout this document I dont -->
18 <!-- see where we have talked about lower latencies one of the -->
19 <!-- important aspects of RT systems. I understand the term -->
20 <!-- "throughput" is used for latencies. My understanding is that -->
21 <!-- improved latencies can give better throughtput, but better -->
22 <!-- throughput doesnt necessarily mean lower latencies. Please -->
23 <!-- correct me if I am wrong -->
24 TAO is increasingly being used to support high-performance
25 distributed real-time and embedded (DRE) applications. DRE
26 applications constitute an important class of distributed
27 systems where predictability and efficiency are essential for
28 success. This document describes how to configure <a href
29 ="index.html">TAO</a> to enhance its throughput, scalability,
30 <!-- Ossama, let me know if I am offtrack. Would it be better if -->
31 <!-- we mention this as "reduced latencies" instead of improved -->
32 <!-- latencies. I can make the change but thought would discuss -->
33 <!-- with you before jumping on to it. - Bala -->
34 <!-- Bala, aren't they the same? In any case Doug wrote -->
35 <!-- this. ;-) -->
36 <!-- Shouldnt be an issue though :-) -->
37 and latency for a variety of applications. We also explain
38 various ways to speedup the compilation of ACE+TAO and
39 applications that use ACE+TAO. </p>
41 <p>
42 As with most applications, including compilers, enabling
43 optimizations can often introduce side-effects that may not be
44 desirable for all use-cases. TAO's default configuration
45 therefore emphasizes programming simplicity rather than top
46 speed or scalability. Our goal is to assure that CORBA
47 applications work correctly ``out-of-the-box,'' while also
48 enabling developers to further optimize their CORBA applications
49 to meet stringent performance requirements. </P>
51 <p>
52 TAO's performance tuning philosophy reflects the fact that there
53 are trade-offs between speed, size, scalability, and programming
54 simplicity. For example, certain ORB configurations work well
55 for a large number of clients, whereas others work better for a
56 small number. Likewise, certain configurations minimize
57 internal ORB synchronization and memory allocation overhead by
58 making assumptions about how applications are designed.
59 </p>
61 <p>
62 This document is organized as follows:
63 </p>
64 <ul>
65 <li>
66 <!-- Ossama, do we call it optimizing throughput? Shouldnt -->
67 <!-- we mention it as Improving throughput? Because the -->
68 <!-- suggestions that we give seems to show only that. -->
69 <!--
70 Bala, by optimizing throughput aren't we improving it? I
71 prefer "optimizing" but if the general consensus is that
72 "improving" is better than I won't debate the issue.
73 -->
74 <!-- Neither am I :-). I dont know why the term Optimizing -->
75 <!-- looks odd to me. I think this way -- the user can -->
76 <!-- apply different optimization strategies that we have -->
77 <!-- offered through different ORB options. Using the -->
78 <!-- strategies that TAO offers the user, can optimize -->
79 <!-- applications to get better throughput or reduced -->
80 <!-- latencies, as the case may be. For the application -->
81 <!-- developer this could involve rewriting portions of his -->
82 <!-- code. He actually optimizes his application -->
83 <!-- constrained by the strategies that TAO offers.
85 <!-- Honestly, I dont think its a matter worth loosing sleep -->
86 <!-- over. Why did I start that in the first place. Late -->
87 <!-- realisation :-)-->
89 <a href="#throughput">Optimizing Run-time Throughput</a>
90 <ul>
91 <li>
92 <a href="#client_throughput">Optimizing Client Throughput</a>
93 </li>
94 <li>
95 <a href="#server_throughput">Optimizing Server Throughput</a>
96 </li>
97 </ul>
98 </li>
99 <li>
100 <a href="#scalability">Optimizing Run-time Scalability</a>
101 <ul>
102 <li>
103 <a href="#client_scalability">Optimizing Client Scalability</a>
104 </li>
105 <li>
106 <a href="#server_scalability">Optimizing Server Scalability</a>
107 </li>
108 </ul>
109 </li>
110 <li>
111 <a href="#compile">Reducing Compilation Time</a>
112 <ul>
113 <li>
114 <a href="#compile_optimization">Compilation Optimization</a>
115 </li>
116 <li>
117 <a href="#compile_inlinling">Compilation Inlining</a>
118 </li>
119 </ul>
120 </li>
121 <li>
122 <a href="#footprint">Reducing Memory Footprint</a>
123 <ul>
124 <li>
125 <a href="#compile_footprint">Compile-time Footprint</a>
126 </li>
127 <li>
128 <a href="#runtime_footprint">Run-time Footprint</a>
129 </li>
130 </ul>
131 </li>
132 </ul>
134 <p><hr><p>
135 <a name="throughput"></a>
136 <h3>Optimizing Throughput</h3>
139 In this context, ``throughput'' refers to the number of events
140 occurring per unit time, where ``events'' can refer to
141 ORB-mediated operation invocations, for example. This section
142 describes how to optimize client and server throughput.
143 </p>
146 It is important to understand that enabling throughput
147 optimizations for the client may not affect the server
148 performance and vice versa. In particular, the client and
149 server ORBs may be designed by different ORB suppliers.
150 </p>
152 <a name="client_throughput"></a>
153 <h4>Optimizing Client Throughput</h4>
156 Client ORB throughput optimizations improve the rate at which
157 CORBA requests (operation invocations) are sent to the target
158 server. Depending on the application, various techniques can be
159 employed to improve the rate at which CORBA requests are sent
160 and/or the amount of work the client can perform as requests are
161 sent or replies received. These techniques consist of:
162 </p>
163 <ul>
164 <li>
165 <!-- Ossama, I have my jitters on putting this here for the -->
166 <!-- following reasons -->
167 <!-- 1. AMI doesnt have many optimizations built in. Most of -->
168 <!-- the configurations that we mention below wouldnt work -->
169 <!-- with AMI. Say for instance we dont have a RW handler -->
170 <!-- for AMI -->
172 <!--
173 Yes, I know that. No claim was made that the ORB
174 configurations mentioned below should be or could be used
175 with AMI. AMI was only given as an example of how to
176 potentially improve throughput using programmatical means,
177 as opposed to using static ORB configurations.
180 <!-- Agreed. With the little I know of users, they try to -->
181 <!-- mix and match. They tend to assume that programming -->
182 <!-- considerations can be mixed and matched with ORB -->
183 <!-- configurations. Hence my jitters. If we split things as -->
184 <!-- Dr.Schmidt suggests, I guess things could be better -->
187 <!-- 2.For long we have been interchanging the terms, -->
188 <!-- "Throughput" and "Latency". AMI is good for -->
189 <!-- "Throughput", you could keep the client thread busy by -->
190 <!-- making more invocations. I doubt whether that leads to -->
191 <!-- better latencies. I dont know. Further the ORB -->
193 <!--
194 No such claim was made, so what's the issue here? This
195 section is after all about improving throughput not
196 latency. :-)
198 <!-- Aahn!! See we interchange the usage of Latency and -->
199 <!-- Throughput which doesnt sound like a good idea. The ORB -->
200 <!-- configuration options that we suggest are mainly for -->
201 <!-- getting low latencies. Throughput is an after effect of -->
202 <!-- it. -->
204 <!-- configuration section talks about options that improve -->
205 <!-- latencies. IMHO, lower latencies can lead to improved -->
207 <!-- If the options I wrote about improve latency and not
208 throughput that should certainly be corrected. -->
210 <!-- I guess that is where we need to start working. The -->
211 <!-- strategies that we talk gives lower latencies and hence -->
212 <!-- better throughput. They have been -->
213 <!-- implemented/designed/thought about as options that will -->
214 <!-- give low latencies. Making that change should help a lot. -->
216 <!-- throughput, but vice-versa may not apply. -->
217 <!-- Please correct me if I am wrong. I am willing to stand -->
218 <!-- corrected. -->
219 <b>Run-time features</b> offered by the ORB, such as
220 Asynchronous Method Invocations (AMI)
221 <!-- Ossama, are there other examples you can list here? -->
222 <!-- ADD BUFFERED ONEWAYS -->
223 </li>
224 <li>
225 <b>ORB configurations</b>, such as disabling synchronization
226 of various parts of the ORB in a single-threaded application
227 </li>
228 </ul>
231 We explore these techniques below.
232 </p>
234 <h4>Run-time Client Optimizations</h4>
237 For two-way invocations, i.e., those that expect a reply
238 (including ``<CODE>void</CODE>'' replies), Asynchronous method
239 invocations (AMI) can be used to give the client the opportunity
240 to perform other work as a CORBA request is sent to the target,
241 handled by the target, and the reply is received.
242 </p>
244 <h4>Client Optimizations via ORB Configuration</h4>
247 A TAO client ORB can be optimized for various types of
248 applications:
249 </p>
251 <ul>
252 <li>
253 <b>Single-Threaded</b>
254 <ul>
255 <li>
257 Other options include disabling synchronization in the
258 components of TAO responsible for constructing and sending
259 requests to the target and for receiving replies. These
260 components are called ``connection handlers.'' To disable
261 synchronization in the client connection handlers, simply
262 add:
263 </p>
264 <!-- Ossama, if we are going to ask people to use ST, -->
265 <!-- they could as well use ST reactor too. TAO uses a -->
266 <!-- reactor for ST and it would be better to use ST -->
268 <!-- Sure, but this particular section is about the
269 -ORBClientConnectionHandler section. We can certainly
270 mention that it is better to use the ST reactor. -->
272 <!-- reactor instead of TP. BTW, shouldnt we interchange -->
274 <!-- The TP reactor was never mentioned here, so what the
275 issue? -->
277 <!-- things here for example tell about RW and then go -->
278 <!-- to ST handlers? -->
280 <!-- Fine with me Bala. You know more about the this
281 option than I do. Go for it! :-) -->
283 <!-- No problem. I will start changing this once you -->
284 <!-- make your next pass -->
285 <blockquote>
286 <code>
287 <a href="Options.html#-ORBClientConnectionHandler">
288 -ORBClientConnectionHandler</a> ST
289 </code>
290 </blockquote>
293 to the list of <code>Client_Strategy_Factory</code>
294 options. Other values for this option, such as
295 <code>RW</code>, are more appropriate for "pure"
296 synchronous clients. See the <code>
297 <a href="Options.html#-ORBClientConnectionHandler">
298 -ORBClientConnectionHandler</a></code> option
299 documentation for details.
300 </p>
302 </li>
303 </ul>
304 </li>
306 <li>
307 <b>Low Client Scalability Requirements</b>
308 <ul>
309 <li>
311 Clients with lower scalability requirements can dedicate a
312 connection to one request at a time, which means that no
313 other requests or replies will be sent or received,
314 respectively, over that connection while a request is
315 pending. The connection is <i>exclusive</i> to a given
316 request, thus reducing contention on a connection.
317 However, that exclusivity
318 <!-- Ossama, I am not sure I understand that using -->
319 <!-- exclusive connections could lead to reduced -->
320 <!-- throughput. As a matter of fact we have a cache map -->
321 <!-- lookup on the client side for muxed and that would -->
322 <!-- increase the latencies a bit :-). Exclusive takes -->
323 <!-- more resources and that could leade reduced -->
324 <!-- scalability, right?-->
326 <!-- Bala, isn't that what I said? Paraphrasing what I
327 said, if the client has low scalability
328 requirements then exclusive connections can be used
329 to improve throughput. Isn't that incorrect? -->
331 comes at the cost of a smaller number of requests that
332 may be issued at a given point in time.
334 <!-- May be I am confused :-). The above statement that -->
335 <!-- says "smaller number of requests" tries to convey -->
336 <!-- that we will have reduced throughput. What am I -->
337 <!-- missing here? -->
338 To enable this
339 behaviour, add the following option to the
340 <code>Client_Strategy_Factory</code> line of your
341 <code>svc.conf</code> file:
342 </p>
344 <blockquote>
345 <code>
346 <a href="Options.html#-ORBTransportMuxStrategy">
347 -ORBTransportMuxStrategy</a> EXCLUSIVE
348 </code>
349 </blockquote>
351 </li>
352 </ul>
353 </li>
354 </ul>
356 <a name="server_throughput"></a>
357 <h4>Optimizing Server Throughput</h4>
360 Throughput on the server side can be improved by configuring TAO
361 to use a <i>thread-per-connection</i> concurrency model. With
362 this concurrency model, a single thread is assigned to service
363 each connection. That same thread is used to dispatch the
364 request to the appropriate servant, meaning that thread context
365 switching is kept to minimum. To enable this concurrency model
366 in TAO, add the following option to the
367 <code>
368 <a href="Options.html#DefaultServer">Server_Strategy_Factory</a>
369 </code>
370 entry in your <code>svc.conf</code> file:
371 </p>
373 <blockquote>
374 <code>
375 <a href="Options.html#orb_concurrency">
376 -ORBConcurrency</a> thread-per-connection
377 </code>
378 </blockquote>
381 While the <i>thread-per-connection</i> concurrency model may
382 improve throughput, it generally does not scale well due to
383 limitations of the platform the application is running. In
384 particular, most operating systems cannot efficiently handle
385 more than <code>100</code> or <code>200</code> threads running
386 concurrently. Hence performance often degrades sharply as the
387 number of connections increases over those numbers.
388 </p>
391 Other concurrency models are further discussed in the
392 <i><a href="#server_scalability">Optimizing Server
393 Scalability</a></i> section below.
394 </p>
396 <p><hr><p>
398 <a name="scalability"></a>
399 <h3>Optimizing Scalability</h3>
402 In this context, ``scalability'' refers to how well an ORB
403 performs as the number of CORBA requests increases. For
404 example, a non-scalable configuration will perform poorly as the
405 number of pending CORBA requests on the client increases from
406 <code>10</code> to <code>1,000</code>, and similarly on the
407 server. ORB scalability is particularly important on the server
408 since it must often handle many requests from multiple clients.
409 </p>
411 <a name="client_scalability"></a>
412 <h4>Optimizing Client Scalability</h4>
415 In order to optimize TAO for scalability on the client side,
416 connection multiplexing must be enabled. Specifically, multiple
417 requests may be issued and pending over the same connection.
418 Sharing a connection in this manner reduces the amount of
419 resources required by the ORB, which in turn makes more
420 resources available to the application. To enable this behavior
421 use the following <code>Client_Strategy_Factory</code> option:
422 </p>
424 <blockquote>
425 <code>
426 <a href="Options.html#-ORBTransportMuxStrategy">
427 -ORBTransportMuxStrategy</a> MUXED
428 </code>
429 </blockquote>
432 This is the default setting used by TAO.
433 </p>
435 <a name="server_scalability"></a>
436 <h4>Optimizing Server Scalability</h4>
439 Scalability on the server side depends greatly on the
440 <i>concurrency model</i> in use. TAO supports two concurrency
441 models:
442 </p>
444 <ol>
445 <li>Reactive, and</li>
446 <li>Thread-per-connection</li>
447 </ol>
450 The thread-per-connection concurrency model is described above
451 in the
452 <i><a href="#server_throughput">Optimizing Server
453 Throughput</a></i>
454 section.
455 </p>
458 A <i>reactive</i> concurrency model employs the Reactor design
459 pattern to demultiplex incoming CORBA requests. The underlying
460 event demultiplexing mechanism is typically one of the
461 mechanisms provided by the operating system, such as the
462 <code>select(2)</code> system call. To enable this concurrency
463 model, add the following option to the
464 <code>
465 <a href="Options.html#DefaultServer">Server_Strategy_Factory</a>
466 </code>
467 entry in your <code>svc.conf</code> file:
468 </p>
470 <blockquote>
471 <code>
472 <a href="Options.html#orb_concurrency">
473 -ORBConcurrency</a> reactive
474 </code>
475 </blockquote>
478 This is the default setting used by TAO.
479 </p>
482 The reactive concurrency model provides improved scalability on
483 the server side due to the fact that less resources are used,
484 which in turn allows a very large number of requests to be
485 handled by the server side ORB. This concurrency model provides
486 much better scalability than the thread-per-connection model
487 described above.
488 </p>
491 Further scalability tuning can be achieved by choosing a Reactor
492 appropriate for your application. For example, if your
493 application is single-threaded then a reactor optimized for
494 single-threaded use may be appropriate. To select a
495 single-threaded <code>select(2)</code> based reactor, add the
496 following option to the
497 <code>
498 <a href="Options.html#AdvancedResourceFactory">Advanced_Resource_Factory</a>
499 </code>
500 entry in your <code>svc.conf</code> file:
501 </p>
503 <blockquote>
504 <code>
505 <a href="Options.html#-ORBReactorType">
506 -ORBReactorType</a> select_st
507 </code>
508 </blockquote>
511 If your application uses thread pools, then the thread pool
512 reactor may be a better choice. To use it, add the following
513 option instead:
514 </p>
516 <blockquote>
517 <code>
518 <a href="Options.html#-ORBReactorType">
519 -ORBReactorType</a> tp_reactor
520 </code>
521 </blockquote>
524 This is TAO's default reactor. See the
525 <code>
526 <a href="Options.html#-ORBReactorType">-ORBReactorType</a>
527 </code>
528 documentation for other reactor choices.
529 </p>
532 Note that may have to link the <code>TAO_Strategies</code>
533 library into your application in order to take advantage of the
534 <code>
535 <a href="Options.html#AdvancedResourceFactory">Advanced_Resource_Factory</a>
536 </code>
537 features, such as alternate reactor choices.
538 </p>
541 A third concurrency model, <i>un</i>supported by TAO, is
542 <i>thread-per-request</i>. In this case, a single thread is
543 used to service each request as it arrives. This concurrency
544 model generally provides neither scalability nor speed, which is
545 the reason why it is often not used in practice.
546 </p>
548 <p><hr><p>
549 <a name="compile"></a>
550 <h3>Reducing Compilation Time</h3>
552 <a name="compile_optimization"></a>
553 <h4>Compilation Optimization</h4>
555 When developing software that uses ACE+TAO you can reduce the time it
556 takes to compile your software by not enabling you compiler's optimizer
557 flags. These often take the form -O&lt;n&gt;.<P>
559 Disabling optimization for your application will come at the cost of run
560 time performance, so you should normally only do this during
561 development, keeping your test and release build optimized. <P>
563 <a name="compile_inlinling"></a>
564 <h4>Compilation Inlining</h4>
566 When compiler optimization is disabled, it is frequently the case that
567 no inlining will be performed. In this case the ACE inlining will be
568 adding to your compile time without any appreciable benefit. You can
569 therefore decrease compile times further by build building your
570 application with the -DACE_NO_INLINE C++ flag. <P>
572 In order for code built with -DACE_NO_INLINE to link, you will need to
573 be using a version of ACE+TAO built with the "inline=0" make flag. <P>
575 To accommodate both inline and non-inline builds of your application
576 it will be necessary to build two copies of your ACE+TAO libraries,
577 one with inlining and one without. You can then use your ACE_ROOT and
578 TAO_ROOT variables to point at the appropriate installation.<P>
580 <p><hr><p>
581 <a name="footprint"></a>
582 <h3>Reducing Memory Footprint</h3>
584 <a name="compile_footprint"></a>
585 <h4>Compile-time Footprint</h4>
587 It has also been observed recently that using -xO3 with -xspace on SUN
588 CC 5.x compiler gives a big footprint reduction of the order of 40%.</P>
589 <P>Also footprint can be saved by specifying the following in your
590 platform_macros.GNU file: </P>
592 <code>
593 <pre>
594 optimize=1
595 debug=0
596 CPPFLAGS += -DACE_USE_RCSID=0 -DACE_NLOGGING=1
597 </pre>
598 </code>
601 If portable interceptors aren't needed, code around line 729 in
602 <code>$TAO_ROOT/tao/orbconf.h</code> can be modified to hard-code
603 <code>TAO_HAS_INTERCEPTORS</code> as <code>0</code>, and all interceptor
604 code will be skipped by the preprocessor.
607 <TABLE BORDER=2 CELLSPACING=2 CELLPADDING=2>
608 <caption><b>IDL compiler options to reduce compile-time footprint</b></caption>
609 <TH>Command-Line Option
610 <TH>Description and Usage
611 <TR>
612 <TD><code>-Sc</code>
613 <TD>Suppresses generation of the TIE classes (template classes used
614 to delegate request dispatching when IDL interface inheritance
615 would cause a 'ladder' of inheritance if the servant classe had
616 corresponding inheritance). This option can be used almost all the
617 time.
618 <tr>
619 <td><code>-Sa</code>
620 <td>Suppresses generation of Any insertion/extraction operators. If
621 the application IDL contains no Anys, and the application itself
622 doesn't use them, this can be a useful option.
623 <tr>
624 <td><code>-St</code>
625 <td>Suppresses type code generation. Since Anys depend on type codes,
626 this option will also suppress the generation of Any operators. Usage
627 requires the same conditions as for the suppression of Any operators,
628 plus no type codes in application IDL and no application usage of
629 generated type codes.
630 <tr>
631 <td><code>-GA</code>
632 <td>Generates type code and Any operator definitions into a separate
633 file with a 'A' suffix just before the <code>.cpp</code> extension.
634 This is a little more flexible and transparent than using <code>-Sa</code> or
635 <code>-St</code> if you are compiling to DLLs or shared objects,
636 since the code in this file won't get linked in unless it's used.
637 <tr>
638 <td><code>-Sp</code>
639 <td>Suppresses the generation of extra classes used for thru-POA
640 collocation optimization. If the application has no collocated
641 client/server pairs, or if the performance gain from collocation
642 optimization is not important, this option can be used.
643 <tr>
644 <td><code>-H dynamic_hash</code><br>
645 <code>-H binary_search</code><br>
646 <code>-H linear_search</code><br>
647 <td>Generates alternatives to the default code generated on
648 the skeleton side for operation dispatching (which uses perfect
649 hashing). These options each give a small amount of footprint
650 reducion, each amount slightly different, with a corresponding tradeoff
651 in speed of operation dispatch.
652 </TABLE>
654 <a name="runtime_footprint"></a>
655 <h4>Run-time Footprint</h4>
657 <!-- Doug, put information about how to reduce the size of the -->
658 <!-- connection blocks, etc. -->
660 <table border="1" width="75%">
661 <caption><b>Control size of internal data structures<br></b></caption>
662 <thead>
663 <tr valign="top">
664 <th>Define</th>
665 <th>Default</th>
666 <th>Minimum</th>
667 <th>Maximum</th>
668 <th>Description</th>
669 </tr>
670 </thead><tbody>
671 <tr>
672 <td>TAO_DEFAULT_ORB_TABLE_SIZE</td>
673 <td>16</td>
674 <td>1</td>
675 <td>-</td>
676 <td>The size of the internal table that stores all ORB Cores.</td>
677 </tr>
678 </tr>
679 <tr><td></td>
680 </tr>
681 </tbody></table></p><p>
683 More information on reducing the memory footprint of TAO is available
685 HREF="http://www.ociweb.com/cnb/CORBANewsBrief-200212.html">here</A>. <P>
687 <hr><P>
688 <address><a href="mailto:ossama@uci.edu">Ossama Othman</a></address>
689 <!-- Created: Mon Nov 26 13:22:00 PST 2001 -->
690 <!-- hhmts start -->
691 Last modified: Thu Jul 14 16:36:12 CDT 2005
692 <!-- hhmts end -->
693 </body>
694 </html>