1 --- !ditz.rubyforge.org,2008-03-06/issue
2 title: "Split both data and code into two sections (e.g., by file separation): common (or shared) and client (i.e., host)."
4 Rather than pursuing the current approach of having host code incrementally (or
5 so) feeding source into an aggregate of compilable text, instead isolate the
6 shared part of code from that specific only to the client, so that this shared
7 part can be processed (at compile time) into representations appropriate for
8 both client (i.e., OpenCL "host") and any compute servers (i.e., OpenCL
11 In this way, the question of exactly how inclusions and compilation (for
12 processing by OpenCL) should occur can be worked out independently from the
13 normal processes of declaring data structures and code.
15 A possible drawback to the above might be limiting the extensibility into
16 further clients of this library (either client libraries or end-user
17 applications), but it should still be possible to provide for this while
18 preserving the above separation internally (e.g., by offering routines for this
21 Another way of viewing this is that the fundamental functionality available
22 within and outside of the library need not change, but that source might be
23 rearranged to allow for a more straightforward representation for the sake of
24 internal implementation of the library, possibly with external benefits as well
25 (e.g., if the API header were adjusted to expose a limited subset API for use
26 with compute servers (devices)).
30 reporter: David Hilvert <dhilvert@auricle.dyndns.org>
33 creation_time: 2009-10-10 23:26:08.168811 Z
36 id: d0797684fabf05af24e73639e0ce5e30a145a3c5
38 - - 2009-10-10 23:26:47.592939 Z
39 - David Hilvert <dhilvert@auricle.dyndns.org>
41 - Add task for splitting between common (client and compute server) and client-specific code.
42 - - 2009-10-10 23:44:24.896284 Z
43 - David Hilvert <dhilvert@auricle.dyndns.org>
46 Note that exposing a limited subset API for use with compute devices might
47 well be more properly viewed as a bug than as a feature (although it could be
48 the latter). In any case, it is likely that this will occur for the convenience
49 of implementation, since the types (or some subset) will be needed on the side of
50 the compute server, and adding #ifdefs (or #ifs or so) to the header is likely
51 the most convenient manner of accomplishing this.
52 - - 2009-10-10 23:57:18.054878 Z
53 - David Hilvert <dhilvert@auricle.dyndns.org>
56 Other resources to look at in this area (client/server separation), but
57 probably for implementations far more general than currently appearing within
58 Libale, would include CORBA at least, and possibly things such as dbus, to the
59 extent that they specify client/server relationships. (X might also be
60 relevant, especially as it predominantly involves such relationships within the
61 same machine.) One area to look at, for example, would be the IDL of CORBA.
62 It should be noted, however, that since everything within Libale is currently
63 implemented internally in C, all of the above would probably be overkill at the
64 moment. In the case of CORBA, there is a data parallel spec that could be
65 looked at (again, probably for something more general than would be needed
68 All of the above might be useful were there to be interest in directly
69 accessing the Libale server side (OpenCL compute contexts) from external
70 libraries or applications.
72 Relevant links might include:
74 http://www.omg.org/technology/documents/formal/data_parallel.htm
75 http://dbus.freedesktop.org/doc/dbus-faq.html
76 http://developer.gnome.org/doc/guides/corba/html/book1.html
77 - - 2009-10-11 01:02:22.744954 Z
78 - David Hilvert <dhilvert@auricle.dyndns.org>
81 For specifics of OpenCL capabilities for the purposes of splitting code, the
82 spec would be the thing to reference. E.g., something will have to be devised
83 for sorting and aggregating sources into a program, which could probably be
84 done explicitly via a separate include file (much as ALE's d2.cc and d3.cc
85 aggregate namespace files in a manner recognizing dependencies). Beyond this,
86 nothing new should be incurred from the proposed change. (E.g., recursion is
87 not supported, but this introduces no new obstacles under the proposed
88 approach.) Latest spec as of 16 May 2009 is here:
90 http://www.khronos.org/registry/cl/specs/opencl-1.0.43.pdf
92 (For a slide presentation of why GPU and compiler technology don't always mesh
93 very well, the following might be of interest (although much of this will
94 ideally be addressed by OpenCL):
96 http://www.cs.unc.edu/Events/Conferences/GP2/slides/Cooper.pdf
98 The following is a much better overview of current technologies:
100 http://www.hpccommunity.org/blogs/bearcat/multi-core-gpu-background-77/
102 Unfortunately, none of the above address the issue of sharing of data
103 structures or objects in a constructive manner. Fortunately, all seem to
104 suggest OpenCL (or something like it) as the most feasible current approach.
105 - - 2009-10-11 04:23:45.681282 Z
106 - David Hilvert <dhilvert@auricle.dyndns.org>
109 Consider that, in addition to client-specific code, and shared code, it would
110 probably be worthwhile to also provide for server-specific code (i.e., code
111 specific to the compute device), as various idioms appropriate for the compute
112 device (e.g., 'kernel' and other attributes, OpenCL built-in functions, etc.)
113 may not be appropriate for compilation targeted to the client (OpenCL host).
114 - - 2009-10-11 04:34:21.081366 Z
115 - David Hilvert <dhilvert@auricle.dyndns.org>
118 Note that separation of client, server, and common code could be achieved by
119 directory (e.g., server/, client/, common/), filename component (e.g., -server,
120 -client, -common), or by preprocessor conditionals (as will likely be done for
121 the API header file), among other methods.
123 Other names for labeling server code might include 'kernels', 'device'
124 (which?), or 'cl'; other names for client code might include 'host' (after
127 The client/server terminology seems to follow the X Window System model, which
128 might be appropriate, especially if computation is eventually distributed over
129 cluster or grid platforms.
130 - - 2009-10-11 12:45:25.752974 Z
131 - David Hilvert <dhilvert@auricle.dyndns.org>
134 As a refinement to the above suggestion, consider the following:
136 If maintenance of a separate file for explicitly indicating dependencies among
137 sources is necessary (as it would be for the proposed shared-code approach
138 under a library, such as OpenCL, that does not currently provide for separate
139 translation units), then such dependencies could also be indicated by calling a
140 sequence of initializations, one per file, according to the method of
141 incremental aggregation currently conceived (via ale_eval or equivalent). The
142 most obvious differences in implementation between the current approach (of
143 aggregation of code text at runtime) and the above-proposed approach (of
144 aggregation of code text at compile time) is that in the current approach, code
145 sharing is limited mostly to data structures (via the current macro
146 definitions), as no provision is currently made for sharing other kinds of
147 code, whereas the above-proposed approach would immediately provide for
148 complete sharing (via common files).
150 Unfortunately, it's not immediately clear that the above-proposed approach is
151 better than an approach that does not share code. In particular, the current
152 choice of data-type sharing over code sharing seems, on the face of it, a good
153 one. Much of the code between host (or client) and device (or server) domains
154 is not naturally shared, whereas data types can be. To establish a large
155 shared body of code would require that a fair amount of code be usable in both
156 domains. (E.g., simple routines for accessing data in a canonical way, or
157 preserving some invariant during data modification, might be appropriate for
158 sharing.) If, on the other hand, the degree of sharing is small, then adding a
159 new structure of managing sharing through files might be more complex than
160 necessary. Furthermore, if compilation of server code occurs at run-time (as
161 may be common under most OpenCL installations), then excessive sharing of code
162 might impose a run-time penalty.
164 Given this, the best argument for sharing of code via files might be a more
165 natural syntax for data structure definition. In particular, it would no
166 longer be necessary to define these via the macro processor. Furthermore, the
167 sorts of lightweight operations (simple access and modification) indicated
168 above would now be possible. Another advantage to having a separate file for
169 shared code would be avoiding the need for separate consideration of each data
170 structure (or each macro declaration).
172 Instead of something like this:
176 ALE_...(... data_struct_1 ...)
177 ALE_...(... data_struct_2 ...)
178 ALE_...(... data_struct_3 ...)
181 ale_eval( ... data_struct_1 ... data_struct_2 ... data_struct_3 ...)
188 We get the more natural representation:
198 Where Nx may include a macro element (e.g., ALE_...(data_struct_1)), and where
199 libale-server may ultimately be processed, if desired, by a call to ale_eval().
201 Hence there is a consolidation of data structure definitions on the server
202 side, in addition to provisions for code sharing. Maintaining a wrapped
203 runtime option for structure definition (via ale_eval or otherwise) would
204 probably be good, however, for the purposes of supporting client code or client
206 - - 2009-10-11 13:10:51.957739 Z
207 - David Hilvert <dhilvert@auricle.dyndns.org>
210 Note that the empty line following 'libale-server.c' in my most recent comment
211 should have read '#include "foo-shared.c"'.
212 - - 2009-10-11 14:28:17.840394 Z
213 - David Hilvert <dhilvert@auricle.dyndns.org>
216 Consider that, if preprocessor macros were used for implementing the sorts of
217 methods for accessing and modifying data described, then it should be possible
218 to implement these in a manner efficient for both client and server
219 implementations, hence strengthening the case for a body of shared code (in
220 addition to shared data types) between client and server.
222 (Note from examples in the OpenCL spec that preprocessor conditionals can be
223 used a fair amount for kernel definitions. This sort of code is probably the
224 sort that would not be good for sharing with the client. On the other hand, a
225 kernel could efficiently make use of macros defined in a separate, shared
227 - - 2009-10-11 15:27:41.066320 Z
228 - David Hilvert <dhilvert@auricle.dyndns.org>
231 Consider that the most straightforward approach to partition and naming of
232 files would probably be the following:
234 Create a new header file for source files currently defining shared data types,
235 and move the structure part of the definition into the header file. Header
236 file naming follows the usual C convention, so that, e.g., image.c will now
237 have an associated file image.h. (Since certain data type details, such as
238 memory allocation, are currently executed only by the host, these need not be
239 moved.) One benefit of such division may be relieving src/libale.h of its duty
240 as a repository of miscellaneous definitions.
242 Name files containing code specific to a compute device with the qualifier -cl,
243 so that, e.g., image.c gains a counterpart image-cl.c. (This bit is fairly
244 obvious, and not immediately relevant to the question of file separation of
245 shared code.) Handling of the transfer of .h and -cl.c files to the OpenCL API
246 will likely occur through a process including compile-time translation into a
247 form allowing use with either direct OpenCL API calls or (more likely) a Libale
248 intermediate call (e.g., ale_eval), as well as the actual execution of such
249 calls at runtime. Hence, a further translation step beyond standard C
250 preprocessing should probably be added at compile time, perhaps very simple
251 (e.g., converting a file to a quoted C string, and either including this string
252 within some new named function or assigning it to a variable that can then be
253 referenced by other library code).
254 - - 2009-10-11 15:55:54.298156 Z
255 - David Hilvert <dhilvert@auricle.dyndns.org>
258 Consider that, since things like ale_eval definitions might occur within code
259 evaluated at runtime (e.g., defining MAP_PIXEL and such for image code), such
260 code could either be stored separately (e.g., image-ale.c separate from
261 image-cl.c), or the two could be stored together (e.g., in the case that the
262 file will be processed by ale_eval; an appropriate name might be image-rt.c,
265 In the best of scenarios, it will be possible to implement MAP_PIXEL directly
266 in the OpenCL language, so that the above is not necessary. If this is not
267 sufficient, however, some additional layer (e.g., m4 or a similar processor)
268 might be used by ale_eval.
269 - - 2009-10-17 01:31:26.090103 Z
270 - David Hilvert <dhilvert@auricle.dyndns.org>
273 Consider the following fairly straightforward argument in favor of the proposed
274 .h (header) separation, in addition to .c source separation:
276 As previously conceived, the only separation in files for the CL program would
277 be as a .c file, with, e.g., structure definition occurring via variables or
278 functions generated by macros. The revised approach (using an additional .h
279 file) may be seen as more convenient, as macros are not required for this, as
280 outlined in earlier comments. This is, however, the lesser concern at the
283 More to the point, compilation is currently failing on the src/align.c file, with
284 first error occurring at a structure making reference to an elem_bounds_t type
285 from the transformation class. In order to reference this structure from align.c,
286 an inclusion of some sort will be necessary.
288 Preserving the current file structure, options would be either of include/ale.h
289 and src/libale.h, but neither of these are ideal. Better would be to associate
290 the structure definition more tightly; else, structure definitions would be
291 distributed somewhat haphazardly between these files and the .c files according
292 to whether they were used elsewhere.
294 (Note that there are currently efforts for automatic parallelization within
295 GCC, under the Graphite framework, but that efforts toward integration with an
296 OpenCL backend are apparently not being made at the moment. One possibility
297 would be to use either something similar to OpenMP (as Graphite may ? ), an
298 extension to this, or a separate syntax (as I had suggested on the mailing
299 list) to indicate areas that should be parallelized for an OpenCL backend
302 (A further alternative would be to await a hardware (e.g., Larrabee) or
303 software (e.g., OpenCL alternative or successor) solution more suited to
304 automatic methods. Binary translation based on QEMU might be one very general
305 possibility, but probably more challenging than even a compiler [e.g., GCC]
306 approach, the advantage of these over the current approach being largely
307 cleanness and generality, which might lead to greater maintainability in the
309 - - 2009-10-17 02:28:18.322606 Z
310 - David Hilvert <dhilvert@auricle.dyndns.org>
313 The following might be worth looking at:
315 http://portal.acm.org/citation.cfm?id=1504194
316 - - 2009-10-18 01:36:20.217266 Z
317 - David Hilvert <dhilvert@auricle.dyndns.org>
320 Rather lengthy discussion regarding OpenCL and OpenMP can be found here:
322 http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=119644
324 This discussion references the earlier-linked paper. There is also an
325 (apparently) proprietary compiler linked from this discussion having a CUDA
326 (also proprietary) back-end, but nothing obvious currently available with an
327 OpenCL back-end, or otherwise suitable as a standard replacement for direct
329 - - 2009-10-20 14:41:03.658982 Z
330 - David Hilvert <dhilvert@auricle.dyndns.org>
333 Note that kernel and hardware mechanisms (esp. page faulting and caching) are
334 probably the correct approach to pointer sharing, and that it would probably be
335 best in the long run for either (a) GPUs to adopt a hardware memory management
336 approach, or (b) GPUs to be sufficiently integrated with CPUs such that memory
337 management can be shared between the two. (One might think that Intel would
338 capitalize on this idea sooner rather than later -- they seem to be postponing
339 this until Larrabee finally ships instead of planning something sooner; despite
340 their existing integrated graphics line, they have as yet made no annoucements
341 on GPGPU advances taking advantage of more uniform memory management between
344 Given all of the above, while a compiler solution is likely possible to some
345 extent, and might work for ALE, the difficulty of working out special cases
346 such as void pointers and casting suggests that finding a long-term maintainer
347 for such a solution within the GCC maintainer community might be a bit
348 difficult. (Where void pointers and casting are exactly the sorts of things
349 that hardware facilities could trivially cope with, but compiler features might
350 struggle with.) For now, management of pointers within Libale via the OpenCL
351 API might continue to be acceptable, but a hardware solution should probably be
352 looked for in the long run.