bugs/issue-507685b9ac970ed0fdbc479e4d25fc07ee13efe5.yaml

   1 --- !ditz.rubyforge.org,2008-03-06/issue
   2 title: Refine approach for sharing pointer data structures with compute device
   3 desc: |-
   4   E.g., one approach might be to prepend a bitfield to structures to be passed to
   5   the compute device indicating the positions of pointers in the structure, so
   6   that any necessary dependencies could be included and passed as well.  (The
   7   resulting collection of structures could be packed into a pre-allocated memory
   8   area, a new area could be made, etc., but the primary point is that it would
   9   not be necessary to use a special pointer structure (or pointer macros) on the
  10   host.)
  11
  12   (The above approach would be most appropriate to areas where structural depth
  13   is shallow [i.e., nested dependency counts are small] and structures are small,
  14   reflecting the situation for most of 2D code.  For 3D code, a different
  15   approach might be required (e.g., the current approach of pointer encoding in a
  16   union or structure, a new approach involving a pre-allocated data structure, or
  17   multiple such structures as with a traditional memory page allocator, etc.))
  18 type: :task
  19 component: libale
  20 release: 0.0.0
  21 reporter: David Hilvert <dhilvert@auricle.dyndns.org>
  22 status: :unstarted
  23 disposition:
  24 creation_time: 2009-10-22 16:55:57.695059 Z
  25 references: []
  26
  27 id: 507685b9ac970ed0fdbc479e4d25fc07ee13efe5
  28 log_events:
  29 - - 2009-10-22 16:56:00.206484 Z
  30   - David Hilvert <dhilvert@auricle.dyndns.org>
  31   - created
  32   - ""
  33 - - 2009-10-22 17:07:07.287977 Z
  34   - David Hilvert <dhilvert@auricle.dyndns.org>
  35   - commented
  36   - |-
  37     It's possible that, if done properly, such a reformatting of pointers could
  38     allow processing to more easily occur in client code (e.g., for the case of
  39     parsing of filter and rendering descriptor strings appearing within
  40     command-line arguments, which should really be a client task).
  41
  42     Further, consider that a parameter for type could be used as an alternative to
  43     a bitfield structure member in at least some cases, as such a parameter could
  44     be used for either retrieving information about pointer dependencies from some
  45     other location or for directly representing such a bitfield.
  46 - - 2009-10-22 18:30:23.843635 Z
  47   - David Hilvert <dhilvert@auricle.dyndns.org>
  48   - commented
  49   - |-
  50     Consider that the above-noted bitfields or type signifiers could be specified
  51     explicitly by the user in addition to structure specification, or such
  52     bitfields or signifiers could be extracted automatically from headers (either
  53     .h or, if this is not sufficient, a different representation, such as IDL [see
  54     {issue d0797684fabf05af24e73639e0ce5e30a145a3c5} for links to CORBA pages that might be a relevant starting point for
  55     investigation of IDL as a possibility; this could be seen as a special form of
  56     serialization or remote execution, as alluded to in the earlier bug comments).
  57 - - 2009-10-22 18:41:29.273835 Z
  58   - David Hilvert <dhilvert@auricle.dyndns.org>
  59   - commented
  60   - |-
  61     An alternative to dealing with OpenCL's concept of memory management might be to
  62     use an architecture that allows the compute device to interact with host memory.
  63     (Cell might be one such architecture; Gregory Maxwell had suggested a Cell port
  64     ca. 2007, which inspired these acceleration efforts.  Whether Cell is the most
  65     appropriate target at the moment is not clear, due to issues of availability among
  66     testers.)
  67 - - 2009-10-22 20:33:58.215645 Z
  68   - David Hilvert <dhilvert@auricle.dyndns.org>
  69   - commented
  70   - |-
  71     Note that, if a Cell approach is pursued, an acceptable alternative for testers
  72     not using Cell might be AltiVec or SSE (which had been suggested on the mailing list
  73     or in e-mail some time ago; check the reference).  A review of Wikipedia's AltiVec
  74     article indicates that both of these have instructions for controlling cache, which
  75     is observably a problem in multi-threaded operation (at least) in the multi-alignment
  76     case.
  77
  78     Wikipedia refs:
  79
  80     http://en.wikipedia.org/wiki/AltiVec
  81     http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
  82     http://en.wikipedia.org/wiki/Cell_software_development
  83     http://en.wikipedia.org/wiki/Cell_(microprocessor)
  84 - - 2009-10-23 11:29:37.802858 Z
  85   - David Hilvert <dhilvert@auricle.dyndns.org>
  86   - commented
  87   - "Consider that it would probably be worthwhile to compare speeds for GPU, Cell,\n\
  88     and so-called multimedia extensions (SSE, etc.), for tasks similar to those\n\
  89     found in ALE 2D and 3D code.\n\n\
  90     Relevant sources may include (via Google cell gpu sse benchmarks):\n\n\
  91     http://ps3hacking.googlecode.com/files/Benchmark-ITA.pdf\n\n\
  92     \twhich compares operations on images, giving \n\n\
  93     \t\t3.3ms CPU\n\
  94     \t\t2.4ms CPU (SIMD)\n\
  95     \t\t1.0ms GPU (8 pixel shaders)\n\
  96     \t\t0.87ms Cell\n\
  97     \t\t0.4ms GPU (32 pixel shaders)\n\n\
  98     \tand which identifies memory transfers as the main bottleneck (which\n\
  99     \tseems consistent with what is observed at least in the multi-alignment\n\
 100     \tcase of ALE, although this latter may be at least partly due to caching\n\
 101     \tinefficiencies).\n\n\
 102     \tThe paper gives a general overview of differences, identifying relevant\n\
 103     \tparameters of the tested configurations for Cell (204.8 GFLOPS, 3.2GHz,\n\
 104     \t25.6GB/s to memory), GeForce 8800 GTX (518.4, 1.35, 86.4), GeForce 7800\n\
 105     \tGTX (165, 0.43, 38.4), GeForce 6800 Ultra (54, 0.4, 35.2).\n\n\
 106     \tSuggested means of using SSE is via an include file apparently provided\n\
 107     \tby Intel (?).\n\n\
 108     \tThe method tested for benchmarking the various hardware is image\n\
 109     \tsegmentation by Laplacian kernel computation.\n\n\
 110     \tAs expected, transfer of images to/from GPU is identified as a\n\
 111     \tsignificant bottleneck in the GPU case, along with context creation\n\
 112     \t(for Cg).\n\n\
 113     \tThe paper explicitly notes its choice of commodity processors,\n\
 114     \tsomething relevant to availibility of hardware to testing communities\n\
 115     \tfor FOSS projects.\n\n\
 116     \tCell is mentioned as being particularly dependent on choice of compiler\n\
 117     \toptions (not a good thing).\n\n\
 118     \tDetails:\n\n\
 119     \tPaper submitted for the First Workshop on General Purpose Processing on\n\
 120     \tGraphics Processing Units, October 4, 2007 Northeastern University,\n\
 121     \tBoston, MA\n\n\
 122     \t\"Benchmark of multi-core technologies through image segmentation\n\
 123     \talgorithms using Cell processor, GPGPU architecture and SIMD\n\
 124     \ttechniques\"\n\n\
 125     \tDaniel L\xC3\xA9lis Baggio (danielbaggio@gmail.com), Fernando Fernandes Neto\n\
 126     \t(alphakiller_@msn.com), Thom\xC3\xA1s Dias (tcdias@gmail.com), Einstein do\n\
 127     \tNascimento Jr.  (einsteinnjr@gmail.com) Mauro Eidi Vilela\n\
 128     \tAssano(massano@br.ibm.com), Celso Hirata (hirata@ita.br) - Instituto\n\
 129     \tTecnol\xC3\xB3gico de Aeron\xC3\xA1utica - Brazil \n\n\
 130     A tentative conclusion might be that it would be unwise to ignore GPUs."
 131 - - 2009-10-23 12:46:31.241173 Z
 132   - David Hilvert <dhilvert@auricle.dyndns.org>
 133   - commented
 134   - "Relevant comparisons for computational hardware would probably include any based on \n\
 135     results from the Folding@home project (since this code would probably be optimized for\n\
 136     the various platforms).\n\n\
 137     Results for this project and others are found through the earlier-referenced Google\n\
 138     search.  Also found are:\n\n\
 139     \thttp://en.wikipedia.org/wiki/Smith-Waterman_algorithm\n\n\
 140     Which indicates that substantial speed-up is possible via SSE2 (Core 2 Duo, in\n\
 141     contrast with the Centrino used in the earlier-referenced paper).\n\n\
 142     Worthwhile would probably be to investigate further what cache management\n\
 143     facilities are available or necessary for efficient SSE programming, as well as\n\
 144     how effectively these are/can be used from within OpenCL. \n\n\
 145     And of possible interest might be: \n\n\
 146     \tSatoshi Matsuoka1, et al.  GPU accelerated computing\xE2\x80\x94from hype to\n\
 147     \tmainstream, the rebirth of vector computing.  SciDAC 2009.  (Which\n\
 148     \twww.iop.org URL Google mangles in a manner inconvenient for\n\
 149     \treproduction here, but the paper is available on-line.)\n\n\
 150     Conclusions, however, are not immediately clear, and not much detail is given\n\
 151     on programming techniques used in the case of CPUs.  Performance increases for\n\
 152     GPUs are emphasized, however, as seems to be fairly typical of what is usually\n\
 153     said of GPUs when these are proposed as a solution for vector problems.\n\n\
 154     One conclusion that might be drawn from the above is that GPUs cannot easily be\n\
 155     ignored (stressing importance of OpenCL), and further that if it is possible to\n\
 156     program CPUs well for efficient caching and memory transfer, doing so is\n\
 157     sufficiently non-trivial that challenges encountered in GPU programming could\n\
 158     not easily justify a focus on CPU (e.g., x86 SSE) techniques (which is\n\
 159     effectively what would occur if a combined Cell and CPU development approach\n\
 160     were pursued, since most testers will not have a Cell in the near- (and likely\n\
 161     also long-) term).\n\n\
 162     A remaining question is whether CPUs could be efficiently used for transferring\n\
 163     data structures from a form encoded in pointer structures to a form encoded in\n\
 164     a manner more easily accessible to GPUs, so that the sorts of problems with\n\
 165     pointer encoding currently seen might be avoided.  For flattening the rather\n\
 166     shallow 2D structures, it seems clear that this is at least possible (with\n\
 167     techniques suggested in this [?] bug record), and it may also be possible in\n\
 168     the case of 3D structures (e.g., if flattening of the octree occurs relative to\n\
 169     a particular perspective).  Perhaps better for 3D would be to collect elements\n\
 170     of the octree into pages, according to memory page allocation techniques\n\
 171     suggested earlier, which would allow reference to locations relative to a page\n\
 172     start; the optimization problem would then be efficient (in speed and source\n\
 173     complexity) manipulation of pages."
 174 - - 2009-10-23 13:08:29.764834 Z
 175   - David Hilvert <dhilvert@auricle.dyndns.org>
 176   - commented
 177   - |-
 178     Given the fact that 3D will have to be addressed, given the fact that the
 179     current arrangement of pointer structures occurs via typedefs and macros, and
 180     given the earlier comments in this bug entry, and the current approach of
 181     separation of allocated device memory into pages, suitable to 3D, consider that
 182     the best approach might simply be to continue with the current pointer encoding
 183     approach, using macros (as is currently done), which should allow for an
 184     efficient transition in the case that hardware and libraries eventually support
 185     sharing of pointers (or at least sharing of pointer-based data structures) in
 186     the most general case (which seems most restricted by the GPU case at the
 187     moment).
 188
 189     How to allow for such data structures to be built on the client side when
 190     appropriate is probably something worth further investigation, but this should
 191     not present a significant challenge.  The more relevant aspect is probably
 192     whether it is desirable (as it might be argued) that things like parsing of
 193     filter descriptor strings are of sufficiently general use that they should
 194     appear in a library (which was in fact the initial motivating factor for
 195     inclusion, prior to considerations of implementation details).
 196 - - 2009-10-23 16:36:42.198652 Z
 197   - David Hilvert <dhilvert@auricle.dyndns.org>
 198   - commented
 199   - "Note that anyone involved in vector processing over complex data structures\n\
 200     must wonder whether it is better to attempt to fit the structures to current\n\
 201     technology or rather work on technology better suited to the data structures.\n\
 202     E.g., a [capable] MMU for GPUs would be rather useful, and apparently\n\
 203     justified, given the resources being directed toward GPGPU and the high prices\n\
 204     associated with better graphics cards and with special-purpose vector\n\
 205     accelerators.\n\n\
 206     Given the extent of overlap that is growing between CPU and GPU manufacturers,\n\
 207     one might imagine that such integration of memory management would be not far\n\
 208     off, but I have seen no discussion of it yet.\n\n\
 209     Searching (Google gpu mmu) indicates that mobile GPUs (at least) have\n\
 210     integrated MMU for transferring data to/from main memory.\n\
 211     http://www.x.org/wiki/ttm indicates that GPU memory maps (as opposed to mapping\n\
 212     for CPU user space) are left to drivers, which raises the question of whether\n\
 213     limitations in OpenCL are strictly a software issue.  The section \"AGP TTM\n\
 214     backend\" particularly gives this impression w.r.t. the i965.\n\n\
 215     http://en.wikipedia.org/wiki/CUDA appears to suggest that interaction with host\n\
 216     memory may be straightforward in CUDA (\"Scattered reads \xE2\x80\x93 code can read from\n\
 217     arbitrary addresses in memory.\"; but we must wonder whether the memory in\n\
 218     question includes host memory, or whether explicit transfers between host and\n\
 219     device domains are necessary).  Examples on the same page suggest that this is\n\
 220     not the case, however (indeed, explicit transfers appear to be required).\n\n\
 221     The above seems to suggest that the technology in need of work might be a\n\
 222     software technology in the case of hardware with open specifications (such as\n\
 223     Intel's integrated chips), and that work is progressing in the general area of\n\
 224     drivers in the Linux and Xorg driver space, suggesting that providing a user\n\
 225     API might be what is necessary.  The next step would probably be to investigate\n\
 226     the hardware or driver facilities further (e.g., via Linux or Xorg\n\
 227     documentation and source, or via hardware specs)."
 228 - - 2009-10-23 17:21:15.507335 Z
 229   - David Hilvert <dhilvert@auricle.dyndns.org>
 230   - commented
 231   - |-
 232     Consider that a reasonable approach to the previously-mentioned use of hardware
 233     mapping via Linux (or Xorg) drivers might be to provide a map that can be used
 234     during execution of an OpenCL kernel.  Exactly how to negotiate the map between
 235     user space and the OpenCL framework might be tricky, however.  One might
 236     imagine use of OpenCL extensions (either in the usual sense or otherwise), but
 237     note that there's no obvious way to integrate an extension into an arbitrary
 238     given OpenCL implementation.  Still, this approach would probably be worth
 239     looking at.
 240
 241     To wit, there's the advantage of avoiding maintaining maps in user code (common
 242     to hardware map solutions), and hence avoiding the need to explicitly transfer
 243     maps from user code to the device; further, there's the advantage of using
 244     OpenCL, which, aside from the awkwardness of the call interface and absence of
 245     provisions for pointer sharing, appears to be better than most alternatives, in
 246     generality if nothing else.
 247 - - 2009-10-23 18:42:59.183941 Z
 248   - David Hilvert <dhilvert@auricle.dyndns.org>
 249   - commented
 250   - |-
 251     Consider that a reasonable sort of OpenCL extension (or a reasonable mode of
 252     operation, to frame the idea outside of the specific realm of OpenCL) would be
 253     to share an identical mapping (i.e., identical pointer values) of all data
 254     shared with the GPU between GPU and user-space CPU memory maps.
 255
 256     A reasonable restriction might be that GPU and CPU code must have the same
 257     pointer size, but this does not seem to be a serious restriction.  (In the case
 258     of a 64-bit GPU, it would not be difficult to find a 64-bit host, and in the
 259     case of a 32-bit GPU, 32-bit code could be executed on the host.  In the case
 260     of smaller pointer sizes, it should be possible to use appropriate masks, but
 261     this last case probably won't be very common.)
 262 - - 2009-10-23 19:15:43.451522 Z
 263   - David Hilvert <dhilvert@auricle.dyndns.org>
 264   - commented
 265   - |-
 266     Note that http://en.wikipedia.org/wiki/Translation_Table_Maps has been deleted
 267     (!) but that http://en.wikipedia.org/wiki/Graphics_Execution_Manager may have
 268     relevant links, leading to the following:
 269
 270     http://www.phoronix.com/scan.php?page=news_item&px=NzMxOA
 271     http://www.phoronix.com/scan.php?page=search&q=TTM
 272     http://www.phoronix.com/scan.php?page=search&q=Graphics+Execution+Manager
 273
 274     In particular, note that TTM appears to have been added to Linux 2.6.31.  E.g.,
 275
 276     http://www.phoronix.com/scan.php?page=news_item&px=NzMzMA
 277
 278     Note from Wikipedia (and direct observation from glxinfo) that GEM seems to be
 279     in earlier kernels.
 280 - - 2009-10-23 19:24:48.447254 Z
 281   - David Hilvert <dhilvert@auricle.dyndns.org>
 282   - commented
 283   - |-
 284     Note that the issue of memory mapping is addressed to some extent by Keith
 285     Packard here:
 286
 287     http://keithp.com/blogs/gem_update/
 288
 289     He also touches on bit swizzling in this page, which is an issue that might be
 290     important eventually for caching.  (I believe I've seen a different page by him
 291     on the topic as well, so that may be worth looking into further once code is
 292     refined enough that this point becomes relevant.)
 293 - - 2009-10-23 23:41:16.913143 Z
 294   - David Hilvert <dhilvert@auricle.dyndns.org>
 295   - commented
 296   - |-
 297     Slides covering issues relevant to this bug report here (via (Google gpu pointers)):
 298
 299     http://www.cs.berkeley.edu/~kubitron/courses/cs252-S07/projects/reports/project3_talk_ver2.ppt
 300 - - 2009-10-23 23:50:50.826996 Z
 301   - David Hilvert <dhilvert@auricle.dyndns.org>
 302   - commented
 303   - |-
 304     A bit on the topic of linear algebra on GPU at a URL that Google mangles to
 305     www.cs.utk.edu/~dongarra/WEB-PAGES/...2009/Lect08_GPU.pdf.
 306 - - 2009-10-24 00:07:41.413418 Z
 307   - David Hilvert <dhilvert@auricle.dyndns.org>
 308   - commented
 309   - |-
 310     Note Thrust library, a set of C++ headers for use with CUDA.
 311
 312     http://ldn.linuxfoundation.org/article/c-gpu-and-thrust-strings-gpu
 313     http://ldn.linuxfoundation.org/article/c-gpu-and-thrust-sorting-numbers-gpu
 314     http://ldn.linuxfoundation.org/article/general-programming-gpu-the-prefix-sum
 315
 316     http://code.google.com/p/thrust/
 317
 318     (But it's not clear that C++ and CUDA are what we want in the long run for a
 319     library.)
 320 - - 2009-10-24 23:14:56.981895 Z
 321   - David Hilvert <dhilvert@auricle.dyndns.org>
 322   - commented
 323   - |-
 324     Since the approach of pointer, data, and memory management is fairly integral
 325     to the programming of the library, this could be considered a bit of a blocking
 326     issue.  One possibility would be to first focus on client (ALE)/library
 327     (libale) separation, as this has benefits quite separate from OpenCL
 328     integration (e.g., better allowance for use with UIs other than the current
 329     CLI, separation from current CLI code and other legacy, difficult-to-maintain
 330     code, and greater opportunities for prototyping, as there would assumedly be a
 331     better independence of the various parts of the code -- alignment, rendering,
 332     etc.
 333
 334     Later, any one of a number of acceleration techniques could be used, including
 335     those outlined in this and other bug entries (e.g., {issue d0797684fabf05af24e73639e0ce5e30a145a3c5}) -- SSE, etc.,
 336     CUDA, OpenCL, as well as derivatives and abstractions of these -- including
 337     Thrust, MAGMA, etc.  By postponing acceleration, the structure of the library
 338     would be general rather than specific (to a particular acceleration approach),
 339     and experimentation with different acceleration approaches could occur on top
 340     of a functional foundation, rather than as part of a process of initial
 341     refactoring of legacy code.
 342
 343     (Note that an advantage of performing such experimentation here rather than on
 344     current or old ALE code is that it avoids dealing with the negative aspects of
 345     ALE above-noted -- legacy development, tight integration with a particular UI,
 346     and poor independence of subsystems.)
 347 - - 2009-10-25 01:59:39.783212 Z
 348   - David Hilvert <dhilvert@auricle.dyndns.org>
 349   - commented
 350   - |-
 351     As an addendum to the previous comment, note that, if pointers are shared
 352     between accelerated and unaccelerated code (e.g., in the case of acceleration
 353     via CPU-based techniques such as SSE), then the portion of ALE code to be
 354     changed for acceleration (i.e., inner loops) *is* relatively independent
 355     between subsystems, so that this particular approach to acceleration could
 356     likely be tried with ALE in a manner relatively independently of any library
 357     development (i.e., libale), as changes would transfer fairly easily to library
 358     code in the future.  Furthermore, adding such acceleration within ALE now may
 359     not only allow for quicker progress on the overall project, but may also allow
 360     for testing of changes planned for libale, such as storage of inputs as integer
 361     type arrays (char, etc.) rather than as floating point arrays, as is done
 362     currently within ALE (since it's possible that the conversion could be done
 363     quickly enough within the CPU to exceed any benefit of precalculation).  Hence,
 364     such experimentation with acceleration within the ALE codebase might allow for
 365     better confidence in the advantages of the intended changes within Libale.
 366 - - 2009-10-26 15:52:17.379069 Z
 367   - David Hilvert <dhilvert@auricle.dyndns.org>
 368   - commented
 369   - |-
 370     Note, given that ALE spends long amounts of time performing just one of
 371     alignment or rendering, that it would probably be fairly efficient in an
 372     initial implementation to integrate OpenCL operations into the alignment and
 373     rendering code separately (and directly within ALE, rather than in libale),
 374     hence improving the common case for most users while also allowing
 375     experimentation with acceleration using current code.  Migration to Libale
 376     could then be postponed, or occur as necessary, e.g., for alternative UI
 377     design (for which a library has been requested in the past), cluster operation,
 378     or whatever else.
 379
 380     (This is roughly the opposite of the pernicious case described in the linear
 381     algebra slide deck, where transferring between main memory and graphics memory
 382     may present a significant bottleneck because the operations performed on the
 383     graphics device are small.  Here, it shouldn't, because alignment and rendering
 384     are each very large operations, so transfer of images at each stage should be
 385     cheap in comparison with the job being performed by the compute device.)
 386 - - 2009-10-27 20:56:31.550543 Z
 387   - David Hilvert <dhilvert@auricle.dyndns.org>
 388   - commented
 389   - |-
 390     Note that previous comment may be correct, but ignores the question of how to
 391     integrate acceleration into the specified parts (alignment and rendering).
 392     Indeed, sharing data between the two parts is not difficult, but rather the
 393     acceleration itself (since, e.g., filtering has its own set of things to
 394     accelerate, and is used by both alignment and rendering).  OpenCL, meanwhile,
 395     doesn't seem to handle separate linking and compilation, which, again, may not
 396     be a problem, but it would probably be fair to say that the previous comment
 397     misidentifies where the potential problem lies.  (Indeed, the rather involved
 398     structure of ALE, developed through incremental changes, has been a consistent
 399     motivation for development of Libale.)
 400 - - 2009-10-30 02:54:37.938070 Z
 401   - David Hilvert <dhilvert@auricle.dyndns.org>
 402   - commented
 403   - |-
 404     Consider that the most natural approach might be to store ordinary pointer
 405     structures on the host side, with no special provision made for sharing, and to
 406     use wrappers for interfacing with OpenCL calls, which wrappers would convert
 407     host pointer-based structures (given some knowledge of the host structure) to a
 408     representation convenient for OpenCL, perhaps using indices within a common
 409     newly-allocated (or recycled) memory object, perhaps spanning several such
 410     memory objects.
 411
 412     For cases where storage on the compute device is especially important (e.g.,
 413     images), a cl_mem object could be stored at the appropriate point in the host
 414     structure, and the wrapper function used could be constructed in such a way as
 415     to account for this.
 416
 417     Such an approach should allow coexistence of (i) pointer structures on the host
 418     sufficiently general not only for libale use but also -- since no special
 419     representation is required for pointer types -- for sharing with client code of
 420     libale (if this is necessary or desired) with (ii) an OpenCL implementation of
 421     kernels, while mostly restricting the effects of OpenCL's limitations w.r.t.
 422     pointers to the wrappers and the OpenCL code itself (hence largely abstracting
 423     this detail away from the larger part of libale code and client code).
 424 - - 2009-10-30 03:04:20.396328 Z
 425   - David Hilvert <dhilvert@auricle.dyndns.org>
 426   - commented
 427   - |-
 428     Note that the previously-described approach may not be sufficiently general for
 429     handling very large structures (such as those used in 3D), but that this could
 430     be handled by encoding these in one or more cl_mem objects, and using indices
 431     into these objects (in aanner roughly similar to that suggested by the current
 432     P() and Q() operators, where such operators or similar could be used in cases
 433     where access to such  structures from libale code -- and client code? -- was
 434     necessary).
 435 - - 2009-10-31 21:43:12.725970 Z
 436   - David Hilvert <dhilvert@auricle.dyndns.org>
 437   - commented
 438   - |-
 439     Note that the wrapper approach described in previous comments (or, more generally,
 440     an approach computing how to arrange data for passing to the compute device) could
 441     also be applied to the case of data already stored in cl_mem objects, where the
 442     task would be to determine which cl_mem objects to pass to the kernel to be run.
 443     This alternative approach might be a bit more tricky, however, as it would tend
 444     to entail more quickly the case of variable numbers of cl_mem objects, which
 445     could be less straightforward to handle than a wrapper working from host-stored
 446     objects, which could merely pack the data into some fixed number of cl_mem objects
 447     in the common case.  (Of course, since such an approach would be inadequate for
 448     the 3D case [it seems], it might be acceptable to take on the additional
 449     complexity entailed by storage in cl_mem.  The end result being an implementation
 450     perhaps not too far removed from what is currently under development in present
 451     code.)
 452 - - 2009-11-04 12:57:51.651083 Z
 453   - David Hilvert <dhilvert@auricle.dyndns.org>
 454   - commented
 455   - |-
 456     Note that, rather than using wrappers in the common case for converting pointer
 457     structures to flat representations, these flat representations could instead be
 458     maintained as the representation used throughout, as has been suggested for the
 459     3D case, so that a conversion operation need not occur at each call to the
 460     compute device involving the data structure, and so that the wrappers earlier
 461     mentioned need not be maintained for each desired such kernel to be called.
 462
 463     The disadvantage to this would be somewhat lessened flexibility in handling the
 464     structures on the host side, but the same sort of techniques as planned for 3D
 465     in earlier comments could be used -- macros similar to the current P() and Q(),
 466     as an example, perhaps specialized to the particular case of the flat
 467     representation, within one or a small number of memory objects, rather than
 468     spanned over several.
 469
 470     (A bit of a hybrid approach, as perhaps suggested before, would be to continue
 471     to store the 'loose' structures in device memory, as reflected in current code,
 472     but to copy these to a flattened representation before invoking a kernel.  This
 473     approach might allow for faster copies than the approach copying from host
 474     memory.  For data structures small in size, however -- as most structures will
 475     probably be, outside of images and image-like arrays, which can likely be
 476     passed separately -- the advantage of such an approach to copying is unclear,
 477     while this approach would be inconvenient for both operations on the host side
 478     (which would be more easily done with ordinary host pointers) and operations on
 479     the device side (which would be more easily done with a flat representation).)
 480 - - 2009-11-04 13:11:06.989963 Z
 481   - David Hilvert <dhilvert@auricle.dyndns.org>
 482   - commented
 483   - |-
 484     Note that a further possible hybrid based on recent suggestions might be to
 485     store a flattened representation in host memory, and then copy this into device
 486     memory just prior to kernel execution.  In particular, this would allow for
 487     manipulation of structures on the host side to proceed without mapping device
 488     memory, and would allow copies into device memory to occur in a more
 489     straightforward manner, requiring simpler wrappers than those earlier conceived
 490     for the case of wrappers mapping from pointer-based host structures to
 491     flattened structures.
 492 - - 2009-11-05 09:51:51.365841 Z
 493   - David Hilvert <dhilvert@auricle.dyndns.org>
 494   - commented
 495   - |-
 496     For the covered cases involving passing a flattened representation of a data
 497     structure to a kernel (of which I think there were three), consider that a
 498     reasonable initial approach might be to allow client code to perform its own
 499     flattening into the final representation to be used by the kernel, so that
 500     client code can choose its own method of its internal encoding (which, in the
 501     case of ALE, could be use of data structures already in use within ALE).  In
 502     this way, effort required for adapting to OpenCL and libale could perhaps be
 503     roughly minimal, and the impact on existing code perhaps also be roughly
 504     minimal.  In particular, there would be no obvious advantage in beginning from
 505     older ALE code, so that current ALE code could be used as a starting point for
 506     acceleration, while still allowing migration to an approach providing an API
 507     for alternative UIs.
 508 - - 2009-11-25 01:23:35.614857 Z
 509   - David Hilvert <dhilvert@auricle.dyndns.org>
 510   - commented
 511   - |-
 512     A couple of notes on the previous -- (a) I think there were four (or five) methods proposed
 513     if all variants are considered; (b) better than having the client do the flattening would be to
 514     make use of the currently outlined API facilities for returning things such as filters, renderers, etc.
 515     to the client, so that the library could perform flattening, and the client could be passed a flattened
 516     version (pointer thereto, etc.).
 517
 518     In this way, the details of flattening into some sort of convenient representation would be wholly contained
 519     in the library, while the client could specify a more abstract name (e.g., the text names currently used for
 520     naming filters and renderers; which should follow notes hereto).
 521 git_branch: