1 <!--===- docs/InternalProcedureTrampolines.md
3 Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4 See https://llvm.org/LICENSE.txt for license information.
5 SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
9 # Trampolines for pointers to internal procedures.
23 end subroutine internal
27 Procedure code generated for subprogram `internal()` must have access to the scope of
28 its host procedure, e.g. to access `local` variable. Flang achieves this by passing
29 an extra argument to `internal()` that is a tuple of references to all variables
30 used via host association inside `internal()`. We will call this extra argument
33 Fortran standard 2008 allowed using internal procedures as actual arguments for
34 procedure pointer targets:
36 > Fortran 2008 contains several extensions to Fortran 2003; some of these are listed below.
38 > * An internal procedure can be used as an actual argument or procedure pointer target.
42 > An internal procedure cannot be invoked using a procedure pointer from either Fortran or C after the host instance completes execution, because the pointer is then undefined. While the host instance is active, however, the internal procedure may be invoked from outside of the host procedure scoping unit if that internal procedure was passed as an actual argument or is the target of a procedure pointer.
44 Special handling is required for the internal procedures that might be invoked
45 via an argument association or via pointer.
46 This document describes Flang implementation to support it.
48 > NOTE: in some languages/extensions the static chain may contain links
49 to more than one stack frame, while Fortra's static chain only ever
50 has a link to a single host procedure.
52 ## Flang current implementation
56 Internal procedure as procedure pointer target:
67 procedure(callback), pointer :: fptr
68 ! `fptr` is pointing to `callee`, which needs the static chain link.
73 subroutine host(local)
76 procedure(callback), pointer :: fptr
94 Internal procedure as actual argument (F90 style):
101 integer function fptr()
104 ! `fptr` is pointing to `callee`, which needs the static chain link.
109 subroutine host(local)
128 Internal procedure as actual argument (F77 style):
135 ! `fptr` is pointing to `callee`, which needs the static chain link.
140 subroutine host(local)
159 In all cases, the call sequence implementing `fptr()` call site inside `foo()`
160 must pass the stack chain link to the actual function `callee()`.
162 ### Usage of trampolines in Flang
164 `BoxedProcedure` pass recognizes `fir.emboxproc` operations that
165 embox a subroutine address together with the static chain link,
166 and transforms them into a sequence of operations that replace
167 the result of `fir.emboxproc` with an address of a trampoline.
168 Eventually, it is the address of the trampoline that is passed
169 as an actual argument to `foo()`.
171 The trampoline has the following structure:
175 MOV static-chain-address, R#
180 - `callee-address` is the address of function `callee()`.
181 - `static-chain-address` - the address of the static chain
182 object created inside `host()`.
183 - `R#` is a target specific register.
185 In MLIR LLVM dialect the replacement looks like this:
188 llvm.call @llvm.init.trampoline(%8, %9, %7) : (!llvm.ptr<i8>, !llvm.ptr<i8>, !llvm.ptr<i8>) -> ()
189 %10 = llvm.call @llvm.adjust.trampoline(%8) : (!llvm.ptr<i8>) -> !llvm.ptr<i8>
190 %11 = llvm.bitcast %10 : !llvm.ptr<i8> to !llvm.ptr<func<void ()>>
191 llvm.call @_QMotherPfoo(%11) {fastmathFlags = #llvm.fastmath<fast>} : (!llvm.ptr<func<void ()>>) -> ()
195 So any call of `fptr` inside `foo()` will result in invocation of the trampoline.
196 The trampoline will setup `R#` register and jump to `callee()` directly.
198 The ABI of `callee()` is adjusted using `llvm.nest` call argument attribute,
199 so that the target code generator assumes the static chain argument is passed
200 to `callee()` in `R#`:
203 llvm.func @_QFhostPcallee(%arg0: !llvm.ptr<struct<(ptr<i32>)>> {fir.host_assoc, llvm.nest}) -> i32 attributes {fir.internal_proc} {
206 #### Trampoline handling
208 Currently used [llvm.init.trampoline intrinsic](https://llvm.org/docs/LangRef.html#trampoline-intrinsics)
209 expects that the memory for the trampoline content is passed to it as the first argument.
210 The memory has to be writeable at the point of the intrinsic call, and it has to be executable
211 at any point where `callee()` might be ivoked via the trampoline.
213 `@llvm.init.trampoline` intrinsic initializes the trampoline area in a target-specific manner
214 so that being executed: the trampoline sets a target-specific register to be equal to the third argument
215 (which is a static chain address), and then calls the function defined by the second argument.
217 Some targets may perform additional actions to guarantee the readiness of the trampoline for execution,
218 e.g. [call](https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/builtins/trampoline_setup.c)
219 `__clear_cache` or do something else.
221 For each internal procedure a trampoline may be initialized once per the host invocation.
223 The target-specific address of the new trampoline function must be taken via another intrinsic call:
226 %p = call i8* @llvm.adjust.trampoline(i8* %trampoline_address)
229 Note that value of `%p` is equal to `%tramp1` in most cases, but this is not
230 a requirement - this is partly [why](https://lists.llvm.org/pipermail/llvm-dev/2011-August/042845.html)
231 the second intrinsic was introduced:
234 > By the way an example of adjust_trampoline is ARM, which or's a 1 into the address of the trampoline. When the pointer is called the processor sees the 1 and puts itself into thumb mode.
236 Currently, the trampolines are allocated on the stack of `host()` subroutine,
237 so that they are available throughout the life span of `host()` and are
238 automatically deallocated at the end of `host()` invocation.
239 Unfortunately, this requires the program stack to be writeable and executable
240 at the same time, which might be a security concern.
242 > NOTE: LLVM's AArch64 backend supports `nest` attribute, but it does not seem to support trampoline intrinsics.
244 ## Alternative implementation(s)
246 To address the security risk we may consider managing the trampoline memory
247 in a way that it is not writeable and executable at the same time.
248 One of the options is to use separate allocations for the trampoline code
249 and the trampoline "data".
251 The trampolines may be located in non-writeable executable memory:
254 MOV (TDATA[0].static_chain_address), R#
255 JMP (TDATA[0].callee_address)
257 MOV (TDATA[1].static_chain_address), R#
258 JMP (TDATA[1].callee_address)
262 The `TDATA` memory is writeable and contains *<static chain address, function address>*
263 for each of the trampolines.
265 A runtime support library may provide APIs for initializing/accessing/deallocating
266 the trampolines that can be used by `BoxedProcedure` pass.
268 ### Implementation considerations
270 * The static chain address still has to be passed in fixed target-specific register,
271 and the implementations that rely on LLVM back-ends can use `nest` attribute for this.
273 * The trampoline area must be able to grow, because there can be a trampoline
274 for each internal procedure per host invocation, and an internal procedure can call
275 the host recursively. This means that the amount of trampolines in one thread
276 may grow pretty quickly.
279 recursive subroutine host(local)
289 if (local .le. CONST_N) then
296 * On the other hand, putting a hard limit on the number of trampolines live at the same time
297 allows putting the trampolines into the static code segment.
299 * Each thread may have its own dynamic trampoline area to reduce the number
302 * Some support is required for the offload devices.
304 * Each trampoline invocation implies two indirect accesses with this approach.
306 ### Fortran runtime support
308 The following APIs are suggested:
312 * \brief Initializes new trampoline and returns its internal handle.
314 * Initializes new trampoline with the given \p callee_address
315 * and \p static_chain_address, and returns the new trampoline's
316 * internal handle. The compiler calls this method once per host
317 * invocation for each internal procedure that will need its address
320 * The initialization is reserving a new entry in TDATA and
321 * initializes the entry with the given \p callee_address and
322 * \p static_chain_address; it is also reserving a new entry
323 * in the trampoline area that is using the corresponding TDATA entry.
326 * \p scratch may be used to switch between the trampoline pool
327 * and llvm.init.trampoline implementation, e.g. if compiler passes
328 * non-null \p scratch it will be used as a writeable/executable
329 * memory for the new trampoline.
331 const void *InitTrampoline([[maybe_unused]] void *scratch,
332 const void *callee_address,
333 const void *static_chain_address);
336 * \brief Returns the trampoline's address for the given handle.
338 * \p handle is a value returned by InitTrampoline().
339 * The result of AdjustTrampoline() is the actual callable
340 * trampoline's address.
342 * Optional: may be implemented via llvm.adjust.trampoline.
344 const void *AdjustTrampoline(const void *handle);
347 * \brief Frees internal resources occupied for the given trampoline.
349 * The compiler must call this API at every exit from the host function.
351 * Optional: may be no-op, if LLVM trampolines are used underneath.
353 void FreeTrampoline(void *handle);
356 `InitTrampoline` will do the initial allocation of the TDATA memory
357 and the trampoline area followed by the initialization of the trampoline
358 area with the binary code to "link" the trampolines with the corresponding
359 TDATA entries. After the initial allocation the trampoline area is made
360 executable and not writeable.
362 If there is an available entry in the TDATA/trampoline area, then the function
363 will initialized the TDATA entry with the given arguments and return
364 a handle to the trampoline entry.
366 `FreeTrampoline` will free the reserved entry.
368 > NOTE: `FreeTrampoline` may reset the `callee_address` in the trampoline
369 being freed to a runtime library function that complains about a dead
370 internal procedure being called. This provides some runtime diagnostics
371 of dangling procedure pointer usage. Such freed trampolines may still
372 have to be reclaimed, if new trampoline is requested and the trampoline
378 // Init the trampoline once per host procedure invocation
379 // (i.e. when the procedure address is emboxed).
380 %handle = llvm.call @_FortranAInitTrampoline(%nullptr, %9, %7) : (!llvm.ptr<i8>, !llvm.ptr<i8>, !llvm.ptr<i8>) -> !llvm.ptr<i8>
381 // Get the actual internal procedure address once per host procedure invocation.
382 %10 = llvm.call @_FortranAAdjustTrampoline(%handle) : (!llvm.ptr<i8>) -> !llvm.ptr<i8>
383 %11 = llvm.bitcast %10 : !llvm.ptr<i8> to !llvm.ptr<func<void ()>>
384 llvm.call @_QMotherPfoo(%11) {fastmathFlags = #llvm.fastmath<fast>} : (!llvm.ptr<func<void ()>>) -> ()
385 // The trampoline deallocation must be done only at the exits from the host procedure.
386 llvm.call @_FortranAFreeTrampoline(%handle) : (!llvm.ptr<i8>) -> ()
389 ### Implementation options
391 We may try to reuse [libffi](https://github.com/libffi/libffi) implementation for __static trampolines__:
392 * Initial implementation added support for x64, i386, aarch64 and arm on Linux: https://github.com/libffi/libffi/pull/624
394 * Added support for Cygwin: https://github.com/libffi/libffi/commit/a1130f37712c03957c9b0adf316cd006fa92a60b
395 * Added support for LoongArch: https://github.com/libffi/libffi/pull/723
396 * Page protection for iOS devices: https://github.com/libffi/libffi/pull/718
397 * Fix for trampoline code for x32: https://github.com/libffi/libffi/pull/657
398 * The author (@madvenka786) initially [proposed](https://sourceware.org/pipermail/libffi-discuss/2021/002587.html) to make the trampoline APIs public,
399 but this was not a requirement at the time and the APIs were made private.
400 If we want to rely on `libffi`, the APIs have to be made public.
401 * We may also try to extract the static trampolines implementation from `libffi`
402 into separate library (e.g. `libstatictramp` as mentioned [here](https://sourceware.org/pipermail/libffi-discuss/2021/002592.html)).
404 Flang's own implementation for trampolines have an advantage that,
405 having to support the only Fortran/C interoperable calling convention,
406 the implementation may reduce the trampoline overhead. For example,
407 it may avoid saving/restoring the scratch registers used by the trampoline code,
408 and just clobber some of them according to the particular ABI.
410 At this point, the recommended approach is to implement the trampoline
411 support in Flang runtime.