1 <!--===- docs/Character.md
3 Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4 See https://llvm.org/LICENSE.txt for license information.
5 SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
9 # Implementation of `CHARACTER` types in f18
16 ## Kinds and Character Sets
18 The f18 compiler and runtime support three kinds of the intrinsic
19 `CHARACTER` type of Fortran 2018.
20 The default (`CHARACTER(KIND=1)`) holds 8-bit character codes;
21 `CHARACTER(KIND=2)` holds 16-bit character codes;
22 and `CHARACTER(KIND=4)` holds 32-bit character codes.
24 We assume that code values 0 through 127 correspond to
25 the 7-bit ASCII character set (ISO-646) in every kind of `CHARACTER`.
26 This is a valid assumption for Unicode (UCS == ISO/IEC-10646),
27 ISO-8859, and many legacy character sets and interchange formats.
29 `CHARACTER` data in memory and unformatted files are not in an
30 interchange representation (like UTF-8, Shift-JIS, EUC-JP, or a JIS X).
31 Each character's code in memory occupies a 1-, 2-, or 4- byte
32 word and substrings can be indexed with simple arithmetic.
33 In formatted I/O, however, `CHARACTER` data may be assumed to use
34 the UTF-8 variable-length encoding when it is selected with
35 `OPEN(ENCODING='UTF-8')`.
37 `CHARACTER(KIND=1)` literal constants in Fortran source files,
38 Hollerith constants, and formatted I/O with `ENCODING='DEFAULT'`
41 For the purposes of non-default-kind `CHARACTER` constants in Fortran
42 source files, formatted I/O with `ENCODING='UTF-8'` or non-default-kind
43 `CHARACTER` value, and conversions between kinds of `CHARACTER`,
45 * `CHARACTER(KIND=1)` is assumed to be ISO-8859-1 (Latin-1),
46 * `CHARACTER(KIND=2)` is assumed to be UCS-2 (16-bit Unicode), and
47 * `CHARACTER(KIND=4)` is assumed to be UCS-4 (full Unicode in a 32-bit word).
49 In particular, conversions between kinds are assumed to be
50 simple zero-extensions or truncation, not table look-ups.
52 We might want to support one or more environment variables to change these
53 assumptions, especially for `KIND=1` users of ISO-8859 character sets
58 Allocatable `CHARACTER` objects in Fortran may defer the specification
59 of their lengths until the time of their allocation or whole (non-substring)
61 Non-allocatable objects (and non-deferred-length allocatables) have
62 lengths that are fixed or assumed from an actual argument, or,
63 in the case of assumed-length `CHARACTER` functions, their local
64 declaration in the calling scope.
66 The elements of `CHARACTER` arrays have the same length.
68 Assignments to targets that are not deferred-length allocatables will
69 truncate or pad the assigned value to the length of the left-hand side
72 Lengths and offsets that are used by or exposed to Fortran programs via
73 declarations, substring bounds, and the `LEN()` intrinsic function are always
74 represented in units of characters, not bytes.
75 In generated code, assumed-length arguments, the runtime support library,
76 and in the `elem_len` field of the interoperable descriptor `cdesc_t`,
77 lengths are always in units of bytes.
78 The distinction matters only for kinds other than the default.
80 Fortran substrings are rather like subscript triplets into a hidden
81 "zero" dimension of a scalar `CHARACTER` value, but they cannot have
86 Fortran has one `CHARACTER`-valued intrinsic operator, `//`, which
87 concatenates its operands (10.1.5.3).
88 The operands must have the same kind type parameter.
89 One or both of the operands may be arrays; if both are arrays, their
90 shapes must be identical.
91 The effective length of the result is the sum of the lengths of the
93 Parentheses may be ignored, so any `CHARACTER`-valued expression
94 may be "flattened" into a single sequence of concatenations.
96 The result of `//` may be used
97 * as an operand to another concatenation,
98 * as an operand of a `CHARACTER` relation,
99 * as an actual argument,
100 * as the right-hand side of an assignment,
101 * as the `SOURCE=` or `MOLD=` of an `ALLOCATE` statemnt,
102 * as the selector or case-expr of an `ASSOCIATE` or `SELECT` construct,
103 * as a component of a structure or array constructor,
104 * as the value of a named constant or initializer,
105 * as the `NAME=` of a `BIND(C)` attribute,
106 * as the stop-code of a `STOP` statement,
107 * as the value of a specifier of an I/O statement,
108 * or as the value of a statement function.
110 The f18 compiler has a general (but slow) means of implementing concatenation
111 and a specialized (fast) option to optimize the most common case.
113 ### General concatenation
115 In the most general case, the f18 compiler's generated code and
116 runtime support library represent the result as a deferred-length allocatable
117 `CHARACTER` temporary scalar or array variable that is initialized
118 as a zero-length array by `AllocatableInitCharacter()`
119 and then progressively augmented in place by the values of each of the
120 operands of the concatenation sequence in turn with calls to
121 `CharacterConcatenate()`.
122 Conformability errors are fatal -- Fortran has no means by which a program
123 may recover from them.
124 The result is then used as any other deferred-length allocatable
125 array or scalar would be, and finally deallocated like any other
128 The runtime routine `CharacterAssign()` takes care of
129 truncating, padding, or replicating the value(s) assigned to the left-hand
130 side, as well as reallocating an nonconforming or deferred-length allocatable
131 left-hand side. It takes the descriptors of the left- and right-hand sides of
132 a `CHARACTER` assignemnt as its arguments.
134 When the left-hand side of a `CHARACTER` assignment is a deferred-length
135 allocatable and the right-hand side is a temporary, use of the runtime's
136 `MoveAlloc()` subroutine instead can save an allocation and a copy.
138 ### Optimized concatenation
140 Scalar `CHARACTER(KIND=1)` expressions evaluated as the right-hand sides of
141 assignments to independent substrings or whole variables that are not
142 deferred-length allocatables can be optimized into a sequence of
143 calls to the runtime support library that do not allocate temporary
146 The routine `CharacterAppend()` copies data from the right-hand side value
147 to the remaining space, if any, in the left-hand side object, and returns
148 the new offset of the reduced remaining space.
149 It is essentially `memcpy(lhs + offset, rhs, min(lhsLength - offset, rhsLength))`.
150 It does nothing when `offset > lhsLength`.
152 `void CharacterPad()`adds any necessary trailing blank characters.