1 <!--===- docs/Character.md
3 Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4 See https://llvm.org/LICENSE.txt for license information.
5 SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
9 # Implementation of `CHARACTER` types in f18
17 ## Kinds and Character Sets
19 The f18 compiler and runtime support three kinds of the intrinsic
20 `CHARACTER` type of Fortran 2018.
21 The default (`CHARACTER(KIND=1)`) holds 8-bit character codes;
22 `CHARACTER(KIND=2)` holds 16-bit character codes;
23 and `CHARACTER(KIND=4)` holds 32-bit character codes.
25 We assume that code values 0 through 127 correspond to
26 the 7-bit ASCII character set (ISO-646) in every kind of `CHARACTER`.
27 This is a valid assumption for Unicode (UCS == ISO/IEC-10646),
28 ISO-8859, and many legacy character sets and interchange formats.
30 `CHARACTER` data in memory and unformatted files are not in an
31 interchange representation (like UTF-8, Shift-JIS, EUC-JP, or a JIS X).
32 Each character's code in memory occupies a 1-, 2-, or 4- byte
33 word and substrings can be indexed with simple arithmetic.
34 In formatted I/O, however, `CHARACTER` data may be assumed to use
35 the UTF-8 variable-length encoding when it is selected with
36 `OPEN(ENCODING='UTF-8')`.
38 `CHARACTER(KIND=1)` literal constants in Fortran source files,
39 Hollerith constants, and formatted I/O with `ENCODING='DEFAULT'`
42 For the purposes of non-default-kind `CHARACTER` constants in Fortran
43 source files, formatted I/O with `ENCODING='UTF-8'` or non-default-kind
44 `CHARACTER` value, and conversions between kinds of `CHARACTER`,
46 * `CHARACTER(KIND=1)` is assumed to be ISO-8859-1 (Latin-1),
47 * `CHARACTER(KIND=2)` is assumed to be UCS-2 (16-bit Unicode), and
48 * `CHARACTER(KIND=4)` is assumed to be UCS-4 (full Unicode in a 32-bit word).
50 In particular, conversions between kinds are assumed to be
51 simple zero-extensions or truncation, not table look-ups.
53 We might want to support one or more environment variables to change these
54 assumptions, especially for `KIND=1` users of ISO-8859 character sets
59 Allocatable `CHARACTER` objects in Fortran may defer the specification
60 of their lengths until the time of their allocation or whole (non-substring)
62 Non-allocatable objects (and non-deferred-length allocatables) have
63 lengths that are fixed or assumed from an actual argument, or,
64 in the case of assumed-length `CHARACTER` functions, their local
65 declaration in the calling scope.
67 The elements of `CHARACTER` arrays have the same length.
69 Assignments to targets that are not deferred-length allocatables will
70 truncate or pad the assigned value to the length of the left-hand side
73 Lengths and offsets that are used by or exposed to Fortran programs via
74 declarations, substring bounds, and the `LEN()` intrinsic function are always
75 represented in units of characters, not bytes.
76 In generated code, assumed-length arguments, the runtime support library,
77 and in the `elem_len` field of the interoperable descriptor `cdesc_t`,
78 lengths are always in units of bytes.
79 The distinction matters only for kinds other than the default.
81 Fortran substrings are rather like subscript triplets into a hidden
82 "zero" dimension of a scalar `CHARACTER` value, but they cannot have
87 Fortran has one `CHARACTER`-valued intrinsic operator, `//`, which
88 concatenates its operands (10.1.5.3).
89 The operands must have the same kind type parameter.
90 One or both of the operands may be arrays; if both are arrays, their
91 shapes must be identical.
92 The effective length of the result is the sum of the lengths of the
94 Parentheses may be ignored, so any `CHARACTER`-valued expression
95 may be "flattened" into a single sequence of concatenations.
97 The result of `//` may be used
98 * as an operand to another concatenation,
99 * as an operand of a `CHARACTER` relation,
100 * as an actual argument,
101 * as the right-hand side of an assignment,
102 * as the `SOURCE=` or `MOLD=` of an `ALLOCATE` statemnt,
103 * as the selector or case-expr of an `ASSOCIATE` or `SELECT` construct,
104 * as a component of a structure or array constructor,
105 * as the value of a named constant or initializer,
106 * as the `NAME=` of a `BIND(C)` attribute,
107 * as the stop-code of a `STOP` statement,
108 * as the value of a specifier of an I/O statement,
109 * or as the value of a statement function.
111 The f18 compiler has a general (but slow) means of implementing concatenation
112 and a specialized (fast) option to optimize the most common case.
114 ### General concatenation
116 In the most general case, the f18 compiler's generated code and
117 runtime support library represent the result as a deferred-length allocatable
118 `CHARACTER` temporary scalar or array variable that is initialized
119 as a zero-length array by `AllocatableInitCharacter()`
120 and then progressively augmented in place by the values of each of the
121 operands of the concatenation sequence in turn with calls to
122 `CharacterConcatenate()`.
123 Conformability errors are fatal -- Fortran has no means by which a program
124 may recover from them.
125 The result is then used as any other deferred-length allocatable
126 array or scalar would be, and finally deallocated like any other
129 The runtime routine `CharacterAssign()` takes care of
130 truncating, padding, or replicating the value(s) assigned to the left-hand
131 side, as well as reallocating an nonconforming or deferred-length allocatable
132 left-hand side. It takes the descriptors of the left- and right-hand sides of
133 a `CHARACTER` assignemnt as its arguments.
135 When the left-hand side of a `CHARACTER` assignment is a deferred-length
136 allocatable and the right-hand side is a temporary, use of the runtime's
137 `MoveAlloc()` subroutine instead can save an allocation and a copy.
139 ### Optimized concatenation
141 Scalar `CHARACTER(KIND=1)` expressions evaluated as the right-hand sides of
142 assignments to independent substrings or whole variables that are not
143 deferred-length allocatables can be optimized into a sequence of
144 calls to the runtime support library that do not allocate temporary
147 The routine `CharacterAppend()` copies data from the right-hand side value
148 to the remaining space, if any, in the left-hand side object, and returns
149 the new offset of the reduced remaining space.
150 It is essentially `memcpy(lhs + offset, rhs, min(lhsLength - offset, rhsLength))`.
151 It does nothing when `offset > lhsLength`.
153 `void CharacterPad()`adds any necessary trailing blank characters.