doc/src/sgml/xtypes.sgml

   1 <!-- doc/src/sgml/xtypes.sgml -->
   2
   3  <sect1 id="xtypes">
   4   <title>User-Defined Types</title>
   5
   6   <indexterm zone="xtypes">
   7    <primary>data type</primary>
   8    <secondary>user-defined</secondary>
   9   </indexterm>
  10
  11   <para>
  12    As described in <xref linkend="extend-type-system"/>,
  13    <productname>PostgreSQL</productname> can be extended to support new
  14    data types.  This section describes how to define new base types,
  15    which are data types defined below the level of the <acronym>SQL</acronym>
  16    language.  Creating a new base type requires implementing functions
  17    to operate on the type in a low-level language, usually C.
  18   </para>
  19
  20   <para>
  21    The examples in this section can be found in
  22    <filename>complex.sql</filename> and <filename>complex.c</filename>
  23    in the <filename>src/tutorial</filename> directory of the source distribution.
  24    See the <filename>README</filename> file in that directory for instructions
  25    about running the examples.
  26   </para>
  27
  28  <para>
  29   <indexterm>
  30    <primary>input function</primary>
  31   </indexterm>
  32   <indexterm>
  33    <primary>output function</primary>
  34   </indexterm>
  35   A user-defined type must always have input and output functions.
  36   These functions determine how the type appears in strings (for input
  37   by the user and output to the user) and how the type is organized in
  38   memory.  The input function takes a null-terminated character string
  39   as its argument and returns the internal (in memory) representation
  40   of the type.  The output function takes the internal representation
  41   of the type as argument and returns a null-terminated character
  42   string.  If we want to do anything more with the type than merely
  43   store it, we must provide additional functions to implement whatever
  44   operations we'd like to have for the type.
  45  </para>
  46
  47  <para>
  48   Suppose we want to define a type <type>complex</type> that represents
  49   complex numbers. A natural way to represent a complex number in
  50   memory would be the following C structure:
  51
  52 <programlisting>
  53 typedef struct Complex {
  54     double      x;
  55     double      y;
  56 } Complex;
  57 </programlisting>
  58
  59   We will need to make this a pass-by-reference type, since it's too
  60   large to fit into a single <type>Datum</type> value.
  61  </para>
  62
  63  <para>
  64   As the external string representation of the type, we choose a
  65   string of the form <literal>(x,y)</literal>.
  66  </para>
  67
  68  <para>
  69   The input and output functions are usually not hard to write,
  70   especially the output function.  But when defining the external
  71   string representation of the type, remember that you must eventually
  72   write a complete and robust parser for that representation as your
  73   input function.  For instance:
  74
  75 <programlisting><![CDATA[
  76 PG_FUNCTION_INFO_V1(complex_in);
  77
  78 Datum
  79 complex_in(PG_FUNCTION_ARGS)
  80 {
  81     char       *str = PG_GETARG_CSTRING(0);
  82     double      x,
  83                 y;
  84     Complex    *result;
  85
  86     if (sscanf(str, " ( %lf , %lf )", &x, &y) != 2)
  87         ereport(ERROR,
  88                 (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
  89                  errmsg("invalid input syntax for type %s: \"%s\"",
  90                         "complex", str)));
  91
  92     result = (Complex *) palloc(sizeof(Complex));
  93     result->x = x;
  94     result->y = y;
  95     PG_RETURN_POINTER(result);
  96 }
  97 ]]>
  98 </programlisting>
  99
 100   The output function can simply be:
 101
 102 <programlisting><![CDATA[
 103 PG_FUNCTION_INFO_V1(complex_out);
 104
 105 Datum
 106 complex_out(PG_FUNCTION_ARGS)
 107 {
 108     Complex    *complex = (Complex *) PG_GETARG_POINTER(0);
 109     char       *result;
 110
 111     result = psprintf("(%g,%g)", complex->x, complex->y);
 112     PG_RETURN_CSTRING(result);
 113 }
 114 ]]>
 115 </programlisting>
 116  </para>
 117
 118  <para>
 119   You should be careful to make the input and output functions inverses of
 120   each other.  If you do not, you will have severe problems when you
 121   need to dump your data into a file and then read it back in.  This
 122   is a particularly common problem when floating-point numbers are
 123   involved.
 124  </para>
 125
 126  <para>
 127   Optionally, a user-defined type can provide binary input and output
 128   routines.  Binary I/O is normally faster but less portable than textual
 129   I/O.  As with textual I/O, it is up to you to define exactly what the
 130   external binary representation is.  Most of the built-in data types
 131   try to provide a machine-independent binary representation.  For
 132   <type>complex</type>, we will piggy-back on the binary I/O converters
 133   for type <type>float8</type>:
 134
 135 <programlisting><![CDATA[
 136 PG_FUNCTION_INFO_V1(complex_recv);
 137
 138 Datum
 139 complex_recv(PG_FUNCTION_ARGS)
 140 {
 141     StringInfo  buf = (StringInfo) PG_GETARG_POINTER(0);
 142     Complex    *result;
 143
 144     result = (Complex *) palloc(sizeof(Complex));
 145     result->x = pq_getmsgfloat8(buf);
 146     result->y = pq_getmsgfloat8(buf);
 147     PG_RETURN_POINTER(result);
 148 }
 149
 150 PG_FUNCTION_INFO_V1(complex_send);
 151
 152 Datum
 153 complex_send(PG_FUNCTION_ARGS)
 154 {
 155     Complex    *complex = (Complex *) PG_GETARG_POINTER(0);
 156     StringInfoData buf;
 157
 158     pq_begintypsend(&buf);
 159     pq_sendfloat8(&buf, complex->x);
 160     pq_sendfloat8(&buf, complex->y);
 161     PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
 162 }
 163 ]]>
 164 </programlisting>
 165  </para>
 166
 167  <para>
 168   Once we have written the I/O functions and compiled them into a shared
 169   library, we can define the <type>complex</type> type in SQL.
 170   First we declare it as a shell type:
 171
 172 <programlisting>
 173 CREATE TYPE complex;
 174 </programlisting>
 175
 176   This serves as a placeholder that allows us to reference the type while
 177   defining its I/O functions.  Now we can define the I/O functions:
 178
 179 <programlisting>
 180 CREATE FUNCTION complex_in(cstring)
 181     RETURNS complex
 182     AS '<replaceable>filename</replaceable>'
 183     LANGUAGE C IMMUTABLE STRICT;
 184
 185 CREATE FUNCTION complex_out(complex)
 186     RETURNS cstring
 187     AS '<replaceable>filename</replaceable>'
 188     LANGUAGE C IMMUTABLE STRICT;
 189
 190 CREATE FUNCTION complex_recv(internal)
 191    RETURNS complex
 192    AS '<replaceable>filename</replaceable>'
 193    LANGUAGE C IMMUTABLE STRICT;
 194
 195 CREATE FUNCTION complex_send(complex)
 196    RETURNS bytea
 197    AS '<replaceable>filename</replaceable>'
 198    LANGUAGE C IMMUTABLE STRICT;
 199 </programlisting>
 200  </para>
 201
 202  <para>
 203   Finally, we can provide the full definition of the data type:
 204 <programlisting>
 205 CREATE TYPE complex (
 206    internallength = 16,
 207    input = complex_in,
 208    output = complex_out,
 209    receive = complex_recv,
 210    send = complex_send,
 211    alignment = double
 212 );
 213 </programlisting>
 214  </para>
 215
 216  <para>
 217   <indexterm>
 218     <primary>array</primary>
 219     <secondary>of user-defined type</secondary>
 220   </indexterm>
 221   When you define a new base type,
 222   <productname>PostgreSQL</productname> automatically provides support
 223   for arrays of that type.  The array type typically
 224   has the same name as the base type with the underscore character
 225   (<literal>_</literal>) prepended.
 226  </para>
 227
 228  <para>
 229   Once the data type exists, we can declare additional functions to
 230   provide useful operations on the data type.  Operators can then be
 231   defined atop the functions, and if needed, operator classes can be
 232   created to support indexing of the data type.  These additional
 233   layers are discussed in following sections.
 234  </para>
 235
 236  <para>
 237   If the internal representation of the data type is variable-length, the
 238   internal representation must follow the standard layout for variable-length
 239   data: the first four bytes must be a <type>char[4]</type> field which is
 240   never accessed directly (customarily named <structfield>vl_len_</structfield>). You
 241   must use the <function>SET_VARSIZE()</function> macro to store the total
 242   size of the datum (including the length field itself) in this field
 243   and <function>VARSIZE()</function> to retrieve it.  (These macros exist
 244   because the length field may be encoded depending on platform.)
 245  </para>
 246
 247  <para>
 248   For further details see the description of the
 249   <xref linkend="sql-createtype"/> command.
 250  </para>
 251
 252  <sect2 id="xtypes-toast">
 253   <title>TOAST Considerations</title>
 254    <indexterm>
 255     <primary>TOAST</primary>
 256     <secondary>and user-defined types</secondary>
 257    </indexterm>
 258
 259  <para>
 260   If the values of your data type vary in size (in internal form), it's
 261   usually desirable to make the data type <acronym>TOAST</acronym>-able (see <xref
 262   linkend="storage-toast"/>). You should do this even if the values are always
 263   too small to be compressed or stored externally, because
 264   <acronym>TOAST</acronym> can save space on small data too, by reducing header
 265   overhead.
 266  </para>
 267
 268  <para>
 269   To support <acronym>TOAST</acronym> storage, the C functions operating on the data
 270   type must always be careful to unpack any toasted values they are handed
 271   by using <function>PG_DETOAST_DATUM</function>.  (This detail is customarily hidden
 272   by defining type-specific <function>GETARG_DATATYPE_P</function> macros.)
 273   Then, when running the <command>CREATE TYPE</command> command, specify the
 274   internal length as <literal>variable</literal> and select some appropriate storage
 275   option other than <literal>plain</literal>.
 276  </para>
 277
 278  <para>
 279   If data alignment is unimportant (either just for a specific function or
 280   because the data type specifies byte alignment anyway) then it's possible
 281   to avoid some of the overhead of <function>PG_DETOAST_DATUM</function>. You can use
 282   <function>PG_DETOAST_DATUM_PACKED</function> instead (customarily hidden by
 283   defining a <function>GETARG_DATATYPE_PP</function> macro) and using the macros
 284   <function>VARSIZE_ANY_EXHDR</function> and <function>VARDATA_ANY</function> to access
 285   a potentially-packed datum.
 286   Again, the data returned by these macros is not aligned even if the data
 287   type definition specifies an alignment. If the alignment is important you
 288   must go through the regular <function>PG_DETOAST_DATUM</function> interface.
 289  </para>
 290
 291  <note>
 292   <para>
 293    Older code frequently declares <structfield>vl_len_</structfield> as an
 294    <type>int32</type> field instead of <type>char[4]</type>.  This is OK as long as
 295    the struct definition has other fields that have at least <type>int32</type>
 296    alignment.  But it is dangerous to use such a struct definition when
 297    working with a potentially unaligned datum; the compiler may take it as
 298    license to assume the datum actually is aligned, leading to core dumps on
 299    architectures that are strict about alignment.
 300   </para>
 301  </note>
 302
 303  <para>
 304   Another feature that's enabled by <acronym>TOAST</acronym> support is the
 305   possibility of having an <firstterm>expanded</firstterm> in-memory data
 306   representation that is more convenient to work with than the format that
 307   is stored on disk.  The regular or <quote>flat</quote> varlena storage format
 308   is ultimately just a blob of bytes; it cannot for example contain
 309   pointers, since it may get copied to other locations in memory.
 310   For complex data types, the flat format may be quite expensive to work
 311   with, so <productname>PostgreSQL</productname> provides a way to <quote>expand</quote>
 312   the flat format into a representation that is more suited to computation,
 313   and then pass that format in-memory between functions of the data type.
 314  </para>
 315
 316  <para>
 317   To use expanded storage, a data type must define an expanded format that
 318   follows the rules given in <filename>src/include/utils/expandeddatum.h</filename>,
 319   and provide functions to <quote>expand</quote> a flat varlena value into
 320   expanded format and <quote>flatten</quote> the expanded format back to the
 321   regular varlena representation.  Then ensure that all C functions for
 322   the data type can accept either representation, possibly by converting
 323   one into the other immediately upon receipt.  This does not require fixing
 324   all existing functions for the data type at once, because the standard
 325   <function>PG_DETOAST_DATUM</function> macro is defined to convert expanded inputs
 326   into regular flat format.  Therefore, existing functions that work with
 327   the flat varlena format will continue to work, though slightly
 328   inefficiently, with expanded inputs; they need not be converted until and
 329   unless better performance is important.
 330  </para>
 331
 332  <para>
 333   C functions that know how to work with an expanded representation
 334   typically fall into two categories: those that can only handle expanded
 335   format, and those that can handle either expanded or flat varlena inputs.
 336   The former are easier to write but may be less efficient overall, because
 337   converting a flat input to expanded form for use by a single function may
 338   cost more than is saved by operating on the expanded format.
 339   When only expanded format need be handled, conversion of flat inputs to
 340   expanded form can be hidden inside an argument-fetching macro, so that
 341   the function appears no more complex than one working with traditional
 342   varlena input.
 343   To handle both types of input, write an argument-fetching function that
 344   will detoast external, short-header, and compressed varlena inputs, but
 345   not expanded inputs.  Such a function can be defined as returning a
 346   pointer to a union of the flat varlena format and the expanded format.
 347   Callers can use the <function>VARATT_IS_EXPANDED_HEADER()</function> macro to
 348   determine which format they received.
 349  </para>
 350
 351  <para>
 352   The <acronym>TOAST</acronym> infrastructure not only allows regular varlena
 353   values to be distinguished from expanded values, but also
 354   distinguishes <quote>read-write</quote> and <quote>read-only</quote> pointers to
 355   expanded values.  C functions that only need to examine an expanded
 356   value, or will only change it in safe and non-semantically-visible ways,
 357   need not care which type of pointer they receive.  C functions that
 358   produce a modified version of an input value are allowed to modify an
 359   expanded input value in-place if they receive a read-write pointer, but
 360   must not modify the input if they receive a read-only pointer; in that
 361   case they have to copy the value first, producing a new value to modify.
 362   A C function that has constructed a new expanded value should always
 363   return a read-write pointer to it.  Also, a C function that is modifying
 364   a read-write expanded value in-place should take care to leave the value
 365   in a sane state if it fails partway through.
 366  </para>
 367
 368  <para>
 369   For examples of working with expanded values, see the standard array
 370   infrastructure, particularly
 371   <filename>src/backend/utils/adt/array_expanded.c</filename>.
 372  </para>
 373
 374  </sect2>
 375
 376 </sect1>