doc/combining.texi

   1 @c PSPP - a program for statistical analysis.
   2 @c Copyright (C) 2017, 2020 Free Software Foundation, Inc.
   3 @c Permission is granted to copy, distribute and/or modify this document
   4 @c under the terms of the GNU Free Documentation License, Version 1.3
   5 @c or any later version published by the Free Software Foundation;
   6 @c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
   7 @c A copy of the license is included in the section entitled "GNU
   8 @c Free Documentation License".
   9 @c
  10 @node Combining Data Files
  11 @chapter Combining Data Files
  12
  13 This chapter describes commands that allow data from system files,
  14 portable files, and open datasets to be combined to
  15 form a new active dataset.  These commands can combine data files in the
  16 following ways:
  17
  18 @itemize
  19 @item
  20 @cmd{ADD FILES} interleaves or appends the cases from each input file.
  21 It is used with input files that have variables in common, but
  22 distinct sets of cases.
  23
  24 @item
  25 @cmd{MATCH FILES} adds the data together in cases that match across
  26 multiple input files.  It is used with input files that have cases in
  27 common, but different information about each case.
  28
  29 @item
  30 @cmd{UPDATE} updates a master data file from data in a set of
  31 transaction files.  Each case in a transaction data file modifies a
  32 matching case in the primary data file, or it adds a new case if no
  33 matching case can be found.
  34 @end itemize
  35
  36 These commands share the majority of their syntax, which is described
  37 in the following section, followed by one section for each command
  38 that describes its specific syntax and semantics.
  39
  40 @menu
  41 * Combining Files Common Syntax::
  42 * ADD FILES::                   Interleave cases from multiple files.
  43 * MATCH FILES::                 Merge cases from multiple files.
  44 * UPDATE::                      Update cases using transactional data.
  45 @end menu
  46
  47 @node Combining Files Common Syntax
  48 @section Common Syntax
  49
  50 @display
  51 Per input file:
  52         /FILE=@{*,'@var{file_name}'@}
  53         [/RENAME=(@var{src_names}=@var{target_names})@dots{}]
  54         [/IN=@var{var_name}]
  55         [/SORT]
  56
  57 Once per command:
  58         /BY @var{var_list}[(@{D|A@})] [@var{var_list}[(@{D|A@}]]@dots{}
  59         [/DROP=@var{var_list}]
  60         [/KEEP=@var{var_list}]
  61         [/FIRST=@var{var_name}]
  62         [/LAST=@var{var_name}]
  63         [/MAP]
  64 @end display
  65
  66 This section describes the syntactical features in common among the
  67 @cmd{ADD FILES}, @cmd{MATCH FILES}, and @cmd{UPDATE} commands.  The
  68 following sections describe details specific to each command.
  69
  70 Each of these commands reads two or more input files and combines them.
  71 The command's output becomes the new active dataset.
  72 None of the commands actually change the input files.
  73 Therefore, if you want the changes to become permanent, you must explicitly
  74 save them using an appropriate procedure or transformation (@pxref{System and Portable File IO}).
  75
  76 The syntax of each command begins with a specification of the files to
  77 be read as input.  For each input file, specify FILE with a system
  78 file or portable file's name as a string, a dataset (@pxref{Datasets})
  79 or file handle name, (@pxref{File Handles}), or an asterisk (@samp{*})
  80 to use the active dataset as input.  Use of portable files on @subcmd{FILE} is a
  81 @pspp{} extension.
  82
  83 At least two @subcmd{FILE} subcommands must be specified.  If the active dataset
  84 is used as an input source, then @cmd{TEMPORARY} must not be in
  85 effect.
  86
  87 Each @subcmd{FILE} subcommand may be followed by any number of @subcmd{RENAME}
  88 subcommands that specify a parenthesized group or groups of variable
  89 names as they appear in the input file, followed by those variables'
  90 new names, separated by an equals sign (@subcmd{=}),
  91 @i{e.g.} @subcmd{/RENAME=(OLD1=NEW1)(OLD2=NEW2)}.  To rename a single
  92 variable, the parentheses may be omitted: @subcmd{/RENAME=@var{old}=@var{new}}.
  93 Within a parenthesized group, variables are renamed simultaneously, so
  94 that @subcmd{/RENAME=(@var{A} @var{B}=@var{B} @var{A})} exchanges the
  95 names of variables @var{A} and @var{B}.
  96 Otherwise, renaming occurs in left-to-right order.
  97
  98 Each @subcmd{FILE} subcommand may optionally be followed by a single @subcmd{IN}
  99 subcommand, which creates a numeric variable with the specified name
 100 and format F1.0.  The IN variable takes value 1 in an output case if
 101 the given input file contributed to that output case, and 0 otherwise.
 102 The @subcmd{DROP}, @subcmd{KEEP}, and @subcmd{RENAME} subcommands have no effect on IN variables.
 103
 104 If @subcmd{BY} is used (see below), the @subcmd{SORT} keyword must be specified after a
 105 @subcmd{FILE} if that input file is not already sorted on the @subcmd{BY} variables.
 106 When @subcmd{SORT} is specified, @pspp{} sorts the input file's data on the @subcmd{BY}
 107 variables before it applies it to the command.  When @subcmd{SORT} is used, @subcmd{BY}
 108 is required.  @subcmd{SORT} is a @pspp{} extension.
 109
 110 @pspp{} merges the dictionaries of all of the input files to form the
 111 dictionary of the new active dataset, like so:
 112
 113 @itemize @bullet
 114 @item
 115 The variables in the new active dataset are the union of all the input files'
 116 variables, matched based on their name.  When a single input file
 117 contains a variable with a given name, the output file will contain
 118 exactly that variable.  When more than one input file contains a
 119 variable with a given name, those variables must be all string or all numeric.
 120 If they are string variables, then the result will have the width of the longest
 121 variable with that name, with narrower values padded on the right with spaces
 122 to fill the width.
 123 Variables are matched after renaming with the @subcmd{RENAME} subcommand.
 124 Thus, @subcmd{RENAME} can be used to resolve conflicts.
 125 Only variables in the output file can conflict, so @subcmd{DROP} or
 126 @subcmd{KEEP}, as described below, can also resolve a conflict.
 127
 128 @item
 129 The variable label for each output variable is taken from the first
 130 specified input file that has a variable label for that variable, and
 131 similarly for value labels and missing values.
 132
 133 @item
 134 The file label of the new active dataset (@pxref{FILE LABEL}) is that of the
 135 first specified @subcmd{FILE} that has a file label.
 136
 137 @item
 138 The documents in the new active dataset (@pxref{DOCUMENT}) are the
 139 concatenation of all the input files' documents, in the order in which
 140 the @subcmd{FILE} subcommands are specified.
 141
 142 @item
 143 If all of the input files are weighted on the same variable, then the
 144 new active dataset is weighted on that variable.  Otherwise, the new
 145 active dataset is not weighted.
 146 @end itemize
 147
 148 The remaining subcommands apply to the output file as a whole, rather
 149 than to individual input files.  They must be specified at the end of
 150 the command specification, following all of the @subcmd{FILE} and related
 151 subcommands.  The most important of these subcommands is @subcmd{BY}, which
 152 specifies a set of one or more variables that may be used to find
 153 corresponding cases in each of the input files.  The variables
 154 specified on @subcmd{BY} must be present in all of the input files.
 155 Furthermore, if any of the input files are not sorted on the @subcmd{BY}
 156 variables, then @subcmd{SORT} must be specified for those input files.
 157
 158 The variables listed on @subcmd{BY} may include (A) or (D) annotations to
 159 specify ascending or descending sort order.  @xref{SORT CASES}, for
 160 more details on this notation.  Adding (A) or (D) to the @subcmd{BY} subcommand
 161 specification is a @pspp{} extension.
 162
 163 The @subcmd{DROP} subcommand can be used to specify a list of variables to
 164 exclude from the output.  By contrast, the @subcmd{KEEP} subcommand can be used
 165 to specify variables to include in the output; all variables not
 166 listed are dropped.  @subcmd{DROP} and @subcmd{KEEP} are executed in left-to-right order
 167 and may be repeated any number of times.  @subcmd{DROP} and @subcmd{KEEP} do not affect
 168 variables created by the @subcmd{IN}, @subcmd{FIRST}, and @subcmd{LAST} subcommands, which are
 169 always included in the new active dataset, but they can be used to drop
 170 @subcmd{BY} variables.
 171
 172 The @subcmd{FIRST} and @subcmd{LAST} subcommands are optional.  They may only be
 173 specified on @cmd{MATCH FILES} and @cmd{ADD FILES}, and only when @subcmd{BY}
 174 is used.  @subcmd{FIRST} and @subcmd{LIST} each adds a numeric variable to the new
 175 active dataset, with the name given as the subcommand's argument and F1.0
 176 print and write formats.  The value of the @subcmd{FIRST} variable is 1 in the
 177 first output case with a given set of values for the @subcmd{BY} variables, and
 178 0 in other cases.  Similarly, the @subcmd{LAST} variable is 1 in the last case
 179 with a given of @subcmd{BY} values, and 0 in other cases.
 180
 181 When any of these commands creates an output case, variables that are
 182 only in files that are not present for the current case are set to the
 183 system-missing value for numeric variables or spaces for string
 184 variables.
 185
 186 These commands may combine any number of files, limited only by the
 187 machine's memory.
 188
 189 @node ADD FILES
 190 @section ADD FILES
 191 @vindex ADD FILES
 192
 193 @display
 194 ADD FILES
 195
 196 Per input file:
 197         /FILE=@{*,'@var{file_name}'@}
 198         [/RENAME=(@var{src_names}=@var{target_names})@dots{}]
 199         [/IN=@var{var_name}]
 200         [/SORT]
 201
 202 Once per command:
 203         [/BY @var{var_list}[(@{D|A@})] [@var{var_list}[(@{D|A@})]@dots{}]]
 204         [/DROP=@var{var_list}]
 205         [/KEEP=@var{var_list}]
 206         [/FIRST=@var{var_name}]
 207         [/LAST=@var{var_name}]
 208         [/MAP]
 209 @end display
 210
 211 @cmd{ADD FILES} adds cases from multiple input files.  The output,
 212 which replaces the active dataset, consists all of the cases in all of
 213 the input files.
 214
 215 @subcmd{ADD FILES} shares the bulk of its syntax with other @pspp{} commands for
 216 combining multiple data files.  @xref{Combining Files Common Syntax},
 217 above, for an explanation of this common syntax.
 218
 219 When @subcmd{BY} is not used, the output of @subcmd{ADD FILES} consists of all the cases
 220 from the first input file specified, followed by all the cases from
 221 the second file specified, and so on.  When @subcmd{BY} is used, the output is
 222 additionally sorted on the @subcmd{BY} variables.
 223
 224 When @subcmd{ADD FILES} creates an output case, variables that are not part of
 225 the input file from which the case was drawn are set to the
 226 system-missing value for numeric variables or spaces for string
 227 variables.
 228
 229 @node MATCH FILES
 230 @section MATCH FILES
 231 @vindex MATCH FILES
 232
 233 @display
 234 MATCH FILES
 235
 236 Per input file:
 237         /@{FILE,TABLE@}=@{*,'@var{file_name}'@}
 238         [/RENAME=(@var{src_names}=@var{target_names})@dots{}]
 239         [/IN=@var{var_name}]
 240         [/SORT]
 241
 242 Once per command:
 243         /BY @var{var_list}[(@{D|A@}] [@var{var_list}[(@{D|A@})]@dots{}]
 244         [/DROP=@var{var_list}]
 245         [/KEEP=@var{var_list}]
 246         [/FIRST=@var{var_name}]
 247         [/LAST=@var{var_name}]
 248         [/MAP]
 249 @end display
 250
 251 @cmd{MATCH FILES} merges sets of corresponding cases in multiple
 252 input files into single cases in the output, combining their data.
 253
 254 @cmd{MATCH FILES} shares the bulk of its syntax with other @pspp{} commands for
 255 combining multiple data files.  @xref{Combining Files Common Syntax},
 256 above, for an explanation of this common syntax.
 257
 258 How @cmd{MATCH FILES} matches up cases from the input files depends on
 259 whether @subcmd{BY} is specified:
 260
 261 @itemize @bullet
 262 @item
 263 If @subcmd{BY} is not used, @cmd{MATCH FILES} combines the first case from each input
 264 file to produce the first output case, then the second case from each
 265 input file for the second output case, and so on.  If some input files
 266 have fewer cases than others, then the shorter files do not contribute
 267 to cases output after their input has been exhausted.
 268
 269 @item
 270 If @subcmd{BY} is used, @cmd{MATCH FILES} combines cases from each input file that
 271 have identical values for the @subcmd{BY} variables.
 272
 273 When @subcmd{BY} is used, @subcmd{TABLE} subcommands may be used to introduce @dfn{table
 274 lookup file}.  @subcmd{TABLE} has same syntax as @subcmd{FILE}, and the @subcmd{RENAME}, @subcmd{IN}, and
 275 @subcmd{SORT} subcommands may follow a @subcmd{TABLE} in the same way as @subcmd{FILE}.
 276 Regardless of the number of @subcmd{TABLE}s, at least one @subcmd{FILE} must specified.
 277 Table lookup files are treated in the same way as other input files
 278 for most purposes and, in particular, table lookup files must be
 279 sorted on the @subcmd{BY} variables or the @subcmd{SORT} subcommand must be specified
 280 for that @subcmd{TABLE}.
 281
 282 Cases in table lookup files are not consumed after they have been used
 283 once.  This means that data in table lookup files can correspond to
 284 any number of cases in @subcmd{FILE} input files.  Table lookup files are
 285 analogous to lookup tables in traditional relational database systems.
 286
 287 If a table lookup file contains more than one case with a given set of
 288 @subcmd{BY} variables, only the first case is used.
 289 @end itemize
 290
 291 When @cmd{MATCH FILES} creates an output case, variables that are only in
 292 files that are not present for the current case are set to the
 293 system-missing value for numeric variables or spaces for string
 294 variables.
 295
 296 @node UPDATE
 297 @section UPDATE
 298 @vindex UPDATE
 299
 300 @display
 301 UPDATE
 302
 303 Per input file:
 304         /FILE=@{*,'@var{file_name}'@}
 305         [/RENAME=(@var{src_names}=@var{target_names})@dots{}]
 306         [/IN=@var{var_name}]
 307         [/SORT]
 308
 309 Once per command:
 310         /BY @var{var_list}[(@{D|A@})] [@var{var_list}[(@{D|A@})]]@dots{}
 311         [/DROP=@var{var_list}]
 312         [/KEEP=@var{var_list}]
 313         [/MAP]
 314 @end display
 315
 316 @cmd{UPDATE} updates a @dfn{master file} by applying modifications
 317 from one or more @dfn{transaction files}.
 318
 319 @cmd{UPDATE} shares the bulk of its syntax with other @pspp{} commands for
 320 combining multiple data files.  @xref{Combining Files Common Syntax},
 321 above, for an explanation of this common syntax.
 322
 323 At least two @subcmd{FILE} subcommands must be specified.  The first @subcmd{FILE}
 324 subcommand names the master file, and the rest name transaction files.
 325 Every input file must either be sorted on the variables named on the
 326 @subcmd{BY} subcommand, or the @subcmd{SORT} subcommand must be used just after the @subcmd{FILE}
 327 subcommand for that input file.
 328
 329 @cmd{UPDATE} uses the variables specified on the @subcmd{BY} subcommand, which is
 330 required, to attempt to match each case in a transaction file with a
 331 case in the master file:
 332
 333 @itemize @bullet
 334 @item
 335 When a match is found, then the values of the variables present in the
 336 transaction file replace those variables' values in the new active
 337 file.  If there are matching cases in more than more transaction file,
 338 @pspp{} applies the replacements from the first transaction file, then
 339 from the second transaction file, and so on.  Similarly, if a single
 340 transaction file has cases with duplicate @subcmd{BY} values, then those are
 341 applied in order to the master file.
 342
 343 When a variable in a transaction file has a missing value or when a string
 344 variable's value is all blanks, that value is never used to update the
 345 master file.
 346
 347 @item
 348 If a case in the master file has no matching case in any transaction
 349 file, then it is copied unchanged to the output.
 350
 351 @item
 352 If a case in a transaction file has no matching case in the master
 353 file, then it causes a new case to be added to the output, initialized
 354 from the values in the transaction file.
 355 @end itemize