doc/transformation.texi

   1 @c PSPP - a program for statistical analysis.
   2 @c Copyright (C) 2017, 2020 Free Software Foundation, Inc.
   3 @c Permission is granted to copy, distribute and/or modify this document
   4 @c under the terms of the GNU Free Documentation License, Version 1.3
   5 @c or any later version published by the Free Software Foundation;
   6 @c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
   7 @c A copy of the license is included in the section entitled "GNU
   8 @c Free Documentation License".
   9 @c
  10 @node Data Manipulation
  11 @chapter Data transformations
  12 @cindex transformations
  13
  14 The @pspp{} procedures examined in this chapter manipulate data and
  15 prepare the active dataset for later analyses.  They do not produce output,
  16 as a rule.
  17
  18 @menu
  19 * AGGREGATE::                   Summarize multiple cases into a single case.
  20 * AUTORECODE::                  Automatic recoding of variables.
  21 * COMPUTE::                     Assigning a variable a calculated value.
  22 * COUNT::                       Counting variables with particular values.
  23 * FLIP::                        Exchange variables with cases.
  24 * IF::                          Conditionally assigning a calculated value.
  25 * RECODE::                      Mapping values from one set to another.
  26 * SORT CASES::                  Sort the active dataset.
  27 @end menu
  28
  29 @node AGGREGATE
  30 @section AGGREGATE
  31 @vindex AGGREGATE
  32
  33 @display
  34 AGGREGATE
  35         [OUTFILE=@{*,'@var{file_name}',@var{file_handle}@} [MODE=@{REPLACE,ADDVARIABLES@}]]
  36         [/MISSING=COLUMNWISE]
  37         [/PRESORTED]
  38         [/DOCUMENT]
  39         [/BREAK=@var{var_list}]
  40         /@var{dest_var}['@var{label}']@dots{}=@var{agr_func}(@var{src_vars}[, @var{args}]@dots{})@dots{}
  41 @end display
  42
  43 @cmd{AGGREGATE} summarizes groups of cases into single cases.
  44 It divides cases into groups that have the same values for one or more
  45 variables called @dfn{break variables}.  Several functions are available
  46 for summarizing case contents.
  47
  48 The @cmd{AGGREGATE} syntax consists of subcommands to control its
  49 behavior, all of which are optional, followed by one or more
  50 destination variable assigments, each of which uses an aggregation
  51 function to define how it is calculated.
  52
  53 The @subcmd{OUTFILE} subcommand, which must be first, names the
  54 destination for @cmd{AGGREGATE} output.  It may name a system file by
  55 file name or file handle (@pxref{File Handles}), a dataset by its name
  56 (@pxref{Datasets}), or @samp{*} to replace the active dataset.
  57 @cmd{AGGREGATE} writes its output to this file.
  58
  59 With @subcmd{OUTFILE=*} only, @code{MODE} may be specified immediately
  60 afterward with the value @code{ADDVARIABLES} or @code{REPLACE}:
  61
  62 @itemize
  63 @item
  64 With @code{REPLACE}, the default, the active dataset is replaced by a new dataset
  65 which contains just the break variables and the destination varibles.
  66 The new file contains as many cases as there are
  67 unique combinations of the break variables.
  68
  69 @item
  70 With @code{ADDVARIABLES}, the destination variables are added to those in
  71 the existing active dataset.
  72 Cases that have the same combination of values in their break
  73 variables receive identical values for the destination variables.
  74 The number of cases in the active dataset remains unchanged.
  75 The data must be
  76 sorted on the break variables, that is, @code{ADDVARIABLES} implies @code{PRESORTED}
  77 @end itemize
  78
  79 If @code{OUTFILE} is omitted, @cmd{AGGREGATE} acts as if
  80 @code{OUTFILE=* MODE=ADDVARIABLES} were specified.
  81
  82 By default, @cmd{AGGREGATE} first sorts the data on the break variables.
  83 If the active dataset is already sorted
  84 or grouped by the break variables, specify
  85 @subcmd{PRESORTED} to save time.
  86 With @subcmd{MODE=ADDVARIABLES}, the data must be pre-sorted.
  87
  88 Specify @subcmd{DOCUMENT} to copy the documents from the active dataset into the
  89 aggregate file (@pxref{DOCUMENT}).  Otherwise, the aggregate file does
  90 not contain any documents, even if the aggregate file replaces the
  91 active dataset.
  92
  93 Normally, @code{AGGREGATE} produces a non-missing value whenever there
  94 is enough non-missing data for the aggregation function in use, that
  95 is, just one non-missing value or, for the @code{SD} and @code{SD.}
  96 aggregation functions, two non-missing values.  Specify
  97 @code{/MISSING=COLUMNWISE} to make @code{AGGREGATE} output a missing
  98 value when one or more of the input values are missing.
  99
 100 The @subcmd{BREAK} subcommand is optionally but usually present.  On
 101 @subcmd{BREAK}, list the variables used to divide the active dataset
 102 into groups to be summarized.
 103
 104 @cmd{AGGREGATE} is particular about the order of subcommands.
 105 @subcmd{OUTFILE} must be first, followed by @subcmd{MISSING}.
 106 @subcmd{PRESORTED} and @subcmd{DOCUMENT} follow @subcmd{MISSING}, in
 107 either order, followed by @subcmd{BREAK}, then followed by aggregation
 108 variable specifications.
 109
 110 At least one set of aggregation variables is required.  Each set
 111 comprises a list of aggregation variables, an equals sign (@samp{=}),
 112 the name of an aggregation function (see the list below), and a list
 113 of source variables in parentheses.  A few aggregation functions do
 114 not accept source variables, and some aggregation functions expect
 115 additional arguments after the source variable names.
 116
 117 @cmd{AGGREGATE} typically creates aggregation variables with no
 118 variable label, value labels, or missing values.  Their default print
 119 and write formats depend on the aggregation function used, with
 120 details given in the table below.  A variable label for an aggregation
 121 variable may be specified just after the variable's name in the
 122 aggregation variable list.
 123
 124 Each set must have exactly as many source variables as aggregation
 125 variables.  Each aggregation variable receives the results of applying
 126 the specified aggregation function to the corresponding source
 127 variable.
 128
 129 The following aggregation functions may be applied only to numeric
 130 variables:
 131
 132 @table @asis
 133 @item @subcmd{MEAN(@var{var_name}@dots{})}
 134 Arithmetic mean.  Limited to numeric values.  The default format is
 135 F8.2.
 136
 137 @item @subcmd{MEDIAN(@var{var_name}@dots{})}
 138 The median value.  Limited to numeric values.  The default format is F8.2.
 139
 140 @item @subcmd{SD(@var{var_name}@dots{})}
 141 Standard deviation of the mean.  Limited to numeric values.  The
 142 default format is F8.2.
 143
 144 @item @subcmd{SUM(@var{var_name}@dots{})}
 145 Sum.  Limited to numeric values.  The default format is F8.2.
 146 @end table
 147
 148 These aggregation functions may be applied to numeric and string variables:
 149
 150 @table @asis
 151 @item @subcmd{CGT(@var{var_name}@dots{}, @var{value})}
 152 @itemx @subcmd{CLT(@var{var_name}@dots{}, @var{value})}
 153 @itemx @subcmd{CIN(@var{var_name}@dots{}, @var{low}, @var{high})}
 154 @itemx @subcmd{COUT(@var{var_name}@dots{}, @var{low}, @var{high})}
 155 Total weight of cases greater than or less than @var{value} or inside
 156 or outside the closed range [@var{low},@var{high}], respectively.  The
 157 default format is F5.3.
 158
 159 @item @subcmd{FGT(@var{var_name}@dots{}, @var{value})}
 160 @itemx @subcmd{FLT(@var{var_name}@dots{}, @var{value})}
 161 @itemx @subcmd{FIN(@var{var_name}@dots{}, @var{low}, @var{high})}
 162 @itemx @subcmd{FOUT(@var{var_name}@dots{}, @var{low}, @var{high})}
 163 Fraction of values greater than or less than @var{value} or inside or
 164 outside the closed range [@var{low},@var{high}], respectively.  The
 165 default format is F5.3.
 166
 167 @item @subcmd{FIRST(@var{var_name}@dots{})}
 168 @itemx @subcmd{LAST(@var{var_name}@dots{})}
 169 First or last non-missing value, respectively, in break group.  The
 170 aggregation variable
 171 receives the complete dictionary information from the source variable.
 172 The sort performed by @cmd{AGGREGATE} (and by @cmd{SORT CASES}) is stable.
 173 This means that
 174 the first (or last) case with particular values for the break variables before
 175 sorting is also the first (or last) case in that break group after sorting.
 176
 177 @item @subcmd{MIN(@var{var_name}@dots{})}
 178 @itemx @subcmd{MAX(@var{var_name}@dots{})}
 179 Minimum or maximum value, respectively.  The aggregation variable
 180 receives the complete dictionary information from the source variable.
 181
 182 @item @subcmd{N(@var{var_name}@dots{})}
 183 @itemx @subcmd{NMISS(@var{var_name}@dots{})}
 184 Total weight of non-missing or missing values, respectively.  The
 185 default format is F7.0 if weighting is not enabled, F8.2 if it is
 186 (@pxref{WEIGHT}).
 187
 188 @item @subcmd{NU(@var{var_name}@dots{})}
 189 @itemx @subcmd{NUMISS(@var{var_name}@dots{})}
 190 Count of non-missing or missing values, respectively, ignoring case
 191 weights.  The default format is F7.0.
 192
 193 @item @subcmd{PGT(@var{var_name}@dots{}, @var{value})}
 194 @itemx @subcmd{PLT(@var{var_name}@dots{}, @var{value})}
 195 @itemx @subcmd{PIN(@var{var_name}@dots{}, @var{low}, @var{high})}
 196 @itemx @subcmd{POUT(@var{var_name}@dots{}, @var{low}, @var{high})}
 197 Percentage between 0 and 100 of values greater than or less than
 198 @var{VALUE} or inside or outside the closed range
 199 [@var{low},@var{high}], respectively.  The default format is F5.1.
 200 @end table
 201
 202 These aggregation functions do not accept source variables:
 203
 204 @table @asis
 205 @item @subcmd{N}
 206 Total weight of cases aggregated to form this group.  The default
 207 format is F7.0 if weighting is not enabled, F8.2 if it is
 208 (@pxref{WEIGHT}).
 209
 210 @item @subcmd{NU}
 211 Count of cases aggregated to form this group, ignoring case weights.
 212 The default format is F7.0.
 213 @end table
 214
 215 Aggregation functions compare string values in terms of internal
 216 character codes.
 217 On most modern computers, this is  @acronym{ASCII} or a superset thereof.
 218
 219 The aggregation functions listed above exclude all user-missing values
 220 from calculations.  To include user-missing values, insert a period
 221 (@samp{.}) at the end of the function name.  (@i{e.g.}@: @samp{SUM.}).
 222 (Be aware that specifying such a function as the last token on a line
 223 causes the period to be interpreted as the end of the command.)
 224
 225 @cmd{AGGREGATE} both ignores and cancels the current @cmd{SPLIT FILE}
 226 settings (@pxref{SPLIT FILE}).
 227
 228 @subsection Aggregate Example
 229
 230 The @file{personnel.sav} dataset provides the occupations and salaries of
 231 many individuals.  For many purposes however such detailed information is
 232 not interesting, but often the aggregated statistics of each occupation are
 233 of interest.  In @ref{aggregate:ex} the @cmd{AGGREGATE} command is used
 234 to calculate the mean, the median and the standard deviation of each
 235 occupation.
 236
 237 @float Example, aggregate:ex
 238 @psppsyntax {aggregate.sps}
 239 @caption {Calculating aggregated statistics from the @file{personnel.sav} file.}
 240 @end float
 241
 242 Since we chose the @samp{MODE=REPLACE} option, in @ref{aggregate:res} cases
 243 for the individual persons are no longer present.  They have each been replaced
 244 by a single case per aggregated value.
 245
 246 @float Results, aggregate:res
 247 @psppoutput {aggregate}
 248 @caption {Aggregated mean, median and standard deviation per @exvar{occupation}.}
 249 @end float
 250
 251 Note that some values for the standard deviation are blank.
 252 This is because there is only one case with the respective
 253 occupation.
 254
 255 @node AUTORECODE
 256 @section AUTORECODE
 257 @vindex AUTORECODE
 258
 259 @display
 260 AUTORECODE VARIABLES=@var{src_vars} INTO @var{dest_vars}
 261         [ /DESCENDING ]
 262         [ /PRINT ]
 263         [ /GROUP ]
 264         [ /BLANK = @{VALID, MISSING@} ]
 265 @end display
 266
 267 The @cmd{AUTORECODE} procedure considers the @var{n} values that a variable
 268 takes on and maps them onto values 1@dots{}@var{n} on a new numeric
 269 variable.
 270
 271 Subcommand @subcmd{VARIABLES} is the only required subcommand and must come
 272 first.  Specify @subcmd{VARIABLES}, an equals sign (@samp{=}), a list of source
 273 variables, @subcmd{INTO}, and a list of target variables.  There must the same
 274 number of source and target variables.  The target variables must not
 275 already exist.
 276
 277 @cmd{AUTORECODE} ordinarily assigns each increasing non-missing value
 278 of a source variable (for a string, this is based on character code
 279 comparisons) to consecutive values of its target variable.  For
 280 example, the smallest non-missing value of the source variable is
 281 recoded to value 1, the next smallest to 2, and so on.  If the source
 282 variable has user-missing values, they are recoded to
 283 consecutive values just above the non-missing values.  For example, if
 284 a source variables has seven distinct non-missing values, then the
 285 smallest missing value would be recoded to 8, the next smallest to 9,
 286 and so on.
 287
 288 Use @subcmd{DESCENDING} to reverse the sort order for non-missing
 289 values, so that the largest non-missing value is recoded to 1, the
 290 second-largest to 2, and so on.  Even with @subcmd{DESCENDING},
 291 user-missing values are still recoded in ascending order just above
 292 the non-missing values.
 293
 294 The system-missing value is always recoded into the system-missing
 295 variable in target variables.
 296
 297 If a source value has a value label, then that value label is retained
 298 for the new value in the target variable.  Otherwise, the source value
 299 itself becomes each new value's label.
 300
 301 Variable labels are copied from the source to target variables.
 302
 303 @subcmd{PRINT} is currently ignored.
 304
 305 The @subcmd{GROUP} subcommand is relevant only if more than one variable is to be
 306 recoded.   It causes a single mapping between source and target values to
 307 be used, instead of one map per variable.  With @subcmd{GROUP},
 308 user-missing values are taken from the first source variable that has
 309 any user-missing values.
 310
 311 If @subcmd{/BLANK=MISSING} is given, then string variables which contain only
 312 whitespace are recoded as SYSMIS.  If @subcmd{/BLANK=VALID} is specified then they
 313 are allocated a value like any other.  @subcmd{/BLANK} is not relevant
 314 to numeric values. @subcmd{/BLANK=VALID} is the default.
 315
 316 @cmd{AUTORECODE} is a procedure.  It causes the data to be read.  It
 317 ignores @cmd{TEMPORARY} (@pxref{TEMPORARY}), so that ``temporary''
 318 transformations become permanent.
 319
 320
 321 @subsection Autorecode Example
 322
 323 In the file @file{personnel.sav}, the variable @exvar{occupation} is a string
 324 variable.   Except for data of a purely commentary nature, string variables
 325 are generally a bad idea.  One reason is that data entry errors are easily
 326 overlooked.  This has happened in @file{personnel.sav}; one entry which should
 327 read ``Scientist'' has been mistyped as ``Scrientist''.  In @ref{autorecode:ex}
 328 first, this error is corrected by the @cmd{DO IF} clause,
 329 @footnote{One must use care when correcting such data input errors rather than
 330 msimply marking them as missing.  For example, if an occupation has been entered
 331 ``Barister'', did the person mean ``Barrister'' or did she mean ``Barista''?}
 332 then we use @cmd{AUTORECODE} to
 333 create a new numeric variable which takes recoded values of @exvar{occupation}.
 334 Finally, we remove the old variable and rename the new variable to
 335 the name of the old variable.
 336
 337 @float Example, autorecode:ex
 338 @psppsyntax {autorecode.sps}
 339 @caption {Changing a string variable to a numeric variable using @cmd{AUTORECODE}
 340 after correcting a data entry error}
 341 @end float
 342
 343
 344 @float Screenshot, autorecode:scr
 345 @psppimage {autorecode}
 346 @caption {Autorecode dialog box set to recode @exvar{occupation} to @exvar{occ}}
 347 @end float
 348
 349 Notice in @ref{autorecode:res}, how  the new variable has been automatically
 350 allocated value labels which correspond to the strings of the old variable.
 351 This means that in future analyses the descriptive strings are reported instead
 352 of the numeric values.
 353
 354 @float Result, autorecode:res
 355 @psppoutput {autorecode}
 356 @caption {The properties of the @exvar{occupation} variable following @cmd{AUTORECODE}}
 357 @end float
 358
 359
 360 @node COMPUTE
 361 @section COMPUTE
 362 @vindex COMPUTE
 363
 364 @display
 365 COMPUTE @var{variable} = @var{expression}.
 366 @end display
 367   or
 368 @display
 369 COMPUTE vector(@var{index}) = @var{expression}.
 370 @end display
 371
 372 @cmd{COMPUTE} assigns the value of an expression to a target
 373 variable.  For each case, the expression is evaluated and its value
 374 assigned to the target variable.  Numeric and string
 375 variables may be assigned.  When a string expression's width differs
 376 from the target variable's width, the string result of the expression
 377 is truncated or padded with spaces on the right as necessary.  The
 378 expression and variable types must match.
 379
 380 For numeric variables only, the target variable need not already
 381 exist.  Numeric variables created by @cmd{COMPUTE} are assigned an
 382 @code{F8.2} output format.  String variables must be declared before
 383 they can be used as targets for @cmd{COMPUTE}.
 384
 385 The target variable may be specified as an element of a vector
 386 (@pxref{VECTOR}).  In this case, an expression @var{index} must be
 387 specified in parentheses following the vector name.  The expression @var{index}
 388 must evaluate to a numeric value that, after rounding down
 389 to the nearest integer, is a valid index for the named vector.
 390
 391 Using @cmd{COMPUTE} to assign to a variable specified on @cmd{LEAVE}
 392 (@pxref{LEAVE}) resets the variable's left state.  Therefore,
 393 @code{LEAVE} should be specified following @cmd{COMPUTE}, not before.
 394
 395 @cmd{COMPUTE} is a transformation.  It does not cause the active dataset to be
 396 read.
 397
 398 When @cmd{COMPUTE} is specified following @cmd{TEMPORARY}
 399 (@pxref{TEMPORARY}), the @cmd{LAG} function may not be used
 400 (@pxref{LAG}).
 401
 402 @subsection Compute Examples
 403
 404 The dataset @file{physiology.sav} contains the height and weight of persons.
 405 For some purposes, neither height nor weight alone is of interest.
 406 Epidemiologists are often more interested in the @dfn{body mass index} which
 407 can sometimes be used as a predictor for clinical conditions.
 408 The body mass index is defined as the weight of the person in kilograms divided
 409 by the square of the person's height in metres.
 410 @footnote{Since BMI is a quantity with a ratio scale and has units, the term ``index''
 411 is a misnomer, but that is what it is called.}
 412
 413 @float Example, bmi:ex
 414 @psppsyntax {compute.sps}
 415 @caption {Computing the body mass index from @exvar{weight} and @exvar{height}}
 416 @end float
 417
 418 @ref{bmi:ex} shows how you can use @cmd{COMPUTE} to generate a new variable called
 419 @exvar{bmi} and have every case's value calculated from the existing values of
 420 @exvar{weight} and @exvar{height}.
 421 It also shows how you can add a label to this new variable (@pxref{VARIABLE LABELS}),
 422 so that a more descriptive label appears in subsequent analyses, and this can be seen
 423 in the ouput from the @cmd{DESCRIPTIVES} command in @ref{bmi:res}.
 424
 425 @float Screenshot, bmi:scr
 426 @psppimage {compute}
 427 @caption {Using the dialog box to generate a new variable and compute its values}
 428 @end float
 429
 430 The expression which follows the @samp{=} sign can be as complicated as necessary.
 431 @xref{Expressions} for a precise description of the language accepted.
 432 Normally it is easiest to enter the code directly, however there is a dialog box
 433 available if desired.  This is illustrated in @ref{bmi:scr}.
 434 One advantage is that it offers a list of mathematical
 435 functions which can be selected and pasted into the expression.
 436
 437 @float Results, bmi:res
 438 @psppoutput {compute}
 439 @caption {An analysis which includes @exvar{bmi} in its results}
 440 @end float
 441
 442
 443
 444 @node COUNT
 445 @section COUNT
 446 @vindex COUNT
 447
 448 @display
 449 COUNT @var{var_name} = @var{var}@dots{} (@var{value}@dots{})
 450     [/@var{var_name} = @var{var}@dots{} (@var{value}@dots{})]@dots{}
 451
 452 Each @var{value} takes one of the following forms:
 453         @var{number}
 454         @var{string}
 455         @var{num1} THRU @var{num2}
 456         MISSING
 457         SYSMIS
 458 where @var{num1} is a numeric expression or the words @subcmd{LO}  or @subcmd{LOWEST}
 459       and @var{num2} is a numeric expression  or @subcmd{HI} or @subcmd{HIGHEST}.
 460 @end display
 461
 462 @cmd{COUNT} creates or replaces a numeric @dfn{target} variable that
 463 counts the occurrence of a @dfn{criterion} value or set of values over
 464 one or more @dfn{test} variables for each case.
 465
 466 The target variable values are always nonnegative integers.  They are
 467 never missing.  The target variable is assigned an F8.2 output format.
 468 @xref{Input and Output Formats}.  Any variables, including
 469 string variables, may be test variables.
 470
 471 User-missing values of test variables are treated just like any other
 472 values.  They are @strong{not} treated as system-missing values.
 473 User-missing values that are criterion values or inside ranges of
 474 criterion values are counted as any other values.  However (for numeric
 475 variables), keyword @subcmd{MISSING} may be used to refer to all system-
 476 and user-missing values.
 477
 478 @cmd{COUNT} target variables are assigned values in the order
 479 specified.  In the command @subcmd{COUNT @var{A}=@var{A} @var{B}(1) /@var{B}=@var{A} @var{B}(2).}, the
 480 following actions occur:
 481
 482 @itemize @minus
 483 @item
 484 The number of occurrences of 1 between @var{A} and @var{B} is counted.
 485
 486 @item
 487 @var{A} is assigned this value.
 488
 489 @item
 490 The number of occurrences of 1 between @var{B} and the @strong{new}
 491 value of @var{A} is counted.
 492
 493 @item
 494 @var{B} is assigned this value.
 495 @end itemize
 496
 497 Despite this ordering, all @cmd{COUNT} criterion variables must exist
 498 before the procedure is executed---they may not be created as target
 499 variables earlier in the command!  Break such a command into two
 500 separate commands.
 501
 502 @subsection Count Examples
 503
 504 In the survey results in dataset @file{hotel.sav} a manager wishes
 505 to know how many respondents answered with low valued answers to questions
 506 @exvar{v1}, @exvar{v2} and @exvar{v3}.  This can be found using the code
 507 in @ref{count:ex}.  Specifically, this code creates a new variable, and
 508 populates it with the number of values in @exvar{v1}--@exvar{v2} which
 509 are 2 or lower.
 510
 511 @float Example, count:ex
 512 @psppsyntax {count.sps}
 513 @caption {Counting low values to responses @exvar{v1}, @exvar{v2} and @exvar{v3}}
 514 @end float
 515
 516 In @ref{count:ex} the @cmd{COUNT} transformation creates a new variable, @exvar{low_counts} and
 517 its values are shown using the @cmd{LIST} command.
 518
 519 If using the graphic user interface, a two step process must be used to set
 520 up the @cmd{COUNT} transformation.  The first dialog box (@ref{count:scr}) provides for the
 521 variables to be chosen.
 522 Then, one must click on the button marked ``Define Values...'' to reveal
 523 the dialog box for selecting the values to count.
 524
 525 @float Screenshot, count:scr
 526 @psppimage {count}
 527 @caption {The variables @exvar{v1}, @exvar{v2} and @exvar{v3} selected, ready
 528 to define values to count}
 529 @end float
 530
 531 In this dialog box, you must select the values you wish to count
 532 --- in this case all values up to and including 2 --- as shown in @ref{count-define:scr}
 533 and click ``Add''.  As many ranges or may be added as you desire.
 534 When all desired ranges have been added click ``Continue''.
 535
 536 @float Screenshot, count-define:scr
 537 @psppimage {count-define}
 538 @caption {Count ``Define Values'' dialog with @samp{lowest thru 2} selected}
 539 @end float
 540
 541 In @ref{count:res} we can see the values of @exvar{low_counts} after the @cmd{COUNT}
 542 transformation has completed.  The first value is 1, because there is only one
 543 variable amoung @exvar{v1}, @exvar{v2} and @exvar{3} which has a value of 2 or less.
 544 The second value is 2, because both @exvar{v1} and @exvar{v2} are 2 or less.
 545
 546 @float Result, count:res
 547 @psppoutput {count}
 548 @caption {The values of @exvar{v1}, @exvar{v2}, @exvar{v3} and @exvar{low_counts} after
 549 the @cmd{COUNT} transformation has run}
 550 @end float
 551
 552
 553 @node FLIP
 554 @section FLIP
 555 @vindex FLIP
 556
 557 @display
 558 FLIP /VARIABLES=@var{var_list} /NEWNAMES=@var{var_name}.
 559 @end display
 560
 561 @cmd{FLIP} transposes rows and columns in the active dataset.  It
 562 causes cases to be swapped with variables, and vice versa.
 563
 564 All variables in the transposed active dataset are numeric.  String
 565 variables take on the system-missing value in the transposed file.
 566
 567 @subcmd{N} subcommands are required.  If specified, the @subcmd{VARIABLES} subcommand
 568 selects variables to be transformed into cases, and variables not
 569 specified are discarded.  If the @subcmd{VARIABLES} subcommand is omitted, all
 570 variables are selected for transposition.
 571
 572 The variables specified by @subcmd{NEWNAMES}, which must be a
 573 string variable, is
 574 used to give names to the variables created by @cmd{FLIP}.  Only the
 575 first 8 characters of the variable are used.  If
 576 @subcmd{NEWNAMES} is not
 577 specified then the default is a variable named @exvar{CASE_LBL}, if it exists.
 578 If it does not then the variables created by @cmd{FLIP} are named VAR000
 579 through VAR999, then VAR1000, VAR1001, and so on.
 580
 581 When a @subcmd{NEWNAMES} variable is available, the names must be canonicalized
 582 before becoming variable names.  Invalid characters are replaced by
 583 letter @samp{V} in the first position, or by @samp{_} in subsequent
 584 positions.  If the name thus generated is not unique, then numeric
 585 extensions are added, starting with 1, until a unique name is found or
 586 there are no remaining possibilities.  If the latter occurs then the
 587 @cmd{FLIP} operation aborts.
 588
 589 The resultant dictionary contains a @exvar{CASE_LBL} variable, a string
 590 variable of width 8, which stores the names of the variables in the
 591 dictionary before the transposition.  Variables names longer than 8
 592 characters are truncated.  If @cmd{FLIP} is called again on
 593 this dataset, the @exvar{CASE_LBL} variable can be passed to the @subcmd{NEWNAMES}
 594 subcommand to recreate the original variable names.
 595
 596 @cmd{FLIP} honors @cmd{N OF CASES} (@pxref{N OF CASES}).  It ignores
 597 @cmd{TEMPORARY} (@pxref{TEMPORARY}), so that ``temporary''
 598 transformations become permanent.
 599
 600 @subsection Flip Examples
 601
 602
 603 In @ref{flip:ex}, data has been entered using @cmd{DATA LIST} (@pxref{DATA LIST})
 604 such that the first variable in the dataset is a string variable containing
 605 a description of the other data for the case.
 606 Clearly this is not a convenient arrangement for performing statistical analyses,
 607 so it would have been better to think a little more carefully about how the data
 608 should have been arranged.
 609 However often the data is provided by some third party source, and you have
 610 no control over the form.
 611 Fortunately, we can use @cmd{FLIP} to exchange the variables
 612 and cases in the active dataset.
 613
 614 @float Example, flip:ex
 615 @psppsyntax {flip.sps}
 616 @caption {Using @cmd{FLIP} to exchange variables and cases in a dataset}
 617 @end float
 618
 619 As you can see in @ref{flip:res} before the @cmd{FLIP} command has run there
 620 are seven variables (six containing data and one for the heading) and three cases.
 621 Afterwards there are four variables (one per case, plus the @exvar{CASE_LBL} variable)
 622 and six cases.
 623 You can delete the @exvar{CASE_LBL} variable (@pxref{DELETE VARIABLES}) if you don't need it.
 624
 625 @float Results, flip:res
 626 @psppoutput {flip}
 627 @caption {The results of using @cmd{FLIP} to exchange variables and cases in a dataset}
 628 @end float
 629
 630
 631 @node IF
 632 @section IF
 633 @vindex IF
 634
 635 @display
 636 IF @var{condition} @var{variable}=@var{expression}.
 637 @end display
 638   or
 639 @display
 640 IF @var{condition} vector(@var{index})=@var{expression}.
 641 @end display
 642
 643 The @cmd{IF} transformation conditionally assigns the value of a target
 644 expression to a target variable, based on the truth of a test
 645 expression.
 646
 647 Specify a boolean-valued expression (@pxref{Expressions}) to be tested
 648 following the @cmd{IF} keyword.  This expression is evaluated for each case.
 649 If the value is true, then the value of the expression is computed and
 650 assigned to the specified variable.  If the value is false or missing,
 651 nothing is done.  Numeric and string variables may be
 652 assigned.  When a string expression's width differs from the target
 653 variable's width, the string result of the expression is truncated or
 654 padded with spaces on the right as necessary.  The expression and
 655 variable types must match.
 656
 657 The target variable may be specified as an element of a vector
 658 (@pxref{VECTOR}).  In this case, a vector index expression must be
 659 specified in parentheses following the vector name.  The index
 660 expression must evaluate to a numeric value that, after rounding down
 661 to the nearest integer, is a valid index for the named vector.
 662
 663 Using @cmd{IF} to assign to a variable specified on @cmd{LEAVE}
 664 (@pxref{LEAVE}) resets the variable's left state.  Therefore,
 665 @code{LEAVE} should be specified following @cmd{IF}, not before.
 666
 667 When @cmd{IF} is specified following @cmd{TEMPORARY}
 668 (@pxref{TEMPORARY}), the @cmd{LAG} function may not be used
 669 (@pxref{LAG}).
 670
 671 @node RECODE
 672 @section RECODE
 673 @vindex RECODE
 674
 675 The @cmd{RECODE} command is used to transform existing values into other,
 676 user specified values.
 677 The general form is:
 678
 679 @display
 680 RECODE @var{src_vars}
 681         (@var{src_value} @var{src_value} @dots{} = @var{dest_value})
 682         (@var{src_value} @var{src_value} @dots{} = @var{dest_value})
 683         (@var{src_value} @var{src_value} @dots{} = @var{dest_value}) @dots{}
 684          [INTO @var{dest_vars}].
 685 @end display
 686
 687 Following the @cmd{RECODE} keyword itself comes @var{src_vars} which is a list
 688 of variables whose values are to be transformed.
 689 These variables may be string variables or they may be numeric.
 690 However the list must be homogeneous; you may not mix string variables and
 691 numeric variables in the same recoding.
 692
 693 After the list of source variables, there should be one or more @dfn{mappings}.
 694 Each mapping is enclosed in parentheses, and contains the source values and
 695 a destination value separated by a single @samp{=}.
 696 The source values are used to specify the values in the dataset which
 697 need to change, and the destination value specifies the new value
 698 to which they should be changed.
 699 Each @var{src_value} may take one of the following forms:
 700 @table @asis
 701 @item @var{number}
 702 If the source variables are numeric then @var{src_value} may be a literal
 703 number.
 704 @item @var{string}
 705 If the source variables are string variables then @var{src_value} may be a
 706 literal string (like all strings, enclosed in single or double quotes).
 707 @item @var{num1} THRU @var{num2}
 708 This form is valid only when the source variables are numeric.
 709 It specifies all values in the range between @var{num1} and @var{num2},
 710 including both endpoints of the range.  By convention, @var{num1}
 711 should be less than @var{num2}.
 712 Open-ended ranges may be specified using @samp{LO} or @samp{LOWEST}
 713 for @var{num1}
 714 or @samp{HI} or @samp{HIGHEST} for @var{num2}.
 715 @item @samp{MISSING}
 716 The literal keyword @samp{MISSING} matches both system missing and user
 717 missing values.
 718 It is valid for both numeric and string variables.
 719 @item @samp{SYSMIS}
 720 The literal keyword @samp{SYSMIS} matches system missing
 721 values.
 722 It is valid for both numeric variables only.
 723 @item @samp{ELSE}
 724 The @samp{ELSE} keyword may be used to match any values which are
 725 not matched by any other @var{src_value} appearing in the command.
 726 If this keyword appears, it should be used in the last mapping of the
 727 command.
 728 @end table
 729
 730 After the source variables comes an @samp{=} and then the @var{dest_value}.
 731 The @var{dest_value} may take any of the following forms:
 732 @table @asis
 733 @item @var{number}
 734 A literal numeric value to which the source values should be changed.
 735 This implies the destination variable must be numeric.
 736 @item @var{string}
 737 A literal string value (enclosed in quotation marks) to which the source
 738 values should be changed.
 739 This implies the destination variable must be a string variable.
 740 @item @samp{SYSMIS}
 741 The keyword @samp{SYSMIS} changes the value to the system missing value.
 742 This implies the destination variable must be numeric.
 743 @item @samp{COPY}
 744 The special keyword @samp{COPY} means that the source value should not be
 745 modified, but
 746 copied directly to the destination value.
 747 This is meaningful only if @samp{INTO @var{dest_vars}} is specified.
 748 @end table
 749
 750 Mappings are considered from left to right.
 751 Therefore, if a value is matched by a @var{src_value} from more than
 752 one mapping, the first (leftmost) mapping which matches is considered.
 753 Any subsequent matches are ignored.
 754
 755 The clause @samp{INTO @var{dest_vars}} is optional.
 756 The behaviour of the command is slightly different depending on whether it
 757 appears or not.
 758
 759 If @samp{INTO @var{dest_vars}} does not appear, then values are recoded
 760 ``in place''.
 761 This means that the recoded values are written back to the
 762 source variables from whence the original values came.
 763 In this case, the @var{dest_value} for every mapping must imply a value which
 764 has the same type as the @var{src_value}.
 765 For example, if the source value is a string value, it is not permissible for
 766 @var{dest_value} to be @samp{SYSMIS} or another forms which implies a numeric
 767 result.
 768 It is also not permissible for @var{dest_value} to be  longer than the width
 769 of the source variable.
 770
 771 The following example two numeric variables @var{x} and @var{y} are recoded
 772 in place.
 773 Zero is recoded to 99, the values 1 to 10 inclusive are unchanged,
 774 values 1000 and higher are recoded to the system-missing value and all other
 775 values are changed to 999:
 776 @example
 777 recode @var{x} @var{y}
 778         (0 = 99)
 779         (1 THRU 10 = COPY)
 780         (1000 THRU HIGHEST = SYSMIS)
 781         (ELSE = 999).
 782 @end example
 783
 784 If @samp{INTO @var{dest_vars}} is given, then recoded values are written
 785 into the variables specified in @var{dest_vars}, which must therefore
 786  contain a list of valid variable names.
 787 The number of variables in @var{dest_vars} must be the same as the number
 788 of variables in @var{src_vars}
 789 and the respective order of the variables in @var{dest_vars} corresponds to
 790 the order of @var{src_vars}.
 791 That is to say, the recoded value whose
 792 original value came from the @var{n}th variable in @var{src_vars} is
 793 placed into the @var{n}th variable in @var{dest_vars}.
 794 The source variables are unchanged.
 795 If any mapping implies a string as its destination value, then the respective
 796 destination variable must already exist, or
 797 have been declared using @cmd{STRING} or another transformation.
 798 Numeric variables however are automatically created if they don't already
 799 exist.
 800 The following example deals with two source variables, @var{a} and @var{b}
 801 which contain string values.  Hence there are two destination variables
 802 @var{v1} and @var{v2}.
 803 Any cases where @var{a} or @var{b} contain the values @samp{apple},
 804 @samp{pear} or @samp{pomegranate} result in @var{v1} or @var{v2} being
 805 filled with the string @samp{fruit} whilst cases with
 806 @samp{tomato}, @samp{lettuce} or @samp{carrot} result in @samp{vegetable}.
 807 Any other values produce the result @samp{unknown}:
 808 @example
 809 string @var{v1} (a20).
 810 string @var{v2} (a20).
 811
 812 recode @var{a} @var{b}
 813         ("apple" "pear" "pomegranate" = "fruit")
 814         ("tomato" "lettuce" "carrot" = "vegetable")
 815         (ELSE = "unknown")
 816         into @var{v1} @var{v2}.
 817 @end example
 818
 819 There is one very special mapping, not mentioned above.
 820 If the source variable is a string variable
 821 then a mapping may be specified as @samp{(CONVERT)}.
 822 This mapping, if it appears must be the last mapping given and
 823 the @samp{INTO @var{dest_vars}} clause must also be given and
 824 must not refer to a string variable.
 825 @samp{CONVERT} causes a number specified as a string to
 826 be converted to a numeric value.
 827 For example it converts the string @samp{"3"} into the numeric
 828 value 3 (note that it does not convert @samp{three} into 3).
 829 If the string cannot be parsed as a number, then the system-missing value
 830 is assigned instead.
 831 In the following example, cases where the value of @var{x} (a string variable)
 832 is the empty string, are recoded to 999 and all others are converted to the
 833 numeric equivalent of the input value.  The results are placed into the
 834 numeric variable @var{y}:
 835 @example
 836 recode @var{x}
 837        ("" = 999)
 838         (convert)
 839         into @var{y}.
 840 @end example
 841
 842 It is possible to specify multiple recodings on a single command.
 843 Introduce additional recodings with a slash (@samp{/}) to
 844 separate them from the previous recodings:
 845 @example
 846 recode
 847         @var{a}  (2 = 22) (else = 99)
 848         /@var{b} (1 = 3) into @var{z}
 849         .
 850 @end example
 851 @noindent Here we have two recodings. The first affects the source variable
 852 @var{a} and recodes in-place the value 2 into 22 and all other values to 99.
 853 The second recoding copies the values of @var{b} into the variable @var{z},
 854 changing any instances of 1 into 3.
 855
 856 @node SORT CASES
 857 @section SORT CASES
 858 @vindex SORT CASES
 859
 860 @display
 861 SORT CASES BY @var{var_list}[(@{D|A@}] [ @var{var_list}[(@{D|A@}] ] ...
 862 @end display
 863
 864 @cmd{SORT CASES} sorts the active dataset by the values of one or more
 865 variables.
 866
 867 Specify @subcmd{BY} and a list of variables to sort by.  By default, variables
 868 are sorted in ascending order.  To override sort order, specify @subcmd{(D)} or
 869 @subcmd{(DOWN)} after a list of variables to get descending order, or @subcmd{(A)}
 870 or @subcmd{(UP)}
 871 for ascending order.  These apply to all the listed variables
 872 up until the preceding @subcmd{(A)}, @subcmd{(D)}, @subcmd{(UP)} or @subcmd{(DOWN)}.
 873
 874 The sort algorithms used by @cmd{SORT CASES} are stable.  This means
 875 records which have equal values of the sort variables have the
 876 same relative order before and after sorting.  Thus,
 877 re-sorting an already sorted file does not affect the ordering of
 878 cases.
 879
 880 @cmd{SORT CASES} is a procedure.  It causes the data to be read.
 881
 882 @cmd{SORT CASES} attempts to sort the entire active dataset in main memory.
 883 If workspace is exhausted, it falls back to a merge sort algorithm which
 884 creates numerous temporary files.
 885
 886 @cmd{SORT CASES} may not be specified following @cmd{TEMPORARY}.
 887
 888 @subsection Sorting Example
 889
 890 In @ref{sort-cases:ex} the data from the file @file {physiology.sav} is sorted
 891 by two variables, @i{viz@:} @exvar{sex} in descending order and @exvar{temperature} in
 892 ascending order.
 893
 894 @float Example, sort-cases:ex
 895 @psppsyntax {sort-cases.sps}
 896 @caption {Sorting cases  by two variables.}
 897 @end float
 898
 899 In @ref{sort-cases:res} you can see that all the cases with a @exvar{sex} of
 900 @samp{1} (female) appear before those with a sex of @samp{0} (male).
 901 This is because they have been sorted in descending order.
 902 Within each sex, the data is sorted on the @exvar{temperature} variable,
 903 this time in ascending order.
 904
 905 @float Results, sort-cases:res
 906 @psppoutput {sort-cases}
 907 @caption {The @file{physiology.sav} file after sorting.}
 908 @end float
 909
 910 Note that @cmd{SORT CASES}, like all other transformations, affects only the active file.
 911 It does not have any effect upon the @file{physiology.sav} file itself. For that, you
 912 would have to rewrite the file using the @cmd{SAVE} command (@pxref{SAVE}).
 913
 914 When using the graphic user interface, it is often simpler to perform a sort
 915 directly from the data view.
 916 To do this, switch to the data view.  Select the column corresponding to the
 917 variable by which you want to sort and click button 1 and then click button 3.
 918 A popup menu will appear like that shown in @ref{sort-simple:scr}.  Select
 919 either ``Sort Ascending'' or ``Sort Descending'' from this menu.
 920
 921 @float Screenshot, sort-simple:scr
 922 @psppimage {sort-simple}
 923 @caption {Sorting the data on a single variable @exvar{height}}
 924 @end float
 925
 926 However, sometimes you will want to sort on two or more variables, and that is
 927 not possible using this method.  In this case, you must either use some code or
 928 the ``Sort Cases'' dialog from the Data menu.  @ref{sort:scr} shows the dialog
 929 box set up to perform a sort on both @exvar{sex} and @exvar{height}.
 930 Note that the order in which you enter the variables is important.  In this case,
 931 the data will be first sorted on @exvar{sex}, and then all cases for which @exvar{sex}
 932 is the same will then be sorted by @exvar{height}.
 933
 934 @float Screenshot, sort:scr
 935 @psppimage {sort}
 936 @caption {Sorting the data on two variables @exvar{sex} and  @exvar{height}}
 937 @end float