1 @c PSPP - a program for statistical analysis.
2 @c Copyright (C) 2017, 2020 Free Software Foundation, Inc.
3 @c Permission is granted to copy, distribute and/or modify this document
4 @c under the terms of the GNU Free Documentation License, Version 1.3
5 @c or any later version published by the Free Software Foundation;
6 @c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
7 @c A copy of the license is included in the section entitled "GNU
8 @c Free Documentation License".
10 @node Data Manipulation
11 @chapter Data transformations
12 @cindex transformations
14 The @pspp{} procedures examined in this chapter manipulate data and
15 prepare the active dataset for later analyses. They do not produce output,
19 * AGGREGATE:: Summarize multiple cases into a single case.
20 * AUTORECODE:: Automatic recoding of variables.
21 * COMPUTE:: Assigning a variable a calculated value.
22 * COUNT:: Counting variables with particular values.
23 * FLIP:: Exchange variables with cases.
24 * IF:: Conditionally assigning a calculated value.
25 * RECODE:: Mapping values from one set to another.
26 * SORT CASES:: Sort the active dataset.
35 [OUTFILE=@{*,'@var{file_name}',@var{file_handle}@} [MODE=@{REPLACE,ADDVARIABLES@}]]
39 [/BREAK=@var{var_list}]
40 /@var{dest_var}['@var{label}']@dots{}=@var{agr_func}(@var{src_vars}[, @var{args}]@dots{})@dots{}
43 @cmd{AGGREGATE} summarizes groups of cases into single cases.
44 It divides cases into groups that have the same values for one or more
45 variables called @dfn{break variables}. Several functions are available
46 for summarizing case contents.
48 The @cmd{AGGREGATE} syntax consists of subcommands to control its
49 behavior, all of which are optional, followed by one or more
50 destination variable assigments, each of which uses an aggregation
51 function to define how it is calculated.
53 The @subcmd{OUTFILE} subcommand, which must be first, names the
54 destination for @cmd{AGGREGATE} output. It may name a system file by
55 file name or file handle (@pxref{File Handles}), a dataset by its name
56 (@pxref{Datasets}), or @samp{*} to replace the active dataset.
57 @cmd{AGGREGATE} writes its output to this file.
59 With @subcmd{OUTFILE=*} only, @code{MODE} may be specified immediately
60 afterward with the value @code{ADDVARIABLES} or @code{REPLACE}:
64 With @code{REPLACE}, the default, the active dataset is replaced by a new dataset
65 which contains just the break variables and the destination varibles.
66 The new file contains as many cases as there are
67 unique combinations of the break variables.
70 With @code{ADDVARIABLES}, the destination variables are added to those in
71 the existing active dataset.
72 Cases that have the same combination of values in their break
73 variables receive identical values for the destination variables.
74 The number of cases in the active dataset remains unchanged.
76 sorted on the break variables, that is, @code{ADDVARIABLES} implies @code{PRESORTED}
79 If @code{OUTFILE} is omitted, @cmd{AGGREGATE} acts as if
80 @code{OUTFILE=* MODE=ADDVARIABLES} were specified.
82 By default, @cmd{AGGREGATE} first sorts the data on the break variables.
83 If the active dataset is already sorted
84 or grouped by the break variables, specify
85 @subcmd{PRESORTED} to save time.
86 With @subcmd{MODE=ADDVARIABLES}, the data must be pre-sorted.
88 Specify @subcmd{DOCUMENT} to copy the documents from the active dataset into the
89 aggregate file (@pxref{DOCUMENT}). Otherwise, the aggregate file does
90 not contain any documents, even if the aggregate file replaces the
93 Normally, @code{AGGREGATE} produces a non-missing value whenever there
94 is enough non-missing data for the aggregation function in use, that
95 is, just one non-missing value or, for the @code{SD} and @code{SD.}
96 aggregation functions, two non-missing values. Specify
97 @code{/MISSING=COLUMNWISE} to make @code{AGGREGATE} output a missing
98 value when one or more of the input values are missing.
100 The @subcmd{BREAK} subcommand is optionally but usually present. On
101 @subcmd{BREAK}, list the variables used to divide the active dataset
102 into groups to be summarized.
104 @cmd{AGGREGATE} is particular about the order of subcommands.
105 @subcmd{OUTFILE} must be first, followed by @subcmd{MISSING}.
106 @subcmd{PRESORTED} and @subcmd{DOCUMENT} follow @subcmd{MISSING}, in
107 either order, followed by @subcmd{BREAK}, then followed by aggregation
108 variable specifications.
110 At least one set of aggregation variables is required. Each set
111 comprises a list of aggregation variables, an equals sign (@samp{=}),
112 the name of an aggregation function (see the list below), and a list
113 of source variables in parentheses. A few aggregation functions do
114 not accept source variables, and some aggregation functions expect
115 additional arguments after the source variable names.
117 @cmd{AGGREGATE} typically creates aggregation variables with no
118 variable label, value labels, or missing values. Their default print
119 and write formats depend on the aggregation function used, with
120 details given in the table below. A variable label for an aggregation
121 variable may be specified just after the variable's name in the
122 aggregation variable list.
124 Each set must have exactly as many source variables as aggregation
125 variables. Each aggregation variable receives the results of applying
126 the specified aggregation function to the corresponding source
129 The following aggregation functions may be applied only to numeric
133 @item @subcmd{MEAN(@var{var_name}@dots{})}
134 Arithmetic mean. Limited to numeric values. The default format is
137 @item @subcmd{MEDIAN(@var{var_name}@dots{})}
138 The median value. Limited to numeric values. The default format is F8.2.
140 @item @subcmd{SD(@var{var_name}@dots{})}
141 Standard deviation of the mean. Limited to numeric values. The
142 default format is F8.2.
144 @item @subcmd{SUM(@var{var_name}@dots{})}
145 Sum. Limited to numeric values. The default format is F8.2.
148 These aggregation functions may be applied to numeric and string variables:
151 @item @subcmd{CGT(@var{var_name}@dots{}, @var{value})}
152 @itemx @subcmd{CLT(@var{var_name}@dots{}, @var{value})}
153 @itemx @subcmd{CIN(@var{var_name}@dots{}, @var{low}, @var{high})}
154 @itemx @subcmd{COUT(@var{var_name}@dots{}, @var{low}, @var{high})}
155 Total weight of cases greater than or less than @var{value} or inside
156 or outside the closed range [@var{low},@var{high}], respectively. The
157 default format is F5.3.
159 @item @subcmd{FGT(@var{var_name}@dots{}, @var{value})}
160 @itemx @subcmd{FLT(@var{var_name}@dots{}, @var{value})}
161 @itemx @subcmd{FIN(@var{var_name}@dots{}, @var{low}, @var{high})}
162 @itemx @subcmd{FOUT(@var{var_name}@dots{}, @var{low}, @var{high})}
163 Fraction of values greater than or less than @var{value} or inside or
164 outside the closed range [@var{low},@var{high}], respectively. The
165 default format is F5.3.
167 @item @subcmd{FIRST(@var{var_name}@dots{})}
168 @itemx @subcmd{LAST(@var{var_name}@dots{})}
169 First or last non-missing value, respectively, in break group. The
171 receives the complete dictionary information from the source variable.
172 The sort performed by @cmd{AGGREGATE} (and by @cmd{SORT CASES}) is stable.
174 the first (or last) case with particular values for the break variables before
175 sorting is also the first (or last) case in that break group after sorting.
177 @item @subcmd{MIN(@var{var_name}@dots{})}
178 @itemx @subcmd{MAX(@var{var_name}@dots{})}
179 Minimum or maximum value, respectively. The aggregation variable
180 receives the complete dictionary information from the source variable.
182 @item @subcmd{N(@var{var_name}@dots{})}
183 @itemx @subcmd{NMISS(@var{var_name}@dots{})}
184 Total weight of non-missing or missing values, respectively. The
185 default format is F7.0 if weighting is not enabled, F8.2 if it is
188 @item @subcmd{NU(@var{var_name}@dots{})}
189 @itemx @subcmd{NUMISS(@var{var_name}@dots{})}
190 Count of non-missing or missing values, respectively, ignoring case
191 weights. The default format is F7.0.
193 @item @subcmd{PGT(@var{var_name}@dots{}, @var{value})}
194 @itemx @subcmd{PLT(@var{var_name}@dots{}, @var{value})}
195 @itemx @subcmd{PIN(@var{var_name}@dots{}, @var{low}, @var{high})}
196 @itemx @subcmd{POUT(@var{var_name}@dots{}, @var{low}, @var{high})}
197 Percentage between 0 and 100 of values greater than or less than
198 @var{VALUE} or inside or outside the closed range
199 [@var{low},@var{high}], respectively. The default format is F5.1.
202 These aggregation functions do not accept source variables:
206 Total weight of cases aggregated to form this group. The default
207 format is F7.0 if weighting is not enabled, F8.2 if it is
211 Count of cases aggregated to form this group, ignoring case weights.
212 The default format is F7.0.
215 Aggregation functions compare string values in terms of internal
217 On most modern computers, this is @acronym{ASCII} or a superset thereof.
219 The aggregation functions listed above exclude all user-missing values
220 from calculations. To include user-missing values, insert a period
221 (@samp{.}) at the end of the function name. (@i{e.g.}@: @samp{SUM.}).
222 (Be aware that specifying such a function as the last token on a line
223 causes the period to be interpreted as the end of the command.)
225 @cmd{AGGREGATE} both ignores and cancels the current @cmd{SPLIT FILE}
226 settings (@pxref{SPLIT FILE}).
228 @subsection Aggregate Example
230 The @file{personnel.sav} dataset provides the occupations and salaries of
231 many individuals. For many purposes however such detailed information is
232 not interesting, but often the aggregated statistics of each occupation are
233 of interest. In @ref{aggregate:ex} the @cmd{AGGREGATE} command is used
234 to calculate the mean, the median and the standard deviation of each
237 @float Example, aggregate:ex
238 @psppsyntax {aggregate.sps}
239 @caption {Calculating aggregated statistics from the @file{personnel.sav} file.}
242 Since we chose the @samp{MODE=REPLACE} option, in @ref{aggregate:res} cases
243 for the individual persons are no longer present. They have each been replaced
244 by a single case per aggregated value.
246 @float Results, aggregate:res
247 @psppoutput {aggregate}
248 @caption {Aggregated mean, median and standard deviation per @exvar{occupation}.}
251 Note that some values for the standard deviation are blank.
252 This is because there is only one case with the respective
260 AUTORECODE VARIABLES=@var{src_vars} INTO @var{dest_vars}
264 [ /BLANK = @{VALID, MISSING@} ]
267 The @cmd{AUTORECODE} procedure considers the @var{n} values that a variable
268 takes on and maps them onto values 1@dots{}@var{n} on a new numeric
271 Subcommand @subcmd{VARIABLES} is the only required subcommand and must come
272 first. Specify @subcmd{VARIABLES}, an equals sign (@samp{=}), a list of source
273 variables, @subcmd{INTO}, and a list of target variables. There must the same
274 number of source and target variables. The target variables must not
277 @cmd{AUTORECODE} ordinarily assigns each increasing non-missing value
278 of a source variable (for a string, this is based on character code
279 comparisons) to consecutive values of its target variable. For
280 example, the smallest non-missing value of the source variable is
281 recoded to value 1, the next smallest to 2, and so on. If the source
282 variable has user-missing values, they are recoded to
283 consecutive values just above the non-missing values. For example, if
284 a source variables has seven distinct non-missing values, then the
285 smallest missing value would be recoded to 8, the next smallest to 9,
288 Use @subcmd{DESCENDING} to reverse the sort order for non-missing
289 values, so that the largest non-missing value is recoded to 1, the
290 second-largest to 2, and so on. Even with @subcmd{DESCENDING},
291 user-missing values are still recoded in ascending order just above
292 the non-missing values.
294 The system-missing value is always recoded into the system-missing
295 variable in target variables.
297 If a source value has a value label, then that value label is retained
298 for the new value in the target variable. Otherwise, the source value
299 itself becomes each new value's label.
301 Variable labels are copied from the source to target variables.
303 @subcmd{PRINT} is currently ignored.
305 The @subcmd{GROUP} subcommand is relevant only if more than one variable is to be
306 recoded. It causes a single mapping between source and target values to
307 be used, instead of one map per variable. With @subcmd{GROUP},
308 user-missing values are taken from the first source variable that has
309 any user-missing values.
311 If @subcmd{/BLANK=MISSING} is given, then string variables which contain only
312 whitespace are recoded as SYSMIS. If @subcmd{/BLANK=VALID} is specified then they
313 are allocated a value like any other. @subcmd{/BLANK} is not relevant
314 to numeric values. @subcmd{/BLANK=VALID} is the default.
316 @cmd{AUTORECODE} is a procedure. It causes the data to be read. It
317 ignores @cmd{TEMPORARY} (@pxref{TEMPORARY}), so that ``temporary''
318 transformations become permanent.
321 @subsection Autorecode Example
323 In the file @file{personnel.sav}, the variable @exvar{occupation} is a string
324 variable. Except for data of a purely commentary nature, string variables
325 are generally a bad idea. One reason is that data entry errors are easily
326 overlooked. This has happened in @file{personnel.sav}; one entry which should
327 read ``Scientist'' has been mistyped as ``Scrientist''. In @ref{autorecode:ex}
328 first, this error is corrected by the @cmd{DO IF} clause,
329 @footnote{One must use care when correcting such data input errors rather than
330 msimply marking them as missing. For example, if an occupation has been entered
331 ``Barister'', did the person mean ``Barrister'' or did she mean ``Barista''?}
332 then we use @cmd{AUTORECODE} to
333 create a new numeric variable which takes recoded values of @exvar{occupation}.
334 Finally, we remove the old variable and rename the new variable to
335 the name of the old variable.
337 @float Example, autorecode:ex
338 @psppsyntax {autorecode.sps}
339 @caption {Changing a string variable to a numeric variable using @cmd{AUTORECODE}
340 after correcting a data entry error}
344 @float Screenshot, autorecode:scr
345 @psppimage {autorecode}
346 @caption {Autorecode dialog box set to recode @exvar{occupation} to @exvar{occ}}
349 Notice in @ref{autorecode:res}, how the new variable has been automatically
350 allocated value labels which correspond to the strings of the old variable.
351 This means that in future analyses the descriptive strings are reported instead
352 of the numeric values.
354 @float Result, autorecode:res
355 @psppoutput {autorecode}
356 @caption {The properties of the @exvar{occupation} variable following @cmd{AUTORECODE}}
365 COMPUTE @var{variable} = @var{expression}.
369 COMPUTE vector(@var{index}) = @var{expression}.
372 @cmd{COMPUTE} assigns the value of an expression to a target
373 variable. For each case, the expression is evaluated and its value
374 assigned to the target variable. Numeric and string
375 variables may be assigned. When a string expression's width differs
376 from the target variable's width, the string result of the expression
377 is truncated or padded with spaces on the right as necessary. The
378 expression and variable types must match.
380 For numeric variables only, the target variable need not already
381 exist. Numeric variables created by @cmd{COMPUTE} are assigned an
382 @code{F8.2} output format. String variables must be declared before
383 they can be used as targets for @cmd{COMPUTE}.
385 The target variable may be specified as an element of a vector
386 (@pxref{VECTOR}). In this case, an expression @var{index} must be
387 specified in parentheses following the vector name. The expression @var{index}
388 must evaluate to a numeric value that, after rounding down
389 to the nearest integer, is a valid index for the named vector.
391 Using @cmd{COMPUTE} to assign to a variable specified on @cmd{LEAVE}
392 (@pxref{LEAVE}) resets the variable's left state. Therefore,
393 @code{LEAVE} should be specified following @cmd{COMPUTE}, not before.
395 @cmd{COMPUTE} is a transformation. It does not cause the active dataset to be
398 When @cmd{COMPUTE} is specified following @cmd{TEMPORARY}
399 (@pxref{TEMPORARY}), the @cmd{LAG} function may not be used
402 @subsection Compute Examples
404 The dataset @file{physiology.sav} contains the height and weight of persons.
405 For some purposes, neither height nor weight alone is of interest.
406 Epidemiologists are often more interested in the @dfn{body mass index} which
407 can sometimes be used as a predictor for clinical conditions.
408 The body mass index is defined as the weight of the person in kilograms divided
409 by the square of the person's height in metres.
410 @footnote{Since BMI is a quantity with a ratio scale and has units, the term ``index''
411 is a misnomer, but that is what it is called.}
413 @float Example, bmi:ex
414 @psppsyntax {compute.sps}
415 @caption {Computing the body mass index from @exvar{weight} and @exvar{height}}
418 @ref{bmi:ex} shows how you can use @cmd{COMPUTE} to generate a new variable called
419 @exvar{bmi} and have every case's value calculated from the existing values of
420 @exvar{weight} and @exvar{height}.
421 It also shows how you can add a label to this new variable (@pxref{VARIABLE LABELS}),
422 so that a more descriptive label appears in subsequent analyses, and this can be seen
423 in the ouput from the @cmd{DESCRIPTIVES} command in @ref{bmi:res}.
425 @float Screenshot, bmi:scr
427 @caption {Using the dialog box to generate a new variable and compute its values}
430 The expression which follows the @samp{=} sign can be as complicated as necessary.
431 @xref{Expressions} for a precise description of the language accepted.
432 Normally it is easiest to enter the code directly, however there is a dialog box
433 available if desired. This is illustrated in @ref{bmi:scr}.
434 One advantage is that it offers a list of mathematical
435 functions which can be selected and pasted into the expression.
437 @float Results, bmi:res
438 @psppoutput {compute}
439 @caption {An analysis which includes @exvar{bmi} in its results}
449 COUNT @var{var_name} = @var{var}@dots{} (@var{value}@dots{})
450 [/@var{var_name} = @var{var}@dots{} (@var{value}@dots{})]@dots{}
452 Each @var{value} takes one of the following forms:
455 @var{num1} THRU @var{num2}
458 where @var{num1} is a numeric expression or the words @subcmd{LO} or @subcmd{LOWEST}
459 and @var{num2} is a numeric expression or @subcmd{HI} or @subcmd{HIGHEST}.
462 @cmd{COUNT} creates or replaces a numeric @dfn{target} variable that
463 counts the occurrence of a @dfn{criterion} value or set of values over
464 one or more @dfn{test} variables for each case.
466 The target variable values are always nonnegative integers. They are
467 never missing. The target variable is assigned an F8.2 output format.
468 @xref{Input and Output Formats}. Any variables, including
469 string variables, may be test variables.
471 User-missing values of test variables are treated just like any other
472 values. They are @strong{not} treated as system-missing values.
473 User-missing values that are criterion values or inside ranges of
474 criterion values are counted as any other values. However (for numeric
475 variables), keyword @subcmd{MISSING} may be used to refer to all system-
476 and user-missing values.
478 @cmd{COUNT} target variables are assigned values in the order
479 specified. In the command @subcmd{COUNT @var{A}=@var{A} @var{B}(1) /@var{B}=@var{A} @var{B}(2).}, the
480 following actions occur:
484 The number of occurrences of 1 between @var{A} and @var{B} is counted.
487 @var{A} is assigned this value.
490 The number of occurrences of 1 between @var{B} and the @strong{new}
491 value of @var{A} is counted.
494 @var{B} is assigned this value.
497 Despite this ordering, all @cmd{COUNT} criterion variables must exist
498 before the procedure is executed---they may not be created as target
499 variables earlier in the command! Break such a command into two
502 @subsection Count Examples
504 In the survey results in dataset @file{hotel.sav} a manager wishes
505 to know how many respondents answered with low valued answers to questions
506 @exvar{v1}, @exvar{v2} and @exvar{v3}. This can be found using the code
507 in @ref{count:ex}. Specifically, this code creates a new variable, and
508 populates it with the number of values in @exvar{v1}--@exvar{v2} which
511 @float Example, count:ex
512 @psppsyntax {count.sps}
513 @caption {Counting low values to responses @exvar{v1}, @exvar{v2} and @exvar{v3}}
516 In @ref{count:ex} the @cmd{COUNT} transformation creates a new variable, @exvar{low_counts} and
517 its values are shown using the @cmd{LIST} command.
519 If using the graphic user interface, a two step process must be used to set
520 up the @cmd{COUNT} transformation. The first dialog box (@ref{count:scr}) provides for the
521 variables to be chosen.
522 Then, one must click on the button marked ``Define Values...'' to reveal
523 the dialog box for selecting the values to count.
525 @float Screenshot, count:scr
527 @caption {The variables @exvar{v1}, @exvar{v2} and @exvar{v3} selected, ready
528 to define values to count}
531 In this dialog box, you must select the values you wish to count
532 --- in this case all values up to and including 2 --- as shown in @ref{count-define:scr}
533 and click ``Add''. As many ranges or may be added as you desire.
534 When all desired ranges have been added click ``Continue''.
536 @float Screenshot, count-define:scr
537 @psppimage {count-define}
538 @caption {Count ``Define Values'' dialog with @samp{lowest thru 2} selected}
541 In @ref{count:res} we can see the values of @exvar{low_counts} after the @cmd{COUNT}
542 transformation has completed. The first value is 1, because there is only one
543 variable amoung @exvar{v1}, @exvar{v2} and @exvar{3} which has a value of 2 or less.
544 The second value is 2, because both @exvar{v1} and @exvar{v2} are 2 or less.
546 @float Result, count:res
548 @caption {The values of @exvar{v1}, @exvar{v2}, @exvar{v3} and @exvar{low_counts} after
549 the @cmd{COUNT} transformation has run}
558 FLIP /VARIABLES=@var{var_list} /NEWNAMES=@var{var_name}.
561 @cmd{FLIP} transposes rows and columns in the active dataset. It
562 causes cases to be swapped with variables, and vice versa.
564 All variables in the transposed active dataset are numeric. String
565 variables take on the system-missing value in the transposed file.
567 @subcmd{N} subcommands are required. If specified, the @subcmd{VARIABLES} subcommand
568 selects variables to be transformed into cases, and variables not
569 specified are discarded. If the @subcmd{VARIABLES} subcommand is omitted, all
570 variables are selected for transposition.
572 The variables specified by @subcmd{NEWNAMES}, which must be a
574 used to give names to the variables created by @cmd{FLIP}. Only the
575 first 8 characters of the variable are used. If
576 @subcmd{NEWNAMES} is not
577 specified then the default is a variable named @exvar{CASE_LBL}, if it exists.
578 If it does not then the variables created by @cmd{FLIP} are named VAR000
579 through VAR999, then VAR1000, VAR1001, and so on.
581 When a @subcmd{NEWNAMES} variable is available, the names must be canonicalized
582 before becoming variable names. Invalid characters are replaced by
583 letter @samp{V} in the first position, or by @samp{_} in subsequent
584 positions. If the name thus generated is not unique, then numeric
585 extensions are added, starting with 1, until a unique name is found or
586 there are no remaining possibilities. If the latter occurs then the
587 @cmd{FLIP} operation aborts.
589 The resultant dictionary contains a @exvar{CASE_LBL} variable, a string
590 variable of width 8, which stores the names of the variables in the
591 dictionary before the transposition. Variables names longer than 8
592 characters are truncated. If @cmd{FLIP} is called again on
593 this dataset, the @exvar{CASE_LBL} variable can be passed to the @subcmd{NEWNAMES}
594 subcommand to recreate the original variable names.
596 @cmd{FLIP} honors @cmd{N OF CASES} (@pxref{N OF CASES}). It ignores
597 @cmd{TEMPORARY} (@pxref{TEMPORARY}), so that ``temporary''
598 transformations become permanent.
600 @subsection Flip Examples
603 In @ref{flip:ex}, data has been entered using @cmd{DATA LIST} (@pxref{DATA LIST})
604 such that the first variable in the dataset is a string variable containing
605 a description of the other data for the case.
606 Clearly this is not a convenient arrangement for performing statistical analyses,
607 so it would have been better to think a little more carefully about how the data
608 should have been arranged.
609 However often the data is provided by some third party source, and you have
610 no control over the form.
611 Fortunately, we can use @cmd{FLIP} to exchange the variables
612 and cases in the active dataset.
614 @float Example, flip:ex
615 @psppsyntax {flip.sps}
616 @caption {Using @cmd{FLIP} to exchange variables and cases in a dataset}
619 As you can see in @ref{flip:res} before the @cmd{FLIP} command has run there
620 are seven variables (six containing data and one for the heading) and three cases.
621 Afterwards there are four variables (one per case, plus the @exvar{CASE_LBL} variable)
623 You can delete the @exvar{CASE_LBL} variable (@pxref{DELETE VARIABLES}) if you don't need it.
625 @float Results, flip:res
627 @caption {The results of using @cmd{FLIP} to exchange variables and cases in a dataset}
636 IF @var{condition} @var{variable}=@var{expression}.
640 IF @var{condition} vector(@var{index})=@var{expression}.
643 The @cmd{IF} transformation conditionally assigns the value of a target
644 expression to a target variable, based on the truth of a test
647 Specify a boolean-valued expression (@pxref{Expressions}) to be tested
648 following the @cmd{IF} keyword. This expression is evaluated for each case.
649 If the value is true, then the value of the expression is computed and
650 assigned to the specified variable. If the value is false or missing,
651 nothing is done. Numeric and string variables may be
652 assigned. When a string expression's width differs from the target
653 variable's width, the string result of the expression is truncated or
654 padded with spaces on the right as necessary. The expression and
655 variable types must match.
657 The target variable may be specified as an element of a vector
658 (@pxref{VECTOR}). In this case, a vector index expression must be
659 specified in parentheses following the vector name. The index
660 expression must evaluate to a numeric value that, after rounding down
661 to the nearest integer, is a valid index for the named vector.
663 Using @cmd{IF} to assign to a variable specified on @cmd{LEAVE}
664 (@pxref{LEAVE}) resets the variable's left state. Therefore,
665 @code{LEAVE} should be specified following @cmd{IF}, not before.
667 When @cmd{IF} is specified following @cmd{TEMPORARY}
668 (@pxref{TEMPORARY}), the @cmd{LAG} function may not be used
675 The @cmd{RECODE} command is used to transform existing values into other,
676 user specified values.
680 RECODE @var{src_vars}
681 (@var{src_value} @var{src_value} @dots{} = @var{dest_value})
682 (@var{src_value} @var{src_value} @dots{} = @var{dest_value})
683 (@var{src_value} @var{src_value} @dots{} = @var{dest_value}) @dots{}
684 [INTO @var{dest_vars}].
687 Following the @cmd{RECODE} keyword itself comes @var{src_vars} which is a list
688 of variables whose values are to be transformed.
689 These variables may be string variables or they may be numeric.
690 However the list must be homogeneous; you may not mix string variables and
691 numeric variables in the same recoding.
693 After the list of source variables, there should be one or more @dfn{mappings}.
694 Each mapping is enclosed in parentheses, and contains the source values and
695 a destination value separated by a single @samp{=}.
696 The source values are used to specify the values in the dataset which
697 need to change, and the destination value specifies the new value
698 to which they should be changed.
699 Each @var{src_value} may take one of the following forms:
702 If the source variables are numeric then @var{src_value} may be a literal
705 If the source variables are string variables then @var{src_value} may be a
706 literal string (like all strings, enclosed in single or double quotes).
707 @item @var{num1} THRU @var{num2}
708 This form is valid only when the source variables are numeric.
709 It specifies all values in the range between @var{num1} and @var{num2},
710 including both endpoints of the range. By convention, @var{num1}
711 should be less than @var{num2}.
712 Open-ended ranges may be specified using @samp{LO} or @samp{LOWEST}
714 or @samp{HI} or @samp{HIGHEST} for @var{num2}.
716 The literal keyword @samp{MISSING} matches both system missing and user
718 It is valid for both numeric and string variables.
720 The literal keyword @samp{SYSMIS} matches system missing
722 It is valid for both numeric variables only.
724 The @samp{ELSE} keyword may be used to match any values which are
725 not matched by any other @var{src_value} appearing in the command.
726 If this keyword appears, it should be used in the last mapping of the
730 After the source variables comes an @samp{=} and then the @var{dest_value}.
731 The @var{dest_value} may take any of the following forms:
734 A literal numeric value to which the source values should be changed.
735 This implies the destination variable must be numeric.
737 A literal string value (enclosed in quotation marks) to which the source
738 values should be changed.
739 This implies the destination variable must be a string variable.
741 The keyword @samp{SYSMIS} changes the value to the system missing value.
742 This implies the destination variable must be numeric.
744 The special keyword @samp{COPY} means that the source value should not be
746 copied directly to the destination value.
747 This is meaningful only if @samp{INTO @var{dest_vars}} is specified.
750 Mappings are considered from left to right.
751 Therefore, if a value is matched by a @var{src_value} from more than
752 one mapping, the first (leftmost) mapping which matches is considered.
753 Any subsequent matches are ignored.
755 The clause @samp{INTO @var{dest_vars}} is optional.
756 The behaviour of the command is slightly different depending on whether it
759 If @samp{INTO @var{dest_vars}} does not appear, then values are recoded
761 This means that the recoded values are written back to the
762 source variables from whence the original values came.
763 In this case, the @var{dest_value} for every mapping must imply a value which
764 has the same type as the @var{src_value}.
765 For example, if the source value is a string value, it is not permissible for
766 @var{dest_value} to be @samp{SYSMIS} or another forms which implies a numeric
768 It is also not permissible for @var{dest_value} to be longer than the width
769 of the source variable.
771 The following example two numeric variables @var{x} and @var{y} are recoded
773 Zero is recoded to 99, the values 1 to 10 inclusive are unchanged,
774 values 1000 and higher are recoded to the system-missing value and all other
775 values are changed to 999:
777 recode @var{x} @var{y}
780 (1000 THRU HIGHEST = SYSMIS)
784 If @samp{INTO @var{dest_vars}} is given, then recoded values are written
785 into the variables specified in @var{dest_vars}, which must therefore
786 contain a list of valid variable names.
787 The number of variables in @var{dest_vars} must be the same as the number
788 of variables in @var{src_vars}
789 and the respective order of the variables in @var{dest_vars} corresponds to
790 the order of @var{src_vars}.
791 That is to say, the recoded value whose
792 original value came from the @var{n}th variable in @var{src_vars} is
793 placed into the @var{n}th variable in @var{dest_vars}.
794 The source variables are unchanged.
795 If any mapping implies a string as its destination value, then the respective
796 destination variable must already exist, or
797 have been declared using @cmd{STRING} or another transformation.
798 Numeric variables however are automatically created if they don't already
800 The following example deals with two source variables, @var{a} and @var{b}
801 which contain string values. Hence there are two destination variables
802 @var{v1} and @var{v2}.
803 Any cases where @var{a} or @var{b} contain the values @samp{apple},
804 @samp{pear} or @samp{pomegranate} result in @var{v1} or @var{v2} being
805 filled with the string @samp{fruit} whilst cases with
806 @samp{tomato}, @samp{lettuce} or @samp{carrot} result in @samp{vegetable}.
807 Any other values produce the result @samp{unknown}:
809 string @var{v1} (a20).
810 string @var{v2} (a20).
812 recode @var{a} @var{b}
813 ("apple" "pear" "pomegranate" = "fruit")
814 ("tomato" "lettuce" "carrot" = "vegetable")
816 into @var{v1} @var{v2}.
819 There is one very special mapping, not mentioned above.
820 If the source variable is a string variable
821 then a mapping may be specified as @samp{(CONVERT)}.
822 This mapping, if it appears must be the last mapping given and
823 the @samp{INTO @var{dest_vars}} clause must also be given and
824 must not refer to a string variable.
825 @samp{CONVERT} causes a number specified as a string to
826 be converted to a numeric value.
827 For example it converts the string @samp{"3"} into the numeric
828 value 3 (note that it does not convert @samp{three} into 3).
829 If the string cannot be parsed as a number, then the system-missing value
831 In the following example, cases where the value of @var{x} (a string variable)
832 is the empty string, are recoded to 999 and all others are converted to the
833 numeric equivalent of the input value. The results are placed into the
834 numeric variable @var{y}:
842 It is possible to specify multiple recodings on a single command.
843 Introduce additional recodings with a slash (@samp{/}) to
844 separate them from the previous recodings:
847 @var{a} (2 = 22) (else = 99)
848 /@var{b} (1 = 3) into @var{z}
851 @noindent Here we have two recodings. The first affects the source variable
852 @var{a} and recodes in-place the value 2 into 22 and all other values to 99.
853 The second recoding copies the values of @var{b} into the variable @var{z},
854 changing any instances of 1 into 3.
861 SORT CASES BY @var{var_list}[(@{D|A@}] [ @var{var_list}[(@{D|A@}] ] ...
864 @cmd{SORT CASES} sorts the active dataset by the values of one or more
867 Specify @subcmd{BY} and a list of variables to sort by. By default, variables
868 are sorted in ascending order. To override sort order, specify @subcmd{(D)} or
869 @subcmd{(DOWN)} after a list of variables to get descending order, or @subcmd{(A)}
871 for ascending order. These apply to all the listed variables
872 up until the preceding @subcmd{(A)}, @subcmd{(D)}, @subcmd{(UP)} or @subcmd{(DOWN)}.
874 The sort algorithms used by @cmd{SORT CASES} are stable. This means
875 records which have equal values of the sort variables have the
876 same relative order before and after sorting. Thus,
877 re-sorting an already sorted file does not affect the ordering of
880 @cmd{SORT CASES} is a procedure. It causes the data to be read.
882 @cmd{SORT CASES} attempts to sort the entire active dataset in main memory.
883 If workspace is exhausted, it falls back to a merge sort algorithm which
884 creates numerous temporary files.
886 @cmd{SORT CASES} may not be specified following @cmd{TEMPORARY}.
888 @subsection Sorting Example
890 In @ref{sort-cases:ex} the data from the file @file {physiology.sav} is sorted
891 by two variables, @i{viz@:} @exvar{sex} in descending order and @exvar{temperature} in
894 @float Example, sort-cases:ex
895 @psppsyntax {sort-cases.sps}
896 @caption {Sorting cases by two variables.}
899 In @ref{sort-cases:res} you can see that all the cases with a @exvar{sex} of
900 @samp{1} (female) appear before those with a sex of @samp{0} (male).
901 This is because they have been sorted in descending order.
902 Within each sex, the data is sorted on the @exvar{temperature} variable,
903 this time in ascending order.
905 @float Results, sort-cases:res
906 @psppoutput {sort-cases}
907 @caption {The @file{physiology.sav} file after sorting.}
910 Note that @cmd{SORT CASES}, like all other transformations, affects only the active file.
911 It does not have any effect upon the @file{physiology.sav} file itself. For that, you
912 would have to rewrite the file using the @cmd{SAVE} command (@pxref{SAVE}).
914 When using the graphic user interface, it is often simpler to perform a sort
915 directly from the data view.
916 To do this, switch to the data view. Select the column corresponding to the
917 variable by which you want to sort and click button 1 and then click button 3.
918 A popup menu will appear like that shown in @ref{sort-simple:scr}. Select
919 either ``Sort Ascending'' or ``Sort Descending'' from this menu.
921 @float Screenshot, sort-simple:scr
922 @psppimage {sort-simple}
923 @caption {Sorting the data on a single variable @exvar{height}}
926 However, sometimes you will want to sort on two or more variables, and that is
927 not possible using this method. In this case, you must either use some code or
928 the ``Sort Cases'' dialog from the Data menu. @ref{sort:scr} shows the dialog
929 box set up to perform a sort on both @exvar{sex} and @exvar{height}.
930 Note that the order in which you enter the variables is important. In this case,
931 the data will be first sorted on @exvar{sex}, and then all cases for which @exvar{sex}
932 is the same will then be sorted by @exvar{height}.
934 @float Screenshot, sort:scr
936 @caption {Sorting the data on two variables @exvar{sex} and @exvar{height}}