1 @node SPSS/PC+ System File Format
2 @appendix SPSS/PC+ System File Format
4 SPSS/PC+, first released in 1984, was a simplified version of SPSS for
5 IBM PC and compatible computers. It used a data file format related
6 to the one described in the previous chapter, but simplified and
7 incompatible. The SPSS/PC+ software became obsolete in the 1990s, so
8 files in this format are rarely encountered today. Nevertheless, for
9 completeness, and because it is not very difficult, it seems
10 worthwhile to support at least reading these files. This chapter
11 documents this format, based on examination of a corpus of about 60
12 files from a variety of sources.
14 System files use four data types: 8-bit characters, 16-bit unsigned
15 integers, 32-bit unsigned integers, and 64-bit floating points, called
16 here @code{char}, @code{uint16}, @code{uint32}, and @code{flt64},
17 respectively. Data is not necessarily aligned on a word or
20 SPSS/PC+ ran only on IBM PC and compatible computers. Therefore,
21 values in these files are always in little-endian byte order.
22 Floating-point numbers are always in IEEE 754 format.
24 SPSS/PC+ system files represent the system-missing value as -1.66e308,
25 or @code{f5 1e 26 02 8a 8c ed ff} expressed as hexadecimal. (This is
26 an unusual choice: it is close to, but not equal to, the largest
27 negative 64-bit IEEE 754, which is about -1.8e308.)
29 Text in SPSS/PC+ system file is encoded in ASCII-based 8-bit MS DOS
30 codepages. The corpus used for investigating the format were all
33 An SPSS/PC+ system file begins with the following 256-byte directory:
48 Always set to 2 and 0, respectively.
50 These fields could be used as a signature for the file format, but the
51 @code{product} field in record 0 seems more likely to be unique
52 (@pxref{Record 0 Main Header Record}).
54 @item struct @{ @dots{} @} records[15];
55 Each of the elements in this array identifies a record in the system
56 file. The @code{ofs} is a byte offset, from the beginning of the
57 file, that identifies the start of the record. @code{len} specifies
58 the length of the record, in bytes. Many records are optional or not
59 used. If a record is not present, @code{ofs} and @code{len} for that
60 record are both are zero.
62 @item char filename[128];
63 In most files in the corpus, this field is entirely filled with
64 spaces. In one file, it contains a file name, followed by a null
65 bytes, followed by spaces to fill the remainder of the field. The
69 The following sections describe the contents of each record,
70 identified by the index into the @code{records} array.
73 * Record 0 Main Header Record::
74 * Record 1 Variables Record::
75 * Record 2 Labels Record::
76 * Record 3 Data Record::
77 * Records 4 and 5 Data Entry::
80 @node Record 0 Main Header Record
81 @section Record 0: Main Header Record
83 All files in the corpus have this record at offset 0x100 with length
84 0xb0 (but readers should find this record, like the others, via the
85 @code{records} table in the directory). Its format is:
95 uint16 nominal_case_size;
101 char creation_date[8];
102 char creation_time[8];
117 It seems likely that one of these variables is set to 1 if weighting
118 is enabled, but none of the files in the corpus is weighted.
120 @item char product[62];
121 Name of the program that created the file. Only the following unique
122 values have been observed, in each case padded on the right with
126 DESPSS/PC+ System File Written by Data Entry II
127 PCSPSS SYSTEM FILE. IBM PC DOS, SPSS/PC+
128 PCSPSS SYSTEM FILE. IBM PC DOS, SPSS/PC+ V3.0
129 PCSPSS SYSTEM FILE. IBM PC DOS, SPSS for Windows
132 Thus, it is reasonable to use the presence of the string @samp{SPSS}
133 at offset 0x104 as a simple test for an SPSS/PC+ data file.
136 The system-missing value, as described previously (@pxref{SPSS/PC+
137 System File Format}).
139 @item uint16 compressed;
140 Set to 0 if the data in the file is not compressed, 1 if the data is
141 compressed with simple bytecode compression.
143 @item uint16 nominal_case_size;
144 Number of data elements per case. This is the number of variables,
145 except that long string variables add extra data elements (one for
146 every 8 bytes after the first 8). String variables in SPSS/PC+ system
147 files are limited to 255 bytes.
149 @item uint16 n_cases0;
150 @itemx uint16 n_cases1;
151 The number of cases in the data record. Both values are the same.
152 Some files in the corpus contain data for the number of cases noted
153 here, followed by garbage that somewhat resembles data.
155 @item uint16 weight_index;
156 0, if the file is unweighted, otherwise a 1-based index into the data
157 record of the weighting variable, e.g.@: 4 for the first variable
158 after the 3 system-defined variables.
160 @item char creation_date[8];
161 The date that the file was created, in @samp{mm/dd/yy} format.
162 Single-digit days and months are not prefixed by zeros. The string is
163 padded with spaces on right or left or both, e.g. @samp{_2/4/93_},
164 @samp{10/5/87_}, and @samp{_1/11/88} (with @samp{_} standing in for a
165 space) are all actual examples from the corpus.
167 @item char creation_time[8];
168 The time that the file was created, in @samp{HH:MM:SS} format.
169 Single-digit hours are padded on a left with a space. Minutes and
170 seconds are always written as two digits.
172 @item char file_label[64];
173 File label declared by the user, if any (@pxref{FILE LABEL,,,pspp,
174 PSPP Users Guide}). Padded on the right with spaces.
177 @node Record 1 Variables Record
178 @section Record 1: Variables Record
180 The variables record most commonly starts at offset 0x1b0, but it can
181 be placed elsewhere. The record contains instances of the following
185 uint32 value_label_start;
186 uint32 value_label_end;
187 uint32 var_label_ofs;
196 The number of instances is the @code{nominal_case_size} specified in
197 the main header record. There is one instance for each numeric
198 variable and each string variable with width 8 bytes or less. String
199 variables wider than 8 bytes have one instance for each 8 bytes,
200 rounding up. The first instance for a long string specifies the
201 variable's correct dictionary information. Subsequent instances for a
202 long string are generally filled with all-zero bytes, although the
203 @code{missing} field contains the numeric system-missing value, and
204 some writers also fill in @code{var_label_ofs}, @code{format}, and
205 @code{name}, sometimes filling the latter with the numeric
206 system-missing value rather than a text string. Regardless of the
207 values used, readers should ignore the contents of these additional
208 instances for long strings.
211 @item uint32 value_label_start;
212 @itemx uint32 value_label_end;
213 For a variable with value labels, these specify offsets into the label
214 record of the start and end of this variable's value labels,
215 respectively. @xref{Record 2 Labels Record}, for more information.
217 For a variable without any value labels, these are both zero.
219 A long string variable may not have value labels.
221 @item uint32 var_label_ofs;
222 For a variable with a variable label, this specifies an offset into
223 the label record. @xref{Record 2 Labels Record}, for more
226 For a variable without a variable label, this is zero.
229 The variable's output format, in the same format used in system files.
230 @xref{System File Output Formats}, for details. SPSS/PC+ system files
231 only use format types 5 (F, for numeric variables) and 1 (A, for
235 The variable's name, padded on the right with spaces.
237 @item union @{ @dots{} @} missing;
238 A user-missing value. For numeric variables, @code{missing.f} is the
239 variable's user-missing value. For string variables, @code{missing.s}
240 is a string missing value. A variable without a user-missing value is
241 indicated with @code{missing.f} set to the system-missing value, even
242 for string variables (!). A Long string variable may not have a
246 In addition to the user-defined variables, every SPSS/PC+ system file
247 contains, as its first three variables, the following system-defined
248 variables, in the following order. The system-defined variables have
249 no variable label, value labels, or missing values.
253 A numeric variable with format F8.0. Most of the time this is a
254 sequence number, starting with 1 for the first case and counting up
255 for each subsequent case. Some files skip over values, which probably
256 reflects cases that were deleted.
259 A string variable with format A8. Same format (including varying
260 padding) as the @code{creation_date} field in the main header record
261 (@pxref{Record 0 Main Header Record}). The actual date can differ
262 from @code{creation_date} and from record to record. This may reflect
263 when individual cases were added or updated.
266 A numeric variable with format F8.2. This represents the case's
267 weight; SPSS/PC+ files do not have a user-defined weighting variable.
268 If weighting has not been enabled, every case has value 1.0.
271 @node Record 2 Labels Record
272 @section Record 2: Labels Record
274 The labels record holds value labels and variable labels. Unlike the
275 other records, it is not meant to be read directly and sequentially.
276 Instead, this record must be interpreted one piece at a time, by
277 following pointers from the variables record.
279 The @code{value_label_start}, @code{value_label_end}, and
280 @code{var_label_ofs} fields in a variable record are all offsets
281 relative to the beginning of the labels record, with an additional
282 7-byte offset. That is, if the labels record starts at byte offset
283 @code{labels_ofs} and a variable has a given @code{var_label_ofs},
284 then the variable label begins at byte offset @math{@code{labels_ofs}
285 + @code{var_label_ofs} + 7} in the file.
287 A variable label, starting at the offset indicated by
288 @code{var_label_ofs}, consists of a one-byte length followed by the
289 specified number of bytes of the variable label string, like this:
296 A set of value labels, extending from @code{value_label_start} to
297 @code{value_label_end} (exclusive), consists of a numeric or string
298 value followed by a string in the format just described. String
299 values are padded on the right with spaces to fill the 8-byte field,
311 The labels record begins with a pair of uint32 values. The first of
312 these is always 3. The second is between 8 and 16 less than the
313 number of bytes in the record. Neither value is important for
314 interpreting the file.
316 @node Record 3 Data Record
317 @section Record 3: Data Record
319 The format of the data record varies depending on the value of
320 @code{compressed} in the file header record:
323 @item 0: no compression
324 Data is arranged as a series of 8-byte elements, one per variable
325 instance variable in the variable record (@pxref{Record 1 Variables
326 Record}). Numeric values are given in @code{flt64} format; string
327 values are literal characters string, padded on the right with spaces
328 when necessary to fill out 8-byte units.
330 @item 1: bytecode compression
331 The first 8 bytes of the data record is divided into a series of
332 1-byte command codes. These codes have meanings as described below:
336 The system-missing value.
339 A numeric or string value that is not
340 compressible. The value is stored in the 8 bytes following the
341 current block of command bytes. If this value appears twice in a block
342 of command bytes, then it indicates the second group of 8 bytes following the
343 command bytes, and so on.
346 A number with value @var{code} - 100, where @var{code} is the value of
347 the compression code. For example, code 105 indicates a numeric
351 The end of the 8-byte group of bytecodes is followed by any 8-byte
352 blocks of non-compressible values indicated by code 1. After that
353 follows another 8-byte group of bytecodes, then those bytecodes'
354 non-compressible values. The pattern repeats up to the number of
355 cases specified by the main header record have been seen.
357 The corpus does not contain any files with command codes 2 through 95,
358 so it is possible that some of these codes are used for special
362 Cases of data often, but not always, fill the entire data record.
363 Readers should stop reading after the number of cases specified in the
364 main header record. Otherwise, readers may try to interpret garbage
365 following the data as additional cases.
367 @node Records 4 and 5 Data Entry
368 @section Records 4 and 5: Data Entry
370 Records 4 and 5 appear to be related to SPSS/PC+ Data Entry.