1 @node SPSS Viewer File Format
2 @chapter SPSS Viewer File Format
4 SPSS Viewer or @file{.spv} files, here called SPV files, are written
5 by SPSS 16 and later to represent the contents of its output editor.
6 This chapter documents the format, based on examination of a corpus of
7 about 500 files from a variety of sources. This description is
8 detailed enough to read SPV files, but probably not enough to write
11 SPSS 15 and earlier versions use a completely different output format
12 based on the Microsoft Compound Document Format. This format is not
15 An SPV file is a Zip archive that can be read with @command{zipinfo}
16 and @command{unzip} and similar programs. The final member in the Zip
17 archive is a file named @file{META-INF/MANIFEST.MF}. This structure
18 makes SPV files resemble Java ``JAR'' files (and ODF files), but
19 whereas a JAR manifest contains a sequence of colon-delimited
20 key/value pairs, an SPV manifest contains the string
21 @samp{allowPivoting=true}, without a new-line. (This string may be
22 the best way to identify an SPV file; it is invariant across the
25 The rest of the members in an SPV file's Zip archive fall into two
26 categories: @dfn{structure} and @dfn{detail} members. Structure
27 member names begin with @file{outputViewer@var{nnnnnnnnnn}}, where
28 each @var{n} is a decimal digit, and end with @file{.xml}, and often
29 include the string @file{_heading} in between. Each of these members
30 represents some kind of output item (a table, a heading, a block of
31 text, etc.) or a group of them. The member whose output goes at the
32 beginning of the document is numbered 0, the next member in the output
33 is numbered 1, and so on.
35 Structure members contain XML. This XML is sometimes self-contained,
36 but it often references detail members in the Zip archive, which named
40 @item @file{@var{prefix}_table.xml} and @file{@var{prefix}_tableData.bin}
41 @itemx @file{@var{prefix}_lightTableData.bin}
42 The structure of a table plus its data. Older SPV files pair a
43 @file{@var{prefix}_table.xml} file that describes the table's
44 structure with a binary @file{@var{prefix}_tableData.bin} file that
45 gives its data. Newer SPV files (the majority of those in the corpus)
46 instead include a single @file{@var{prefix}_lightTableData.bin} file
47 that incorporates both into a single binary format.
49 @item @file{@var{prefix}_warning.xml} and @file{@var{prefix}_warningData.bin}
50 @itemx @file{@var{prefix}_lightWarningData.bin}
51 Same format used for tables, with a different name.
53 @item @file{@var{prefix}_notes.xml} and @file{@var{prefix}_notesData.bin}
54 @itemx @file{@var{prefix}_lightNotesData.bin}
55 Same format used for tables, with a different name.
57 @item @file{@var{prefix}_chartData.bin} and @file{@var{prefix}_chart.xml}
58 The structure of a chart plus its data. Charts do not have a
61 @item @file{@var{prefix}_pmml.scf}
62 @itemx @file{@var{prefix}_stats.scf}
63 @item @file{@var{prefix}_model.xml}
64 Not yet investigated. The corpus contains few examples.
67 The @file{@var{prefix}} in the names of the detail members is
68 typically an 11-digit decimal number that increases for each item,
69 tending to skip values. Older SPV files use different naming
70 conventions. Structure member refer to detail members by name, and so
71 their exact names do not matter to readers as long as they are unique.
74 * SPV Structure Member Format::
75 * SPV Light Detail Member Format::
76 * SPV Legacy Detail Member Binary Format::
77 * SPV Legacy Detail Member XML Format::
80 @node SPV Structure Member Format
81 @section Structure Member Format
83 Structure members' XML files claim conformance with a collection of
84 XML Schemas. These schemas are distributed, under a nonfree license,
85 with SPSS binaries. Fortunately, the schemas are not necessary to
86 understand the structure members. To a degree, the schemas can even
87 be deceptive because they document elements and attributes that are
88 not in the corpus and do not document elements and attributes that are
91 Structure members use a different XML namespace for each schema, but
92 these namespaces are not entirely consistent. In some SPV files, for
93 example, the @code{viewer-tree} schema is associated with namespace
94 @indicateurl{http://xml.spss.com/spss/viewer-tree} and in others with
95 @indicateurl{http://xml.spss.com/spss/viewer/viewer-tree} (note the
96 additional @file{viewer/}). Under either name, the schema URIs are
97 not resolvable to obtain the schemas themselves.
99 One may ignore all of the above in interpreting a structure member.
100 The actual XML has a simple and straightforward form that does not
101 require a reader to take schemas or namespaces into account.
103 The elements found in structure members are documented below. For
104 each element, we note the possible parent elements and the element's
105 contents. The contents are specified as pseudo-regular expressions
106 with the following conventions:
119 Grouping multiple elements.
124 @item @var{a} @math{|} @var{b}
125 A choice between @var{a} and @var{b}.
128 Zero or more @var{x}.
132 For a diagram illustrating the hierarchy of elements within an SPV
133 structure member, please refer to a PDF version of the manual.
137 The following diagram shows the hierarchy of elements within an SPV
138 structure member. Edges point from parent to child elements.
139 Unlabeled edges indicate that the child appears exactly once; edges
140 labeled with *, zero or more times; edges labeled with ?, zero or one
142 @center @image{dev/spv-structure, 5in}
146 * SPV heading Element::
147 * SPV label Element::
148 * SPV container Element::
149 * SPV text Element (Inside @code{container})::
151 * SPV table Element::
152 * SPV tableStructure Element::
153 * SPV dataPath Element::
154 * SPV pageSetup Element::
155 * SPV pageHeader and pageFooter Elements::
156 * SPV pageParagraph Element::
157 * SPV @code{text} Element (Inside @code{pageParagraph})::
160 @node SPV heading Element
161 @subsection The @code{heading} Element
163 Parent: Document root or @code{heading} @*
164 Contents: [@code{pageSetup}] @code{label} (@code{container} @math{|} @code{heading})*
166 The root of a structure member is a @code{heading}, which represents a
167 section of output beginning with a title (the @code{label}) and
168 ordinarily followed by content containers or further nested
169 (sub)-sections of output.
171 The document root heading, only, may also contain a @code{pageSetup}
174 The following attributes have been observed on both document root and
175 nested @code{heading} elements.
177 @defvr {Optional} creator-version
178 The version of the software that created this SPV file. A string of
179 the form @code{xxyyzzww} represents software version xx.yy.zz.ww,
180 e.g.@: @code{21000001} is version 21.0.0.1. Trailing pairs of zeros
181 are sometimes omitted, so that @code{21}, @code{210000}, and
182 @code{21000000} are all version 21.0.0.0 (and the corpus contains all
183 three of those forms).
187 The following attributes have been observed on document root
188 @code{heading} elements only:
190 @defvr {Optional} @code{creator}
191 The directory in the file system of the software that created this SPV
195 @defvr {Optional} @code{creation-date-time}
196 The date and time at which the SPV file was written, in a
197 locale-specific format, e.g. @code{Friday, May 16, 2014 6:47:37 PM
198 PDT} or @code{lunedì 17 marzo 2014 3.15.48 CET} or even @code{Friday,
199 December 5, 2014 5:00:19 o'clock PM EST}.
202 @defvr {Optional} @code{lockReader}
203 Whether a reader should be allowed to edit the output. The possible
204 values are @code{true} and @code{false}, but the corpus only contains
208 @defvr {Optional} @code{schemaLocation}
209 This is actually an XML Namespace attribute. A reader may ignore it.
213 The following attributes have been observed only on nested
214 @code{heading} elements:
216 @defvr {Required} @code{commandName}
217 The locale-invariant name of the command that produced the output,
218 e.g.@: @code{Frequencies}, @code{T-Test}, @code{Non Par Corr}.
221 @defvr {Optional} @code{visibility}
222 To what degree the output represented by the element is visible. The
223 only observed value is @code{collapsed}.
226 @defvr {Optional} @code{locale}
227 The locale used for output, in Windows format, which is similar to the
228 format used in Unix with the underscore replaced by a hyphen, e.g.@:
229 @code{en-US}, @code{en-GB}, @code{el-GR}, @code{sr-Cryl-RS}.
232 @defvr {Optional} @code{olang}
233 The output language, e.g.@: @code{en}, @code{it}, @code{es},
234 @code{de}, @code{pt-BR}.
237 @node SPV label Element
238 @subsection The @code{label} Element
240 Parent: @code{heading} or @code{container} @*
243 Every @code{heading} and @code{container} holds a @code{label} as its
244 first child. The root @code{heading} in a structure member always
245 contains the string ``Output''. Otherwise, the text in @code{label}
246 describes what it labels, often by naming the statistical procedure
247 that was executed, e.g.@: ``Frequencies'' or ``T-Test''. Labels are
248 often very generic, especially within a @code{container}, e.g.@:
249 ``Title'' or ``Warnings'' or ``Notes''. Label text is localized
250 according to the output language, e.g.@: in Italian a frequency table
251 procedure is labeled ``Frequenze''.
253 The corpus contains one example of an empty label, one that contains
256 This element has no attributes.
258 @node SPV container Element
259 @subsection The @code{container} Element
261 Parent: @code{heading} @*
262 Contents: @code{label} [@code{table} @math{|} @code{text}]
264 A @code{container} serves to label a @code{table} or a @code{text}
267 This element has the following attributes.
269 @defvr {Required} @code{visibility}
270 Either @code{visible} or @code{hidden}, this indicates whether the
271 container's content is displayed.
274 @defvr {Optional} @code{text-align}
275 Presumably indicates the alignment of text within the container. The
276 only observed value is @code{left}. Observed with nested @code{table}
277 and @code{text} elements.
280 @defvr {Optional} @code{width}
281 The width of the container in the form @code{@var{n}px}, e.g.@:
285 @node SPV text Element (Inside @code{container})
286 @subsection The @code{text} Element (Inside @code{container})
288 Parent: @code{container} @*
289 Contents: @code{html}
291 This @code{text} element is nested inside a @code{container}. There
292 is a different @code{text} element that is nested inside a
293 @code{pageParagraph}.
295 This element has the following attributes.
297 @defvr {Required} @code{type}
298 One of @code{title}, @code{log}, or @code{text}.
301 @defvr {Optional} @code{commandName}
302 As on the @code{heading} element. For output not specific to a
303 command, this is simply @code{log}. The corpus contains one example
304 of where @code{commandName} is present but set to the empty string.
307 @defvr {Optional} @code{creator-version}
308 As on the @code{heading} element.
311 @node SPV html Element
312 @subsection The @code{html} Element
314 Parent: @code{text} @*
317 The CDATA contains an HTML document. In some cases, the document
318 starts with @code{<html>} and ends with @code{</html}; in others the
319 @code{html} element is implied. Generally the HTML includes a
320 @code{head} element with a CSS stylesheet. The HTML body often begins
321 with @code{<BR>}. The actual content ranges from trivial to simple:
322 just discarding the CSS and tags yields readable results.
324 This element has the following attributes.
326 @defvr {Required} @code{lang}
327 This always contains @code{en} in the corpus.
330 @node SPV table Element
331 @subsection The @code{table} Element
333 Parent: @code{container} @*
334 Contents: @code{tableStructure}
336 This element has the following attributes.
338 @defvr {Required} @code{commandName}
339 As on the @code{heading} element.
342 @defvr {Required} @code{type}
343 One of @code{table}, @code{note}, or @code{warning}.
346 @defvr {Required} @code{subType}
347 The locale-invariant name for the particular kind of output that this
348 table represents in the procedure. This can be the same as
349 @code{commandName} e.g.@: @code{Frequencies}, or different, e.g.@:
350 @code{Case Processing Summary}. Generic subtypes @code{Notes} and
351 @code{Warnings} are often used.
354 @defvr {Required} @code{tableId}
355 A number that uniquely identifies the table within the SPV file,
356 typically a large negative number such as @code{-4147135649387905023}.
359 @defvr {Optional} @code{creator-version}
360 As on the @code{heading} element. In the corpus, this is only present
361 for version 21 and up and always includes all 8 digits.
364 @node SPV tableStructure Element
365 @subsection The @code{tableStructure} Element
367 Parent: @code{table} @*
368 Contents: @code{dataPath}
370 This element has no attributes.
372 @node SPV dataPath Element
373 @subsection The @code{dataPath} Element
375 Parent: @code{tableStructure} @*
378 Contains the name of the Zip member that holds the table details,
379 e.g.@: @code{0000000001437_lightTableData.bin}.
381 This element has no attributes.
383 @node SPV pageSetup Element
384 @subsection The @code{pageSetup} Element
386 Parent: @code{heading} @*
387 Contents: @code{pageHeader} @code{pageFooter}
389 This element has the following attributes.
391 @defvr {Required} @code{initial-page-number}
395 @defvr {Optional} @code{chart-size}
396 Always @code{as-is} or a localization (!) of it (e.g.@: @code{dimensione
397 attuale}, @code{Wie vorgegeben}).
400 @defvr {Optional} @code{margin-left}
401 @defvrx {Optional} @code{margin-right}
402 @defvrx {Optional} @code{margin-top}
403 @defvrx {Optional} @code{margin-bottom}
404 Margin sizes in the form @code{@var{size}in}, e.g.@: @code{0.25in}.
407 @defvr {Optional} @code{paper-height}
408 @defvrx {Optional} @code{paper-width}
409 Paper sizes in the form @code{@var{size}in}, e.g.@: @code{8.5in} by
410 @code{11in} for letter paper or @code{8.267in} by @code{11.692in} for
414 @defvr {Optional} @code{reference-orientation}
418 @defvr {Optional} @code{space-after}
422 @node SPV pageHeader and pageFooter Elements
423 @subsection The @code{pageHeader} and @code{pageFooter} Elements
425 Parent: @code{pageSetup} @*
426 Contents: @code{pageParagraph}*
428 This element has no attributes.
430 @node SPV pageParagraph Element
431 @subsection The @code{pageParagraph} Element
433 Parent: @code{pageHeader} or @code{pageFooter} @*
434 Contents: @code{text}
436 Text to go at the top or bottom of a page, respectively.
438 This element has no attributes.
440 @node SPV @code{text} Element (Inside @code{pageParagraph})
441 @subsection The @code{text} Element (Inside @code{pageParagraph})
443 Parent: @code{pageParagraph} @*
446 This @code{text} element is nested inside a @code{pageParagraph}. There
447 is a different @code{text} element that is nested inside a
450 The element is either empty, or contains CDATA that holds almost-XHTML
451 text: in the corpus, either an @code{html} or @code{p} element. It is
452 @emph{almost}-XHTML because the @code{html} element designates the
454 @code{http://xml.spss.com/spss/viewer/viewer-tree} instead of an XHTML
455 namespace, and because the CDATA can contain substitution variables:
456 @code{&[Page]} for the page number and @code{&[PageTitle]} for the
459 Typical contents (indented for clarity):
462 <html xmlns="http://xml.spss.com/spss/viewer/viewer-tree">
465 <p style="text-align:right; margin-top: 0">Page &[Page]</p>
470 This element has the following attributes.
472 @defvr {Required} @code{type}
476 @node SPV Light Detail Member Format
477 @section Light Detail Member Format
479 This section describes the format of ``light'' detail @file{.bin}
480 members. These members have a binary format which we describe here in
481 terms of a context-free grammar using the following conventions:
484 @item NonTerminal @result{} @dots{}
485 Nonterminals have CamelCaps names, and @result{} indicates a
486 production. The right-hand side of a production is often broken
487 across multiple lines. Break points are chosen for aesthetics only
488 and have no semantic significance.
490 @item 00, 01, @dots{}, ff.
491 Bytes with fixed values are written in hexadecimal:
493 @item i0, i1, @dots{}, i9, i10, i11, @dots{}
494 32-bit integers with fixed values are written in decimal, prefixed by
501 An arbitrary 32-bit integer.
504 An arbitrary 64-bit IEEE floating-point number.
507 A 32-bit integer followed by the specified number of bytes of
508 character data. (The encoding is indicated by the Formats
512 @var{x} is optional, e.g.@: 00? is an optional zero byte.
514 @item @var{x}*@var{n}
515 @var{x} is repeated @var{n} times, e.g. byte*10 for ten arbitrary bytes.
517 @item @var{x}[@var{name}]
518 Gives @var{x} the specified @var{name}. Names are used in textual
519 explanations. They are also used, also bracketed, to indicate counts,
520 e.g.@: int[@t{n}] byte*[@t{n}] for a 32-bit integer followed by the
521 specified number of arbitrary bytes.
523 @item @var{a} @math{|} @var{b}
524 Either @var{a} or @var{b}.
527 Parentheses are used for grouping to make precedence clear, especially
528 in the presence of @math{|}, e.g.@: in 00 (01 @math{|} 02 @math{|} 03)
532 A 32-bit integer that indicates the number of bytes in @var{x},
533 followed by @var{x} itself.
536 In a version 1 @file{.bin} member, @var{x}; in version 3, nothing.
537 (The @file{.bin} header indicates the version.)
540 In a version 3 @file{.bin} member, @var{x}; in version 1, nothing.
543 All integer and floating-point values in this format use little-endian
546 A ``light'' detail member @file{.bin} consists of a number of sections
547 concatenated together, terminated by a byte 01:
551 LightMember @result{} Header Title Caption Footnotes Fonts Formats Dimensions Data 01
555 The following sections go into more detail.
558 * SPV Light Member Header::
559 * SPV Light Member Title::
560 * PSV Light Member Caption::
561 * SPV Light Member Footnotes::
562 * SPV Light Member Fonts::
563 * SPV Light Member Formats::
564 * SPV Light Member Dimensions::
565 * SPV Light Member Categories::
566 * SPV Light Member Data::
567 * SPV Light Member Value::
568 * SPV Light Member ValueMod::
571 @node SPV Light Member Header
574 An SPV file begins with an 39-byte header:
580 (i1 @math{|} i3)[@t{version}]
581 01 (00 @math{|} 01) byte*21 00 00
582 int[@t{table-id}] byte*4
586 @code{version} is a version number that affects the interpretation of
587 some of the other data in the member. We will refer to ``version 1''
588 and ``version 3'' later on and use v1(@dots{}) and v3(@dots{}) for
589 version-specific formatting (as described previously).
591 @code{table-id} is a binary version of the @code{tableId} attribute in
592 the structure member that refers to the detail member. For example,
593 if @code{tableId} is @code{-4154297861994971133}, then @code{table-id}
596 The meaning of the other variable parts of the header is not known.
598 @node SPV Light Member Title
604 Value[@t{title1}] 01?
606 Value[@t{title2}] 01? 00? 58
610 The Title, which follows the Header, specifies the pivot table's title
611 twice, as @code{title1} and @code{title2}. In the corpus, they are
614 Whereas the Value in @code{title1} and in @code{title2} are
615 appropriate for presentation, and localized to the user's language,
616 @code{c} is in English, sometimes less specific, and sometimes less
617 well formatted. For example, for a frequency table, @code{title1} and
618 @code{title2} name the variable and @code{c} is simply ``Frequencies''.
620 @node PSV Light Member Caption
625 Caption @result{} 58 @math{|} 31 Value[@t{caption}]
629 The @code{caption}, if presented, is shown below the table.
631 @node SPV Light Member Footnotes
632 @subsection Footnotes
636 Footnotes @result{} int[@t{n}] Footnote*[@t{n}]
637 Footnote @result{} Value[@t{text}] (58 @math{|} 31 Value[@t{marker}]) byte*4
641 Each footnote has @code{text} and an optional customer @code{marker}
644 @node SPV Light Member Fonts
649 Fonts @result{} 00 Font*8
651 byte[@t{index}] 31 string[@t{typeface}] 00 00
652 (10 @math{|} 20 @math{|} 40 @math{|} 50 @math{|} 70 @math{|} 80)[@t{f1}] 41
653 (i0 @math{|} i1 @math{|} i2)[@t{f2}] 00
654 (i0 @math{|} i2 @math{|} i64173)[@t{f3}]
655 (i0 @math{|} i1 @math{|} i2 @math{|} i3)[@t{f4}]
656 string[@t{fgcolor}] string[@t{bgcolor}] i0 i0 00
657 v3(int[@t{f5}] int[@t{f6}] int[@t{f7}] int[@t{f8}]))
661 Each Font represents the font style for a different element, in the
662 following order: title, caption, footnote, row labels, column labels,
663 corner labels, data, and layers.
665 @code{index} is the 1-based index of the Font, i.e. 1 for the first
666 Font, through 8 for the final Font.
668 @code{typeface} is the string name of the font. In the corpus, this
669 is @code{SansSerif} in over 99% of instances and @code{Times New
672 @code{fgcolor} and @code{bgcolor} are the foreground color and
673 background color, respectively. In the corpus, these are always
674 @code{#000000} and @code{#ffffff}, respectively.
676 The meaning of the remaining data is unknown. It seems likely to
677 include font sizes, horizontal and vertical alignment, attributes such
678 as bold or italic, and margins.
680 The table below lists the values observed in the corpus. When a cell
681 contains a single value, then 99@math{+}% of the corpus contains that value.
682 When a cell contains a pair of values, then the first value is seen in
683 about two-thirds of the corpus and the second value in about the
684 remaining one-third. In fonts that include multiple pairs, values are
685 correlated, that is, for font 3, f5 = 24, f6 = 24, f7 = 2 appears
686 about two-thirds of the time, as does the combination of f4 = 0, f6 =
689 @multitable {font} {40} {f2} {64173} {0/1} {24/11} {10/11} {2/3} {f8}
690 @headitem font @tab f1 @tab f2 @tab f3 @tab f4 @tab f5 @tab f6 @tab f7 @tab f8
691 @item 1 @tab 40 @tab 1 @tab 0 @tab 0 @tab 8 @tab 10/11 @tab 1 @tab 8
692 @item 2 @tab 40 @tab 0 @tab 2 @tab 1 @tab 8 @tab 10/11 @tab 1 @tab 1
693 @item 3 @tab 40 @tab 0 @tab 2 @tab 1 @tab 24/11 @tab 24/ 8 @tab 2/3 @tab 4
694 @item 4 @tab 40 @tab 0 @tab 2 @tab 3 @tab 8 @tab 10/11 @tab 1 @tab 1
695 @item 5 @tab 40 @tab 0 @tab 0 @tab 1 @tab 8 @tab 10/11 @tab 1 @tab 4
696 @item 6 @tab 40 @tab 0 @tab 2 @tab 1 @tab 8 @tab 10/11 @tab 1 @tab 4
697 @item 7 @tab 40 @tab 0 @tab 64173 @tab 0/1 @tab 8 @tab 10/11 @tab 1 @tab 1
698 @item 8 @tab 40 @tab 0 @tab 2 @tab 3 @tab 8 @tab 10/11 @tab 1 @tab 4
701 @node SPV Light Member Formats
707 int[@t{n1}] byte*[@t{n1}]
708 int[@t{n2}] byte*[@t{n2}]
709 int[@t{n3}] byte*[@t{n3}]
710 int[@t{n4}] int*[@t{n4}]
712 (i0 @math{|} i-1) (00 @math{|} 01) 00 (00 @math{|} 01)
714 byte[@t{decimal}] byte[@t{grouping}]
715 int[@t{n-ccs}] string*[@t{n-ccs}]
717 v3(count(count(X5) count(X6)))
719 X5 @result{} byte*33 int[@t{n}] int*[@t{n}]
721 01 00 (03 @math{|} 04) 00 00 00
722 string[@t{command}] string[@t{subcommand}]
723 string[@t{language}] string[@t{charset}] string[@t{locale}]
724 (00 @math{|} 01) 00 (00 @math{|} 01) (00 @math{|} 01)
726 byte[@t{decimal}] byte[@t{grouping}]
728 (string[@t{dataset}] string[@t{datafile}] i0 int i0)?
729 int[@t{n-ccs}] string*[@t{n-ccs}]
730 2e (00 @math{|} 01) (i2000000 i0)?
734 In every example in the corpus, @code{n1} is 240. The meaning of the
735 bytes that follow it is unknown.
737 In every example in the corpus, @code{n2} is 18 and the bytes that
738 follow it are @code{00 00 00 01 00 00 00 00 00 00 00 00 00 02 00 00 00
739 00}. The meaning of these bytes is unknown.
741 In every example in the corpus for version 1, @code{n3} is 16 and the
742 bytes that follow it are @code{00 00 00 01 00 00 00 01 00 00 00 00 01
743 01 01 01}. In version 3, observed @code{n3} varies from 117 to 150,
744 and its bytes include a 1-byte count at offset 0x34. When the count
745 is nonzero, a text string of that length at offset 0x35 is the name of
746 a ``TableLook'', e.g. ``Default'' or ``Academic''.
748 Observed values of @code{n4} vary from 0 to 17. Out of 7,060 examples
749 in the corpus, it is nonzero only 36 times.
751 @code{encoding} is a character encoding, usually a Windows code page
752 such as @code{en_US.windows-1252} or @code{it_IT.windows-1252}. The
753 rest of the character strings in the member use this encoding. The
754 encoding string is itself encoded in US-ASCII.
756 @code{decimal} is the decimal point character. The observed values
757 are @samp{.} and @samp{,}.
759 @code{grouping} is the grouping character. Usually, it is @samp{,} if
760 @code{decimal} is @samp{.}, and vice versa. Other observed values are
761 @samp{'} (apostrophe), @samp{ } (space), and zero (presumably
762 indicating that digits should not be grouped).
764 @code{n-ccs} is observed as either 0 or 5. When it is 5, the
765 following strings are CCA through CCE format strings. @xref{Custom
766 Currency Formats,,, pspp, PSPP}. Most commonly these are all
767 @code{-,,,} but other strings occur.
769 @node SPV Light Member Dimensions
770 @subsection Dimensions
772 A pivot table presents multidimensional data. A Dimension identifies
773 the categories associated with each dimension.
777 Dimensions @result{} int[@t{n-dims}] Dimension*[@t{n-dims}]
778 Dimension @result{} Value[@t{name}] DimUnknown int[@t{n-categories}] Category*[@t{n-categories}]
781 (00 @math{|} 01 @math{|} 02)[@t{d2}]
782 (i0 @math{|} i2)[@t{d3}]
783 (00 @math{|} 01)[@t{d4}]
784 (00 @math{|} 01)[@t{d5}]
790 @code{name} is the name of the dimension, e.g. @code{Variables},
791 @code{Statistics}, or a variable name.
793 @code{d1} is usually 0 but many other values have been observed.
795 @code{d3} is 2 over 99% of the time.
797 @code{d5} is 0 over 99% of the time.
799 @code{d6} is either -1 or the 0-based index of the dimension, e.g.@: 0
800 for the first dimension, 1 for the second, and so on. The latter is
801 the case 98% of the time in the corpus.
803 @node SPV Light Member Categories
804 @subsection Categories
806 Categories are arranged in a tree. Only the leaf nodes in the tree
807 are really categories; the others just serve as grouping constructs.
811 Category @result{} Value[@t{name}] (Leaf @math{|} Group)
812 Leaf @result{} 00 00 00 i2 int[@t{index}] i0
814 (00 @math{|} 01)[@t{merge}] 00 01 (i0 @math{|} i2)[@t{data}]
815 i-1 int[@t{n-subcategories}] Category*[@t{n-subcategories}]
819 @code{name} is the name of the category (or group).
821 A Leaf represents a leaf category. The Leaf's @code{index} is a
822 nonnegative integer less than @code{n-categories} in the Dimension in
823 which the Category is nested (directly or indirectly).
825 A Group represents a Group of nested categories. Usually a Group
826 contains at least one Category, so that @code{n-subcategories} is
827 positive, but a few Groups with @code{n-subcategories} 0 has been
830 If a Group's @code{merge} is 00, the most common value, then the group
831 is really a distinct group that should be represented as such in the
832 visual representation and user interface. If @code{merge} is 01, the
833 categories in this group should be shown and treated as if they were
834 direct children of the group's containing group (or if it has no
835 parent group, then direct children of the dimension), and this group's
836 name is irrelevant and should not be displayed. (Merged groups can be
839 A Group's @code{data} appears to be i2 when all of the categories
840 within a group are leaf categories that directly represent data values
841 for a variable (e.g. in a frequency table or crosstabulation, a group
842 of values in a variable being tabulated) and i0 otherwise.
844 @node SPV Light Member Data
847 The final part of an SPV light member contains the actual data.
852 int[@t{layers}] int[@t{rows}] int[@t{columns}] int*[@t{n-dimensions}]
853 int[@t{n-data}] Datum*[@t{n-data}]
854 Datum @result{} int64[@t{index}] v3(00?) Value
858 The values of @code{layers}, @code{rows}, and @code{columns} each
859 specifies the number of dimensions displayed in layers, rows, and
860 columns, respectively. Any of them may be zero. Their values sum to
861 @code{n-dimensions} from Dimensions (@pxref{SPV Light Member
864 The @code{n-dimensions} integers are a permutation of the 0-based
865 dimension numbers. The first @code{layers} integers specify each of
866 the dimensions represented by layers, the next @code{rows} integers
867 specify the dimensions represented by rows, and the final
868 @code{columns} integers specify the dimensions represented by columns.
869 When there is more than one dimension of a given kind, the inner
870 dimensions are given first.
872 The format of a Datum varies slightly from version 1 to version 3: in
873 version 1 it allows for an extra optional 00 byte.
875 A Datum consists of an @code{index} and a Value. Suppose there are
876 @math{d} dimensions and dimension @math{i}, @math{0 \le i < d}, has
877 @math{n_i} categories. Consider the datum at coordinates @math{x_i},
878 @math{0 \le i < d}, and note that @math{0 \le x_i < n_i}. Then the
879 index is calculated by the following algorithm:
883 for each @math{i} from 0 to @math{d - 1}:
884 @i{index} = (@math{n_i \times} @i{index}) @math{+} @math{x_i}
887 For example, suppose there are 3 dimensions with 3, 4, and 5
888 categories, respectively. The datum at coordinates (1, 2, 3) has
889 index @math{5 \times (4 \times (3 \times 0 + 1) + 2) + 3 = 33}.
891 @node SPV Light Member Value
894 Value is used throughout the SPV light member format. It boils down
895 to a number or a string.
899 Value @result{} 00? 00? 00? 00? RawValue
901 01 ValueMod int[@t{format}] double[@t{x}]
902 @math{|} 02 ValueMod int[@t{format}] double[@t{x}]
903 string[@t{varname}] string[@t{vallab}] (01 @math{|} 02 @math{|} 03)
904 @math{|} 03 string[@t{local}] ValueMod string[@t{id}] string[@t{c}] (00 @math{|} 01)[@t{type}]
905 @math{|} 04 ValueMod int[@t{format}] string[@t{vallab}] string[@t{varname}]
906 (01 @math{|} 02 @math{|} 03) string[@t{s}]
907 @math{|} 05 ValueMod string[@t{varname}] string[@t{varlabel}] (01 @math{|} 02 @math{|} 03)
908 @math{|} ValueMod string[@t{format}] int[@t{n-args}] Argument*[@t{n-args}]
911 @math{|} int[@t{x}] i0 Value*[@t{x}@math{+}1] /* @t{x} @math{>} 0 */
915 There are several possible encodings, which one can distinguish by the
916 first nonzero byte in the encoding.
920 The numeric value @code{x}, intended to be presented to the user
921 formatted according to @code{format}, which is in the format described
922 for system files. @xref{System File Output Formats}, for details.
923 Most commonly, @code{format} has width 40 (the maximum).
925 An @code{x} with the maximum negative double value @code{-DBL_MAX}
926 represents the system-missing value SYSMIS. (HIGHEST and LOWEST have
927 not been observed.) @xref{System File Format}, for more about these
931 Similar to @code{01}, with the additional information that @code{x} is
932 a value of variable @code{varname} and has value label @code{vallab}.
933 Both @code{varname} and @code{vallab} can be the empty string, the
934 latter very commonly.
936 The meaning of the final byte is unknown. Possibly it is connected to
937 whether the value or the label should be displayed.
940 A text string, in two forms: @code{c} is in English, and sometimes
941 abbreviated or obscure, and @code{local} is localized to the user's
942 locale. In an English-language locale, the two strings are often the
943 same, and in the cases where they differ, @code{local} is more
944 appropriate for a user interface, e.g.@: @code{c} of ``Not a PxP table
945 for MCN...'' versus @code{local} of ``Computed only for a PxP table,
946 where P must be greater than 1.''
948 @code{c} and @code{local} are always either both empty or both
951 @code{id} is a brief identifying string whose form seems to resemble a
952 programming language identifier, e.g.@: @code{cumulative_percent} or
953 @code{factor_14}. It is not unique.
955 @code{type} is 00 for text taken from user input, such as syntax
956 fragment, expressions, file names, data set names, and 01 for fixed
957 text strings such as names of procedures or statistics. In the former
958 case, @code{id} is always the empty string; in the latter case,
959 @code{id} is still sometimes empty.
962 The string value @code{s}, intended to be presented to the user
963 formatted according to @code{format}. The format for a string is not
964 too interesting, and the corpus contains many clearly invalid formats
965 like A16.39 or A255.127 or A134.1, so readers should probably ignore
968 @code{s} is a value of variable @code{varname} and has value label
969 @code{vallab}. @code{varname} is never empty but @code{vallab} is
972 The meaning of the final byte is unknown.
975 Variable @code{varname}, which is rarely observed as empty in the
976 corpus, with variable label @code{varlabel}, which is often empty.
978 The meaning of the final byte is unknown.
981 (These bytes begin a ValueMod.) A format string, analogous to
982 @code{printf}, followed by one or more Arguments, each of which has
983 one or more values. The format string uses the following syntax:
990 Each of these expands to the character following @samp{\\}, to escape
991 characters that have special meaning in format strings. These are
992 effective inside and outside the @code{[@dots{}]} syntax forms
996 Expands to a new-line, inside or outside the @code{[@dots{}]} forms
1000 Expands to a formatted version of argument @var{i}, which must have
1001 only a single value. For example, @code{^1} expands to the first
1002 argument's @code{value}.
1004 @item [:@var{a}:]@var{i}
1005 Expands @var{a} for each of the values in @var{i}. @var{a}
1006 should contain one or more @code{^@var{j}} conversions, which are
1007 drawn from the values for argument @var{i} in order. Some examples
1012 All of the values for the first argument, concatenated.
1015 Expands to the values for the first argument, each followed by
1019 Expands to @code{@var{x} = @var{y}} where @var{x} is the second
1020 argument's first value and @var{y} is its second value. (This would
1021 be used only if the argument has two values. If there were more
1022 values, the second and third values would be directly concatenated,
1023 which would look funny.)
1026 @item [@var{a}:@var{b}:]@var{i}
1027 This extends the previous form so that the first values are expanded
1028 using @var{a} and later values are expanded using @var{b}. For an
1029 unknown reason, within @var{a} the @code{^@var{j}} conversions are
1030 instead written as @code{%@var{j}}. Some examples from the corpus:
1034 Expands to all of the values for the first argument, separated by
1037 @item [%1 = %2:, ^1 = ^2:]1
1038 Given appropriate values for the first argument, expands to @code{X =
1042 Given appropriate values, expands to @code{1, 2, 3}.
1046 The format string is localized to the user's locale.
1049 @node SPV Light Member ValueMod
1050 @subsection ValueMod
1052 A ValueMod can specify special modifications to a Value.
1057 31 i0 (i0 @math{|} i1 string[@t{subscript}])
1058 v1(00 (i1 @math{|} i2) 00 00 int 00 00)
1059 v3(count(FormatString Style ValueModUnknown))
1060 @math{|} 31 i1 int[@t{footnote-number}] Format
1061 @math{|} 31 i2 (00 @math{|} 01 @math{|} 02) 00 (i1 @math{|} i2 @math{|} i3) Format
1062 @math{|} 31 i3 00 00 01 00 i2 Format
1064 Style @result{} 58 @math{|} 31 01? 00? 00? 00? 01 string[@t{fgcolor}] string[@t{bgcolor}] string[@t{typeface}] byte
1065 Format @result{} 00 00 count(FormatString Style 58)
1066 FormatString @result{} count((i0 (58 @math{|} 31 string))?)
1067 ValueModUnknown @result{} 58 @math{|} 31 i0 i0 i0 i0 01 00 (01 @math{|} 02 @math{|} 08) 00 08 00 0a 00)
1071 The @code{footnote-number}, if present, specifies a footnote that the
1072 Value references. The footnote's marker is shown appended to the main
1073 text of the Value, as a superscript.
1075 The @code{subscript}, if present, specifies a string to append to the
1076 main text of the Value, as a subscript. The subscript text is a brief
1077 indicator, e.g.@: @samp{a} or @samp{a,b}, with its meaning indicated
1078 by the table caption. In this usage, subscripts are similar to
1079 footnotes; one apparent difference is that a Value can only reference
1080 one footnote but a subscript can list more than one letter.
1082 The Format, if present, is a format string for substitutions using the
1083 syntax explained previously. It appears to be an English-language
1084 version of the localized format string in the Value in which the
1087 The Style, if present, changes the style for this individual Value.
1089 @node SPV Legacy Detail Member Binary Format
1090 @section Legacy Detail Member Binary Format
1092 Whereas the light binary format represents everything about a given
1093 pivot table, the legacy binary format conceptually consists of a
1094 number of named sources, each of which consists of a number of named
1095 series, each of which is a 1-dimensional array of numbers or strings
1096 or a mix. Thus, the legacy binary member format is quite simple.
1098 This section uses the same context-free grammar notation as in the
1099 previous section, with the following additions:
1103 In a version 0xaf legacy member, @var{x}; in other versions, nothing.
1104 (The legacy member header indicates the version; see below.)
1107 In a version 0xb0 legacy member, @var{x}; in other versions, nothing.
1110 A legacy detail member @file{.bin} has the following overall format:
1114 LegacyBinary @result{}
1115 00 byte[@t{version}] int16[@t{n-sources}] int[@t{member-size}]
1116 Metadata*[@t{n-sources}] Data*[@t{n-sources}]
1120 @code{version} is a version number that affects the interpretation of
1121 some of the other data in the member. Versions 0xaf and 0xb0 are
1122 known. We will refer to ``version 0xaf'' and ``version 0xb0'' members
1125 A legacy member consists of @code{n-sources} data sources, each of
1126 which has Metadata and Data.
1128 @code{member-size} is the size of the legacy binary member, in bytes.
1130 The following sections go into more detail.
1133 * SPV Legacy Member Metadata::
1134 * SPV Legacy Member Data::
1137 @node SPV Legacy Member Metadata
1138 @subsection Metadata
1143 int[@t{per-series}] int[@t{n-series}] int[@t{offset}]
1144 vAF(byte*32[@t{source-name}])
1145 vB0(byte*64[@t{source-name}] int[@t{x}])
1149 A data source consists of @code{n-series} series of data, with
1150 @code{per-series} data values per series.
1152 @code{source-name} is a 32- or 64-byte string padded on the right with
1153 zero bytes. The names that appear in the corpus are very generic,
1154 usually @code{tableData} or @code{source0}.
1156 A given Metadata's @code{offset} is the offset, in bytes, from the
1157 beginning of the member to the start of the corresponding Data. This
1158 allows programs to skip to the beginning of the data for a particular
1159 source; it is also important to determine whether a source includes
1160 any string data (@pxref{SPV Legacy Member Data}).
1162 The meaning of @code{x} in version 0xb0 is unknown.
1164 @node SPV Legacy Member Data
1169 Data @result{} NumericData StringData?
1170 NumericData @result{} NumericSeries*[@t{n-series}]
1171 NumericSeries @result{} byte*288[@t{series-name}] double*[@t{per-series}]
1175 Data follow the Metadata in the legacy binary format, with sources in
1176 the same order. Each NumericSeries begins with a @code{series-name}
1177 that generally indicates its role in the pivot table, e.g.@: ``cell'',
1178 ``cellFormat'', ``dimension0categories'', ``dimension0group0'',
1179 followed by the numeric data, one double per element in the series. A
1180 double with the maximum negative double @code{-DBL_MAX} represents the
1181 system-missing value SYSMIS.
1185 StringData @result{} i1 string[@t{source-name}] Pairs Labels
1187 Pairs @result{} int[@t{n-string-series}] PairSeries*[@t{n-string-series}]
1188 PairSeries @result{} string[@t{pair-series-name}] int[@t{n-pairs}] Pair*[@t{n-pairs}]
1189 Pair @result{} int[@t{i}] int[@t{j}]
1191 Labels @result{} int[@t{n-labels}] Label*[@t{n-labels}]
1192 Label @result{} int[@t{frequency}] int[@t{s}]
1196 A source may include a mix of numeric and string data values. When a
1197 source includes any string data, the data values that are strings are
1198 set to SYSMIS in the NumericSeries, and StringData follows the
1199 NumericData. A source that contains no string data omits the
1200 StringData. To reliably determine whether a source includes
1201 StringData, the reader should check whether the offset following the
1202 NumericData is the offset of the next series, as indicated by its
1203 Metadata (or the end of the member, in the case of the last source).
1205 StringData repeats the name of the source (from Metadata).
1207 The string data overlays the numeric data. @code{n-string-series} is
1208 the number of series within the source that include string data. More
1209 precisely, it is the 1-based index of the last series in the source
1210 that includes any string data; thus, it would be 4 if there are 5
1211 series and only the fourth one includes string data.
1213 Each PairSeries consists a sequence of 0 or more Pair nonterminals,
1214 each of which maps from a 0-based index within series @code{i} to a
1215 0-based label index @code{j}, e.g.@: pair @code{i} = 2, @code{j} = 3,
1216 means that the third data value (with value SYSMIS) is to be replaced
1217 by the string of the fourth Label.
1219 The labels themselves follow the pairs. The valuable part of each
1220 label is the string @code{s}. Each label also includes a
1221 @code{frequency} that reports the number of pairs that reference it
1222 (although this is not useful).
1224 @node SPV Legacy Detail Member XML Format
1225 @section Legacy Detail Member XML Format
1227 This format is still under investigation.