technical.md

   1 # Technical information about Deark #
   2
   3 This document is a supplement to the information in the [readme.md](readme.md)
   4 file.
   5
   6 ## Mission statement ##
   7
   8 Deark has several related purposes:
   9
  10 * To find interesting things that are stored in files, but usually ignored,
  11 such as thumbnail images and comments.
  12
  13 * To rescue data from uncommon file formats, and to be a convenient way to
  14 decode many different formats.
  15
  16 * The "-d" option is a core feature, and can help to learn about a file and its
  17 format, whether or not anything is extracted from it.
  18
  19 * Digital preservation of information about file formats. Its source code
  20 encapsulates information about some formats that might otherwise be hard to
  21 find.
  22
  23 There's not much rhyme or reason to the formats Deark supports, or to its
  24 features. It exists mainly because I've written too many one-off programs to
  25 decode file formats, and wanted to put everything in one place. Part of the
  26 goal is to support (mainly old) formats that are under-served by other
  27 open-source software. Many of the formats it currently supports are related to
  28 either graphics or compression/archiving, but it is not limited to such
  29 formats.
  30
  31 ## Security ##
  32
  33 Deark is intended to be safe to use with untrusted input files, but there are
  34 no promises. It is written in C, and vulnerabilities very likely exist.
  35
  36 A strategically-designed input file can definitely cause Deark to use a
  37 disproportionate amount of system resources, such as disk space or CPU time.
  38 Deark does enforce some resource limits, but not consistently. This is a
  39 difficult problem to solve.
  40
  41 The best way to reduce the attack surface is to use the -onlymods option. Other
  42 options that can improve security include -onlydetect, -m, -maxfiles,
  43 -maxfilesize, -maxtotalsize, -zip.
  44
  45 ## The filename problem ##
  46
  47 When Deark writes a file, it has to decide what to name it. This can be a very
  48 difficult problem. For one thing, what is and is not a valid filename depends
  49 on the user's platform, and the relevant filesystem type. For another thing,
  50 there are security hazards everywhere. Deark should not try to write a file
  51 named "/etc/passwd", for example.
  52
  53 Also, there are a near-limitless number of reasonable ways to construct an
  54 output filename, with an elaborate decision tree to select the best behavior
  55 in various circumstances.
  56
  57 Deark essentially throws up its hands and gives up. By default, it names all
  58 output filenames to start with "output.". It overwrites existing files with no
  59 warning (unless you use -n). It bans all ASCII characters that could
  60 conceivably be problematical, as well as any non-ASCII characters that don't
  61 appear on its whitelist.
  62
  63 When Deark writes to a ZIP or tar file (the "-zip"/"-tar" option), it doesn't
  64 have to worry about what to name the internal files. It can palm that problem
  65 off onto your unzip/untar program. It is more tolerant in this case.
  66
  67 Directory paths are only maintained as such if you use -zip/-tar (and you don't
  68 use "-opt archive:subdirs=0"). Deark generally does not write a file anywhere
  69 other than the current directory, though you can tell it to do so with -od, or
  70 with other options such as -arcfn or -k3.
  71
  72 ## The "Is this one format or two?" problem ##
  73
  74 It's often hard to decide whether a format should get its own module, or be a
  75 part of some other module. Deark has some guidelines for this, but doesn't
  76 always follow them consistently.
  77
  78 Modules are not supposed to make use of the input filename, except during
  79 format detection. So if two formats can't be distinguished in any other way,
  80 they generally have to be placed in separate modules.
  81
  82 ## Format detection ##
  83
  84 If the user does not use the "-m" option, then Deark will try to guess the best
  85 module to use. It prefers to do this using only the contents of the file, but
  86 unfortunately, there are many file formats that cannot realistically be
  87 identified in such a way. So, in some cases, Deark also uses the filename,
  88 especially the filename extension.
  89
  90 It does not use any other file attributes, such as the last-modified time or
  91 the executable-flag; though this could change in future versions.
  92
  93 The filename is only used for format detection, and not for any other purpose.
  94 This helps make its behavior safe and predictable. The options -m, -start, and
  95 -fromstdin are among those that might need special cases added, if that were
  96 not the case.
  97
  98 This behavior *might* be changed in the future (as an option?), because some
  99 formats store important information in the filename, and having a separate
 100 module for each possibility isn't always feasible. For example, with Unix
 101 compress format, there is no other way to construct a good output filename, so
 102 Deark has to settle for a generic name like "output.000.bin".
 103
 104 ## Character encoding (console) ##
 105
 106 The "-d" option prints a lot of textual information to the console, some of
 107 which is not ASCII-compatible. Non-ASCII text can sometimes cause problems.
 108
 109 On Windows, Deark generally does the right thing automatically. However, if you
 110 are redirecting the output to a file or a pipe, there are cases where the
 111 "-enc" option can be helpful.
 112
 113 On Unix-like platforms, UTF-8 output will be written to the terminal,
 114 regardless of your LANG (etc.) environment variable. You can use "-enc ascii"
 115 to print only ASCII. (This is not ideal, but seriously, it's time to switch to
 116 UTF-8 if at all possible.)
 117
 118 On Unix-like platforms, command-line parameters are assumed to be in UTF-8.
 119
 120 ## Character encoding (output files) ##
 121
 122 When Deark generates a text file, its preferred encoding is UTF-8, with a BOM
 123 (unless you use "-nobom"). But there are many cases where it can't do that,
 124 because the original encoding is undefined, unsupported, or incompatible with
 125 Unicode. In such cases, it just writes out the original bytes as they are.
 126
 127 If the text was already encoded in UTF-8, Deark does not behave perfectly
 128 consistently. Some modules copy the bytes as they are, while other sanitize
 129 them first.
 130
 131 Deark keeps the end-of-line characters as they are in the original file. If it
 132 has to generate end-of-line characters of its own, it uses Unix-style line-feed
 133 characters.
 134
 135 ## Executable output files ##
 136
 137 Most file attributes (such as file ownership) are ignored when extracting
 138 files, but Deark does try to maintain the "executable" status of output
 139 files, for formats which store this attribute. The Windows version of Deark
 140 does not use this information, except when writing to a ZIP/tar file.
 141
 142 This is a simple yes/no flag. It does not distinguish between
 143 owner-executable and world-executable, for example.
 144
 145 ## Directory "files" and empty directories ##
 146
 147 Some archive formats contain independent representations of subdirectores,
 148 allowing empty directories, and directory attributes, to be stored. By default,
 149 Deark retains these entries when writing to a ZIP/tar file, and otherwise
 150 ignores them. This behavior can be changed with "-opt keepdirentries". Even so,
 151 Deark never creates directories directly. Instead, it may create marker files
 152 with a ".dir" extension.
 153
 154 Note that this means the -zip/-tar option can affect the numbering of output
 155 files used by, e.g., the -get option.
 156
 157 ## Modification times ##
 158
 159 In certain cases, Deark tries to maintain the modification time (and to a
 160 lesser degree, other timestamps) of the original file. This only happens with
 161 timestamps contained inside the input file.
 162
 163 If a timestamp does not include a time zone, the time will be assumed to be in
 164 Universal Time (UTC), unless the -intz option was used. Deark is expected to be
 165 used with files that were created long ago and far away, so it never assumes
 166 that the Deark user's time zone is relevant.
 167
 168 Note that if you are extracting to a system that does not store file times in
 169 UTC (often the case on Windows), the timestamps may not be very accurate.
 170
 171 ## Modification times and thumbnails ##
 172
 173 Some thumbnail image formats store the last-modified time of the original file.
 174 This raises the question of whether Deark should use this as the last-modified
 175 time of the extracted thumbnail file. Currently, Deark *does* do this, but it
 176 must be acknowledged that there's something not quite right about it, because
 177 the thumbnail may have been created much later than the original image.
 178
 179 ## The .iptctiff and .8bimtiff formats ##
 180
 181 In some cases, Deark saves IPTC-IIM metadata, or Photoshop Resources (also
 182 semi-incorrectly known as "8BIM"), to a file. These data formats don't have a
 183 good *file* format to use, so Deark wraps them in a minimal TIFF-based
 184 container. You can reprocess this container file with Deark, and it may decode
 185 the data (use -d), or extract the raw data to a file.
 186
 187 ## AppleDouble format ##
 188
 189 In most cases, Deark writes Macintosh resource forks to AppleDouble format. It
 190 considers this to be its preferred format for resource forks. You have to use
 191 an option, if you want it to write the fork in raw form.
 192
 193 It gives AppleDouble output files an ".adf" file extension. Although this is
 194 one of the conventions suggested in the AppleDouble specification, it is not
 195 commonly used. The other naming conventions don't play well with Deark's naming
 196 conventions.
 197
 198 Another problem is that, because Deark gives each output file a unique prefix
 199 like "output.NNN", the AppleDouble file and its associated data fork will not
 200 have the same base filename, further reducing the chance that other systems
 201 will treat them as a unit. There's no good fix for this, though you may be able
 202 to avoid it by using the -zip option.
 203
 204 ## PNG htSP chunks ##
 205
 206 When decoding mouse cursor graphics, Deark sometimes records the cursor's
 207 "hotspot" in the resulting PNG image, in a custom "htSP" chunk. The htSP
 208 chunk's format is explained here.
 209
 210 The chunk type is "htSP": hex [68 74 53 50].
 211
 212 The chunk data field length is 24 or more bytes. Encoders must write exactly 24
 213 bytes. Decoders must ignore any bytes after the first 24.
 214
 215 The first 16 bytes of the data field are an arbitrary signature UUID: hex [b9
 216 fe 4f 3d 8f 32 45 6f aa 02 dc d7 9c ce 0e 24]. This represents the UUID
 217 b9fe4f3d-8f32-456f-aa02-dcd79cce0e24. If the first 16 bytes are not exactly
 218 this signature, the chunk does not conform to this specification.
 219
 220 At most one htSP chunk with this signature may appear in a PNG file. The chunk
 221 must appear before the IDAT chunks.
 222
 223 After the signature are two 4-byte fields: the X coordinate at offset 16, then
 224 the Y coordinate at offset 20. Each is stored as a "PNG four-byte signed
 225 integer" (big-endian, two's complement).
 226
 227 The X coordinate is the number of pixels the hotspot is to the right of the
 228 image's leftmost column of pixels. (If the hotspot is in the leftmost column,
 229 then its coordinate is 0). The Y coordinate is the number of pixels the hotspot
 230 is below the image's topmost row of pixels. It is legal for the hotspot to be
 231 beyond the bounds of the image.
 232
 233 The hotspot is conceptually an entire pixel (or virtual pixel), not a specific
 234 point in some coordinate system. If more precision is needed, assume the
 235 hotspot is the center of that pixel. This means that if a 16x16-pixel image
 236 with hotspot (0,0) were to be mirrored left-right, the new hotspot would be
 237 (15,0), not (16,0) as it would be if the hotspot were the upper-left corner of
 238 the pixel.
 239
 240 ## I've never heard of that format! ##
 241
 242 For the identities of the formats supported by Deark, see
 243
 244 - [File format wiki: Electronic File Formats](http://fileformats.archiveteam.org/wiki/Electronic_File_Formats)
 245 - [File format wiki: Graphics](http://fileformats.archiveteam.org/wiki/Graphics)
 246
 247 ## Other information ##
 248
 249 By design, Deark does not look at any files that don't explicitly appear on the
 250 command line. In the future, there might be an option to change this behavior,
 251 and automatically try to find related files.
 252
 253 Bitmap fonts are converted to images. Someday, there might be an option to
 254 convert them to some portable font format, but that is difficult to do well.
 255
 256 ## How to build ##
 257
 258 Deark is written in C. On a Unix-like system, typing "make" from a shell prompt
 259 will (hopefully) be sufficient:
 260
 261     $ make
 262
 263 This will build an executable file named "deark". Deark has no dependencies,
 264 other than the standard C libraries.
 265
 266 One way to configure the build is by setting certain environment variables. See
 267 the scripts at scripts/example-build-* for examples.
 268
 269 Another way is to create Makefile fragment files named local1.mk and/or
 270 local2.mk.
 271
 272 Some C-language-level configuration can be done by creating a file named
 273 src/deark-config2.h, and adding -DDE_USE_CONFIG2_H to the CFLAGS variable in
 274 the Makefile.
 275
 276 It is safe to build Deark using "parallel make", i.e. "make -j". This will
 277 speed up the build, in most cases.
 278
 279 If you want to install it in a convenient location, just copy the "deark" file.
 280 For example:
 281
 282     $ sudo cp deark /usr/local/bin/
 283
 284 or
 285
 286     $ sudo make install
 287
 288 For Microsoft Windows, the project files in proj/vs2022 should work
 289 for sufficiently new versions of Microsoft Visual Studio. Alternatively, you
 290 can use Cygwin.
 291
 292 When doing a Windows (Win32 API) build, the Makefile is not intended to be used
 293 directly (without configuration). For MinGW and similar compilers, it is
 294 recommended to use a script (e.g. scripts/example-build-mingw.sh) or local*.mk
 295 files.
 296
 297 Note that for Windows, Deark unconditionally uses Unicode API functions like
 298 fputws() and GetFileAttributesW(), so it's not possible to do a "non-Unicode"
 299 build.
 300
 301 ## Developer notes ##
 302
 303 ### Coding style and C language ###
 304
 305 The C language version that Deark uses is (semi-officially) C99, with a few
 306 common extensions needed for things like querying a file's size.
 307
 308 It should be mostly compatible with most modern compilers, without special
 309 options.
 310
 311 ### Hard-coded CRCs ###
 312
 313 Some source code files contain hard-coded CRC-32 hashes, to help identify
 314 certain things. The original data associated with the hashes may be found
 315 in the separate
 316 [Deark-extras project](https://github.com/jsummers/deark-extras).
 317
 318 ### Other notes ###
 319
 320 Maximizing execution speed is not a goal of Deark. This is a deliberate
 321 concession. The focus is on making the code easy to write, easy to read, and to
 322 minimize the chance of bugs. (Serious performance issues will still be
 323 addressed, though.)
 324
 325 The Deark executable size is a consideration. A suggested feature might be
 326 declined if it would involve too much bloat.
 327
 328 Consistent behavior is a goal. Ideally, the files produced by a given version
 329 of Deark should depend only on the contents of the input file. They should not
 330 change if you update your computer, or change your time zone settings, or
 331 whatever. There is an issue when using the -zip option, though. Consider using
 332 "-opt archive:timestamp" to get reproducible results.
 333
 334 Deark is intended to be compiled as a 64-bit application. It uses 64-bit
 335 integers pervasively. It can be compiled as a 32-bit app (provided the compiler
 336 has a 64-bit integer type), and it will operate correctly. But it will be very
 337 inefficient, and might run at less than half the 64-bit speed.
 338
 339 The Deark source code is structured like a library, but it's not intended to be
 340 used as such. The error handling methods, and error messages, are not really
 341 suitable for use in a library.
 342
 343 A regression test suite does exist for Deark, but is not available publicly at
 344 this time.