Lib/pickletools.py

   1 '''"Executable documentation" for the pickle module.
   2
   3 Extensive comments about the pickle protocols and pickle-machine opcodes
   4 can be found here.  Some functions meant for external use:
   5
   6 genops(pickle)
   7    Generate all the opcodes in a pickle, as (opcode, arg, position) triples.
   8
   9 dis(pickle, out=None, indentlevel=4)
  10    Print a symbolic disassembly of a pickle.
  11 '''
  12
  13 # Other ideas:
  14 #
  15 # - A pickle verifier:  read a pickle and check it exhaustively for
  16 #   well-formedness.  dis() does a lot of this already.
  17 #
  18 # - A protocol identifier:  examine a pickle and return its protocol number
  19 #   (== the highest .proto attr value among all the opcodes in the pickle).
  20 #   dis() already prints this info at the end.
  21 #
  22 # - A pickle optimizer:  for example, tuple-building code is sometimes more
  23 #   elaborate than necessary, catering for the possibility that the tuple
  24 #   is recursive.  Or lots of times a PUT is generated that's never accessed
  25 #   by a later GET.
  26
  27
  28 """
  29 "A pickle" is a program for a virtual pickle machine (PM, but more accurately
  30 called an unpickling machine).  It's a sequence of opcodes, interpreted by the
  31 PM, building an arbitrarily complex Python object.
  32
  33 For the most part, the PM is very simple:  there are no looping, testing, or
  34 conditional instructions, no arithmetic and no function calls.  Opcodes are
  35 executed once each, from first to last, until a STOP opcode is reached.
  36
  37 The PM has two data areas, "the stack" and "the memo".
  38
  39 Many opcodes push Python objects onto the stack; e.g., INT pushes a Python
  40 integer object on the stack, whose value is gotten from a decimal string
  41 literal immediately following the INT opcode in the pickle bytestream.  Other
  42 opcodes take Python objects off the stack.  The result of unpickling is
  43 whatever object is left on the stack when the final STOP opcode is executed.
  44
  45 The memo is simply an array of objects, or it can be implemented as a dict
  46 mapping little integers to objects.  The memo serves as the PM's "long term
  47 memory", and the little integers indexing the memo are akin to variable
  48 names.  Some opcodes pop a stack object into the memo at a given index,
  49 and others push a memo object at a given index onto the stack again.
  50
  51 At heart, that's all the PM has.  Subtleties arise for these reasons:
  52
  53 + Object identity.  Objects can be arbitrarily complex, and subobjects
  54   may be shared (for example, the list [a, a] refers to the same object a
  55   twice).  It can be vital that unpickling recreate an isomorphic object
  56   graph, faithfully reproducing sharing.
  57
  58 + Recursive objects.  For example, after "L = []; L.append(L)", L is a
  59   list, and L[0] is the same list.  This is related to the object identity
  60   point, and some sequences of pickle opcodes are subtle in order to
  61   get the right result in all cases.
  62
  63 + Things pickle doesn't know everything about.  Examples of things pickle
  64   does know everything about are Python's builtin scalar and container
  65   types, like ints and tuples.  They generally have opcodes dedicated to
  66   them.  For things like module references and instances of user-defined
  67   classes, pickle's knowledge is limited.  Historically, many enhancements
  68   have been made to the pickle protocol in order to do a better (faster,
  69   and/or more compact) job on those.
  70
  71 + Backward compatibility and micro-optimization.  As explained below,
  72   pickle opcodes never go away, not even when better ways to do a thing
  73   get invented.  The repertoire of the PM just keeps growing over time.
  74   For example, protocol 0 had two opcodes for building Python integers (INT
  75   and LONG), protocol 1 added three more for more-efficient pickling of short
  76   integers, and protocol 2 added two more for more-efficient pickling of
  77   long integers (before protocol 2, the only ways to pickle a Python long
  78   took time quadratic in the number of digits, for both pickling and
  79   unpickling).  "Opcode bloat" isn't so much a subtlety as a source of
  80   wearying complication.
  81
  82
  83 Pickle protocols:
  84
  85 For compatibility, the meaning of a pickle opcode never changes.  Instead new
  86 pickle opcodes get added, and each version's unpickler can handle all the
  87 pickle opcodes in all protocol versions to date.  So old pickles continue to
  88 be readable forever.  The pickler can generally be told to restrict itself to
  89 the subset of opcodes available under previous protocol versions too, so that
  90 users can create pickles under the current version readable by older
  91 versions.  However, a pickle does not contain its version number embedded
  92 within it.  If an older unpickler tries to read a pickle using a later
  93 protocol, the result is most likely an exception due to seeing an unknown (in
  94 the older unpickler) opcode.
  95
  96 The original pickle used what's now called "protocol 0", and what was called
  97 "text mode" before Python 2.3.  The entire pickle bytestream is made up of
  98 printable 7-bit ASCII characters, plus the newline character, in protocol 0.
  99 That's why it was called text mode.  Protocol 0 is small and elegant, but
 100 sometimes painfully inefficient.
 101
 102 The second major set of additions is now called "protocol 1", and was called
 103 "binary mode" before Python 2.3.  This added many opcodes with arguments
 104 consisting of arbitrary bytes, including NUL bytes and unprintable "high bit"
 105 bytes.  Binary mode pickles can be substantially smaller than equivalent
 106 text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte
 107 int as 4 bytes following the opcode, which is cheaper to unpickle than the
 108 (perhaps) 11-character decimal string attached to INT.  Protocol 1 also added
 109 a number of opcodes that operate on many stack elements at once (like APPENDS
 110 and SETITEMS), and "shortcut" opcodes (like EMPTY_DICT and EMPTY_TUPLE).
 111
 112 The third major set of additions came in Python 2.3, and is called "protocol
 113 2".  This added:
 114
 115 - A better way to pickle instances of new-style classes (NEWOBJ).
 116
 117 - A way for a pickle to identify its protocol (PROTO).
 118
 119 - Time- and space- efficient pickling of long ints (LONG{1,4}).
 120
 121 - Shortcuts for small tuples (TUPLE{1,2,3}}.
 122
 123 - Dedicated opcodes for bools (NEWTRUE, NEWFALSE).
 124
 125 - The "extension registry", a vector of popular objects that can be pushed
 126   efficiently by index (EXT{1,2,4}).  This is akin to the memo and GET, but
 127   the registry contents are predefined (there's nothing akin to the memo's
 128   PUT).
 129
 130 Another independent change with Python 2.3 is the abandonment of any
 131 pretense that it might be safe to load pickles received from untrusted
 132 parties -- no sufficient security analysis has been done to guarantee
 133 this and there isn't a use case that warrants the expense of such an
 134 analysis.
 135
 136 To this end, all tests for __safe_for_unpickling__ or for
 137 copy_reg.safe_constructors are removed from the unpickling code.
 138 References to these variables in the descriptions below are to be seen
 139 as describing unpickling in Python 2.2 and before.
 140 """
 141
 142 # Meta-rule:  Descriptions are stored in instances of descriptor objects,
 143 # with plain constructors.  No meta-language is defined from which
 144 # descriptors could be constructed.  If you want, e.g., XML, write a little
 145 # program to generate XML from the objects.
 146
 147 ##############################################################################
 148 # Some pickle opcodes have an argument, following the opcode in the
 149 # bytestream.  An argument is of a specific type, described by an instance
 150 # of ArgumentDescriptor.  These are not to be confused with arguments taken
 151 # off the stack -- ArgumentDescriptor applies only to arguments embedded in
 152 # the opcode stream, immediately following an opcode.
 153
 154 # Represents the number of bytes consumed by an argument delimited by the
 155 # next newline character.
 156 UP_TO_NEWLINE = -1
 157
 158 # Represents the number of bytes consumed by a two-argument opcode where
 159 # the first argument gives the number of bytes in the second argument.
 160 TAKEN_FROM_ARGUMENT1 = -2   # num bytes is 1-byte unsigned int
 161 TAKEN_FROM_ARGUMENT4 = -3   # num bytes is 4-byte signed little-endian int
 162
 163 class ArgumentDescriptor(object):
 164     __slots__ = (
 165         # name of descriptor record, also a module global name; a string
 166         'name',
 167
 168         # length of argument, in bytes; an int; UP_TO_NEWLINE and
 169         # TAKEN_FROM_ARGUMENT{1,4} are negative values for variable-length
 170         # cases
 171         'n',
 172
 173         # a function taking a file-like object, reading this kind of argument
 174         # from the object at the current position, advancing the current
 175         # position by n bytes, and returning the value of the argument
 176         'reader',
 177
 178         # human-readable docs for this arg descriptor; a string
 179         'doc',
 180     )
 181
 182     def __init__(self, name, n, reader, doc):
 183         assert isinstance(name, str)
 184         self.name = name
 185
 186         assert isinstance(n, int) and (n >= 0 or
 187                                        n in (UP_TO_NEWLINE,
 188                                              TAKEN_FROM_ARGUMENT1,
 189                                              TAKEN_FROM_ARGUMENT4))
 190         self.n = n
 191
 192         self.reader = reader
 193
 194         assert isinstance(doc, str)
 195         self.doc = doc
 196
 197 from struct import unpack as _unpack
 198
 199 def read_uint1(f):
 200     r"""
 201     >>> import StringIO
 202     >>> read_uint1(StringIO.StringIO('\xff'))
 203     255
 204     """
 205
 206     data = f.read(1)
 207     if data:
 208         return ord(data)
 209     raise ValueError("not enough data in stream to read uint1")
 210
 211 uint1 = ArgumentDescriptor(
 212             name='uint1',
 213             n=1,
 214             reader=read_uint1,
 215             doc="One-byte unsigned integer.")
 216
 217
 218 def read_uint2(f):
 219     r"""
 220     >>> import StringIO
 221     >>> read_uint2(StringIO.StringIO('\xff\x00'))
 222     255
 223     >>> read_uint2(StringIO.StringIO('\xff\xff'))
 224     65535
 225     """
 226
 227     data = f.read(2)
 228     if len(data) == 2:
 229         return _unpack("<H", data)[0]
 230     raise ValueError("not enough data in stream to read uint2")
 231
 232 uint2 = ArgumentDescriptor(
 233             name='uint2',
 234             n=2,
 235             reader=read_uint2,
 236             doc="Two-byte unsigned integer, little-endian.")
 237
 238
 239 def read_int4(f):
 240     r"""
 241     >>> import StringIO
 242     >>> read_int4(StringIO.StringIO('\xff\x00\x00\x00'))
 243     255
 244     >>> read_int4(StringIO.StringIO('\x00\x00\x00\x80')) == -(2**31)
 245     True
 246     """
 247
 248     data = f.read(4)
 249     if len(data) == 4:
 250         return _unpack("<i", data)[0]
 251     raise ValueError("not enough data in stream to read int4")
 252
 253 int4 = ArgumentDescriptor(
 254            name='int4',
 255            n=4,
 256            reader=read_int4,
 257            doc="Four-byte signed integer, little-endian, 2's complement.")
 258
 259
 260 def read_stringnl(f, decode=True, stripquotes=True):
 261     r"""
 262     >>> import StringIO
 263     >>> read_stringnl(StringIO.StringIO("'abcd'\nefg\n"))
 264     'abcd'
 265
 266     >>> read_stringnl(StringIO.StringIO("\n"))
 267     Traceback (most recent call last):
 268     ...
 269     ValueError: no string quotes around ''
 270
 271     >>> read_stringnl(StringIO.StringIO("\n"), stripquotes=False)
 272     ''
 273
 274     >>> read_stringnl(StringIO.StringIO("''\n"))
 275     ''
 276
 277     >>> read_stringnl(StringIO.StringIO('"abcd"'))
 278     Traceback (most recent call last):
 279     ...
 280     ValueError: no newline found when trying to read stringnl
 281
 282     Embedded escapes are undone in the result.
 283     >>> read_stringnl(StringIO.StringIO(r"'a\n\\b\x00c\td'" + "\n'e'"))
 284     'a\n\\b\x00c\td'
 285     """
 286
 287     data = f.readline()
 288     if not data.endswith('\n'):
 289         raise ValueError("no newline found when trying to read stringnl")
 290     data = data[:-1]    # lose the newline
 291
 292     if stripquotes:
 293         for q in "'\"":
 294             if data.startswith(q):
 295                 if not data.endswith(q):
 296                     raise ValueError("strinq quote %r not found at both "
 297                                      "ends of %r" % (q, data))
 298                 data = data[1:-1]
 299                 break
 300         else:
 301             raise ValueError("no string quotes around %r" % data)
 302
 303     # I'm not sure when 'string_escape' was added to the std codecs; it's
 304     # crazy not to use it if it's there.
 305     if decode:
 306         data = data.decode('string_escape')
 307     return data
 308
 309 stringnl = ArgumentDescriptor(
 310                name='stringnl',
 311                n=UP_TO_NEWLINE,
 312                reader=read_stringnl,
 313                doc="""A newline-terminated string.
 314
 315                    This is a repr-style string, with embedded escapes, and
 316                    bracketing quotes.
 317                    """)
 318
 319 def read_stringnl_noescape(f):
 320     return read_stringnl(f, decode=False, stripquotes=False)
 321
 322 stringnl_noescape = ArgumentDescriptor(
 323                         name='stringnl_noescape',
 324                         n=UP_TO_NEWLINE,
 325                         reader=read_stringnl_noescape,
 326                         doc="""A newline-terminated string.
 327
 328                         This is a str-style string, without embedded escapes,
 329                         or bracketing quotes.  It should consist solely of
 330                         printable ASCII characters.
 331                         """)
 332
 333 def read_stringnl_noescape_pair(f):
 334     r"""
 335     >>> import StringIO
 336     >>> read_stringnl_noescape_pair(StringIO.StringIO("Queue\nEmpty\njunk"))
 337     'Queue Empty'
 338     """
 339
 340     return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f))
 341
 342 stringnl_noescape_pair = ArgumentDescriptor(
 343                              name='stringnl_noescape_pair',
 344                              n=UP_TO_NEWLINE,
 345                              reader=read_stringnl_noescape_pair,
 346                              doc="""A pair of newline-terminated strings.
 347
 348                              These are str-style strings, without embedded
 349                              escapes, or bracketing quotes.  They should
 350                              consist solely of printable ASCII characters.
 351                              The pair is returned as a single string, with
 352                              a single blank separating the two strings.
 353                              """)
 354
 355 def read_string4(f):
 356     r"""
 357     >>> import StringIO
 358     >>> read_string4(StringIO.StringIO("\x00\x00\x00\x00abc"))
 359     ''
 360     >>> read_string4(StringIO.StringIO("\x03\x00\x00\x00abcdef"))
 361     'abc'
 362     >>> read_string4(StringIO.StringIO("\x00\x00\x00\x03abcdef"))
 363     Traceback (most recent call last):
 364     ...
 365     ValueError: expected 50331648 bytes in a string4, but only 6 remain
 366     """
 367
 368     n = read_int4(f)
 369     if n < 0:
 370         raise ValueError("string4 byte count < 0: %d" % n)
 371     data = f.read(n)
 372     if len(data) == n:
 373         return data
 374     raise ValueError("expected %d bytes in a string4, but only %d remain" %
 375                      (n, len(data)))
 376
 377 string4 = ArgumentDescriptor(
 378               name="string4",
 379               n=TAKEN_FROM_ARGUMENT4,
 380               reader=read_string4,
 381               doc="""A counted string.
 382
 383               The first argument is a 4-byte little-endian signed int giving
 384               the number of bytes in the string, and the second argument is
 385               that many bytes.
 386               """)
 387
 388
 389 def read_string1(f):
 390     r"""
 391     >>> import StringIO
 392     >>> read_string1(StringIO.StringIO("\x00"))
 393     ''
 394     >>> read_string1(StringIO.StringIO("\x03abcdef"))
 395     'abc'
 396     """
 397
 398     n = read_uint1(f)
 399     assert n >= 0
 400     data = f.read(n)
 401     if len(data) == n:
 402         return data
 403     raise ValueError("expected %d bytes in a string1, but only %d remain" %
 404                      (n, len(data)))
 405
 406 string1 = ArgumentDescriptor(
 407               name="string1",
 408               n=TAKEN_FROM_ARGUMENT1,
 409               reader=read_string1,
 410               doc="""A counted string.
 411
 412               The first argument is a 1-byte unsigned int giving the number
 413               of bytes in the string, and the second argument is that many
 414               bytes.
 415               """)
 416
 417
 418 def read_unicodestringnl(f):
 419     r"""
 420     >>> import StringIO
 421     >>> read_unicodestringnl(StringIO.StringIO("abc\uabcd\njunk"))
 422     u'abc\uabcd'
 423     """
 424
 425     data = f.readline()
 426     if not data.endswith('\n'):
 427         raise ValueError("no newline found when trying to read "
 428                          "unicodestringnl")
 429     data = data[:-1]    # lose the newline
 430     return unicode(data, 'raw-unicode-escape')
 431
 432 unicodestringnl = ArgumentDescriptor(
 433                       name='unicodestringnl',
 434                       n=UP_TO_NEWLINE,
 435                       reader=read_unicodestringnl,
 436                       doc="""A newline-terminated Unicode string.
 437
 438                       This is raw-unicode-escape encoded, so consists of
 439                       printable ASCII characters, and may contain embedded
 440                       escape sequences.
 441                       """)
 442
 443 def read_unicodestring4(f):
 444     r"""
 445     >>> import StringIO
 446     >>> s = u'abcd\uabcd'
 447     >>> enc = s.encode('utf-8')
 448     >>> enc
 449     'abcd\xea\xaf\x8d'
 450     >>> n = chr(len(enc)) + chr(0) * 3  # little-endian 4-byte length
 451     >>> t = read_unicodestring4(StringIO.StringIO(n + enc + 'junk'))
 452     >>> s == t
 453     True
 454
 455     >>> read_unicodestring4(StringIO.StringIO(n + enc[:-1]))
 456     Traceback (most recent call last):
 457     ...
 458     ValueError: expected 7 bytes in a unicodestring4, but only 6 remain
 459     """
 460
 461     n = read_int4(f)
 462     if n < 0:
 463         raise ValueError("unicodestring4 byte count < 0: %d" % n)
 464     data = f.read(n)
 465     if len(data) == n:
 466         return unicode(data, 'utf-8')
 467     raise ValueError("expected %d bytes in a unicodestring4, but only %d "
 468                      "remain" % (n, len(data)))
 469
 470 unicodestring4 = ArgumentDescriptor(
 471                     name="unicodestring4",
 472                     n=TAKEN_FROM_ARGUMENT4,
 473                     reader=read_unicodestring4,
 474                     doc="""A counted Unicode string.
 475
 476                     The first argument is a 4-byte little-endian signed int
 477                     giving the number of bytes in the string, and the second
 478                     argument-- the UTF-8 encoding of the Unicode string --
 479                     contains that many bytes.
 480                     """)
 481
 482
 483 def read_decimalnl_short(f):
 484     r"""
 485     >>> import StringIO
 486     >>> read_decimalnl_short(StringIO.StringIO("1234\n56"))
 487     1234
 488
 489     >>> read_decimalnl_short(StringIO.StringIO("1234L\n56"))
 490     Traceback (most recent call last):
 491     ...
 492     ValueError: trailing 'L' not allowed in '1234L'
 493     """
 494
 495     s = read_stringnl(f, decode=False, stripquotes=False)
 496     if s.endswith("L"):
 497         raise ValueError("trailing 'L' not allowed in %r" % s)
 498
 499     # It's not necessarily true that the result fits in a Python short int:
 500     # the pickle may have been written on a 64-bit box.  There's also a hack
 501     # for True and False here.
 502     if s == "00":
 503         return False
 504     elif s == "01":
 505         return True
 506
 507     try:
 508         return int(s)
 509     except OverflowError:
 510         return long(s)
 511
 512 def read_decimalnl_long(f):
 513     r"""
 514     >>> import StringIO
 515
 516     >>> read_decimalnl_long(StringIO.StringIO("1234\n56"))
 517     Traceback (most recent call last):
 518     ...
 519     ValueError: trailing 'L' required in '1234'
 520
 521     Someday the trailing 'L' will probably go away from this output.
 522
 523     >>> read_decimalnl_long(StringIO.StringIO("1234L\n56"))
 524     1234L
 525
 526     >>> read_decimalnl_long(StringIO.StringIO("123456789012345678901234L\n6"))
 527     123456789012345678901234L
 528     """
 529
 530     s = read_stringnl(f, decode=False, stripquotes=False)
 531     if not s.endswith("L"):
 532         raise ValueError("trailing 'L' required in %r" % s)
 533     return long(s)
 534
 535
 536 decimalnl_short = ArgumentDescriptor(
 537                       name='decimalnl_short',
 538                       n=UP_TO_NEWLINE,
 539                       reader=read_decimalnl_short,
 540                       doc="""A newline-terminated decimal integer literal.
 541
 542                           This never has a trailing 'L', and the integer fit
 543                           in a short Python int on the box where the pickle
 544                           was written -- but there's no guarantee it will fit
 545                           in a short Python int on the box where the pickle
 546                           is read.
 547                           """)
 548
 549 decimalnl_long = ArgumentDescriptor(
 550                      name='decimalnl_long',
 551                      n=UP_TO_NEWLINE,
 552                      reader=read_decimalnl_long,
 553                      doc="""A newline-terminated decimal integer literal.
 554
 555                          This has a trailing 'L', and can represent integers
 556                          of any size.
 557                          """)
 558
 559
 560 def read_floatnl(f):
 561     r"""
 562     >>> import StringIO
 563     >>> read_floatnl(StringIO.StringIO("-1.25\n6"))
 564     -1.25
 565     """
 566     s = read_stringnl(f, decode=False, stripquotes=False)
 567     return float(s)
 568
 569 floatnl = ArgumentDescriptor(
 570               name='floatnl',
 571               n=UP_TO_NEWLINE,
 572               reader=read_floatnl,
 573               doc="""A newline-terminated decimal floating literal.
 574
 575               In general this requires 17 significant digits for roundtrip
 576               identity, and pickling then unpickling infinities, NaNs, and
 577               minus zero doesn't work across boxes, or on some boxes even
 578               on itself (e.g., Windows can't read the strings it produces
 579               for infinities or NaNs).
 580               """)
 581
 582 def read_float8(f):
 583     r"""
 584     >>> import StringIO, struct
 585     >>> raw = struct.pack(">d", -1.25)
 586     >>> raw
 587     '\xbf\xf4\x00\x00\x00\x00\x00\x00'
 588     >>> read_float8(StringIO.StringIO(raw + "\n"))
 589     -1.25
 590     """
 591
 592     data = f.read(8)
 593     if len(data) == 8:
 594         return _unpack(">d", data)[0]
 595     raise ValueError("not enough data in stream to read float8")
 596
 597
 598 float8 = ArgumentDescriptor(
 599              name='float8',
 600              n=8,
 601              reader=read_float8,
 602              doc="""An 8-byte binary representation of a float, big-endian.
 603
 604              The format is unique to Python, and shared with the struct
 605              module (format string '>d') "in theory" (the struct and cPickle
 606              implementations don't share the code -- they should).  It's
 607              strongly related to the IEEE-754 double format, and, in normal
 608              cases, is in fact identical to the big-endian 754 double format.
 609              On other boxes the dynamic range is limited to that of a 754
 610              double, and "add a half and chop" rounding is used to reduce
 611              the precision to 53 bits.  However, even on a 754 box,
 612              infinities, NaNs, and minus zero may not be handled correctly
 613              (may not survive roundtrip pickling intact).
 614              """)
 615
 616 # Protocol 2 formats
 617
 618 from pickle import decode_long
 619
 620 def read_long1(f):
 621     r"""
 622     >>> import StringIO
 623     >>> read_long1(StringIO.StringIO("\x00"))
 624     0L
 625     >>> read_long1(StringIO.StringIO("\x02\xff\x00"))
 626     255L
 627     >>> read_long1(StringIO.StringIO("\x02\xff\x7f"))
 628     32767L
 629     >>> read_long1(StringIO.StringIO("\x02\x00\xff"))
 630     -256L
 631     >>> read_long1(StringIO.StringIO("\x02\x00\x80"))
 632     -32768L
 633     """
 634
 635     n = read_uint1(f)
 636     data = f.read(n)
 637     if len(data) != n:
 638         raise ValueError("not enough data in stream to read long1")
 639     return decode_long(data)
 640
 641 long1 = ArgumentDescriptor(
 642     name="long1",
 643     n=TAKEN_FROM_ARGUMENT1,
 644     reader=read_long1,
 645     doc="""A binary long, little-endian, using 1-byte size.
 646
 647     This first reads one byte as an unsigned size, then reads that
 648     many bytes and interprets them as a little-endian 2's-complement long.
 649     If the size is 0, that's taken as a shortcut for the long 0L.
 650     """)
 651
 652 def read_long4(f):
 653     r"""
 654     >>> import StringIO
 655     >>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\xff\x00"))
 656     255L
 657     >>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\xff\x7f"))
 658     32767L
 659     >>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\x00\xff"))
 660     -256L
 661     >>> read_long4(StringIO.StringIO("\x02\x00\x00\x00\x00\x80"))
 662     -32768L
 663     >>> read_long1(StringIO.StringIO("\x00\x00\x00\x00"))
 664     0L
 665     """
 666
 667     n = read_int4(f)
 668     if n < 0:
 669         raise ValueError("long4 byte count < 0: %d" % n)
 670     data = f.read(n)
 671     if len(data) != n:
 672         raise ValueError("not enough data in stream to read long4")
 673     return decode_long(data)
 674
 675 long4 = ArgumentDescriptor(
 676     name="long4",
 677     n=TAKEN_FROM_ARGUMENT4,
 678     reader=read_long4,
 679     doc="""A binary representation of a long, little-endian.
 680
 681     This first reads four bytes as a signed size (but requires the
 682     size to be >= 0), then reads that many bytes and interprets them
 683     as a little-endian 2's-complement long.  If the size is 0, that's taken
 684     as a shortcut for the long 0L, although LONG1 should really be used
 685     then instead (and in any case where # of bytes < 256).
 686     """)
 687
 688
 689 ##############################################################################
 690 # Object descriptors.  The stack used by the pickle machine holds objects,
 691 # and in the stack_before and stack_after attributes of OpcodeInfo
 692 # descriptors we need names to describe the various types of objects that can
 693 # appear on the stack.
 694
 695 class StackObject(object):
 696     __slots__ = (
 697         # name of descriptor record, for info only
 698         'name',
 699
 700         # type of object, or tuple of type objects (meaning the object can
 701         # be of any type in the tuple)
 702         'obtype',
 703
 704         # human-readable docs for this kind of stack object; a string
 705         'doc',
 706     )
 707
 708     def __init__(self, name, obtype, doc):
 709         assert isinstance(name, str)
 710         self.name = name
 711
 712         assert isinstance(obtype, type) or isinstance(obtype, tuple)
 713         if isinstance(obtype, tuple):
 714             for contained in obtype:
 715                 assert isinstance(contained, type)
 716         self.obtype = obtype
 717
 718         assert isinstance(doc, str)
 719         self.doc = doc
 720
 721     def __repr__(self):
 722         return self.name
 723
 724
 725 pyint = StackObject(
 726             name='int',
 727             obtype=int,
 728             doc="A short (as opposed to long) Python integer object.")
 729
 730 pylong = StackObject(
 731              name='long',
 732              obtype=long,
 733              doc="A long (as opposed to short) Python integer object.")
 734
 735 pyinteger_or_bool = StackObject(
 736                         name='int_or_bool',
 737                         obtype=(int, long, bool),
 738                         doc="A Python integer object (short or long), or "
 739                             "a Python bool.")
 740
 741 pybool = StackObject(
 742              name='bool',
 743              obtype=(bool,),
 744              doc="A Python bool object.")
 745
 746 pyfloat = StackObject(
 747               name='float',
 748               obtype=float,
 749               doc="A Python float object.")
 750
 751 pystring = StackObject(
 752                name='str',
 753                obtype=str,
 754                doc="A Python string object.")
 755
 756 pyunicode = StackObject(
 757                 name='unicode',
 758                 obtype=unicode,
 759                 doc="A Python Unicode string object.")
 760
 761 pynone = StackObject(
 762              name="None",
 763              obtype=type(None),
 764              doc="The Python None object.")
 765
 766 pytuple = StackObject(
 767               name="tuple",
 768               obtype=tuple,
 769               doc="A Python tuple object.")
 770
 771 pylist = StackObject(
 772              name="list",
 773              obtype=list,
 774              doc="A Python list object.")
 775
 776 pydict = StackObject(
 777              name="dict",
 778              obtype=dict,
 779              doc="A Python dict object.")
 780
 781 anyobject = StackObject(
 782                 name='any',
 783                 obtype=object,
 784                 doc="Any kind of object whatsoever.")
 785
 786 markobject = StackObject(
 787                  name="mark",
 788                  obtype=StackObject,
 789                  doc="""'The mark' is a unique object.
 790
 791                  Opcodes that operate on a variable number of objects
 792                  generally don't embed the count of objects in the opcode,
 793                  or pull it off the stack.  Instead the MARK opcode is used
 794                  to push a special marker object on the stack, and then
 795                  some other opcodes grab all the objects from the top of
 796                  the stack down to (but not including) the topmost marker
 797                  object.
 798                  """)
 799
 800 stackslice = StackObject(
 801                  name="stackslice",
 802                  obtype=StackObject,
 803                  doc="""An object representing a contiguous slice of the stack.
 804
 805                  This is used in conjuction with markobject, to represent all
 806                  of the stack following the topmost markobject.  For example,
 807                  the POP_MARK opcode changes the stack from
 808
 809                      [..., markobject, stackslice]
 810                  to
 811                      [...]
 812
 813                  No matter how many object are on the stack after the topmost
 814                  markobject, POP_MARK gets rid of all of them (including the
 815                  topmost markobject too).
 816                  """)
 817
 818 ##############################################################################
 819 # Descriptors for pickle opcodes.
 820
 821 class OpcodeInfo(object):
 822
 823     __slots__ = (
 824         # symbolic name of opcode; a string
 825         'name',
 826
 827         # the code used in a bytestream to represent the opcode; a
 828         # one-character string
 829         'code',
 830
 831         # If the opcode has an argument embedded in the byte string, an
 832         # instance of ArgumentDescriptor specifying its type.  Note that
 833         # arg.reader(s) can be used to read and decode the argument from
 834         # the bytestream s, and arg.doc documents the format of the raw
 835         # argument bytes.  If the opcode doesn't have an argument embedded
 836         # in the bytestream, arg should be None.
 837         'arg',
 838
 839         # what the stack looks like before this opcode runs; a list
 840         'stack_before',
 841
 842         # what the stack looks like after this opcode runs; a list
 843         'stack_after',
 844
 845         # the protocol number in which this opcode was introduced; an int
 846         'proto',
 847
 848         # human-readable docs for this opcode; a string
 849         'doc',
 850     )
 851
 852     def __init__(self, name, code, arg,
 853                  stack_before, stack_after, proto, doc):
 854         assert isinstance(name, str)
 855         self.name = name
 856
 857         assert isinstance(code, str)
 858         assert len(code) == 1
 859         self.code = code
 860
 861         assert arg is None or isinstance(arg, ArgumentDescriptor)
 862         self.arg = arg
 863
 864         assert isinstance(stack_before, list)
 865         for x in stack_before:
 866             assert isinstance(x, StackObject)
 867         self.stack_before = stack_before
 868
 869         assert isinstance(stack_after, list)
 870         for x in stack_after:
 871             assert isinstance(x, StackObject)
 872         self.stack_after = stack_after
 873
 874         assert isinstance(proto, int) and 0 <= proto <= 2
 875         self.proto = proto
 876
 877         assert isinstance(doc, str)
 878         self.doc = doc
 879
 880 I = OpcodeInfo
 881 opcodes = [
 882
 883     # Ways to spell integers.
 884
 885     I(name='INT',
 886       code='I',
 887       arg=decimalnl_short,
 888       stack_before=[],
 889       stack_after=[pyinteger_or_bool],
 890       proto=0,
 891       doc="""Push an integer or bool.
 892
 893       The argument is a newline-terminated decimal literal string.
 894
 895       The intent may have been that this always fit in a short Python int,
 896       but INT can be generated in pickles written on a 64-bit box that
 897       require a Python long on a 32-bit box.  The difference between this
 898       and LONG then is that INT skips a trailing 'L', and produces a short
 899       int whenever possible.
 900
 901       Another difference is due to that, when bool was introduced as a
 902       distinct type in 2.3, builtin names True and False were also added to
 903       2.2.2, mapping to ints 1 and 0.  For compatibility in both directions,
 904       True gets pickled as INT + "I01\\n", and False as INT + "I00\\n".
 905       Leading zeroes are never produced for a genuine integer.  The 2.3
 906       (and later) unpicklers special-case these and return bool instead;
 907       earlier unpicklers ignore the leading "0" and return the int.
 908       """),
 909
 910     I(name='BININT',
 911       code='J',
 912       arg=int4,
 913       stack_before=[],
 914       stack_after=[pyint],
 915       proto=1,
 916       doc="""Push a four-byte signed integer.
 917
 918       This handles the full range of Python (short) integers on a 32-bit
 919       box, directly as binary bytes (1 for the opcode and 4 for the integer).
 920       If the integer is non-negative and fits in 1 or 2 bytes, pickling via
 921       BININT1 or BININT2 saves space.
 922       """),
 923
 924     I(name='BININT1',
 925       code='K',
 926       arg=uint1,
 927       stack_before=[],
 928       stack_after=[pyint],
 929       proto=1,
 930       doc="""Push a one-byte unsigned integer.
 931
 932       This is a space optimization for pickling very small non-negative ints,
 933       in range(256).
 934       """),
 935
 936     I(name='BININT2',
 937       code='M',
 938       arg=uint2,
 939       stack_before=[],
 940       stack_after=[pyint],
 941       proto=1,
 942       doc="""Push a two-byte unsigned integer.
 943
 944       This is a space optimization for pickling small positive ints, in
 945       range(256, 2**16).  Integers in range(256) can also be pickled via
 946       BININT2, but BININT1 instead saves a byte.
 947       """),
 948
 949     I(name='LONG',
 950       code='L',
 951       arg=decimalnl_long,
 952       stack_before=[],
 953       stack_after=[pylong],
 954       proto=0,
 955       doc="""Push a long integer.
 956
 957       The same as INT, except that the literal ends with 'L', and always
 958       unpickles to a Python long.  There doesn't seem a real purpose to the
 959       trailing 'L'.
 960
 961       Note that LONG takes time quadratic in the number of digits when
 962       unpickling (this is simply due to the nature of decimal->binary
 963       conversion).  Proto 2 added linear-time (in C; still quadratic-time
 964       in Python) LONG1 and LONG4 opcodes.
 965       """),
 966
 967     I(name="LONG1",
 968       code='\x8a',
 969       arg=long1,
 970       stack_before=[],
 971       stack_after=[pylong],
 972       proto=2,
 973       doc="""Long integer using one-byte length.
 974
 975       A more efficient encoding of a Python long; the long1 encoding
 976       says it all."""),
 977
 978     I(name="LONG4",
 979       code='\x8b',
 980       arg=long4,
 981       stack_before=[],
 982       stack_after=[pylong],
 983       proto=2,
 984       doc="""Long integer using found-byte length.
 985
 986       A more efficient encoding of a Python long; the long4 encoding
 987       says it all."""),
 988
 989     # Ways to spell strings (8-bit, not Unicode).
 990
 991     I(name='STRING',
 992       code='S',
 993       arg=stringnl,
 994       stack_before=[],
 995       stack_after=[pystring],
 996       proto=0,
 997       doc="""Push a Python string object.
 998
 999       The argument is a repr-style string, with bracketing quote characters,
1000       and perhaps embedded escapes.  The argument extends until the next
1001       newline character.
1002       """),
1003
1004     I(name='BINSTRING',
1005       code='T',
1006       arg=string4,
1007       stack_before=[],
1008       stack_after=[pystring],
1009       proto=1,
1010       doc="""Push a Python string object.
1011
1012       There are two arguments:  the first is a 4-byte little-endian signed int
1013       giving the number of bytes in the string, and the second is that many
1014       bytes, which are taken literally as the string content.
1015       """),
1016
1017     I(name='SHORT_BINSTRING',
1018       code='U',
1019       arg=string1,
1020       stack_before=[],
1021       stack_after=[pystring],
1022       proto=1,
1023       doc="""Push a Python string object.
1024
1025       There are two arguments:  the first is a 1-byte unsigned int giving
1026       the number of bytes in the string, and the second is that many bytes,
1027       which are taken literally as the string content.
1028       """),
1029
1030     # Ways to spell None.
1031
1032     I(name='NONE',
1033       code='N',
1034       arg=None,
1035       stack_before=[],
1036       stack_after=[pynone],
1037       proto=0,
1038       doc="Push None on the stack."),
1039
1040     # Ways to spell bools, starting with proto 2.  See INT for how this was
1041     # done before proto 2.
1042
1043     I(name='NEWTRUE',
1044       code='\x88',
1045       arg=None,
1046       stack_before=[],
1047       stack_after=[pybool],
1048       proto=2,
1049       doc="""True.
1050
1051       Push True onto the stack."""),
1052
1053     I(name='NEWFALSE',
1054       code='\x89',
1055       arg=None,
1056       stack_before=[],
1057       stack_after=[pybool],
1058       proto=2,
1059       doc="""True.
1060
1061       Push False onto the stack."""),
1062
1063     # Ways to spell Unicode strings.
1064
1065     I(name='UNICODE',
1066       code='V',
1067       arg=unicodestringnl,
1068       stack_before=[],
1069       stack_after=[pyunicode],
1070       proto=0,  # this may be pure-text, but it's a later addition
1071       doc="""Push a Python Unicode string object.
1072
1073       The argument is a raw-unicode-escape encoding of a Unicode string,
1074       and so may contain embedded escape sequences.  The argument extends
1075       until the next newline character.
1076       """),
1077
1078     I(name='BINUNICODE',
1079       code='X',
1080       arg=unicodestring4,
1081       stack_before=[],
1082       stack_after=[pyunicode],
1083       proto=1,
1084       doc="""Push a Python Unicode string object.
1085
1086       There are two arguments:  the first is a 4-byte little-endian signed int
1087       giving the number of bytes in the string.  The second is that many
1088       bytes, and is the UTF-8 encoding of the Unicode string.
1089       """),
1090
1091     # Ways to spell floats.
1092
1093     I(name='FLOAT',
1094       code='F',
1095       arg=floatnl,
1096       stack_before=[],
1097       stack_after=[pyfloat],
1098       proto=0,
1099       doc="""Newline-terminated decimal float literal.
1100
1101       The argument is repr(a_float), and in general requires 17 significant
1102       digits for roundtrip conversion to be an identity (this is so for
1103       IEEE-754 double precision values, which is what Python float maps to
1104       on most boxes).
1105
1106       In general, FLOAT cannot be used to transport infinities, NaNs, or
1107       minus zero across boxes (or even on a single box, if the platform C
1108       library can't read the strings it produces for such things -- Windows
1109       is like that), but may do less damage than BINFLOAT on boxes with
1110       greater precision or dynamic range than IEEE-754 double.
1111       """),
1112
1113     I(name='BINFLOAT',
1114       code='G',
1115       arg=float8,
1116       stack_before=[],
1117       stack_after=[pyfloat],
1118       proto=1,
1119       doc="""Float stored in binary form, with 8 bytes of data.
1120
1121       This generally requires less than half the space of FLOAT encoding.
1122       In general, BINFLOAT cannot be used to transport infinities, NaNs, or
1123       minus zero, raises an exception if the exponent exceeds the range of
1124       an IEEE-754 double, and retains no more than 53 bits of precision (if
1125       there are more than that, "add a half and chop" rounding is used to
1126       cut it back to 53 significant bits).
1127       """),
1128
1129     # Ways to build lists.
1130
1131     I(name='EMPTY_LIST',
1132       code=']',
1133       arg=None,
1134       stack_before=[],
1135       stack_after=[pylist],
1136       proto=1,
1137       doc="Push an empty list."),
1138
1139     I(name='APPEND',
1140       code='a',
1141       arg=None,
1142       stack_before=[pylist, anyobject],
1143       stack_after=[pylist],
1144       proto=0,
1145       doc="""Append an object to a list.
1146
1147       Stack before:  ... pylist anyobject
1148       Stack after:   ... pylist+[anyobject]
1149
1150       although pylist is really extended in-place.
1151       """),
1152
1153     I(name='APPENDS',
1154       code='e',
1155       arg=None,
1156       stack_before=[pylist, markobject, stackslice],
1157       stack_after=[pylist],
1158       proto=1,
1159       doc="""Extend a list by a slice of stack objects.
1160
1161       Stack before:  ... pylist markobject stackslice
1162       Stack after:   ... pylist+stackslice
1163
1164       although pylist is really extended in-place.
1165       """),
1166
1167     I(name='LIST',
1168       code='l',
1169       arg=None,
1170       stack_before=[markobject, stackslice],
1171       stack_after=[pylist],
1172       proto=0,
1173       doc="""Build a list out of the topmost stack slice, after markobject.
1174
1175       All the stack entries following the topmost markobject are placed into
1176       a single Python list, which single list object replaces all of the
1177       stack from the topmost markobject onward.  For example,
1178
1179       Stack before: ... markobject 1 2 3 'abc'
1180       Stack after:  ... [1, 2, 3, 'abc']
1181       """),
1182
1183     # Ways to build tuples.
1184
1185     I(name='EMPTY_TUPLE',
1186       code=')',
1187       arg=None,
1188       stack_before=[],
1189       stack_after=[pytuple],
1190       proto=1,
1191       doc="Push an empty tuple."),
1192
1193     I(name='TUPLE',
1194       code='t',
1195       arg=None,
1196       stack_before=[markobject, stackslice],
1197       stack_after=[pytuple],
1198       proto=0,
1199       doc="""Build a tuple out of the topmost stack slice, after markobject.
1200
1201       All the stack entries following the topmost markobject are placed into
1202       a single Python tuple, which single tuple object replaces all of the
1203       stack from the topmost markobject onward.  For example,
1204
1205       Stack before: ... markobject 1 2 3 'abc'
1206       Stack after:  ... (1, 2, 3, 'abc')
1207       """),
1208
1209     I(name='TUPLE1',
1210       code='\x85',
1211       arg=None,
1212       stack_before=[anyobject],
1213       stack_after=[pytuple],
1214       proto=2,
1215       doc="""One-tuple.
1216
1217       This code pops one value off the stack and pushes a tuple of
1218       length 1 whose one item is that value back onto it.  IOW:
1219
1220           stack[-1] = tuple(stack[-1:])
1221       """),
1222
1223     I(name='TUPLE2',
1224       code='\x86',
1225       arg=None,
1226       stack_before=[anyobject, anyobject],
1227       stack_after=[pytuple],
1228       proto=2,
1229       doc="""One-tuple.
1230
1231       This code pops two values off the stack and pushes a tuple
1232       of length 2 whose items are those values back onto it.  IOW:
1233
1234           stack[-2:] = [tuple(stack[-2:])]
1235       """),
1236
1237     I(name='TUPLE3',
1238       code='\x87',
1239       arg=None,
1240       stack_before=[anyobject, anyobject, anyobject],
1241       stack_after=[pytuple],
1242       proto=2,
1243       doc="""One-tuple.
1244
1245       This code pops three values off the stack and pushes a tuple
1246       of length 3 whose items are those values back onto it.  IOW:
1247
1248           stack[-3:] = [tuple(stack[-3:])]
1249       """),
1250
1251     # Ways to build dicts.
1252
1253     I(name='EMPTY_DICT',
1254       code='}',
1255       arg=None,
1256       stack_before=[],
1257       stack_after=[pydict],
1258       proto=1,
1259       doc="Push an empty dict."),
1260
1261     I(name='DICT',
1262       code='d',
1263       arg=None,
1264       stack_before=[markobject, stackslice],
1265       stack_after=[pydict],
1266       proto=0,
1267       doc="""Build a dict out of the topmost stack slice, after markobject.
1268
1269       All the stack entries following the topmost markobject are placed into
1270       a single Python dict, which single dict object replaces all of the
1271       stack from the topmost markobject onward.  The stack slice alternates
1272       key, value, key, value, ....  For example,
1273
1274       Stack before: ... markobject 1 2 3 'abc'
1275       Stack after:  ... {1: 2, 3: 'abc'}
1276       """),
1277
1278     I(name='SETITEM',
1279       code='s',
1280       arg=None,
1281       stack_before=[pydict, anyobject, anyobject],
1282       stack_after=[pydict],
1283       proto=0,
1284       doc="""Add a key+value pair to an existing dict.
1285
1286       Stack before:  ... pydict key value
1287       Stack after:   ... pydict
1288
1289       where pydict has been modified via pydict[key] = value.
1290       """),
1291
1292     I(name='SETITEMS',
1293       code='u',
1294       arg=None,
1295       stack_before=[pydict, markobject, stackslice],
1296       stack_after=[pydict],
1297       proto=1,
1298       doc="""Add an arbitrary number of key+value pairs to an existing dict.
1299
1300       The slice of the stack following the topmost markobject is taken as
1301       an alternating sequence of keys and values, added to the dict
1302       immediately under the topmost markobject.  Everything at and after the
1303       topmost markobject is popped, leaving the mutated dict at the top
1304       of the stack.
1305
1306       Stack before:  ... pydict markobject key_1 value_1 ... key_n value_n
1307       Stack after:   ... pydict
1308
1309       where pydict has been modified via pydict[key_i] = value_i for i in
1310       1, 2, ..., n, and in that order.
1311       """),
1312
1313     # Stack manipulation.
1314
1315     I(name='POP',
1316       code='0',
1317       arg=None,
1318       stack_before=[anyobject],
1319       stack_after=[],
1320       proto=0,
1321       doc="Discard the top stack item, shrinking the stack by one item."),
1322
1323     I(name='DUP',
1324       code='2',
1325       arg=None,
1326       stack_before=[anyobject],
1327       stack_after=[anyobject, anyobject],
1328       proto=0,
1329       doc="Push the top stack item onto the stack again, duplicating it."),
1330
1331     I(name='MARK',
1332       code='(',
1333       arg=None,
1334       stack_before=[],
1335       stack_after=[markobject],
1336       proto=0,
1337       doc="""Push markobject onto the stack.
1338
1339       markobject is a unique object, used by other opcodes to identify a
1340       region of the stack containing a variable number of objects for them
1341       to work on.  See markobject.doc for more detail.
1342       """),
1343
1344     I(name='POP_MARK',
1345       code='1',
1346       arg=None,
1347       stack_before=[markobject, stackslice],
1348       stack_after=[],
1349       proto=0,
1350       doc="""Pop all the stack objects at and above the topmost markobject.
1351
1352       When an opcode using a variable number of stack objects is done,
1353       POP_MARK is used to remove those objects, and to remove the markobject
1354       that delimited their starting position on the stack.
1355       """),
1356
1357     # Memo manipulation.  There are really only two operations (get and put),
1358     # each in all-text, "short binary", and "long binary" flavors.
1359
1360     I(name='GET',
1361       code='g',
1362       arg=decimalnl_short,
1363       stack_before=[],
1364       stack_after=[anyobject],
1365       proto=0,
1366       doc="""Read an object from the memo and push it on the stack.
1367
1368       The index of the memo object to push is given by the newline-teriminated
1369       decimal string following.  BINGET and LONG_BINGET are space-optimized
1370       versions.
1371       """),
1372
1373     I(name='BINGET',
1374       code='h',
1375       arg=uint1,
1376       stack_before=[],
1377       stack_after=[anyobject],
1378       proto=1,
1379       doc="""Read an object from the memo and push it on the stack.
1380
1381       The index of the memo object to push is given by the 1-byte unsigned
1382       integer following.
1383       """),
1384
1385     I(name='LONG_BINGET',
1386       code='j',
1387       arg=int4,
1388       stack_before=[],
1389       stack_after=[anyobject],
1390       proto=1,
1391       doc="""Read an object from the memo and push it on the stack.
1392
1393       The index of the memo object to push is given by the 4-byte signed
1394       little-endian integer following.
1395       """),
1396
1397     I(name='PUT',
1398       code='p',
1399       arg=decimalnl_short,
1400       stack_before=[],
1401       stack_after=[],
1402       proto=0,
1403       doc="""Store the stack top into the memo.  The stack is not popped.
1404
1405       The index of the memo location to write into is given by the newline-
1406       terminated decimal string following.  BINPUT and LONG_BINPUT are
1407       space-optimized versions.
1408       """),
1409
1410     I(name='BINPUT',
1411       code='q',
1412       arg=uint1,
1413       stack_before=[],
1414       stack_after=[],
1415       proto=1,
1416       doc="""Store the stack top into the memo.  The stack is not popped.
1417
1418       The index of the memo location to write into is given by the 1-byte
1419       unsigned integer following.
1420       """),
1421
1422     I(name='LONG_BINPUT',
1423       code='r',
1424       arg=int4,
1425       stack_before=[],
1426       stack_after=[],
1427       proto=1,
1428       doc="""Store the stack top into the memo.  The stack is not popped.
1429
1430       The index of the memo location to write into is given by the 4-byte
1431       signed little-endian integer following.
1432       """),
1433
1434     # Access the extension registry (predefined objects).  Akin to the GET
1435     # family.
1436
1437     I(name='EXT1',
1438       code='\x82',
1439       arg=uint1,
1440       stack_before=[],
1441       stack_after=[anyobject],
1442       proto=2,
1443       doc="""Extension code.
1444
1445       This code and the similar EXT2 and EXT4 allow using a registry
1446       of popular objects that are pickled by name, typically classes.
1447       It is envisioned that through a global negotiation and
1448       registration process, third parties can set up a mapping between
1449       ints and object names.
1450
1451       In order to guarantee pickle interchangeability, the extension
1452       code registry ought to be global, although a range of codes may
1453       be reserved for private use.
1454
1455       EXT1 has a 1-byte integer argument.  This is used to index into the
1456       extension registry, and the object at that index is pushed on the stack.
1457       """),
1458
1459     I(name='EXT2',
1460       code='\x83',
1461       arg=uint2,
1462       stack_before=[],
1463       stack_after=[anyobject],
1464       proto=2,
1465       doc="""Extension code.
1466
1467       See EXT1.  EXT2 has a two-byte integer argument.
1468       """),
1469
1470     I(name='EXT4',
1471       code='\x84',
1472       arg=int4,
1473       stack_before=[],
1474       stack_after=[anyobject],
1475       proto=2,
1476       doc="""Extension code.
1477
1478       See EXT1.  EXT4 has a four-byte integer argument.
1479       """),
1480
1481     # Push a class object, or module function, on the stack, via its module
1482     # and name.
1483
1484     I(name='GLOBAL',
1485       code='c',
1486       arg=stringnl_noescape_pair,
1487       stack_before=[],
1488       stack_after=[anyobject],
1489       proto=0,
1490       doc="""Push a global object (module.attr) on the stack.
1491
1492       Two newline-terminated strings follow the GLOBAL opcode.  The first is
1493       taken as a module name, and the second as a class name.  The class
1494       object module.class is pushed on the stack.  More accurately, the
1495       object returned by self.find_class(module, class) is pushed on the
1496       stack, so unpickling subclasses can override this form of lookup.
1497       """),
1498
1499     # Ways to build objects of classes pickle doesn't know about directly
1500     # (user-defined classes).  I despair of documenting this accurately
1501     # and comprehensibly -- you really have to read the pickle code to
1502     # find all the special cases.
1503
1504     I(name='REDUCE',
1505       code='R',
1506       arg=None,
1507       stack_before=[anyobject, anyobject],
1508       stack_after=[anyobject],
1509       proto=0,
1510       doc="""Push an object built from a callable and an argument tuple.
1511
1512       The opcode is named to remind of the __reduce__() method.
1513
1514       Stack before: ... callable pytuple
1515       Stack after:  ... callable(*pytuple)
1516
1517       The callable and the argument tuple are the first two items returned
1518       by a __reduce__ method.  Applying the callable to the argtuple is
1519       supposed to reproduce the original object, or at least get it started.
1520       If the __reduce__ method returns a 3-tuple, the last component is an
1521       argument to be passed to the object's __setstate__, and then the REDUCE
1522       opcode is followed by code to create setstate's argument, and then a
1523       BUILD opcode to apply  __setstate__ to that argument.
1524
1525       There are lots of special cases here.  The argtuple can be None, in
1526       which case callable.__basicnew__() is called instead to produce the
1527       object to be pushed on the stack.  This appears to be a trick unique
1528       to ExtensionClasses, and is deprecated regardless.
1529
1530       If type(callable) is not ClassType, REDUCE complains unless the
1531       callable has been registered with the copy_reg module's
1532       safe_constructors dict, or the callable has a magic
1533       '__safe_for_unpickling__' attribute with a true value.  I'm not sure
1534       why it does this, but I've sure seen this complaint often enough when
1535       I didn't want to <wink>.
1536       """),
1537
1538     I(name='BUILD',
1539       code='b',
1540       arg=None,
1541       stack_before=[anyobject, anyobject],
1542       stack_after=[anyobject],
1543       proto=0,
1544       doc="""Finish building an object, via __setstate__ or dict update.
1545
1546       Stack before: ... anyobject argument
1547       Stack after:  ... anyobject
1548
1549       where anyobject may have been mutated, as follows:
1550
1551       If the object has a __setstate__ method,
1552
1553           anyobject.__setstate__(argument)
1554
1555       is called.
1556
1557       Else the argument must be a dict, the object must have a __dict__, and
1558       the object is updated via
1559
1560           anyobject.__dict__.update(argument)
1561
1562       This may raise RuntimeError in restricted execution mode (which
1563       disallows access to __dict__ directly); in that case, the object
1564       is updated instead via
1565
1566           for k, v in argument.items():
1567               anyobject[k] = v
1568       """),
1569
1570     I(name='INST',
1571       code='i',
1572       arg=stringnl_noescape_pair,
1573       stack_before=[markobject, stackslice],
1574       stack_after=[anyobject],
1575       proto=0,
1576       doc="""Build a class instance.
1577
1578       This is the protocol 0 version of protocol 1's OBJ opcode.
1579       INST is followed by two newline-terminated strings, giving a
1580       module and class name, just as for the GLOBAL opcode (and see
1581       GLOBAL for more details about that).  self.find_class(module, name)
1582       is used to get a class object.
1583
1584       In addition, all the objects on the stack following the topmost
1585       markobject are gathered into a tuple and popped (along with the
1586       topmost markobject), just as for the TUPLE opcode.
1587
1588       Now it gets complicated.  If all of these are true:
1589
1590         + The argtuple is empty (markobject was at the top of the stack
1591           at the start).
1592
1593         + It's an old-style class object (the type of the class object is
1594           ClassType).
1595
1596         + The class object does not have a __getinitargs__ attribute.
1597
1598       then we want to create an old-style class instance without invoking
1599       its __init__() method (pickle has waffled on this over the years; not
1600       calling __init__() is current wisdom).  In this case, an instance of
1601       an old-style dummy class is created, and then we try to rebind its
1602       __class__ attribute to the desired class object.  If this succeeds,
1603       the new instance object is pushed on the stack, and we're done.  In
1604       restricted execution mode it can fail (assignment to __class__ is
1605       disallowed), and I'm not really sure what happens then -- it looks
1606       like the code ends up calling the class object's __init__ anyway,
1607       via falling into the next case.
1608
1609       Else (the argtuple is not empty, it's not an old-style class object,
1610       or the class object does have a __getinitargs__ attribute), the code
1611       first insists that the class object have a __safe_for_unpickling__
1612       attribute.  Unlike as for the __safe_for_unpickling__ check in REDUCE,
1613       it doesn't matter whether this attribute has a true or false value, it
1614       only matters whether it exists (XXX this is a bug; cPickle
1615       requires the attribute to be true).  If __safe_for_unpickling__
1616       doesn't exist, UnpicklingError is raised.
1617
1618       Else (the class object does have a __safe_for_unpickling__ attr),
1619       the class object obtained from INST's arguments is applied to the
1620       argtuple obtained from the stack, and the resulting instance object
1621       is pushed on the stack.
1622
1623       NOTE:  checks for __safe_for_unpickling__ went away in Python 2.3.
1624       """),
1625
1626     I(name='OBJ',
1627       code='o',
1628       arg=None,
1629       stack_before=[markobject, anyobject, stackslice],
1630       stack_after=[anyobject],
1631       proto=1,
1632       doc="""Build a class instance.
1633
1634       This is the protocol 1 version of protocol 0's INST opcode, and is
1635       very much like it.  The major difference is that the class object
1636       is taken off the stack, allowing it to be retrieved from the memo
1637       repeatedly if several instances of the same class are created.  This
1638       can be much more efficient (in both time and space) than repeatedly
1639       embedding the module and class names in INST opcodes.
1640
1641       Unlike INST, OBJ takes no arguments from the opcode stream.  Instead
1642       the class object is taken off the stack, immediately above the
1643       topmost markobject:
1644
1645       Stack before: ... markobject classobject stackslice
1646       Stack after:  ... new_instance_object
1647
1648       As for INST, the remainder of the stack above the markobject is
1649       gathered into an argument tuple, and then the logic seems identical,
1650       except that no __safe_for_unpickling__ check is done (XXX this is
1651       a bug; cPickle does test __safe_for_unpickling__).  See INST for
1652       the gory details.
1653
1654       NOTE:  In Python 2.3, INST and OBJ are identical except for how they
1655       get the class object.  That was always the intent; the implementations
1656       had diverged for accidental reasons.
1657       """),
1658
1659     I(name='NEWOBJ',
1660       code='\x81',
1661       arg=None,
1662       stack_before=[anyobject, anyobject],
1663       stack_after=[anyobject],
1664       proto=2,
1665       doc="""Build an object instance.
1666
1667       The stack before should be thought of as containing a class
1668       object followed by an argument tuple (the tuple being the stack
1669       top).  Call these cls and args.  They are popped off the stack,
1670       and the value returned by cls.__new__(cls, *args) is pushed back
1671       onto the stack.
1672       """),
1673
1674     # Machine control.
1675
1676     I(name='PROTO',
1677       code='\x80',
1678       arg=uint1,
1679       stack_before=[],
1680       stack_after=[],
1681       proto=2,
1682       doc="""Protocol version indicator.
1683
1684       For protocol 2 and above, a pickle must start with this opcode.
1685       The argument is the protocol version, an int in range(2, 256).
1686       """),
1687
1688     I(name='STOP',
1689       code='.',
1690       arg=None,
1691       stack_before=[anyobject],
1692       stack_after=[],
1693       proto=0,
1694       doc="""Stop the unpickling machine.
1695
1696       Every pickle ends with this opcode.  The object at the top of the stack
1697       is popped, and that's the result of unpickling.  The stack should be
1698       empty then.
1699       """),
1700
1701     # Ways to deal with persistent IDs.
1702
1703     I(name='PERSID',
1704       code='P',
1705       arg=stringnl_noescape,
1706       stack_before=[],
1707       stack_after=[anyobject],
1708       proto=0,
1709       doc="""Push an object identified by a persistent ID.
1710
1711       The pickle module doesn't define what a persistent ID means.  PERSID's
1712       argument is a newline-terminated str-style (no embedded escapes, no
1713       bracketing quote characters) string, which *is* "the persistent ID".
1714       The unpickler passes this string to self.persistent_load().  Whatever
1715       object that returns is pushed on the stack.  There is no implementation
1716       of persistent_load() in Python's unpickler:  it must be supplied by an
1717       unpickler subclass.
1718       """),
1719
1720     I(name='BINPERSID',
1721       code='Q',
1722       arg=None,
1723       stack_before=[anyobject],
1724       stack_after=[anyobject],
1725       proto=1,
1726       doc="""Push an object identified by a persistent ID.
1727
1728       Like PERSID, except the persistent ID is popped off the stack (instead
1729       of being a string embedded in the opcode bytestream).  The persistent
1730       ID is passed to self.persistent_load(), and whatever object that
1731       returns is pushed on the stack.  See PERSID for more detail.
1732       """),
1733 ]
1734 del I
1735
1736 # Verify uniqueness of .name and .code members.
1737 name2i = {}
1738 code2i = {}
1739
1740 for i, d in enumerate(opcodes):
1741     if d.name in name2i:
1742         raise ValueError("repeated name %r at indices %d and %d" %
1743                          (d.name, name2i[d.name], i))
1744     if d.code in code2i:
1745         raise ValueError("repeated code %r at indices %d and %d" %
1746                          (d.code, code2i[d.code], i))
1747
1748     name2i[d.name] = i
1749     code2i[d.code] = i
1750
1751 del name2i, code2i, i, d
1752
1753 ##############################################################################
1754 # Build a code2op dict, mapping opcode characters to OpcodeInfo records.
1755 # Also ensure we've got the same stuff as pickle.py, although the
1756 # introspection here is dicey.
1757
1758 code2op = {}
1759 for d in opcodes:
1760     code2op[d.code] = d
1761 del d
1762
1763 def assure_pickle_consistency(verbose=False):
1764     import pickle, re
1765
1766     copy = code2op.copy()
1767     for name in pickle.__all__:
1768         if not re.match("[A-Z][A-Z0-9_]+$", name):
1769             if verbose:
1770                 print "skipping %r: it doesn't look like an opcode name" % name
1771             continue
1772         picklecode = getattr(pickle, name)
1773         if not isinstance(picklecode, str) or len(picklecode) != 1:
1774             if verbose:
1775                 print ("skipping %r: value %r doesn't look like a pickle "
1776                        "code" % (name, picklecode))
1777             continue
1778         if picklecode in copy:
1779             if verbose:
1780                 print "checking name %r w/ code %r for consistency" % (
1781                       name, picklecode)
1782             d = copy[picklecode]
1783             if d.name != name:
1784                 raise ValueError("for pickle code %r, pickle.py uses name %r "
1785                                  "but we're using name %r" % (picklecode,
1786                                                               name,
1787                                                               d.name))
1788             # Forget this one.  Any left over in copy at the end are a problem
1789             # of a different kind.
1790             del copy[picklecode]
1791         else:
1792             raise ValueError("pickle.py appears to have a pickle opcode with "
1793                              "name %r and code %r, but we don't" %
1794                              (name, picklecode))
1795     if copy:
1796         msg = ["we appear to have pickle opcodes that pickle.py doesn't have:"]
1797         for code, d in copy.items():
1798             msg.append("    name %r with code %r" % (d.name, code))
1799         raise ValueError("\n".join(msg))
1800
1801 assure_pickle_consistency()
1802 del assure_pickle_consistency
1803
1804 ##############################################################################
1805 # A pickle opcode generator.
1806
1807 def genops(pickle):
1808     """Generate all the opcodes in a pickle.
1809
1810     'pickle' is a file-like object, or string, containing the pickle.
1811
1812     Each opcode in the pickle is generated, from the current pickle position,
1813     stopping after a STOP opcode is delivered.  A triple is generated for
1814     each opcode:
1815
1816         opcode, arg, pos
1817
1818     opcode is an OpcodeInfo record, describing the current opcode.
1819
1820     If the opcode has an argument embedded in the pickle, arg is its decoded
1821     value, as a Python object.  If the opcode doesn't have an argument, arg
1822     is None.
1823
1824     If the pickle has a tell() method, pos was the value of pickle.tell()
1825     before reading the current opcode.  If the pickle is a string object,
1826     it's wrapped in a StringIO object, and the latter's tell() result is
1827     used.  Else (the pickle doesn't have a tell(), and it's not obvious how
1828     to query its current position) pos is None.
1829     """
1830
1831     import cStringIO as StringIO
1832
1833     if isinstance(pickle, str):
1834         pickle = StringIO.StringIO(pickle)
1835
1836     if hasattr(pickle, "tell"):
1837         getpos = pickle.tell
1838     else:
1839         getpos = lambda: None
1840
1841     while True:
1842         pos = getpos()
1843         code = pickle.read(1)
1844         opcode = code2op.get(code)
1845         if opcode is None:
1846             if code == "":
1847                 raise ValueError("pickle exhausted before seeing STOP")
1848             else:
1849                 raise ValueError("at position %s, opcode %r unknown" % (
1850                                  pos is None and "<unknown>" or pos,
1851                                  code))
1852         if opcode.arg is None:
1853             arg = None
1854         else:
1855             arg = opcode.arg.reader(pickle)
1856         yield opcode, arg, pos
1857         if code == '.':
1858             assert opcode.name == 'STOP'
1859             break
1860
1861 ##############################################################################
1862 # A symbolic pickle disassembler.
1863
1864 def dis(pickle, out=None, memo=None, indentlevel=4):
1865     """Produce a symbolic disassembly of a pickle.
1866
1867     'pickle' is a file-like object, or string, containing a (at least one)
1868     pickle.  The pickle is disassembled from the current position, through
1869     the first STOP opcode encountered.
1870
1871     Optional arg 'out' is a file-like object to which the disassembly is
1872     printed.  It defaults to sys.stdout.
1873
1874     Optional arg 'memo' is a Python dict, used as the pickle's memo.  It
1875     may be mutated by dis(), if the pickle contains PUT or BINPUT opcodes.
1876     Passing the same memo object to another dis() call then allows disassembly
1877     to proceed across multiple pickles that were all created by the same
1878     pickler with the same memo.  Ordinarily you don't need to worry about this.
1879
1880     Optional arg indentlevel is the number of blanks by which to indent
1881     a new MARK level.  It defaults to 4.
1882
1883     In addition to printing the disassembly, some sanity checks are made:
1884
1885     + All embedded opcode arguments "make sense".
1886
1887     + Explicit and implicit pop operations have enough items on the stack.
1888
1889     + When an opcode implicitly refers to a markobject, a markobject is
1890       actually on the stack.
1891
1892     + A memo entry isn't referenced before it's defined.
1893
1894     + The markobject isn't stored in the memo.
1895
1896     + A memo entry isn't redefined.
1897     """
1898
1899     # Most of the hair here is for sanity checks, but most of it is needed
1900     # anyway to detect when a protocol 0 POP takes a MARK off the stack
1901     # (which in turn is needed to indent MARK blocks correctly).
1902
1903     stack = []          # crude emulation of unpickler stack
1904     if memo is None:
1905         memo = {}       # crude emulation of unpicker memo
1906     maxproto = -1       # max protocol number seen
1907     markstack = []      # bytecode positions of MARK opcodes
1908     indentchunk = ' ' * indentlevel
1909     errormsg = None
1910     for opcode, arg, pos in genops(pickle):
1911         if pos is not None:
1912             print >> out, "%5d:" % pos,
1913
1914         line = "%-4s %s%s" % (repr(opcode.code)[1:-1],
1915                               indentchunk * len(markstack),
1916                               opcode.name)
1917
1918         maxproto = max(maxproto, opcode.proto)
1919         before = opcode.stack_before    # don't mutate
1920         after = opcode.stack_after      # don't mutate
1921         numtopop = len(before)
1922
1923         # See whether a MARK should be popped.
1924         markmsg = None
1925         if markobject in before or (opcode.name == "POP" and
1926                                     stack and
1927                                     stack[-1] is markobject):
1928             assert markobject not in after
1929             if __debug__:
1930                 if markobject in before:
1931                     assert before[-1] is stackslice
1932             if markstack:
1933                 markpos = markstack.pop()
1934                 if markpos is None:
1935                     markmsg = "(MARK at unknown opcode offset)"
1936                 else:
1937                     markmsg = "(MARK at %d)" % markpos
1938                 # Pop everything at and after the topmost markobject.
1939                 while stack[-1] is not markobject:
1940                     stack.pop()
1941                 stack.pop()
1942                 # Stop later code from popping too much.
1943                 try:
1944                     numtopop = before.index(markobject)
1945                 except ValueError:
1946                     assert opcode.name == "POP"
1947                     numtopop = 0
1948             else:
1949                 errormsg = markmsg = "no MARK exists on stack"
1950
1951         # Check for correct memo usage.
1952         if opcode.name in ("PUT", "BINPUT", "LONG_BINPUT"):
1953             assert arg is not None
1954             if arg in memo:
1955                 errormsg = "memo key %r already defined" % arg
1956             elif not stack:
1957                 errormsg = "stack is empty -- can't store into memo"
1958             elif stack[-1] is markobject:
1959                 errormsg = "can't store markobject in the memo"
1960             else:
1961                 memo[arg] = stack[-1]
1962
1963         elif opcode.name in ("GET", "BINGET", "LONG_BINGET"):
1964             if arg in memo:
1965                 assert len(after) == 1
1966                 after = [memo[arg]]     # for better stack emulation
1967             else:
1968                 errormsg = "memo key %r has never been stored into" % arg
1969
1970         if arg is not None or markmsg:
1971             # make a mild effort to align arguments
1972             line += ' ' * (10 - len(opcode.name))
1973             if arg is not None:
1974                 line += ' ' + repr(arg)
1975             if markmsg:
1976                 line += ' ' + markmsg
1977         print >> out, line
1978
1979         if errormsg:
1980             # Note that we delayed complaining until the offending opcode
1981             # was printed.
1982             raise ValueError(errormsg)
1983
1984         # Emulate the stack effects.
1985         if len(stack) < numtopop:
1986             raise ValueError("tries to pop %d items from stack with "
1987                              "only %d items" % (numtopop, len(stack)))
1988         if numtopop:
1989             del stack[-numtopop:]
1990         if markobject in after:
1991             assert markobject not in before
1992             markstack.append(pos)
1993
1994         stack.extend(after)
1995
1996     print >> out, "highest protocol among opcodes =", maxproto
1997     if stack:
1998         raise ValueError("stack not empty after STOP: %r" % stack)
1999
2000 _dis_test = r"""
2001 >>> import pickle
2002 >>> x = [1, 2, (3, 4), {'abc': u"def"}]
2003 >>> pkl = pickle.dumps(x, 0)
2004 >>> dis(pkl)
2005     0: (    MARK
2006     1: l        LIST       (MARK at 0)
2007     2: p    PUT        0
2008     5: I    INT        1
2009     8: a    APPEND
2010     9: I    INT        2
2011    12: a    APPEND
2012    13: (    MARK
2013    14: I        INT        3
2014    17: I        INT        4
2015    20: t        TUPLE      (MARK at 13)
2016    21: p    PUT        1
2017    24: a    APPEND
2018    25: (    MARK
2019    26: d        DICT       (MARK at 25)
2020    27: p    PUT        2
2021    30: S    STRING     'abc'
2022    37: p    PUT        3
2023    40: V    UNICODE    u'def'
2024    45: p    PUT        4
2025    48: s    SETITEM
2026    49: a    APPEND
2027    50: .    STOP
2028 highest protocol among opcodes = 0
2029
2030 Try again with a "binary" pickle.
2031
2032 >>> pkl = pickle.dumps(x, 1)
2033 >>> dis(pkl)
2034     0: ]    EMPTY_LIST
2035     1: q    BINPUT     0
2036     3: (    MARK
2037     4: K        BININT1    1
2038     6: K        BININT1    2
2039     8: (        MARK
2040     9: K            BININT1    3
2041    11: K            BININT1    4
2042    13: t            TUPLE      (MARK at 8)
2043    14: q        BINPUT     1
2044    16: }        EMPTY_DICT
2045    17: q        BINPUT     2
2046    19: U        SHORT_BINSTRING 'abc'
2047    24: q        BINPUT     3
2048    26: X        BINUNICODE u'def'
2049    34: q        BINPUT     4
2050    36: s        SETITEM
2051    37: e        APPENDS    (MARK at 3)
2052    38: .    STOP
2053 highest protocol among opcodes = 1
2054
2055 Exercise the INST/OBJ/BUILD family.
2056
2057 >>> import random
2058 >>> dis(pickle.dumps(random.random, 0))
2059     0: c    GLOBAL     'random random'
2060    15: p    PUT        0
2061    18: .    STOP
2062 highest protocol among opcodes = 0
2063
2064 >>> x = [pickle.PicklingError()] * 2
2065 >>> dis(pickle.dumps(x, 0))
2066     0: (    MARK
2067     1: l        LIST       (MARK at 0)
2068     2: p    PUT        0
2069     5: (    MARK
2070     6: i        INST       'pickle PicklingError' (MARK at 5)
2071    28: p    PUT        1
2072    31: (    MARK
2073    32: d        DICT       (MARK at 31)
2074    33: p    PUT        2
2075    36: S    STRING     'args'
2076    44: p    PUT        3
2077    47: (    MARK
2078    48: t        TUPLE      (MARK at 47)
2079    49: s    SETITEM
2080    50: b    BUILD
2081    51: a    APPEND
2082    52: g    GET        1
2083    55: a    APPEND
2084    56: .    STOP
2085 highest protocol among opcodes = 0
2086
2087 >>> dis(pickle.dumps(x, 1))
2088     0: ]    EMPTY_LIST
2089     1: q    BINPUT     0
2090     3: (    MARK
2091     4: (        MARK
2092     5: c            GLOBAL     'pickle PicklingError'
2093    27: q            BINPUT     1
2094    29: o            OBJ        (MARK at 4)
2095    30: q        BINPUT     2
2096    32: }        EMPTY_DICT
2097    33: q        BINPUT     3
2098    35: U        SHORT_BINSTRING 'args'
2099    41: q        BINPUT     4
2100    43: )        EMPTY_TUPLE
2101    44: s        SETITEM
2102    45: b        BUILD
2103    46: h        BINGET     2
2104    48: e        APPENDS    (MARK at 3)
2105    49: .    STOP
2106 highest protocol among opcodes = 1
2107
2108 Try "the canonical" recursive-object test.
2109
2110 >>> L = []
2111 >>> T = L,
2112 >>> L.append(T)
2113 >>> L[0] is T
2114 True
2115 >>> T[0] is L
2116 True
2117 >>> L[0][0] is L
2118 True
2119 >>> T[0][0] is T
2120 True
2121 >>> dis(pickle.dumps(L, 0))
2122     0: (    MARK
2123     1: l        LIST       (MARK at 0)
2124     2: p    PUT        0
2125     5: (    MARK
2126     6: g        GET        0
2127     9: t        TUPLE      (MARK at 5)
2128    10: p    PUT        1
2129    13: a    APPEND
2130    14: .    STOP
2131 highest protocol among opcodes = 0
2132
2133 >>> dis(pickle.dumps(L, 1))
2134     0: ]    EMPTY_LIST
2135     1: q    BINPUT     0
2136     3: (    MARK
2137     4: h        BINGET     0
2138     6: t        TUPLE      (MARK at 3)
2139     7: q    BINPUT     1
2140     9: a    APPEND
2141    10: .    STOP
2142 highest protocol among opcodes = 1
2143
2144 Note that, in the protocol 0 pickle of the recursive tuple, the disassembler
2145 has to emulate the stack in order to realize that the POP opcode at 16 gets
2146 rid of the MARK at 0.
2147
2148 >>> dis(pickle.dumps(T, 0))
2149     0: (    MARK
2150     1: (        MARK
2151     2: l            LIST       (MARK at 1)
2152     3: p        PUT        0
2153     6: (        MARK
2154     7: g            GET        0
2155    10: t            TUPLE      (MARK at 6)
2156    11: p        PUT        1
2157    14: a        APPEND
2158    15: 0        POP
2159    16: 0        POP        (MARK at 0)
2160    17: g    GET        1
2161    20: .    STOP
2162 highest protocol among opcodes = 0
2163
2164 >>> dis(pickle.dumps(T, 1))
2165     0: (    MARK
2166     1: ]        EMPTY_LIST
2167     2: q        BINPUT     0
2168     4: (        MARK
2169     5: h            BINGET     0
2170     7: t            TUPLE      (MARK at 4)
2171     8: q        BINPUT     1
2172    10: a        APPEND
2173    11: 1        POP_MARK   (MARK at 0)
2174    12: h    BINGET     1
2175    14: .    STOP
2176 highest protocol among opcodes = 1
2177
2178 Try protocol 2.
2179
2180 >>> dis(pickle.dumps(L, 2))
2181     0: \x80 PROTO      2
2182     2: ]    EMPTY_LIST
2183     3: q    BINPUT     0
2184     5: h    BINGET     0
2185     7: \x85 TUPLE1
2186     8: q    BINPUT     1
2187    10: a    APPEND
2188    11: .    STOP
2189 highest protocol among opcodes = 2
2190
2191 >>> dis(pickle.dumps(T, 2))
2192     0: \x80 PROTO      2
2193     2: ]    EMPTY_LIST
2194     3: q    BINPUT     0
2195     5: h    BINGET     0
2196     7: \x85 TUPLE1
2197     8: q    BINPUT     1
2198    10: a    APPEND
2199    11: 0    POP
2200    12: h    BINGET     1
2201    14: .    STOP
2202 highest protocol among opcodes = 2
2203 """
2204
2205 _memo_test = r"""
2206 >>> import pickle
2207 >>> from StringIO import StringIO
2208 >>> f = StringIO()
2209 >>> p = pickle.Pickler(f, 2)
2210 >>> x = [1, 2, 3]
2211 >>> p.dump(x)
2212 >>> p.dump(x)
2213 >>> f.seek(0)
2214 >>> memo = {}
2215 >>> dis(f, memo=memo)
2216     0: \x80 PROTO      2
2217     2: ]    EMPTY_LIST
2218     3: q    BINPUT     0
2219     5: (    MARK
2220     6: K        BININT1    1
2221     8: K        BININT1    2
2222    10: K        BININT1    3
2223    12: e        APPENDS    (MARK at 5)
2224    13: .    STOP
2225 highest protocol among opcodes = 2
2226 >>> dis(f, memo=memo)
2227    14: \x80 PROTO      2
2228    16: h    BINGET     0
2229    18: .    STOP
2230 highest protocol among opcodes = 2
2231 """
2232
2233 __test__ = {'disassembler_test': _dis_test,
2234             'disassembler_memo_test': _memo_test,
2235            }
2236
2237 def _test():
2238     import doctest
2239     return doctest.testmod()
2240
2241 if __name__ == "__main__":
2242     _test()