1 Return-Path: <Eric.Sosman@Sun.COM>
2 X-OfflineIMAP-163397380-4f7a6c616273-494e424f58: 1074122979-000541449629321
3 X-Original-To: mbp.sourcefrog.net
4 Delivered-To: mbp.ozlabs.org
5 Received: from nwkea-mail-2.sun.com (nwkea-mail-2.sun.com [192.18.42.14])
6 by ozlabs.org (Postfix) with ESMTP id 453CB2BD32
7 for <mbp.sourcefrog.net>; Wed, 24 Dec 2003 07:01:47 +1100 (EST)
8 Received: from phys-bur-1 ([129.148.9.72])
9 by nwkea-mail-2.sun.com (8.12.10/8.12.9) with ESMTP id hBNK1i0H026394
10 for <mbp.sourcefrog.net>; Tue, 23 Dec 2003 12:01:45 -0800 (PST)
11 Received: from sun.com (tardis.East.Sun.COM [129.148.168.113])
12 by bur-mail1.east.sun.com
13 (iPlanet Messaging Server 5.2 HotFix 1.16 (built May 14 2003))
14 with ESMTPA id <0HQD00BUX6AWIO.bur-mail1.east.sun.com> for mbp.sourcefrog.net;
15 Tue, 23 Dec 2003 15:01:44 -0500 (EST)
16 Date: Tue, 23 Dec 2003 15:01:44 -0500
17 From: Eric Sosman <Eric.Sosman.Sun.COM>
18 Subject: Lurking bug in strnatcmp()
19 Sender: Eric.Sosman.Sun.COM
20 To: mbp.sourcefrog.net
21 Reply-To: Eric.Sosman.Sun.COM
22 Message-id: <3FE89F28.56FC1586.sun.com>
23 Organization: Sun Microsystems
25 X-Mailer: Mozilla 4.79C-CCK-MCD [en] (X11; U; SunOS 5.8 sun4u)
26 Content-type: text/plain; charset=iso-8859-1
27 Content-transfer-encoding: quoted-printable
29 X-Spam-Checker-Version: SpamAssassin 2.60 (1.212-2003-09-23-exp) on
32 X-Spam-Status: No, hits=-4.9 required=3.5 tests=BAYES_00 autolearn=ham
38 A few days ago I came across a link to your "natural
39 string comparison" implementations at
41 http://sourcefrog.net/projects/natsort/
43 The work strikes me as a splendid idea, and I'll probably
44 start using the code in my own private projects.
46 However, I noticed that the C implementation has several
47 times fallen victim to a rather nasty little linguistic trap:
48 it is *not* safe to pass an ordinary char value to the <ctype.h>
49 functions: isdigit(), isspace(), toupper(), and so forth. Yes,
50 it looks like this is the intended usage, and it happens that
51 you'll very often get away with the error -- but it's an error
54 Here's why: The argument to a <ctype.h> function must be
56 "[...] an int, the value of which shall be representable
57 as an unsigned char or shall equal the value of the macro
58 EOF. If the argument has any other value, the behavior
60 -- ISO/IEC 9899:1999 (aka "C99"), Section 7.4, paragraph 1
62 That is, ordinary character codes are to be represented as non-
63 negative values; the only negative argument for which the <ctype.h>
64 functions are defined is the special value EOF.
66 EOF isn't a concern when plucking char values out of a string,
67 but you must contend with the possibility that a char value might
68 be negative. This won't happen if the implementation defines its
69 char as an unsigned type, or if the characters are taken from "the
70 basic execution character set," whose codes are required to be non-
71 negative (Section 6.2.5, paragraph 3). But on an implementation
72 where char is signed, characters from "the extended execution set"
73 can have negative codes -- and this will raise merry Hell. Try it
74 if you like: compile srtnatcmp() on a signed-char implementation
75 (with gcc, you can use the -fsigned-char flag) and then feed it a
76 few strings like "A=EFda" or "Cos=EC fan tutte" or "G=F6tterd=E4mmerung"
77 and watch for odd behavior ...
79 The recommended cure is to force unsignedness explicitly:
83 while (isspace( (unsigned char)ca ))
85 if (isdigit( (unsigned char)ca ))
87 ca =3D toupper( (unsigned char)ca ));
89 =2E.. and so forth. This eliminates any chance that a char with a
90 high-order one-bit will be sign-extended upon promotion to int.
92 By the way, there is no analogous problem in Java, where all
93 char values are non-negative. IMHO this is a defect in C, one of
94 the few things Writchie got Rong -- but in his defense, I must
95 admit that had I been he I'd certainly have done even worse.
97 Thanks again for the ideas behind strnatcmp(); I like it!