mikeash.com pyblog/friday-qa-2015-11-06-why-is-swifts-string-api-so-hard.html comments

Rennie - 2016-12-14 08:03:50

Wed, 14 Dec 2016 08:03:50 GMT

"(Incidentally, I think that representing all these different concepts as a single string type is a mistake. Human-readable text, file paths, SQL statements, and others are all conceptually different, and this should be represented as different types at the language level. I think that having different conceptual kinds of strings be distinct types would eliminate a lot of bugs. I'm not aware of any language or standard library that does this, though.)"

Boy, do I disagree. I have painful memories of working with some C++ programs where there were about a half-dozen different representations for string. The basic 8-bit zero-terminated strings, a variation that had a 16-bit length in front, two different "wide character" types from two different development groups in Microsoft and two other text string "standards" created by independent organisations or library implementers. What a mess. Every time some text data had to be conveyed from one part of the program to another, or calling a function at a different level in the implementation, it almost always involved converting from one kind of string representation to another.

Please, never again!

الوليد - 2016-09-30 15:55:46

Fri, 30 Sep 2016 15:55:46 GMT

Do make sure you have optimizations enabled when testing, though. Swift is still slower, but it speeds up by about a factor of 10. Also, it depends pretty heavily on the encoding and the data. UTF-16 is NSString's native encoding, so creating an NSString from a UTF-16 array is basically just a memcpy. Try with UTF-8 and a Unicode flag as the repeated Character and while Swift still loses the race, it "only" loses by a factor of 3-4 instead of a factor of a bazillion.

Lisper - 2016-05-29 19:19:21

Sun, 29 May 2016 19:19:21 GMT

"(Incidentally, I think that representing all these different concepts as a single string type is a mistake. Human-readable text, file paths, SQL statements, and others are all conceptually different, and this should be represented as different types at the language level. I think that having different conceptual kinds of strings be distinct types would eliminate a lot of bugs. I'm not aware of any language or standard library that does this, though.)"

Common Lisp essentially does this: #P"foo" denotes a pathname string, though many stdlib functions still also accept ordinary strings.

Python also recently added Path objects, though there's no syntax for them, and I'm not sure anyone uses them yet.

Database interfaces don't tend to be part of programming languages, but most database *libraries* do this. Again, though, since most languages don't offer extensible syntax, they have to use function calls or classes, like SQLAlchemy's text() wrapper.

Aaron - 2016-01-11 21:06:19

Mon, 11 Jan 2016 21:06:19 GMT

I'm sorry, I just re-read your post and saw that you covered this. Please ignore my last comment.

Aaron - 2016-01-11 21:04:16

Mon, 11 Jan 2016 21:04:16 GMT

I wondered about your thoughts on why integer subscripts are not provided for the various string views? For example, it would be handy to be able to write:

let char = "Hello!".characters[0]

It would be easy to implement:

subscript(index: Int) -> Character {

    let index = startIndex.advancedBy(index)

    return self[index]

}

I really like the String API, but I think what trips people up more than anything is not having simple integer subscript access to the elements, and having instead to go through the process of creating an Index and advancing it.

Given how easy it would be to implement, any ideas why it is not available? Is it a performance thing?

Keegan - 2015-12-18 10:53:20

Fri, 18 Dec 2015 10:53:20 GMT

What if all you only need to deal with ascii? Is there a swift library that has less verbose string functionality?

Now that swift is open sourced many people are going to want to use it for things don't require Unicode, and thus don't need the complexity of the built in string type.

Chad - 2015-11-10 03:00:21

Tue, 10 Nov 2015 03:00:21 GMT

As a unicode nerd, I'm really happy that someone with deep domain knowledge finally developed a string API that matches reality. I like that Swift takes inspiration from functional programming languages where the type system exists to enforce correct usage of the API, not just to say what the bytes will be in memory.

On this aside:

(Incidentally, I think that representing all these different concepts as a single string type is a mistake. Human-readable text, file paths, SQL statements, and others are all conceptually different, and this should be represented as different types at the language level. I think that having different conceptual kinds of strings be distinct types would eliminate a lot of bugs. I'm not aware of any language or standard library that does this, though.)

You should checkout Haskell's newtype operator and Scala's AnyVal that allow for this. Basically 0-cost at runtime types that don't let you mix during compile time.

Josh Ballanco - 2015-11-08 04:18:22

Sun, 08 Nov 2015 04:18:22 GMT

I think that having different conceptual kinds of strings be distinct types would eliminate a lot of bugs. I'm not aware of any language or standard library that does this, though.

Another, more recent, language that does this is Julia with its "Non-Standard String Literals": http://docs.julialang.org/en/release-0.4/manual/metaprogramming/#man-non-standard-string-literals2 . It takes Python's r"foo.*bar" syntax for regexes and extends it in a user-definable way. For example, v"1.1.0" creates a VersionNumber.

mikeash - 2015-11-08 03:12:20

Sun, 08 Nov 2015 03:12:20 GMT

I don't know if Swift cares much about interning strings. It does care a lot about using them as dictionary keys, though, which has essentially the same performance requirements.

I think this approach would fit that well, simply because the implementation details are completely hidden, so String is free to use whatever internal representation makes the most sense for this. Contrast with NSString, where an internal representation using (say) UTF-32 would conflict badly with the API. Whatever internal representation makes the most sense for hashing, String could use it.

Anthony Bailey - 2015-11-07 15:21:23

Sat, 07 Nov 2015 15:21:23 GMT

This is one of those "nice clear expression of how I fuzzily thought things should be" blog posts that I really enjoy and appreciate. Thanks!

Does the approach fit nicely with another real-world concern - a time/space-efficient implementation of string interning? Or is that kind of optimization outside of Swift's intended domain?

Pierre Lebeaupin - 2015-11-07 10:43:39

Sat, 07 Nov 2015 10:43:39 GMT

re: "you can't always avoid it", I am trying to define a set of minimal primitives (with any other operation expressed as convenience methods over these primitives) for the string type to support all non-specialist text operations such that peeking at the components is never necessary:
* defining literal ASCII strings (typically for dictionary keys and debugging)
* reading and writing strings from byte arrays with a specified encoding
* printing the value of variables to a string, possibly under the control of a format and locale
* attempting to interpret the contents of a string as an integer or floating-point number, possibly under the control of a format and locale
* concatenating strings
* hashing strings (with an implementation of hashing that takes into account the fact strings that only vary in character composition are considered equal and so must have equal hashes)
* searching within a string with appropriate options (regular expression or not, case sensitive or not, anchored or not, etc.) and getting the first match (which may compare equal while not being the exact same Unicode sequence as the searched string), the part before that match, and the part after that match, or nothing if the search turned empty.
* comparing strings for equality and sorting with appropriate options (similar to that of searching, plus specific options such as numeric sort, that is "1" < "2" < "100")
* and for very specific purposes, a few text transformations: mostly convert to lowercase, convert to uppercase, and capitalize words.

Though I probably need to add string folding with the same options as comparing for equality (in order to use the folded strings as keys to a dictionary, for instance).

I have yet to hear of an ordinary text operation that cannot be factored as a combination of these primitives.

But what about text editing operations, typesetting, text rendering, full-text indexing, figuring word boundaries, word breaks, etc? Those are to be implemented by specialists who write libraries providing these services and ordinary programmers are to use the APIs of these libraries. Yes, even "translating" to pig latin is such a specialist operation, because you have to think first about what it means for Chinese text, for instance.

mikeash - 2015-11-07 02:03:32

Sat, 07 Nov 2015 02:03:32 GMT

araybould: Gotcha, and yes, I just meant it in the sense that allowing [integer] would be misleading, but not that leaving it out somehow explains the limitations.

Josh Bleecher Snyder: "It's important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes." And their very first example shows a string literal that does not contain valid UTF-8. Am I missing something here?

Josh Bleecher Snyder - 2015-11-07 01:56:12

Sat, 07 Nov 2015 01:56:12 GMT

In Go, strings are UTF8 sequences, not byte sequences, even though indexing into a string returns the nth byte. See https://blog.golang.org/strings.

Keith Thompson - 2015-11-07 00:48:09

Sat, 07 Nov 2015 00:48:09 GMT

In C, a string is a pointer to a sequence of non-zero bytes, terminated by a zero byte.

Not quite. In C, a string is by definition "a contiguous sequence of characters terminated by and including the ﬁrst null character". Strings are manipulated using pointers, but the pointer itself is not the string; it's a pointer to a string.

Reference: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf 7.1.1p1

Eelco - 2015-11-07 00:38:12

Sat, 07 Nov 2015 00:38:12 GMT

Re: different string types, this is a great article (from almost a decade ago!) on how to achieve this in Haskell: http://blog.moertel.com/posts/2006-10-18-a-type-based-solution-to-the-strings-problem.html

ttilley - 2015-11-07 00:23:32

Sat, 07 Nov 2015 00:23:32 GMT

...and normal people can ignore the null check. I need it for my particular use case for unrelated reasons.

ttilley - 2015-11-07 00:05:21

Sat, 07 Nov 2015 00:05:21 GMT

you can use hidden functionality to convert UTF16 arrays. here is an example from my project (UChar is an alias pulled in from C, its a single UTF16 value):



internal func ucharCollectionToString<T:CollectionType where T.Generator.Element == UChar>(collection: T) -> String {

    let count = collection.underestimateCount()

    var sc = _StringCore.init()

    sc.reserveCapacity(count)

    for codeunit in collection {

        // terminate processing at NULL like C string behavior

        if codeunit != UChar(0) {

            sc.append(codeunit)

        } else { break }

    }

    return String(sc)

}

I'm pretty sure this is going to be much faster than transcoding before creating your string object, but it depends on an internal API.

araybould - 2015-11-06 23:18:41

Fri, 06 Nov 2015 23:18:41 GMT

@mikeash: My mistake - when I wrote the last reply I was still thinking that "why not make it easier, and allow indexing with an integer? It's essentially Swift's way of reinforcing the fact that this is an expensive operation" was intended to imply that programmers would deduce, from the absence of the operation, that it would be expensive. I see now that your point is that in the presence of the operation, many programmers would assume that it is efficient. I was wrongly thinking the contrapositive was implied, but programmers are not necessarily assuming anything from the operator's absence. I imagine some of them will go on to use advancedBy() inefficiently, but at least they are warned.

mikeash - 2015-11-06 22:14:31

Fri, 06 Nov 2015 22:14:31 GMT

Pierre Lebeaupin: Thanks for pointing out the mistake with the accent. I don't know what happened there. Maybe I had some temporary brain damage. (I can hear everybody now, "What do you mean, 'temporary'?")

And yes, you're right, avoid actually peeking at the components of the string whenever you can. Unfortunately you can't always avoid it, because the APIs aren't always there for you, but definitely look real hard first.

Note that with Twitter, you need to count code points but first you need to normalize the string, so that's a bit of an extra complication there. Unfortunately Swift doesn't provide anything for normalization at the moment, although you can use the NSString APIs for it.

Matt - 2015-11-06 21:59:23

Fri, 06 Nov 2015 21:59:23 GMT

Thanks for this write up. I'm not a Swift programmer whatsoever and still found it useful and interesting.

Pierre Lebeaupin - 2015-11-06 21:35:45

Fri, 06 Nov 2015 21:35:45 GMT

First, there is a small mistake: the third Unicode code point of your example string in "problems" is not U+00B4 (ACUTE ACCENT), but U+0301 (COMBINING ACUTE ACCENT), among other things we can see it encodes to 0xCC 0X81 in UTF-8 (not to mention that this accent… combines, you know). Also, the last possible Unicode code point (so as to be encodable with two surrogates) is at U+10FFFF, you may need more than 20 bits (five nibbles), so I never represent "A" as 0x00041 or U+00041, rather 0x000041 (or 0x0041 if dealing with UTF-16 or 0x41 if dealing with UTF-8), because, granted, 0x00000041 or U+00000041 is a bit too long…

I argue (at http://wanderingcoder.net/2015/07/24/string-character-processing/ among others) that ordinary programmers need never care about the individual constituent of a string. Doesn't matters if you consider a string to be a sequence of bytes, words, code points, graphemes clusters: simply don't. Need to move the insertion point? Send the advanceInsertionPoint/moveBackInsertionPoint message to the text editing engine, it is going to worry about what that means, not you. A tweet? Serialize the string to a byte array with the UTF-32BE encoding (I haven't checked what the Twitter API takes, to be honest; adapt as appropriate), divide the length of the byte array by 4, that will give you whether you are at less, more, or exactly at 140. Database column? Same, except the encoding it UTF-8, and again you check if the byte array fits. Need to check whether the file name has extension "avi"? Do a case-insensitive, anchored, reverse, locale-independent search for ".avi" in the file name string. etc.

As a result, I completely agree with the Swift design of opaque indexes from string. Besides saving you from quadratic algorithms, it forces you to think about what it is you are actually doing to the string.

I am surprised that Swift still does not have a way to create a string by decoding a byte array in a specified encoding, and to create a byte array in a specified encoding from a string as part of the type. In particular, these UTF-x views should only be though as ways to integrate with foreign APIs (and as first step to decode from/encode to a byte array, since there is no direct way to do so), not means in and of themselves. Python 3 has the right idea (though I think it does not go far enough): there is no character type, merely very short strings when one does character-like processing, and the encodings (UTF-8, UTF-16, UTF-32) are only that: encodings that you can specify when decoding from/encoding to a byte array (for a file, network packet, etc.)

coldtea - 2015-11-06 20:42:22

Fri, 06 Nov 2015 20:42:22 GMT

> I think that having different conceptual kinds of strings be distinct types would eliminate a lot of bugs. I'm not aware of any language or standard library that does this, though.)

Rebol does that.

mikeash - 2015-11-06 20:38:18

Fri, 06 Nov 2015 20:38:18 GMT

araybold: I'm not sure what you're referring to. Which implication is clear, and what "it" are people not getting?

araybold - 2015-11-06 20:25:09

Fri, 06 Nov 2015 20:25:09 GMT

@Jon: tanks for the links.

@mikeash: I am not sure if it entirely consistent to say that the implication is clear, while also providing a detailed explanation here - does that not imply that you think some people are not, in fact, getting it?

Marc P. - 2015-11-06 18:57:01

Fri, 06 Nov 2015 18:57:01 GMT

It seems like the NSString bridging is helped along by a hidden String implementation _SwiftNativeNSStringBase, which bodes poorly for the imminent Linux port of Swift, since the absence of an accompanying Foundation port will leave the String implementation non-performant when it needs to work with real-world encodings like UTF-8 and UTF-16.

Thanks for the tip about testing with optimization; it does help out the string performance quite a bit. But I don't think that the calling of the user-supplied block accounts for most of the overhead. The transcode() function be manually performed without any blocks like so:



    func testUTF16StringConversion() {

        var str = ""



        measureBlock {

            var string = ""

            var utf16 = UTF16()

            var gen = utf16Array.generate()

            var done = false

            while !done {

                switch utf16.decode(&gen) {

                case .Result(let val): string.append(val)

                case .EmptyInput: done = true

                case .Error: fatalError("bad string")

                }

            }



            str = string

        }



        XCTAssertEqual(Array(str.utf16), utf16Array)

    }

This method gives about a 15% boost to the extension where transcode() is being invoked, but it is still a far cry from the internal optimization that the NSString conversion gives you.

OTOH, NSString's constructor almost certainly is not doing any validation of the byte array, so it's not an entirely fair comparison. But it would be nice to have a fast-track mechanism to create a String from a sequence of encoded bytes available in Swift.

mikeash - 2015-11-06 18:39:12

Fri, 06 Nov 2015 18:39:12 GMT

ARaybold: I think that if any programmer sees this, they will assume that it's O(1):

someVar1[someVar2]

On the other hand, they will not assume any particular performance for this:

someVar1.advancedBy(someVar2)

Yes, this doesn't tell you what the performance actually is, but it at least doesn't lead you to make assumptions. (Note that in Swift, the [] indexing operation is still fast. The cost is in getting the right index value to pass to it.)

As far as the documentation goes, command-click a Swift symbol in Xcode and then read through all the comment documentation in the Swift module. The same information is also available on http://swiftdoc.org, and from Apple at https://developer.apple.com/library/ios/documentation/Swift/Reference/Swift_ForwardIndexType_Protocol/index.html. The documentation for advancedBy explicitly states that the complexity is O(1) on a RandomAccessIndexType, and O(n) otherwise.

Jon - 2015-11-06 18:36:38

Fri, 06 Nov 2015 18:36:38 GMT

ARaybold: Most of the documentation for the Swift core language exists in the library "header" files. You can check them out, rendered in a nice, searchable website, at swiftdoc.org.

ARaybold - 2015-11-06 17:28:10

Fri, 06 Nov 2015 17:28:10 GMT

It may seem odd that I read this despite never having written a line of Swift code, and not likely to do so in the near future at least, but it is an interesting case of API design.

I think you make a good case, but this got my attention: "Why not make it easier, and allow indexing with an integer? It's essentially Swift's way of reinforcing the fact that this is an expensive operation."

This is a pretty indirect way of making the point. Perhaps the documentation for advancedBy() contains that warning? I went to the Apple Developer web site and searched around for a while, but not only did I not find a statement to this effect, I did not even find a concise reference document covering string functions, operators and methods. Maybe someone who has spent even a little more time with Swift than I have will have stumbled upon (and bookmarked) the sort of documentation that a programmer will need, but I am (for now) left with the impression that the underlying problem here is a failure to communicate.

Your article here also indirectly makes a good counter-example against the proposition that code can be adequately self-documenting, as it shows some real-world examples of where a function name cannot convey all the information you need to know to use it properly.

mikeash - 2015-11-06 16:43:12

Fri, 06 Nov 2015 16:43:12 GMT

Marc P: Good question. I think the transcode function is just too general to be very fast. Calling a user-supplied function for every code point is tough to optimize. Which is all the more reason the standard library ought to provide a direct initializer for String that takes UTF-8 and UTF-16.

Do make sure you have optimizations enabled when testing, though. Swift is still slower, but it speeds up by about a factor of 10. Also, it depends pretty heavily on the encoding and the data. UTF-16 is NSString's native encoding, so creating an NSString from a UTF-16 array is basically just a memcpy. Try with UTF-8 and a Unicode flag as the repeated Character and while Swift still loses the race, it "only" loses by a factor of 3-4 instead of a factor of a bazillion.

Anon - 2015-11-06 16:35:04

Fri, 06 Nov 2015 16:35:04 GMT

Marc P - 2015-11-06 16:04:18

Fri, 06 Nov 2015 16:04:18 GMT

Is there any way to approach the native NSString bridging performance in pure-Swift? If I compare creating a string using your UTF-16 extensions with that of constructing an NSString with the bytes, the NSString constructing is two orders of magnitude faster.

Here are the test cases I used:



// a very big UTF-16 array

let utf16Array = Array(String(count: 999999, repeatedValue: Character("X")).utf16)



class StringPerformanceTests: XCTestCase {





    func testMikeAshStringConversionPerformance() {

        var str = ""

        measureBlock {

            str = String(utf16: utf16Array)!

        }

        XCTAssertEqual(Array(str.utf16), utf16Array)

    }



    func testNSStringConversionPerformance() {

        var str = ""

        measureBlock {

            str = utf16Array.withUnsafeBufferPointer { ptr in

                NSString(characters: ptr.baseAddress, length: ptr.count) as String

            }

        }

        XCTAssertEqual(Array(str.utf16), utf16Array)

    }

}

mikeash - 2015-11-06 15:33:06

Fri, 06 Nov 2015 15:33:06 GMT

Gerard Guillemette: Yes, that is really nice. Even NSString doesn't do that by default.

stringA as NSString == stringB as NSString // false

You have to use one of the more specific comparison methods to make that say true.

calicoding: As I said, I think it's a mistake to have "path" be the same type as other kinds of text, so I think moving paths over to NSURL is a pretty good thing. Of course, we end up with a similar problem where, where NSURL can contain arbitrary URLs with arbitrary schemes, but 99% of the framework methods that take an NSURL only accept file: URLs. Note that the path methods are still available if you use as NSString to get an explicit NSString first. Then they return String so it's a bit annoying, but the functionality is at least still there.

calicoding - 2015-11-06 15:15:30

Fri, 06 Nov 2015 15:15:30 GMT

I would also add that swift's API did not bring over all the "path" methods of NSString, which seems to be a pain point for a lot of people. But here Apple is trying to reduce the scope of the problem and have people just use NSURL instead. This seems reasonable to me, and to be honest, more correct.

Ps. Great stuff as always

Gerard Guillemette - 2015-11-06 14:49:25

Fri, 06 Nov 2015 14:49:25 GMT

It was pointed out in "The Swift Apprentice" that Swift does the right thing with something like

let stringA = "café"
let stringB = "cafe\u{0301}"
let equal = stringA == stringB

In stringA the bytes are 99-97-102-233 while for the second it is 99-97-102-101-769. "equal" ends up being true despite the difference in representation of é.