Strings
String type
By default, Haskell strings are lists of characters:
code:string.hs
This definition is quite convenient for implementing basic text processing functions, as one can reuse the rich libraries for
lists
But the list representation is quite inefficient for dealing with large amounts of text.
Text
The text package offers
code:text.hs
data Text -- abstact
a packed representation of (Unicode) text.
Much less memory overhead than a String
Still uses UTF-16 encoding per character, so 2 bytes per character
Performance characteristics more like functional arrays
Commonly used operations
code:text-interface.hs
cons :: Char -> Text -> Text -- O(n)
(<>) :: Text -> Text -> Text -- O(n)
length :: Text -> Int -- O(n)
map :: (Char -> Char) -> Text -> Text -- O(n)
filter :: (Char -> Bool) -> Text -> Text -- O(n)
foldr :: (Char -> a -> a) -> a -> Text -> a -- O(n)
toUpper :: Text -> Text -- O(n)
strip :: Text -> Text -- O(n)
lines :: Text -> Text -- O(n) head :: Text -> Char -- O(1)
last :: Text -> Char -- O(1)
index :: Tet -> Int -> Char -- O(n)
Text makes use of stream fusion internally
Some subsequent traversals, such as multiple maps followed by a fold, will be fused together, so that only a single traversal is required.
Lazy text
text package also offers a module Data.Text.Lazy with again
code:lazytext.hs
data Text -- abstract
The internal representaion is a linked list of chunks, which are strict text values
For streaming purposes, not the entire text has to be present in memory at once
code:lazytext.hs
toStrict :: Lazy.Text -> Text
fromStrict :: Text -> Lazy.Text
Building Text
Appending text values are not very efficient
However, ofthen the building and inspecting phases are seperate
Then it's useful to build text using Builder, and convert it to text after building is complete
Intuively a builder simply allocates a buffer and fills it with incoming data.
Commonly used operations in Builder
code:interface.hs
data Builder -- abstract
toLazyText :: Builder -> Lazy.Text -- O(n)
fromText :: Text -> Builder -- O(1)
fromString :: String -> Builder -- O(1)
fromLazyText :: Lazy.Text -> Builder -- O(1)
(<>) :: Builder -> Builder -> Builder -- O(1)
ByteString
Available in the bytestring package.
code:bytestring.hs
data ByteString -- abstract
A bytestring is a packed representation of sequence of bytes.
Much less memory overhead than a String
Even less overhead than Text
No interpretation of the characters (no encoding).
Only useful for ASCII format or (better) binary data.
Like Text, ByteString comes with a strict and lazy variant
Like Text, ByteString types are an instance of IsString
Like Text, ByteString has a Builder type.
Conversion between Text and ByteString
As ByteString consists of pure bytes, but Text is interpreted, we need an encoding in order to translate between two
From Data.Text.Encoding:
code:encoding.hs
encodeUtf8 :: Text -> ByteString
decodeUtf8 :: ByteString -> Text -- partial!!
From Data.ByteString.Char8:
code:encoding.hs
pack :: String -> ByteString
unpack :: ByteString -> String
Warning: These will only work correctly on ASCII subset of characters, so these should be used with extreme care.
Useful links