Fancy types for CSV library

Posted on 2019-07-15 by Oleg Grenrus fancy-types

This blog post is a thought on a following question: Can we make cassava (= CSV) stuff a bit safer using fancy types? This is not a proposal to change cassava, only some ideas and a demonstration of "real world" fancy types (though STLC-implementation also count as real in my world ;). Also a bit of Generics.

#Prologue

This post is not only a Literate Haskell file , but also can be simply run directly with (given a very recent version of cabal-install),

cabal run --index-state=2019-07-15T07:00:36Z posts/2019-07-15-fancy-types-for-cassava.lhs

as we specify dependencies. We'll use fin and vec packages. I assume that data Nat = Z | S Nat and data Vec :: Nat -> Type -> Type are familiar to you.1

{- cabal:
build-depends:
  , base   ^>=4.10 || ^>=4.11
  , fin    ^>=0.1
  , vec    ^>=0.1.1.1
  , tagged ^>=0.8.6
  , text   ^>=1.2.3.0

ghc-options:        -Wall -pgmL markdown-unlit
build-tool-depends: markdown-unlit:markdown-unlit ^>=0.5.0
-}

Next a {-# LANGUAGE Dependent #-} collection of extensions...

{-# LANGUAGE EmptyCase #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE GADTs #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE PartialTypeSignatures #-}
{-# LANGUAGE RankNTypes #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE StandaloneDeriving #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE TypeOperators #-}
{-# LANGUAGE UndecidableInstances #-}
{-# OPTIONS_GHC -Wall -Wno-partial-type-signatures -Wno-unused-imports #-}

... and imports

module Main where

import Control.Monad (forM)
import Data.Bifunctor (first, bimap)
import Data.Kind (Type)
import Data.Tagged (Tagged (..))
import Data.Text (Text)
import Data.Type.Equality ((:~:) (..))
import Data.Fin (Fin (..))
import Data.Type.Nat (Nat (..), SNatI)
import Data.Vec.Lazy (Vec (..))
import GHC.Generics
import Text.Read (readMaybe)

import qualified Data.Fin as F
import qualified Data.Text as T
import qualified Data.Text.IO as T
import qualified Data.Type.Nat as N
import qualified Data.Vec.Lazy as V

#Encoding

Encoding is often easier than decoding, so let's start with it. Our running example will be programming languages:

data PL = PL
    { plName   :: Text
    , plYear   :: Int
    , plPerson :: Text
    }
  deriving (Eq, Ord, Show, Generic)

We can create a small database of programming languages we have heard of:

pls :: [PL]
pls =
    [ PL "Haskell" 1990 "Simon"
    , PL "Scala"   2004 "Martin"
    , PL "Idris"   2009 "Edwin"
    , PL "Perl"    1987 "Larry"
    ]

#ToField

For encoding, we'll need to be able to encode individual fields / cells. This is similar to what we have in cassava now. I'm for using type-classses for such cases, because I want to be able to leverage Generics, we'll see that soon.

Thanks to the fancier types, it would be possible to avoid type-classes, still getting something for free, but that's a topic for another post.

class    ToField a    where toField :: a -> Text
instance ToField Text where toField = id
instance ToField Int  where toField = T.pack . show

#ToRecord

As we can serialise individual fields, let us serialise records. We could have a method toRecord :: r -> [Text] as in cassava now, but it is potentially unsafe. Length of the list may vary depending on the record value. So we'd rather use fancy types!

-- field count in a record: its size.
type family Size (a :: Type) :: Nat

class ToRecord r where
    toRecord :: r -> Vec (Size r) Text

It's easy to imagine the ToRecord PL instance. But we rather write a generic implementation for it.

First a generic Size. Recall that GHC.Generics represents records as a binary tree of :*:. To avoid using Plus, i.e. adding up Nats and concatenating Vecs, we'll foldr like implementation. There is a nested application of GSizeF type family in the :*:-case. GHC wants UndecidableInstances.

type GSize a = GSizeF (Rep a) 'Z -- start from zero

type family GSizeF (f :: Type -> Type) (acc :: Nat) :: Nat where
    GSizeF U1         acc = acc
    GSizeF (K1 i a)   acc = 'S acc
    GSizeF (M1 i c f) acc = GSizeF f acc
    GSizeF (f :*: g)  acc = GSizeF f (GSizeF g acc)

Using that type family, we can say succintly define:

type instance Size PL = GSize PL

Also we can check this dependent-language style. If this test fails, it will be a compilation error. This definitely blurs the distiction between tests and types :)

check1 :: Size PL :~: N.Nat3
check1 = Refl

Using similar induction on the structure, we can write a generic implementation for ToRecord. Different clauses are handled by different instances of a workhorse class GToRecord.

genericToRecord
    :: forall r. (Generic r, GToRecord (Rep r))
    => r -> Vec (GSize r) Text
genericToRecord = gtoRecord VNil . from

class GToRecord rep where
    gtoRecord :: Vec acc Text -> rep () -> Vec (GSizeF rep acc) Text
instance GToRecord U1 where
    gtoRecord xs _ = xs
instance ToField c => GToRecord (K1 i c) where
    gtoRecord xs (K1 c) = toField c ::: xs
instance GToRecord f => GToRecord (M1 i c f) where
    gtoRecord xs (M1 f) = gtoRecord xs f
instance (GToRecord f, GToRecord g) => GToRecord (f :*: g) where
    gtoRecord xs (f :*: g) = gtoRecord (gtoRecord xs g) f

The ToRecord PL instance in the user code is a one-liner:

instance ToRecord PL where toRecord = genericToRecord

#Column names

One more thing: column names. Usually CSV files start with a header line. This is where using Size pays off again: Header have to be the same size as content rows. Here I use Tagged to avoid AllowAmbiguousTypes and Proxy r extra argument.

class Header r where
    header :: Tagged r (Vec (Size r) Text)

We could write generic implementation for it, but as dealing with metadata in GHC.Generics is not pretty, I'll implement PL instance manually:

instance Header PL where
    header = Tagged $ "name" ::: "year" ::: "person" ::: VNil

#Encoding function

The one piece left is an actual encode function. I cut corners by not dealing with escaping for the sake of brevity.

You should notice, that in the implementation of encode we don't care that much about the fact we get Vecs of the same length for each record. encode implementation would work, even with class ToRecord' r where toRecord' :: r -> [Text]. Fancy types are here to help users of a library write correct (by construction) instances.

encode :: forall r. (Header r, ToRecord r) => [r] -> Text
encode rs = T.unlines $ map (T.intercalate "," . V.toList)
    $ unTagged (header :: Tagged r _)
    : map toRecord rs

And it works:

*Main> T.putStr $ encode pls
name,year,person
Haskell,1990,Simon
Scala,2004,Martin
Idris,2009,Edwin
Perl,1987,Larry

Good. Next we'll write an inverse.

#Decoding

Other direction, decoding is trickier. Everything could fail. Fields can contain garbage, there might be not enough fields (too much is not such a problem), but most importantly the fields can come in wrong order. Luckily we have fancy types helping us.

#FromField

Like ToField, FromField is a copy of cassava class:

type Error = String

class    FromField a    where fromField :: Text -> Either String a
instance FromField Text where fromField = Right
instance FromField Int  where
    fromField t
        = maybe (Left $ "Invalid Int: " ++ show t) Right
        $ readMaybe $ T.unpack t

#FromRecord

Also like ToRecord, FromRecord is simple class as well. Note how library users need to deal only with vector of the right size (versus list of any length). We'll also assume that vector is sorted, to match header columns (which could be encoded with fancier types!)

class FromRecord r where
    fromRecord :: Vec (Size r) Text -> Either Error r

Generic implementation "peels off" provided vector:

genericFromRecord
    :: forall r. (Generic r, GFromRecord (Rep r))
    => Vec (GSize r) Text -> Either String r
genericFromRecord ts =
    let tmp :: Either Error (Rep r (), Vec 'Z Text)
        tmp = gfromRecord ts
    in to . fst <$> tmp

class GFromRecord rep where
    gfromRecord :: Vec (GSizeF rep acc) Text -> Either Error (rep (), Vec acc Text)
instance GFromRecord U1 where
    gfromRecord xs = return (U1, xs)
instance FromField c => GFromRecord (K1 i c) where
    gfromRecord (x ::: xs) = do
        y <- fromField x
        return (K1 y, xs)
instance GFromRecord f => GFromRecord (M1 i c f) where
    gfromRecord = fmap (first M1) . gfromRecord
instance (GFromRecord f, GFromRecord g) => GFromRecord (f :*: g) where
    gfromRecord xs = do
        (f, xs')  <- gfromRecord xs
        (g, xs'') <- gfromRecord xs'
        return (f :*: g, xs'')

instance FromRecord PL where
    fromRecord = genericFromRecord

And a small sanity check:

*Main> fromRecord ("Python" ::: "1990" ::: "Guido" ::: VNil) :: Either String PL
Right (PL {plName = "Python", plYear = 1990, plPerson = "Guido"})

*Main> fromRecord ("Lambda Calculus" ::: "in the 1930s" ::: "Alonzo" ::: VNil) :: Either String PL
Left "Invalid Int: \"in the 1930s\""

#The difficult part

We are now solved all the easy problems. We have set up the public api of library. To* and From* classes use fancy types, we have taken some burden from library users. However the difficult task is still undone: implementing decode.

The example we'll want to work will have extra fields, and the fields shuffled:

input :: Text
input = T.unlines
    [ "year,name,types,person,website"
    , "1987,Perl,no,Larry,https://www.perl.org/"
    , "1990,Haskell,nice,Simon,https://www.haskell.org/"
    , "2004,Scala,weird,Martin,https://www.scala-lang.org/"
    , "2009,Idris,fancy,Edwin,https://www.idris-lang.org/"
    ]

which is

*Main> T.putStr input
year,name,types,person,website
1987,Perl,no,Larry,https://www.perl.org/
1990,Haskell,nice,Simon,https://www.haskell.org/
2004,Scala,weird,Martin,https://www.scala-lang.org/
2009,Idris,fancy,Edwin,https://www.idris-lang.org/

There's still enough information, we should be able to successfully extract PL.

Zeroth step is to split input into lines, and lines into cells, and extract the header row. That's not the hard part:

prepare :: Text -> Either Error ([Text], [[Text]])
prepare i = case map (T.splitOn ",") (T.lines i) of
    []     -> Left "No header"
    (r:rs) -> Right (r, rs)

#Sorting columns

The hard part is to decode from [Text] into Vec (Size r) Text. And not only we need to decode, but also sort the columns. Our plan would be to

  1. sort the header row, returning the trace of a sort
  2. use that trace to sort content rows similarly.

We'll require that content rows contain at least as many columns as header row. It's reasonable requirement, and simplifies things a bit. More relaxed requirement would be to require only as much rows as needed, e.g. in our example we could require only four fields, as we aren't interested in the fifth website field.

What's the trace of a sort? Technically it's permutation. However in this case, it's not regular permutation, as we aren't interested in all fields. It's easier to think backwards, and think which kind of trace would determine the execution in the step 2. We'll be given a Vec n Text for some n, and we'll need to produce a Vec m Text for some other m (= Size r). Let's try to write that as a data type:

data Extract :: Nat -> Nat -> Type where
    Step :: Fin ('S n)             -- take a nth value, x
         -> Extract n m            -- recursively extract rest, xs
         -> Extract ('S n) ('S m)  -- cons x xs
    Done :: Extract n 'Z           -- or we are done.

deriving instance Show (Extract n m)

In retrospect, that type is a combination of less-than-or-equal-to and is a permutation (inductively defined) predicates.2

We can (should!) immediately try this type in action. For what it's worth, the implementations of following functions is quite restricted by their types. There are not much places where you can make a mistake. To be fair, extract and Extract were written simultaneously: extract is structurally recusive in the Extract argument, and Extract has just enough data for extract to make choices.

extract :: Extract n m -> Vec n a -> Vec m a
extract Done       _  = VNil
extract (Step n e) xs = case delete n xs of
    (x, xs') -> x ::: extract e xs'

-- this probably should be in the `vec` library
delete :: Fin ('S n) -> Vec ('S n) a -> (a, Vec n a)
delete FZ           (x ::: xs)       = (x, xs)
delete (FS FZ)      (x ::: y ::: xs) = (y, x ::: xs)
delete (FS n@FS {}) (x ::: xs)       = case delete n  xs of
    (y, ys) -> (y, x ::: ys)

For example, given a row and a trace, we can extract fields we want (writing correct trace by hand is tricky).

*Main> row = "1987" ::: "Perl" ::: "no" ::: "Larry" ::: "https://www.perl.org/" ::: VNil
*Main> trc = Step 1 $ Step F.fin0 $ Step F.fin1 Done :: Extract N.Nat5 N.Nat3
*Main> extract trc row
"Perl" ::: "1987" ::: "Larry" ::: VNil

That starts to feel like magic, doesn't it? To complete the whole spell, we need to complete part 1, i.e. construct Extract traces. Luckily, types are there to guide us:

columns
    :: (Eq a, Show a)
    => Vec m a         -- ^ wanted header values
    -> Vec n a         -- ^ given header values
    -> Either Error (Extract n m)
columns VNil       _            = Right Done
columns (_ ::: _)  VNil         = Left "not enought header values"
columns (h ::: hs) xs@(_ ::: _) = do
    (n, xs') <- find' h xs  -- find first value
    rest <- columns hs xs'  -- recurse
    return $ Step n rest    -- record the trace

where we use a helper function find', which finds a value in the Vec ('S n) and returns not only an index, but also a leftover vector. We could write a test:

  • if Right (n, ys) = find' x xs
  • then ys = delete n xs
find'
    :: (Eq a, Show a)
    => a
    -> Vec ('S n) a
    -> Either Error (Fin ('S n), Vec n a)
find' x (y ::: ys)
    | x == y    = Right (FZ, ys)
    | otherwise = case ys of
        VNil    -> Left $ "Cannot find header value " ++ show x
        _ ::: _ -> do
            (n, zs) <- find' x ys
            return (FS n, y ::: zs)

Let's try columns. It takes some time to understand to interpret Extract values. Luckily the machine is there to do that.

*Main> columns ("name" ::: "year" ::: VNil) ("name" ::: "year" ::: VNil)
Right (Step 0 (Step 0 Done))

*Main> columns ("name" ::: "year" ::: VNil) ("year" ::: "name" ::: VNil)
Right (Step 1 (Step 0 Done))

*Main> columns ("name" ::: "year" ::: VNil) ("year" ::: "extra" ::: "name" ::: VNil)
Right (Step 2 (Step 0 Done))

*Main> columns ("name" ::: "year" ::: VNil) ("year" ::: "extra" ::: "foo" ::: VNil)
Left "Cannot find header value \"name\""

*Main> columns ("name" ::: "year" ::: VNil) ("name" ::: VNil)
Left "not enought header values"

#Assembling all parts together

We have three steps

  1. prepare to split input data into header and content rows
  2. columns which checks whether there are all fields we want in the provided header, and returns an Extract value saying how to permute content rows.
  3. extract which uses an Extract to extract (and order) correct data columns.

We'll use few two functions from vec: reifyList and fromListPrefix.

-- Reify any list [a] to Vec n a.
reifyList :: [a] -> (forall n. SNat n => Vec n a -> r) -> r

-- Convert list [a] to Vec n a. Returns Nothing if input list is too short.
fromListPrefix :: SNatI n => [a] -> Maybe (Vec n a)

They both convert a list [a] into Vec n a, however they are different

  • reifyList works for any list. As we don't know the length of dynamic inputs, reifyList takes a continuation which accepts Vec of any length. That continuation however would know and be able to use the vector length.
  • fromListPrefix tries to convert a list to a vector of known length, and thus may fail.

To put it differently, using reifyList we learn the length of the header, and then we require that subsequent content rows are of atleast the same length. Lifting (or promoting) some information to the type level, reduces the amount of dynamic checks we'll need to consequtively e.g. extract doesn't perform any checks.

decode :: forall r. (Header r, FromRecord r) => Text -> Either String [r]
decode contents = do
    (hs,xss) <- prepare contents
    V.reifyList hs $ \hs' -> do
        trc <- columns (unTagged (header :: Tagged r _)) hs'
        forM xss $ \xs -> do
            xs' <- maybe (Left "not enough columns") Right
                $ V.fromListPrefix xs
            fromRecord (extract trc xs')

All done! To convince you that it works, let's run decode on an input we defined at the beginning of this section.

main :: IO ()
main = case decode input :: Either String [PL] of
    Left err -> putStrLn $ "ERROR: " ++ err
    Right xs -> mapM_ print xs
*Main> main
PL {plName = "Perl", plYear = 1987, plPerson = "Larry"}
PL {plName = "Haskell", plYear = 1990, plPerson = "Simon"}
PL {plName = "Scala", plYear = 2004, plPerson = "Martin"}
PL {plName = "Idris", plYear = 2009, plPerson = "Edwin"}

#An exercise

This is a challenging exercise. Improve decode to deal with incomplete data like:

input2 :: Text
input2 = T.unlines
    [ "year,name,types,person,website"
    , "1987,Perl,no,Larry"
    , "1990,Haskell,nice,Simon,https://www.haskell.org/"
    ]

Note, the first content row has only four fields so original decode errors with

*Main> decode input2 :: Either String [PL]
Left "not enough columns"

The goal is to make decode succeed:

*Main> mapM_ (mapM print) (decode input2 :: Either String [PL])
PL {plName = "Perl", plYear = 1987, plPerson = "Larry"}
PL {plName = "Haskell", plYear = 1990, plPerson = "Simon"}

There are at least two way to solve this. A trickier one, for which there are two hints in footnotes: first3 and second4. And a lot simpler way, which "cheats" a little.

#Even fancier types

There is still a lot places where we can make mistakes. We use Vec n a, so we have n elements to pick. If we instead use heterogenous lists, e.g. NP from sop-core The types would become more precise. We could change our public interface to:

type family Fields r :: [Type]
class ToRecord' r where
    toRecord' :: r -> NP I (Fields r)
class Header' r where
    header' :: Tagged r (NP (K Text) (Fields r))

then writing correct versions of delete, extract etc will be even more type-directed. That's is left as an exericise, I suspect that the code shape will be quite the same.

#Row types

One valid question to ask, is whether row-types would simplify something here. Not really.

For example vinyl's Rec type is essentially the same as NP. Even if there were anonymous records in Haskell, so toRecord could be implemented directly using a built-in function, it would remove only a single problem from many. At it's not much, as toRecord is generically derivable.

#Conclusion

In this post I described a complete fancy types usage example, helping us to deal with the untyped real world. Fancy types make library API more precise: we encode (pre/post)conditions like "lists are of the equal length" in the types.

Also we have seen a domain specific"inductive predicate: Extract. It's a library internal, implementation-detail type. Even in "normal" Haskell, not all types (need to) end up into the library's public interface.

The vector example is the hello world of dependent types, but here it prevents users from making silly errors, and also make the implementation of a library more robust.


  1. If they aren't, read through e.g. Stitch paper and / or watch a video of a talk Richard Eisenberg presented at ZuriHac '19 (which i saw live, hopefully it will be posted somewhere soon). or older version from NYC Haskell Group meetup (which I didn't watch, only googled for). The definitions in fin and vec are as follows:

    data Nat = Z | S Nat
    data Fin :: Nat -> Type where
        FZ :: Fin ('S n)
        FS :: Fin n -> Fin ('S n)
    data Vec :: Nat -> Type -> Type where
        VNil  :: Vec 'Z a
        (:::) :: a -> Vec n a -> Vec ('S n) a
  2. compare Extract with LEProof and Permutation

    -- | An evidence of \(n \le m\). /zero+succ/ definition.
    data LEProof :: Nat -> Nat -> Type where
        LEZero :: LEProof 'Z m
        LESucc :: LEProof n m -> LEProof ('S n) ('S m)
    
    -- | Permutation. 'PCons' can be interpretted in two ways:
    -- * uncons head part, insert in a given position in permutted tail
    -- * delete from given position, cons to permutted tail.
    data Permutation :: Nat -> Type where
        PNil  :: Permutation 'Z
        PCons :: Fin ('S n) -> Permutation n -> Permutation ('S n)
  3. First hint implement a function like:

    minimise
        :: SNatI n
        => Extract n m
        -> (forall p. SNatI p => Extract p m -> r)
        -> r
    -- conservative implementation, not minimising at all
    minimise e k = k e

    and use it in decode.

  4. Second hint My variant of minimise uses LEProof and few auxiliary functions:

    minimise
        :: Extract n m
        -> (forall p. N.SNatI p => LEProof p n -> Extract p m -> r)
        -> r
    
    minimiseFin
        :: Fin ('S n)
        -> (forall p. N.SNatI p => LEProof p n -> Fin ('S p) -> r)
        -> r
    
    maxLE
        :: LEProof n p -> LEProof m p
        -> Either (LEProof n m) (LEProof m n)
    
    weakenFin :: LEProof n m -> Fin ('S n) -> Fin ('S m)
    weakenExtract :: LEProof n m -> Extract n p -> Extract m p
Site proudly generated by Hakyll