\documentclass{beamer}
\usepackage{amsmath}
\usepackage{listings}
\usepackage{stmaryrd}

\title{Real World Haskell:\\
  Lecture 7}
\author{Bryan O'Sullivan}
\date{2009-12-09}

\begin{document}
\lstset{language=Haskell}

\frame{\titlepage}

\begin{frame}
  \frametitle{Getting things done}
  
  It's great to dwell so much on purity, but we'd like to maybe use
  Haskell for practical programming some time.

  \vskip8pt
  This leaves us concerned with talking to the outside world.
\end{frame}

\begin{frame}
  \frametitle{Word count}
  
\lstinputlisting{7/WC.hs}
\end{frame}

\begin{frame}[fragile]
  \frametitle{New notation!}
  
  There was a lot to digest there. Let's run through it all, from top
  to bottom.

  \vskip16pt
\begin{lstlisting}
import System.Environment (getArgs)
\end{lstlisting}

  \vskip8pt

  ``Import \emph{only} the thing named \lstinline{getArgs} from
  \lstinline{System.Environment}.''

  \vskip8pt

  Without an explicit (comma separated) list of names to import,
  \emph{everything} that a module exports is imported into this one.

  \vskip8pt
\end{frame}

\begin{frame}[fragile]
  \frametitle{The do block}
  
  Notice that this function's body starts with the keyword
  \lstinline{do}:

  \vskip8pt

\begin{lstlisting}
countWords path = do
  ...
\end{lstlisting}

  \vskip8pt

  That keyword introduces a series of \alert{actions}.  Each action is
  somewhat similar to a statement in C or Python.
\end{frame}

\begin{frame}[fragile]
  \frametitle{Executing an action and using its result}

  The first line of our function's body:

  \vskip8pt

\begin{lstlisting}
countWords path = do
  content <- readFile path
\end{lstlisting}

  \vskip8pt

  This performs the action ``\lstinline{readFile path}'', and assigns
  the result to the name ``content''.

  \vskip8pt

  The special notation ``\lstinline{<-}'' makes it clear that we are
  executing an action, i.e. \emph{not} applying a pure function.
\end{frame}

\begin{frame}[fragile]
  \frametitle{Applying a pure function}
  
  We can use the \lstinline{let} keyword inside a \lstinline{do}
  block, and it applies a pure function, but the code that follows
  does \emph{not} need to start with an \lstinline{in} keyword.

  \vskip8pt
\begin{lstlisting}
  let numWords = length (words content)
  putStrLn (show numWords ++ "  " ++ path)
\end{lstlisting}

  \vskip8pt
  With both \lstinline{let} and \lstinline{<-}, the result is
  immutable as usual, and stays in scope until the end of the
  \lstinline{do} block.
\end{frame}

\begin{frame}[fragile]
  \frametitle{Executing an action}
  
  This line executes an action, and ignores its return value:

  \vskip8pt
\begin{lstlisting}
  putStrLn (show numWords ++ "  " ++ path)
\end{lstlisting}
\end{frame}

\begin{frame}[fragile]
  \frametitle{Compare and contrast}
  
  Wonder how different imperative programming in Haskell is from other
  languages?

  \vskip16pt

\begin{lstlisting}[language=python]
def count_words(path):
    content = open(path).read()
    num_words = len(content.split())
    print repr(num_words) + "  " + path
\end{lstlisting}

  \vskip16pt
\begin{lstlisting}
countWords path = do
    content <- readFile path
    let numWords = length (words content)
    putStrLn (show numWords ++ "  " ++ path)
\end{lstlisting}
\end{frame}

\begin{frame}
  \frametitle{A few handy rules}
  
  When you want to introduce a new name inside a \lstinline{do} block:
  \begin{itemize}
  \item Use \lstinline{name <- action} to perform an action and keep
    its result.
  \item Use \lstinline{let name = expression} to evaluate a pure
    expression, and omit the \lstinline{in}.
  \end{itemize}
\end{frame}

\begin{frame}[fragile]
  \frametitle{More adventures with \texttt{ghci}}
  
  If we load our source file into \texttt{ghci}, we get an interesting
  type signature:

  \vskip8pt

\begin{verbatim}
*Main> :type countWords
countWords :: FilePath -> IO ()
\end{verbatim}

  \vskip8pt See the result type of \lstinline{IO ()}? That means
  ``this is an action that performs I/O, and which returns nothing
  useful when it's done.''
\end{frame}

\begin{frame}[fragile]
  \frametitle{Main}
  
  In Haskell, the entry point to an executable is named
  \lstinline{main}. You are shocked by this, I am sure.

  \vskip8pt

\begin{lstlisting}
main = do
  args <- getArgs
  mapM_ countWords args
\end{lstlisting}

  \vskip8pt

  Instead of \lstinline{main} being passed its command line arguments
  as in C, it uses the \lstinline{getArgs} action to retrieve them.
\end{frame}

\begin{frame}
  \frametitle{What's this \lstinline{mapM_} business?}

  The \lstinline{map} function can only call pure functions, so it has
  an equivalent named \lstinline{mapM} that maps an \emph{impure}
  action over a list of arguments and returns the list of results.

  \vskip8pt

  The \lstinline{mapM} function has a cousin, \lstinline{mapM_}, that
  throws away the result of each action it performs.

  \vskip8pt

  In other words, this is one way to perform a loop over a list in
  Haskell.  

  \vskip8pt ``\lstinline{mapM_ countWords args}'' means ``apply
  \lstinline{countWords} to every element of \lstinline{args} in turn,
  and throw away each result.''
\end{frame}

\begin{frame}[fragile]
  \frametitle{Compare and contrast II, electric boogaloo}

  These don't look as similar as their predecessors:
  \vskip8pt
\begin{lstlisting}[language=python]
def main():
    for name in sys.argv[1:]:
        count_words(name)
\end{lstlisting}

  \vskip8pt

\begin{lstlisting}
main = do
    args <- getArgs
    mapM_ countWords args
\end{lstlisting}

  \vskip8pt I wonder if we could change that.
\end{frame}

\begin{frame}[fragile]
  \frametitle{Idiomatic word count in Python}
  
  If we were writing ``real'' Python code, it would look more like
  this:

  \vskip8pt
\begin{lstlisting}[language=python]
def main():
    for path in sys.argv[1:]:
        c = open(path).read()
        print len(c.split()), path
\end{lstlisting}
\end{frame}

\begin{frame}[fragile]
  \frametitle{Meet \lstinline{forM_}}
  
  In the \lstinline{Control.Monad} module, there are two functions
  named \lstinline{forM} and \lstinline{forM_}. They are nothing more
  than \lstinline{mapM} and \lstinline{mapM_} with their arguments
  flipped.

  \vskip12pt In other words, these are identical:

  \vskip8pt

\begin{lstlisting}
mapM_ countWords args
forM_ args countWords
\end{lstlisting}

  \vskip8pt That seems a bit gratuitous.  Why should we care?
\end{frame}

\begin{frame}[fragile]
  \frametitle{Function application as an operator}
  
  In our last lecture, we were introduced to function composition:

\begin{lstlisting}
f . g = \x -> f (g x)
\end{lstlisting}

  \vskip8pt
  We can also write a function to apply a function:

\begin{lstlisting}
f $ x = f x
\end{lstlisting}

  \vskip8pt This operator has a very low precedence, so we can use it
  to get rid of parentheses. Sometimes this makes code easier to read:

\begin{lstlisting}
putStrLn  (show numWords ++ "  " ++ path)
putStrLn $ show numWords ++ "  " ++ path
\end{lstlisting}
\end{frame}

\begin{frame}[fragile]
  \frametitle{Idiomatic word counting in Haskell}
  
  See what's different about this word counting?

  \vskip8pt

\begin{lstlisting}
main = do
  args <- getArgs
  forM_ args $ \arg -> do
    content <- readFile arg
    let len = length (words content)
    putStrLn (show len ++ "  " ++ arg)
\end{lstlisting}

  \vskip8pt Doesn't that use of \lstinline{forM_} look remarkably like
  a \lstinline[language=python]{for} loop in some other language?
  That's because it \emph{is} one.
\end{frame}

\begin{frame}
  \frametitle{The reason for the \lstinline{$}}
  
  Notice that the body of the \lstinline{forM_} loop is an anonymous
  function of one argument.

  \vskip8pt We put the \lstinline{$} in there so that we wouldn't have
  to either wrap the entire function body in parentheses, or split it
  out and give it a name.
\end{frame}

\begin{frame}[fragile]
  \frametitle{The good}
  Here's our original code, using the \lstinline{$} operator:
  \vskip8pt
\begin{lstlisting}
  forM_ args $ \arg -> do
    content <- readFile arg
    let len = length (words content)
    putStrLn (show len ++ "  " ++ arg)
\end{lstlisting}
\end{frame}

\begin{frame}[fragile]
  \frametitle{The bad}
  If we omit the \lstinline{$}, we could use parentheses:
  \vskip8pt
\begin{lstlisting}
  forM_ args (\arg -> do
    content <- readFile arg
    let len = length (words content)
    putStrLn (show len ++ "  " ++ arg))
\end{lstlisting}
\end{frame}

\begin{frame}[fragile]
  \frametitle{And the ugly}
  Or we could give our loop body a name:
  \vskip8pt
\begin{lstlisting}
  let body arg = do
    content <- readFile arg
    let len = length (words content)
    putStrLn (show len ++ "  " ++ arg))
  forM_ args body
\end{lstlisting}

  \vskip8pt Giving such a trivial single-use function a name seems
  gratuitous.

  \vskip8pt Nevertheless, it should be clear that all three pieces of
  code are identical in their operation.
\end{frame}

\begin{frame}[fragile]
  \frametitle{Trying it out}
  
  Let's assume we've saved our source file as \texttt{WC.hs}, and give
  it a try:

  \vskip8pt
\begin{verbatim}
$ ghc --make WC
[1 of 1] Compiling Main ( WC.hs, WC.o )
Linking WC ...

$ du -h ascii.txt 
58M	ascii.txt

$ time ./WC ascii.txt 
9873630  ascii.txt

real	0m8.043s
\end{verbatim}
\end{frame}

\begin{frame}[fragile]
  \frametitle{Comparison shopping}
  
  How does the performance of our \texttt{WC} program compare with the
  system's built-in \texttt{wc} command?

\begin{verbatim}
$ export LANG=C
$ time wc -w ascii.txt
9873630 ascii.txt

real	0m0.447s
\end{verbatim}

  Ouch!  The C version is almost 18 times faster.
\end{frame}

\begin{frame}[fragile]
  \frametitle{A second try}
  
  Does it help if we recompile with optimisation?

\begin{verbatim}
$ ghc -fforce-recomp -O --make WC
$ time ./WC ascii.txt
9873630 ascii.txt

real	0m7.696s
\end{verbatim}

  So that made our code 5\% faster. Ugh.
\end{frame}

\begin{frame}
  \frametitle{What's going on here?}
  
  Remember that in Haskell, a string is a list. And a list is
  represented as a linked list.

  \vskip8pt This means that every character gets its own list element,
  and list elements are not allocated contiguously. For large data
  structures, list overhead is negligible, but for characters, it's a
  total killer.

  \vskip8pt So what's to be done?

  \vskip8pt Enter the bytestring.
\end{frame}

\begin{frame}[fragile]
  \frametitle{The original code}
  
\begin{lstlisting}
main = do
  args <- getArgs
  forM_ args $ \arg -> do
    content <- readFile arg
    let len = length (words content)
    putStrLn (show len ++ "  " ++ arg)
\end{lstlisting}
\end{frame}

\begin{frame}[fragile]
  \frametitle{The bytestring code}
  
  A bytestring is a contiguously-allocated array of bytes.  Because
  there's no pointer-chasing overhead, this should be faster.

  \vskip8pt

\begin{lstlisting}
import qualified Data.ByteString.Char8 as B

main = do
  args <- getArgs
  forM_ args $ \arg -> do
    content <- B.readFile arg
    let len = length (B.words content)
    putStrLn (show len ++ "  " ++ arg)
\end{lstlisting}

  \vskip8pt Notice the \lstinline{import qualified}---this allows us
  to write \lstinline{B} instead of \lstinline{Data.ByteString.Char8}
  wherever we want to use a name imported from that module.
\end{frame}

\begin{frame}[fragile]
  \frametitle{So is it faster?}
  
  How does this code perform?

  \vskip8pt

\begin{verbatim}
$ time ./WC ascii.txt 
9873630  ascii.txt

real	0m8.043s

$ time ./WC-BS ascii.txt 
9873630  ascii.txt

real	0m1.434s
\end{verbatim}

  \vskip8pt Not bad! We're 6x faster than the \lstinline{String} code,
  and now just 3x slower than the C code.
\end{frame}

\begin{frame}
  \frametitle{Seriously? Bytes for text?}
  
  There is, of course, a snag to using bytestrings: they're strings of
  bytes, not characters.

  \vskip8pt This is the 21st century, and everyone should be using
  Unicode now, right?

  \vskip8pt Our answer to this problem in Haskell is to use a package
  named \lstinline{Data.Text}.
\end{frame}

\begin{frame}[fragile]
  \frametitle{Unicode-aware word count}
  
\begin{lstlisting}
import qualified Data.Text as T
import Data.Text.Encoding (decodeUtf8)
import qualified Data.ByteString.Char8 as B

main = do
  args <- getArgs
  forM_ args $ \arg -> do
    bytes <- B.readFile arg
    let content = decodeUtf8 bytes
        len = length (T.words content)
    putStrLn (show len ++ "  " ++ arg)
\end{lstlisting}
\end{frame}

\begin{frame}
  \frametitle{What happens here?}
  
  Notice that we still use bytestrings to read the initial
  data in.

  \vskip8pt Now, however, we use \lstinline{decodeUtf8} to turn the
  raw bytes from UTF-8 into the Unicode representation that
  \lstinline{Data.Text} uses internally.

  \vskip8pt We then use \lstinline{Data.Text}'s \lstinline{words}
  function to split the big string into a list of words.

\end{frame}

\begin{frame}[fragile]
  \frametitle{Comparing Unicode performance}
  
  \vskip8pt For comparison, let's first try a Unicode-aware word count
  in C, on a file containing 112.6 million characters of UTF-8-encoded
  Greek:

\begin{verbatim}
$ du -h greek.txt
196M	greek.txt

$ export LANG=en_US.UTF-8
$ time wc -w greek.txt
16917959 greek.txt

real	0m8.306s

$ time ./WC-T greek.txt
16917959  greek.txt

real	0m7.350s
\end{verbatim}
\end{frame}

\begin{frame}
  \frametitle{What did we just see?}
  
  Wow! Our tiny Haskell program is actually 13\% \emph{faster} than
  the system's \texttt{wc} command!

  \vskip8pt This suggests that if we choose the right representation,
  we can write real-world code that is both brief and highly
  efficient.

  \vskip8pt This ought to be immensely cheering.
\end{frame}

\end{document}

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% TeX-PDF-mode: t
%%% End: 

