Case Study: Building A URL Bouncer

tl;dr - Building a simple URL bouncer with Servant isn’t that hard, and the usual warm fuzzies you get from well-typed functions, interfaces, and code still apply

If you’re not familiar with Haskell or Servant, the former is a programming language that focuses on pure functional concepts and the latter is one of the most interesting/popular frameworks for it that specializes in exposing your API as a type itself. A brief taste of both of these things is below:

-- A simple API declared with Servant that exposes one endpoint, /api/v1/users, to GET and POST requests
type API = "api" :> "v1" :> UsersAPI
type UsersAPI = "users" :> Get '[JSON] (PaginatedList User)
    <|> "users" :> ReqBody User :> Post '[JSON] (EnvelopedResponse (ModelWithID User))

This is a basically a snippet of code from my own codebase, but should show you just how expressive Haskell is and how the utilities of Servant fit together to extend that expressiveness to the web-server domain. This isn’t an introductory post on either of those technologies, so if you’re looking for that you might want to check out some excellent resources already out there.

A lot of my work these days is with/through GAISMA, a Japanese company I run with a buddy of mine on a fairly recent project we’re hoping on (fully) launching this year – a job board aimed at the flourishing bilingual market in Japan called The Start (project now defunct). At the outset of the project, TheStart was loaded up a list of killer features that could set the project apart, and as I start to iterate on them I realize just how much is involved in actually executing even the simplest of goals in a reasonable, well-engineered manner. Even the simple idea of “get a job board up” has taken me lots more time than I thought. This blog post is an in-some-kind-of-depth look at just one of the features I thought was a footnote, but took some time to think through – a URL bouncer for the early version of the site (v1.1).

A critical feature we forgot

Of course, while dreaming up all the awesome features of this new job board (while job boards aren’t terribly exciting ideas, there is definitely some room left to innovate), the reasonable expectation that bounces occuring on the site should be tracked was forgotten. I’m not referring to bounces in the active user/traffic bounces sense (which is usually the term for when a user comes to your site, hangs around for a bit then leaves to browse the rest of the internet), rather I’m referring to what others might call referrals/conversions, which is when someone sees a job posting they’re interested in on the site, and clicks through to apply.

From the beginning I aimed to build a fairly transparent platform, where business users who made accounts as company reps were able to see just how their jobs were doing, how many clicks/conversions they were garnering, and see just how useful the platform was to them. I think anything less than that level of transparency is just hoping for under-informed customers and I don’t think that makes a good business strategy (I also haven’t studied business at length in any academic capacity so take that statement with a boulder of pure salt). The first obvious thing to do to increase this transparency is of course to count the basic value offer of our product (the job board) – getting people to check out the jobs themselves!

As anything that seems simple at the start, but then gets more complicated as you unpack it, I realized we needed to use bounce URLs in very different ways:

  • Email campaign URLs
  • Viral campaign URLs
  • Regular job posting conversion URLs (automatically generated for every job posting)

Having these statistics would increase the level of analysis (from 0 to something, I guess) we could perform in any one of these areas, and possibly make a huge difference down the line when someone tried to pull insight from the madness. Answering a question like “what are the most popular job postings in a given industry?” or “what are the most popular job postings of all time, and which companies do they belong to?”, are obviously good questions to ask/be able to answer – it’s a no-brainer.

Buy vs Build

I can’t say there was much critical thought put into this decision – entire companies have been built on the promise of “simple” link redirection with metrics. That’s a big indicator of the kind of dangers that could be lurking – those same companies have also made some mistakes that were somewhat subtle, and I was fooled by as well. However, the idea of a link redirection scheme seemed so simple that I chose to build it myself. I also (often to my own detriment) very highly value knowing as much of my “stack” as possible so I’m less surprised in the future, and enjoy building/managing my own infrastructure so it was an easy decision (whether it was right is a whole ‘nother topic).

One piece of infrastructure that we already used that seemed like it could be up for the job was Piwik. I looked semi-desperately for a quick win there, wondering if Piwik somehow supported the creation of simple configurable/API-accessible URL redirects/redirection, but I could never find the feature. There was also the worry of now coupling my pretty no-frills/simple application to Piwik (which is a much more complicated contraption) and that kind of worried me. Spinning up the app in a completely separate environment normally only requires just the database and the executable, but including this would require running Piwik (or some in-memory mock of it) as well – and that’s not such an easy task, even with tools like docker/docker compose around.

Quick planning

While this feature is pretty large – architectural considerations (basically just trying not to paint myself into a corner), frontend, and backend work – here’s the 1000ft view of what needed to be done:

  1. Add the types required to facilitate link bouncing
  2. Write some SQL to support the tables, Think about the data schema changes
  3. Add the endpoints to support viewing, creating bounce URLs and the actual bouncing
    • Decide which kind of bouncing to use – whether to straight 302 or load a page that does some javascript magic (and possibly more logging)
    • Add the API (Haskell) code to support retrieving the data from the backend

While the list looks pretty short written down, there’s a lot packed into each item, and lots of complexity to be avoided.

As the post progresses, I’ll cover how each of these portions were done in the order that they were done. This should do a lot to show what my development process was like and in generally what working with a Haskell codebase CAN be like.

Step 0: Decide how to do the redirecting

After some light thought it was pretty obvious I’d need a type that holds IDs, URLs, and creation dates at the very least (immutability was also probably a plus here, wouldn’t want history/stats for a certain URL to just suddenly disappear. I’d also probably want a wider type with other kinds of information that I could gather about hosts that were bounced (what IP? what country? referrer? device? etc, whatever is on the User Agent that the browser shares with me).

Minutes/hours in and there’s already a relatively large structural dilemma to consider: How exactly should the bouncing be done? There are a bunch of different things you can collect depending on how you do it. Here are two basic approaches:

HTTP Status code 302 redirection

The crux of this approach is returning a HTTP response that instead of being 200 (which means OK), and content that is (usually) a webpage to show or some sort of data, to instead returning a HTTP response that says 302 (go to this place instead, for now). If it were a conversation it might go something like this:

Browser: Hey do you know where this /jobs/apply/1234 thing is? Do you have it?

Server: Hey yeah, we have that, but actually you should visit some-company.com/apply/artisanal-pickle-salesman-position for that information directly. Matter of fact, just leave here and go straight there

NOTE - DO NOT use HTTP 301s for non-permanent redirections. Want to know why? Go read the wiki or the HTTP spec, it’s good for you. Misusing 301 is VERY painful in production.

Simple 302s will yield you only the information kept in the actual web request, because the person basically never gets a chance to load a page on your server, but it’s the fastest and the least-scummy, and least offputting for users. 99% of users never stop to consider the fact that when they click a link in Facebook messenger or Google hangouts it’s not actually a link to the content, but actually Facebook/Google’s link to the link to the content. This enables them to build data on what people are talking about, what’s trending, and whatever else they may be using it for.

JS-based redirecting with window.location

This approach is a little different from HTTP 302 redirection in that it actually loads a webpage, but then uses javascript to collect some more metrics (or maybe even ask if the person wants to be redirected, or whatever else), and do the eventual redirect, using the browser location API.

This basically amounts to adding a script tag like the following to the page you serve:

<script>
  // ... Some other logic ...
  window.location = "https://www.some-company.com/apply/artisanal-pickle-salesman-position";
</script>

There a few more things you can collect/do when you go with the javascript approach. If it were a conversation, it would go like this:

Browser: Hey do you know where this /jobs/apply/1234 thing is? Do you have it?

Server: Hey yeah, we have that, check out this page

Browser: Cool, thanks

** seconds later browser is redirected to some-company.com/apply/artisanal-pickle-salesman-position… **

Decision

Since I really don’t need to collect that much information from browsers right now (and depending on what you do it can be pretty shady/undesirable), I figured I’d go with the basic 302 redirects for now. In the future if I want to do a different type, I can just do some parametrization/change-up the types and modify the code to show a page that will do the redirect. I think I successfully made this decision in a way that makes sense and doesn’t paint me into a corner, which feels like a win. Only time will tell!

Step 1: Build the types

Haskell places a focus on types. Types help you think clearer, and thinking clearly about what’s happening in a program helps you write better code. With that massive over-simplification of the benefits of Haskell out of the way, heres what the first draft (relatively close to the final draft as well) of the types look like:

data URLBounceConfig = URLBounceConfig
    { targetUrl       :: String
    , name            :: Maybe String
    , bounceCreatedAt :: DateTime
    } deriving (Eq, Show)

data URLBounceInfo = URLBounceInfo
    { bouncedHost :: String
    , bouncedAt   :: DateTime
    , referer     :: Maybe String
    , userAgent   :: Maybe String
    } deriving (Eq, Show)

-- .... a few lines down ...
$(deriveJSON defaultOptions ''URLBounceConfig)
$(deriveJSON defaultOptions ''URLBounceInfo)

It’s all pretty self-explanatory, and I’m not even using the most specific types I could, for simplicity (I could replace String with Hostname to be a bit more forthright, replace targetUrl’s type with a proper URI/URL type). maybe in the future I’ll revisit and add stronger type restrictions. you can basically ignore deriveJSON for now, but just think of it as a super useful thing that writes a whole lot of boiler plate that makes it easy to read these Haskell types to/from JSON objects automagically.

Step 2: Add the DB schema that would support these types

-- Create the url bounce table
CREATE TABLE IF NOT EXISTS url_bounces (
       id INTEGER PRIMARY KEY,
       name TEXT UNIQUE,
       targetUrl TEXT,
       isActive INTEGER,
       createdAt TEXT);

CREATE TABLE IF NOT EXISTS url_bounce_info (
       id INTEGER PRIMARY KEY,
       bouncedHost TEXT,
       bouncedAt TEXT,
       referer TEXT,
       userAgent TEXT);

CREATE INDEX IF NOT EXISTS urlBounceName_idx ON url_bounces (name);

This SQL is actually written to be run on SQLite. There’s a lot I could say about the choice to use SQLite instead of spinning up Postgres, but I’m going not going to say too much here (maybe I’ll lay it out in a future blog post). The simple gist is that when I think about projects I’ve done in the past, I can’t remember one that’s ever grown past a scale that SQLite could (probably) handle. Reading SQLite’s case on when to use it really illuminated the idea that I might not need more than SQLite (especially at the prototype/build phase), and I’m using this project as a chance to really test just how far one can get with SQLite.

That said, I’ve written my software in the usual componentized-interface + implementation pattern (that’s not a real term, I just made it up) – meaning that I have a Backend typeclass (interface) that has a SQLiteBackendimplementation (so I could easily write a PostgresBackend implementation if the need to switch ever arises). Yes, this reeks of YAGNI, and in practice, this kind of pattern is rarely ever actually used (not many people seem to ACTUALLY switch their main database around, to almost no one’s surprise) – but in this case I think the abstraction is worth it.

Step 3: Add the API endpoints and Haskell code

Code that I ended up writing for the backend ended up pretty well split into different concerns/important sections in the files so I’ve split them up similarly in the blog post as well:

Routing stuff

The code to make sure API endpoints were actually reachable:

-- ... a bunch of other routes ...

-- ^ URL bouncing related endpoints
type URLBounceConfigsAPIV1 = "url-bounce-configs" :> CookieAuth :> Get '[JSON] (EnvelopedResponse (PaginatedList (ModelWithID URLBounceConfig)))
    :<|> "url-bounce-configs" :> CookieAuth :> ReqBody '[JSON] URLBounceConfig :> Post '[JSON] (EnvelopedResponse (ModelWithID URLBounceConfig))
    :<|> "url-bounce-configs" :> CookieAuth :> Capture "bounceConfigId" URLBounceConfigID :> Get '[JSON] (EnvelopedResponse (ModelWithID URLBounceConfig))
    :<|> "url-bounce-configs" :> CookieAuth :> Capture "bounceConfigId" URLBounceConfigID :> "activate" :> Post '[JSON] (EnvelopedResponse (ModelWithID URLBounceConfig))
    :<|> "url-bounce-configs" :> CookieAuth :> Capture "bounceConfigId" URLBounceConfigID :> "deactivate" :> Post '[JSON] (EnvelopedResponse (ModelWithID URLBounceConfig))
    :<|> "url-bounce-configs" :> CookieAuth :> Capture "bounceConfigId" URLBounceConfigID :> "info" :> QueryParam "pageSize" Limit :> QueryParam "offset" Offset :> Get '[JSON] (EnvelopedResponse (PaginatedList (ModelWithID URLBounceInfo)))
    :<|> "url-bounce-configs" :> CookieAuth :> Capture "bounceConfigId" URLBounceConfigID :> "bounce-count" :> Get '[JSON] (EnvelopedResponse Int)

There are a lot of endpoints here, and implmementation details, but one of the best things about using Servant’s declarative API routing is that for the most part, it’s actually very readable. With that excuse, I’m going to hold off on explaining too much of what’s happening here, and letting you just read for yourself!

Also, you might ask: “Why in the world you allow so much duplication ? Surely you could factor "url-bounce-configs" :> CookieAuth :> Capture ... out and make this cleaner?!“. You’d be absolutely right – this code is kinda dirty.

Backend CRUD code

The code to do CRUD (mostly creating, reading) operations:

addURLBounceConfig :: b -> URLBounceConfig -> IO (Maybe (ModelWithID URLBounceConfig))
addURLBounceConfig b bounceCfg = maybe (return Nothing) handle (backendConn b)
    where
      handle c = getCurrentTime >>= \now -> insertEntity_ DBQ.insertURLBounceConfig bounceCfg { bounceCreatedAt=now } c

setURLBounceConfigActivity :: b -> Bool -> URLBounceConfigID -> IO (Maybe (ModelWithID URLBounceConfig))
setURLBounceConfigActivity b bIsActive bid = maybe (return Nothing) (updateSimpleFieldAndReturnEntityOnSuccess_ DBQ.urlBounceConfigsTableName bid DBQ.genericIsActiveFieldName bIsActive) (backendConn b)

findURLBounceConfigByID :: b -> URLBounceConfigID -> IO (Maybe (ModelWithID URLBounceConfig))
findURLBounceConfigByID b cid = maybe (return Nothing) (getRowBySimpleField_ DBQ.urlBounceConfigsTableName DBQ.genericIDField cid) (backendConn b)

getAllURLBounceConfigs :: b -> IO (Maybe (PaginatedList (ModelWithID URLBounceConfig)))
getAllURLBounceConfigs = maybe (return Nothing) (getFullListOfEntity_ DBQ.urlBounceConfigsTableName) . backendConn

findURLBounceConfigByName :: b -> String -> IO (Maybe (ModelWithID URLBounceConfig))
findURLBounceConfigByName b = getRowBySimpleField DBQ.urlBounceConfigsTableName b DBQ.urlBounceConfigsNameFieldName

saveURLBounceInfo :: b -> URLBounceInfo -> IO (Maybe (ModelWithID URLBounceInfo))
saveURLBounceInfo b info = maybe (return Nothing) (insertEntity_ DBQ.insertURLBounceInfo info) (backendConn b)

findURLBounceInfoForConfig :: b -> URLBounceConfigID -> Maybe Limit -> Maybe Offset -> IO (Maybe (PaginatedList (ModelWithID URLBounceInfo)))
findURLBounceInfoForConfig b cid limit offset = maybe (return Nothing) (getRowsBySimpleField_ DBQ.urlBounceInfoTableName DBQ.urlBounceConfigFKName cid limit offset) (backendConn b)

getNumberHitsForBounceConfigByID :: b -> URLBounceConfigID -> IO (Maybe Int)
getNumberHitsForBounceConfigByID b cid = maybe (return Nothing) (getRowCountBySimpleField_ DBQ.urlBounceInfoTableName DBQ.urlBounceConfigFKName (Just cid)) (backendConn b)

This code lives in both Types.hs (where the typeclass and function signatures are) and SqliteBackend.hs (where the implementation is). There is lots I could go into about the code, but here are the most interesting (to me) higher level points that should help with understanding it.

You might ask “Why in the world would you still be writing CRUD code for your endpoints yourself?” – That’s a good question.

Understanding the Backend typeclass

Here are some light notes on the Backend typeclass and what it means:

  • The b you see passed in everywhere is the Backend object itself. Similar but not quite the exact same as when you you get self as the first argument in methods in a language like python. Somewhere in Types.hs there is a declaration that goes like this:
class Backend b where
    getBackendLogger :: b -> Maybe Logger

    connect :: b -> IO b
    disconnect :: b -> IO b
-- ... lots more stuff ...

Reading the signatures for methods in the type class

Super quick & dirty Haskell 102 (since typeclasses are a littler further ahead than 101 but maybe not quite 201): getBackendLogger is a Function that “takes” a b, where b conforms to the Backend typeclass , and produces a value with type Maybe Logger. Read that sentence over and over until it seems to make sense.

Basically, what it’s saying, is that if you give getBackendLogger, the function, a Backend, it will give you a thing that is a Maybe Logger. If you don’t know what Maybe Logger is, don’t even think worry about it, just think about it in the abstract sense, use a red bowling ball if you want. If you’re interested in what a Maybe is, checkout this SO post that goes into it a bit. Don’t spend too much time on it though, there’s a thing called Monads that are often discussed at the same time as Maybe because they’re related concepts, and you might be tempted to try and figure out what they are, but that’s a bad move super early on if you’re just beginning. It’s one of those things you have to use a bit, then research/read up on, then use a bit more, then research/read-up/watch videos about/watch lectures about, then fully understand, and feel super comfortable with.

So, with all that in mind, the sentence that describes saveURLBounceInfo would be something like: saveURLBounceInfo is a function that takes some b (that is a Backend), a URLBounceInfo value, and produces an IO (Maybe (ModelWithID URLBounceInfo)) value. That’s certainly a mouthful, but once you spend more time with it and start to understand the meaning behind these signatures (and let them sink in even more), the clarity of mind you get is very very rewarding.

One thing that might be confusing to beginners is this IO SOMETHING thing… Again, this isn’t a monad tutorial, but you can just read that to mean that it represents “an action that when run will produce the SOMETHING”. So an IO (Maybe (ModelWithID URLBounceInfo)) is “an action that when run will produce a Maybe ModelWithID URLBounceInfo) value. In this context, it’s obvious what this action is doing from the types alone (even though it could technically do almost anything), it’s putting the URLBounceInfo you gave it in the database! That’s clearly why what you get out is a Maybe (ModelWithID ...), you basically got the ID when you put the row in the database.

The code is pretty easy to read and expressive, so even if you don’t understand these concepts thoroughly don’t despair – you don’t have to worry too much abou this code, that’s my job :).

Maybe that doesn’t mean anything to you? If so I guess now would be a good time to talk about…

Adding the API endpoints themselves

Hooking up the code that was actually going to make the routes DO stuff looks something like this:

-- ... sometime after the API declaration ...
urlBounceConfigsAPIServer :: ServerT URLBounceConfigsAPIV1 (WithApplicationGlobals Handler)
urlBounceConfigsAPIServer = allURLBounceConfigs
    :<|> createURLBounceConfig
    :<|> getURLBounceConfigByID
    :<|> changeURLBounceConfigActivity True
    :<|> changeURLBounceConfigActivity False
    :<|> getInfoForURLBounceConfig
    :<|> getBounceCountForURLBounceConfig

-- ... lots of controller-type code in between ...

createURLBounceConfig :: Auth.WAISession -> URLBounceConfig -> WithApplicationGlobals Handler (EnvelopedResponse (ModelWithID URLBounceConfig))
createURLBounceConfig s bounceConfig = do
  _ <- requireRoleFromRawSession Administrator s
  backend <- getBackendOrFail

  added <- liftIO $ addURLBounceConfig backend bounceConfig
  case added of -- Oh look, a pattern match on a Maybe!!
    Nothing -> throwError Err.failedToAddResource
    Just r -> return $ EnvelopedResponse "success" "Successfully created url bounce config" r

getURLBounceConfigByID :: Auth.WAISession -> URLBounceConfigID -> WithApplicationGlobals Handler (EnvelopedResponse (ModelWithID URLBounceConfig))
getURLBounceConfigByID s cid = do
  _ <- requireRoleFromRawSession Administrator s
  backend <- getBackendOrFail

  bounceCfg <- liftIO $ findURLBounceConfigByID backend cid
  case bounceCfg of
    Nothing -> throwError Err.failedToRetrieveResource
    Just c -> return $ EnvelopedResponse "success" "Successfully retrieved url bounce config" c

-- ... more URL bouncing controller code ...

doBounce :: Request -> Maybe (ModelWithID URLBounceConfig) ->  WithApplicationGlobals Handler String
doBounce req = maybe (throwError Err.unknownBounce) bounce
    where
      bounce (ModelWithID cid c) = do
        backend <- getBackendOrFail
        now <- liftIO DT.getCurrentTime
        _ <- liftIO $ saveURLBounceInfo backend (makeURLBounceInfoFromRequest cid now req)
        throwError err302 { errHeaders = [("Location", B8.pack (bounceTargetUrl c))] }

These are only a few functions but it gives you the idea of the machinery that connects that user-facing route to the work the backend has to do. This is the “controller”-type code, if you’re familiar with the MVC design paradigm as commonly applied to web servers.

As you can see, this haskell code reads very imperatively, and very clearly as to what it’s doing. Some of the conventions (like _ <- .... might not be so crystal clear, but it should be clear what the line is at least trying to accomplish.

There’s a bit to sift through, but this is basically what it took to get started with building a semi-complete and semi-production-ready URL bouncer in Haskell with Servant. I hope you enjoyed reading (and didn’t get too lost in the application specifics everywhere) this code.

These days a lot of my haskell code looks different – less do and more >>= (basically this is like composing functions together instead of calling them one by one on seperate lines and shufflying inputs around), but trying to get into explaining >>= and Monads and all of it would most certainly make this a monad tutorial and lots more, and lord knows there are enough of those on the internet as is.

To sum up

It was super fun, and the solution I got at the end doesn’t seem terrible sooooo maybe everything’s good? If everything goes terribly wrong, I’ll make sure to update this blog post!

Thumbs up kid