Serving a single file over HTTP with Rust and Go

Categories
Rust logo / Golang logo

UPDATE (03/30/2021)

A bunch of readers have submitted suggestions and changes to both the Rust and the Go code so I've updated them and released new versions as appropriate! I added a section to the bottom of the post so check that out! -- the biggest changes were in the Go-related code.

UPDATE (03/26/2021)

A reader named Pavel (Pawel) helped out on the Go implementation by using []byte and io.Copy. Much thanks to Pawel, check out the Merge Request.

UPDATE (03/22/2021)

A testament to the warmness of the rust community u/cramert on r/rust saw fit to bless me with some optimization advice! Turns out hyper::bytes::Body was more-or-less built for the kind of copying and serving I was doing. I've added a section to this postwith the results from the change, which is part of kcup-rust version v0.1.1.

tl;dr - I wrote a program both called kcup in Rust and Golang (kcup-rust and kcup-go respectively) which serve a single file over the internet with the intent to use them as containers for an MTA-STS /.well-known/mta-sts.txt endpoint. I also compared the performance.

Context

I’ve taken a bit of a detour on a project I’m working on (and looking to release soon!) to set up maddy for handling emails (think admin@<product>.tld, hello@<product>.tld, support@<product>.tld). I probably shouldn’t even be using maddy to begin with – I should probably just pick up Amazon Workmail or use Gandi’s 2 free emails per domain or GMail or literally anything else outside of running my own SMTP and IMAP server, but I am very clearly addicted to yak shaving.

I’ve worked on projects like postmgr (which is now defunct) in the past to try and make it easier to deploy veteran email programs like postfix and dovecot, but there are a bunch of programs that have done it far better than I have, in a far better way (by rebuilding the basics rather than trying to orchestrate existing heavyweights):

For my personal email I actually stopped running my own email server (postfix and dovecot) and switched to ProtonMail and have been very happy (at this point I’m subscribed for multiple years) there so maybe I’m getting somewhat better on the yak shaving addiction front. Yak shaving rationalization aside, there are a bunch of ways to “improve your deliverability” (make sure your emails don’t go to spam on some large email provider your customer uses), and running your own mail servers these days:

Which of these should you do to make sure your emails don’t end up in spam? All of them, and check the IP you’re using and hope it’s not on a blacklist and hope that some word you say doesn’t get picked up by a spam filter in a walled garden somewhere. Don’t get me started on how the large walled gardens are absolutely killing normal email and marking themselves as “trusted” while they send mail from correctly set up servers with security in place to the spam box.

Getting back on track the reason this post and the programs I’m about to walk through creating exist is that to properly complete your MTA-STS support, you need to serve a file – a single file at https://mta-sts.domain.tld/.well-known/mta-sts.txt. Not a DNS TXT record – you need to return content at that “well known” URL.

Things I could have done instead

The first thing most people would do (and be right) is to wire up NGINX/Caddy/Apache/anything to a folder with only one file at the given address.

I did a bit of searching as well, and found some things around the internet on serving single static files:

Turns out you can use nginx to return text itself with the return directive. All of these options would have been very very reasonable, not what I chose to do though (see yak shaving talk earlier).

Since I use Traefik, another option would have been to use a plugin that could return static content (looks like Envoy had it but not anymore?), but alas, they don’t and I’m not sure they want that use case. I could write the plugin, but I’m not keen on signing up for Traefik Pilot, though there’s some awesome plugins already up there (btw, I’m off the Ambassadors thing so no need to make disclaimers!).

The idea

Then I had a (bad) thought – why aren’t there any programs that I can download that serve literally just one file with the corresponding small performance and security footprint(s)? You could argue that a program serving just the one file is much safer (and maybe even faster?) than bringing in NGINX or a bigger tool? At first I tried to look around at Cert-Manager since I know they have to do this with the ACME HTTP01 challenge solver pods, but the docker image seemed more complicated than I needed it to be. Maybe I should just… write a program that served a single file? And that’s how I got here.

I could write this program, put that in a container, point Traefik to it and live happily ever after (this is one of those projects that I can actuall complete and never touch again). It even seems simple on it’s face:

  • Load a single file off disk into memory
  • Send that file content whenever a GET request comes in

Of course I thought about it some more and had some more interesting (read: complex) functionality:

  • Does it matter what the incoming request is? Should be 403 for POST, DELETE, HEAD, etc?
  • Does it matter what the URL is? Should GET /the-right-file succeed but GET /other-file 404s?
  • Could I do some file watching for dynamic changes to the file?
  • Read from STDIN?

I’m going to at least control myself here and keep it simple – return the file from memory for any GET request that comes in.

Implementing my bad idea

The first question on my mind was what I should write this in. Two great recent candidates to write this absurdly simple program in – Rust and Go. There are a few reasons:

  • Produce static binaries
  • Produce cross-platform binaries
  • Fast
  • Parallelism support (unlikely to be needed but why not)

Go might be the better choice here, because it’s honestly got pretty much everything you need in the standard library, but I like Rust so I’ll do it in Rust as well, not too hard to make another likely one-file codebase.

The next question was then what should I call it? Since it’s serving a single file, I thought something based on wasteful and formerly patent-encumbered Keurig’s K-Cup might be a good name – I’m going with kcup. The runner up was sfs for “single file server”.

So again what I need to do is write a cross-platform static binary that:

  • Take a path to a file
  • Load the file into memory at startup
  • Serve the file to every GET request that comes in (do we want to restrict to a very specific endpoint? probably?)

Rust

Rust I have written recently, but not so recently. The last big rust project I did was redis-bootleg-backup. It was pretty easy though – the biggest issue was trying to pick the lowest level library with the best ergonomics to do what I want to do very simply – I picked hyper. Here’s src/main.rs:

use std::{fs, io, process};
use std::path::PathBuf;
use std::io::Read;
use std::net::SocketAddr;
use std::convert::Infallible;
use std::sync::Arc;

use structopt::StructOpt;
use hyper::{Server, Request, Response, Body, Method, StatusCode};
use hyper::service::{service_fn, make_service_fn};
use env_logger::Env;
use tokio::time::{Duration, timeout};

#[derive(StructOpt, Debug)]
#[structopt(name = "kcup")]
struct KCupOpts {
    /// Host
    #[structopt(short = "h", long ="host", default_value = "127.0.0.1", env = "HOST")]
    host: String,

    /// Port
    #[structopt(short = "p", long ="port", default_value = "5000", env = "PORT")]
    port: i32,

    /// Amount of seconds to wait for input on STDIN to serve
    #[structopt(long ="stdin-read-timeout-seconds", default_value = "60", env = "STDIN_READ_TIMEOUT_SECONDS")]
    stdin_read_timeout_seconds: u64,

    /// File to read
    #[structopt(name = "FILE", short = "f", long = "file", parse(from_os_str), env = "FILE")]
    file_path: Option<PathBuf>,
}

/// Utility function fo serving static content
async fn serve_static_content(
    req: Request<Body>,
    content: Arc<String>,
) -> Result<Response<Body>, Infallible> {
    match req.method() {
        // Serve the content for every GET request
        &Method::GET => Ok(
            Response::new(Body::from(format!("{}",content)))
        ),

        // All other non-GET routes are 404s
        _ => Ok(
            Response::builder()
                .status(StatusCode::NOT_FOUND)
                .body("No such resource".into())
                .unwrap()
        ),
    }
}

#[tokio::main]
async fn main() -> Result<(), std::io::Error> {
    // Initialize logger at info
    env_logger::init_from_env(
        Env::default().filter_or("LOG_LEVEL", "info")
    );

    // Parse opts
    let KCupOpts{
        host,
        port,
        stdin_read_timeout_seconds,
        file_path
    } = KCupOpts::from_args();

    // Combine host and port into an address, and parse it
    let addr = String::from(format!("{}:{}", host, port)).parse::<SocketAddr>();

    // Stop if parsing failed
    if let Err(_) = addr {
        log::error!("Failed to parse host & port combination");
        process::exit(1);
    }
    let addr = addr.unwrap(); // worry-free unwrap
    log::info!("Server configured to run @ [{}]", addr);

    let mut file_contents = String::new();

    // Attempt to read content from somewhere
    if let Some(path) = file_path {
        // Read from file path
        log::info!("Reading file from path [{}]", path.to_string_lossy());
        file_contents = fs::read_to_string(path)?;
    } else {
        // Read from STDIN
        log::info!(
            "No file path provided, waiting for input on STDIN (max {} seconds)...",
            stdin_read_timeout_seconds,
        );

        let stdin_read_task = tokio::task::spawn_blocking(move || {
            let _ = io::stdin().read_to_string(&mut file_contents);
            return file_contents
        });

        // Attempt ot read from stdin for a given timeout
        match timeout(Duration::from_secs(stdin_read_timeout_seconds), stdin_read_task).await {
            Ok(Ok(contents)) => {
                file_contents = contents;
                log::info!("Successfully read input from STDIN");
            }
            _ => {
                log::error!("Failed to read from STDIN after waiting {} seconds", stdin_read_timeout_seconds);
                process::exit(1);
            }
        }
    }

    // If contents are *still* empty (no file & STDIN is empty), throw error
    if file_contents.is_empty() {
        log::error!("No file contents -- please ensure you've specified a file or fed in data via STDIN");
        process::exit(1);
    }
    log::info!("Read [{}] characters", file_contents.len());

    // Capture the file contents in an Arc so we can use the reference repeatedly
    // across async tasks that the server will spawn
    let file_contents = Arc::new(file_contents);

    // Build server
    let svc_builder = make_service_fn(move |_conn| {
        // The move & async combinations that happen in here (including the move above)
        // are a bit complicated.
        // see: https://www.fpcomplete.com/blog/ownership-puzzle-rust-async-hyper/

        // Create a name-shadowed cloned reference to the content we want to serve
        let file_contents = Arc::clone(&file_contents);

        async {
            // Create service fn
            Ok::<_, Infallible>(
                service_fn(move |req: Request<Body>| {
                    // Create a another name-shadowed, cloned reference
                    // since we have moved the original clone past the service_fn boundary
                    let file_contents = Arc::clone(&file_contents);

                    serve_static_content(req, file_contents)
                })
            )
        }
    });

    // Serve the file from a server for any GET request
    log::info!("Starting server...");
    let server = Server::bind(&addr).serve(svc_builder);
    if let Err(e) = server.await {
        log::error!("Server error: {}", &e);
        eprintln!("Server error: {}", e);
    }

    Ok(())
}

While trying to make this as simple as possible, I did end up running into an issue that made the experience less than pleasant – I couldn’t navigate Arc<T> easily enough to pass data to the function being called. Either due to my own rustiness (pun intended) or just the difficulty of rust’s memory management paradigm, one of the things you’d think as a hyper’s day-two operations (assuming you set up the basic server on day one) is being able to pass in data to a service function:

After a while I got it, but I spent more time than I was comfortable with bumbling around. I like rust a lot, and normally I just brush off complaints about the complexity of the borrow/ownership model (it’s novel, it’s strict and correct, it’s going to brush some feathers the wrong way), but not having used rust in a few months and having to stumble around this much was not a great experience. Rest of everything was pretty easy to get back into though, copied some code and Makefile targets from the bootleg backup project as well.

Golang

So I haven’t written any Golang in a long while but jumping back into it is pretty easy when all you have to write is a single-file web server, here’s cmd/kcup.go:

package main

import (
    "errors"
    "flag"
    "fmt"
    log "github.com/sirupsen/logrus"
    "io/ioutil"
    "net/http"
    "os"
    "strconv"
    "time"
)

/// Generate a function that does nothing but return the given piece of static content
func generateStaticServerFn(content string) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        // Return 404 for all non-GET methods
        if r.Method != http.MethodGet {
            http.NotFound(w, r)
            return
        }

        // Return the content
        fmt.Fprintf(w, content)
    }
}

const (
    DEFAULT_PORT                       = 5000
    DEFAULT_STDIN_READ_TIMEOUT_SECONDS = 60
)

func main() {
    // Retrieve settings from ENV if present
    host, ok := os.LookupEnv("HOST")

    var port int
    portRaw, ok := os.LookupEnv("PORT")
    if !ok {
        port = DEFAULT_PORT
    } else {
        parsedPort, err := strconv.Atoi(portRaw)
        if err != nil {
            log.Fatal("Failed to parse port")
        }
        port = parsedPort
    }

    var stdinReadTimeoutSeconds int
    stdinReadTimeoutSecondsRaw, ok := os.LookupEnv("STDIN_READ_TIMEOUT_SECONDS")
    if !ok {
        stdinReadTimeoutSeconds = DEFAULT_STDIN_READ_TIMEOUT_SECONDS
    } else {
        parsedStdinReadTimeoutSeconds, err := strconv.Atoi(stdinReadTimeoutSecondsRaw)
        if err != nil {
            log.Fatal("Failed to parse STDIN read timeout seconds")
        }
        stdinReadTimeoutSeconds = parsedStdinReadTimeoutSeconds
    }

    filePath, ok := os.LookupEnv("FILE")

    // Retrieve settings from flags
    hostPtr := flag.String("host", "", "Host")
    portPtr := flag.Int("port", -1, "Port")
    stdinReadTimeoutSecondsPtr := flag.Int(
        "stdin-read-timeout-seconds",
        -1,
        "Amount of seconds to wait for input on STDIN to serve",
    )
    filePathPtr := flag.String("file", "", "File to read")
    flag.Parse()

    // Override ENV with flags if necessary
    if *hostPtr != "" {
        host = *hostPtr
    }

    if *portPtr != -1 {
        port = *portPtr
    }

    if *filePathPtr != "" {
        filePath = *filePathPtr
    }

    if *stdinReadTimeoutSecondsPtr != -1 {
        stdinReadTimeoutSeconds = *stdinReadTimeoutSecondsPtr
    }

    // TODO: Validate settings (ex. port has to be non-negative)

    // Buidl address from host and port
    addr := fmt.Sprintf("%s:%d", host, port)
    log.Info(fmt.Sprintf("Server configured to run @ [%s]", addr))

    // Create content to be filled in later
    content := ""

    // Check if filepath was provided
    if filePath != "" {
        // Load file into content

        // Check if the file exists
        if _, err := os.Stat(filePath); err != nil {
            log.Fatal(fmt.Sprintf("Failed fo find file [%s]", filePath))
        }

        // Read from file
        contentBytes, err := ioutil.ReadFile(filePath)
        if err != nil {
            log.Fatal(fmt.Sprintf("Failed fo read file @ [%s]", filePath))
        }

        log.Info(fmt.Sprintf("Reading file from path [%s]", filePath))
        content = string(contentBytes)
    } else {
        // Attempt to read content from STDIN

        log.Info(fmt.Sprintf("No file path provided, waiting for input on STDIN (max %d seconds)...", stdinReadTimeoutSeconds))
        // No file path is present, attempt to read from STDIN

        stdinContent, err := readStdinWithTimeout(stdinReadTimeoutSeconds)
        if err != nil {
            log.Fatal(fmt.Sprintf("Failed to read from STDIN after waiting %d seconds", stdinReadTimeoutSeconds))
        }

        content = stdinContent
    }

    // Ensure we have *some* content at this point
    if content == "" {
        log.Fatal("No file contents -- please ensure you've specified a file or fed in data via STDIN")
    }

    // Set up router tha takes anything
    http.HandleFunc("/", generateStaticServerFn(string(content)))

    // Start the server
    fmt.Println("Starting the server...")
    log.Fatal(http.ListenAndServe(addr, nil))
}

func readStdinWithTimeout(timeoutSeconds int) (string, error) {
    // Create channel for timeout
    ch := make(chan int)

    var result string

    // Spawn goroutine to attempt to read STDIN
    go func() {
        stdinBytes, err := ioutil.ReadAll(os.Stdin)
        if err != nil {
            log.Fatal(fmt.Sprintf("Failed to read from STDIN after waiting %d seconds", timeoutSeconds))
        }
        result = string(stdinBytes)
        ch <- 1
    }()

    // Wait for ReadAll or timeout
    select {
    // Read STDIN
    case <-ch:
        log.Info(fmt.Sprintf("Successfully read input from STDIN"))
        return string(result), nil

        // Timeout
    case <-time.After(time.Duration(timeoutSeconds) * time.Second):
        return "", errors.New("Failed to read from STDIN")
    }

}

Yeah… this is definitely not “good” Go code (can you spot the leaked goroutine?) or even good Go code in general (so many things could be factored out for better testability, oh right there are no tests either), and there are probably way better ways to do this, but I’m not planning on writing any other Go code for now so I’m leaving it here and not optimizing much more. There’s even a TODO in there for validating input that I’m going to get to round about never (most likely). As far as the dev experience went, it was pretty smooth getting back into writing a tiny bit of Go though I’d forgotten most of the normal paradigms/patterns. One thing that annoyed me a little bit ad-hoc Go seemed to be compared to being able to use tools like Rust’s structopt and the paradigms available in Rust. Then again, i didn’t spend any time worrying about borrowing issues so…

Benchmark

Of course, no relatively simple technical project is complete without a gratuitous benchmark, so I got out wrk to see how the servers performed serving files to various file sizes. Another thing that would probably be good to curtail/limit would be the machine size – for this I’m going to use 2 cores and 50mb cgroup powered resource limiting via docker which is way too much for these tiny utility libraries.

Thinking about it a little bit, Go actually does an [initial stack size of 2kb these days](), and since I don’t expect the stacks for the gorountines to go, I guess my goroutine concurrency would be limited by this right up until network saturation. I think what I’ll do for this is to give both Go and Rust the same size of memory and see how they do. So the specs will be:

First the genreated data which I’ll be mounting into /data inside the container:

$ du -hs /tmp/testfiles/*
100K    /tmp/testfiles/file-100K
12K     /tmp/testfiles/file-10K
1.0M    /tmp/testfiles/file-1M
$ tree /tmp/testfiles/
/tmp/testfiles/
├── file-100K
├── file-10K
└── file-1M

0 directories, 3 files

Here’s the docker command (for rust as an example):

$ docker run --detach \
-p 5001:5000 \
-e HOST=0.0.0.0 \
-e FILE=/data/file-10K \
-v /tmp/testfiles:/data \
--cpus 2 \
--memory 100mb \
--name kcup-rust \
registry.gitlab.com/mrman/kcup-rust/cli:v0.1.0

And for go:

$ docker run --detach \
-p 5002:5000 \
-e HOST=0.0.0.0 \
-e FILE=/data/file-10K \
-v /tmp/testfiles:/data \
--cpus 2 \
--memory 100mb \
--name kcup-go \
registry.gitlab.com/mrman/kcup-go/cli:v1

The command I’ll be using is right out of the wrk README, with the --latency option (5001 for rust, 5002 for golang):

$ wrk -t12 -c400 -d30s --latency http://127.0.0.1:5001/any/path/will/work # rust
$ wrk -t12 -c400 -d30s --latency http://127.0.0.1:5002/any/path/will/work # go

File size 10KB

Raw wrk output for rust and go:

$ wrk -t12 -c400 -d30s --latency http://127.0.0.1:5001/any/path/will/work # Rust
Running 30s test @ http://127.0.0.1:5001/any/path/will/work
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.70ms   26.96ms   1.01s    99.79%
    Req/Sec     7.10k   649.83    13.47k    68.70%
  Latency Distribution
     50%    4.34ms
     75%    5.71ms
     90%    7.20ms
     99%   11.36ms
  2548559 requests in 30.10s, 24.49GB read
Requests/sec:  84674.28
Transfer/sec:    833.28MB

$ wrk -t12 -c400 -d30s --latency http://127.0.0.1:5002/any/path/will/work # Go
Running 30s test @ http://127.0.0.1:5002/any/path/will/work
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    45.13ms   75.88ms   1.04s    97.85%
    Req/Sec     0.93k   588.07     5.11k    88.11%
  Latency Distribution
     50%   28.75ms
     75%   63.39ms
     90%   83.00ms
     99%  351.37ms
  335015 requests in 30.05s, 3.24GB read
Requests/sec:  11147.25
Transfer/sec:    110.33MB

In a table:

Rust Golang
Latency 50% (ms) 4.34 28.75
Latency 75% (ms) 5.71 63.39
Latency 90% (ms) 7.20 83.00
Latency 99% (ms) 11.36 351.37
(Thread stats) Latency avg (ms) 5.70 45.13
(Thread stats) Latency stddev (ms) 26.96 75.88
(Thread stats) Latency max (ms) 1010 1040
Request/sec 84,674 11,147
Trasfer/sec (MB) 833.28 110

Well that’s quite the difference between these two! Rust is beating Golang pretty handily here, though it might just be the Go code I’ve written being really bad, or the standard library being unoptimized compared to hyper. While Golang has an excellent standard library there are definitely improved versions of various packages out there in the ecosystem (I’ve used some of the alternate web servers & routers in the past) – probably doesn’t make sense to not at least use the fastest I can find in Golang land…

Switching to valyala/fasthttp

After like 30 seconds of searching I found valyala/fasthttp which seems to do very well (top 3 even including prefork) in the Tech Empower Benchmarks Round 20. As expected it was pretty trivial to switch.

Here are the results with fasthttp (which is v2 of kcup-go):

$ wrk -t12 -c400 -d30s --latency http://127.0.0.1:5002/any/path/will/work # Go (with fasthttp)
Running 30s test @ http://127.0.0.1:5002/any/path/will/work
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    21.43ms   90.40ms   1.32s    98.35%
    Req/Sec     3.66k     1.49k   12.26k    61.44%
  Latency Distribution
     50%    6.70ms
     75%   14.38ms
     90%   28.56ms
     99%  523.14ms
  1308339 requests in 30.07s, 12.65GB read
Requests/sec:  43514.39
Transfer/sec:    430.67MB

In a table:

Rust Golang (w/ fasthttp)
Latency 50% (ms) 4.34 6.70
Latency 75% (ms) 5.71 14.38
Latency 90% (ms) 7.20 28.56
Latency 99% (ms) 11.36 523.14
(Thread stats) Latency avg (ms) 5.70 21.43
(Thread stats) Latency stddev (ms) 26.96 90.40
(Thread stats) Latency max (ms) 1010 1320
Request/sec 84,674 43514
Trasfer/sec (MB) 833.28 430.67

Muuuch better results for Go, but it looks like rust is still ahead. Let’s move on to 100KB.

File size 100KB

Raw wrk output:

$ wrk -t12 -c400 -d30s --latency http://127.0.0.1:5001/any/path/will/work # Rust
Running 30s test @ http://127.0.0.1:5001/any/path/will/work
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    33.22ms   71.41ms   1.28s    98.26%
    Req/Sec     1.32k   485.19     3.18k    69.45%
  Latency Distribution
     50%   20.10ms
     75%   36.77ms
     90%   53.87ms
     99%  195.05ms
  466302 requests in 30.10s, 44.51GB read
  Socket errors: connect 0, read 0, write 0, timeout 94
Requests/sec:  15493.35
Transfer/sec:      1.48GB

$ wrk -t12 -c400 -d30s --latency http://127.0.0.1:5002/any/path/will/work # Go (with fasthttp)
Running 30s test @ http://127.0.0.1:5002/any/path/will/work
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   371.32ms  538.45ms   2.00s    81.78%
    Req/Sec   123.40     99.36     1.68k    88.19%
  Latency Distribution
     50%   82.34ms
     75%  590.48ms
     90%    1.34s
     99%    1.90s
  43736 requests in 30.09s, 4.18GB read
  Socket errors: connect 0, read 0, write 0, timeout 1654
Requests/sec:   1453.30
Transfer/sec:    142.12MB

In a table:

Rust Golang (w/ fasthttp)
Latency 50% (ms) 20.10 82.34
Latency 75% (ms) 36.77 590.48
Latency 90% (ms) 53.87 1340
Latency 99% (ms) 195.05 1900
(Thread stats) Latency avg (ms) 33.22 371.32
(Thread stats) Latency stddev (ms) 71.41 538.45
(Thread stats) Latency max (ms) 1280 2000
Request/sec 15,493 1,453
Trasfer/sec (MB) 1480 142

That is a huge diversion in performance with the 100K file. Rust is handily beating Go now, and also transferring more data cross the wire. I’m going to resist the urge to try and dig into why Golang is going slower, I want to see how the languages handle this without a lot of tuning. If you’re a Go expert and are sitting in you chair right now fuming at how wrong I’m doing it, please feel free to send me an email and I’ll update this article with at least what I could have done to optimize. For now let’s move on to some even bigger files, which is probably going to be bad news bears for Go.

File size 1MB

Raw wrk output:

$ wrk -t12 -c400 -d30s --latency http://127.0.0.1:5001/any/path/will/work # Rust
Running 30s test @ http://127.0.0.1:5001/any/path/will/work
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   279.42ms  345.34ms   1.99s    91.04%
    Req/Sec    55.29     86.85   530.00     91.76%
  Latency Distribution
     50%  181.57ms
     75%  304.86ms
     90%  587.11ms
     99%    1.61s
  1046 requests in 30.09s, 1.10GB read
  Socket errors: connect 0, read 1432, write 14179153, timeout 42
Requests/sec:     34.76
Transfer/sec:     37.34MB

$ wrk -t12 -c400 -d30s --latency http://127.0.0.1:5002/any/path/will/work # Go (with fasthttp)
Running 30s test @ http://127.0.0.1:5002/any/path/will/work
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   359.92ms  420.75ms   2.00s    85.96%
    Req/Sec    18.15     54.69   510.00     96.94%
  Latency Distribution
     50%  220.98ms
     75%  528.65ms
     90%    1.03s
     99%    1.53s
  235 requests in 30.09s, 237.03MB read
  Socket errors: connect 0, read 1302, write 13538217, timeout 7
Requests/sec:      7.81
Transfer/sec:      7.88MB

In a table:

Rust Golang (w/ fasthttp)
Latency 50% (ms) 181.57 220.98
Latency 75% (ms) 304.86 528.65
Latency 90% (ms) 587.11 1030
Latency 99% (ms) 1610 1530
(Thread stats) Latency avg (ms) 279.42 359.92
(Thread stats) Latency stddev (ms) 345.34 420.75
(Thread stats) Latency max (ms) 1990 2000
Request/sec 34.76 7.81
Trasfer/sec (MB) 37.34 7.88

There it is – Rust has hit the large file wall (again, I’m going to not optimize this, I suspect raising provided resources should be good enough to up the performance), and Go continues to pound against it. At the edges they perform similarly but over the whole range it looks like rust performs a bit better – 90% of the request are twice as fast which is much better for a larger segment of people receiving the files.

Making this bad idea available as a Docker Image

Since DockerHub stopped giving away quite so much bandwidth, it seems likely that the free-loading community (of which I am a happy member) will move elsewhere. For now, I’m going to upload the images for kcup to both the project’s GitLab registry ( and skip AWS’s ECR public for now). GitLab has had free public-facing (and private) container registries bundled forever so I’ll put it there and call it a day so this post doesn’t get any longer.

You can find the containers here (kcup-rust and kcup-go)

Repository Rust Golang
GitLab registry.gitlab.com/mrman/kcup-rust/cli:v0.1.0 registry.gitlab.com/mrman/kcup-go/cli:v1

BONUS: Automatically pushing with CI

GitLab makes it pretty easy to push container images to CI from your project, but maybe some people don’t know that so here’s a snippet of my .gitlab-ci.yml you can peruse:

# Publish pre-release images (image:vX.X.X-<hash>) on release-vX.X.X branches
publish_pre_release_image:
  stage: image+publish
  image: docker
  services:
    - docker:dind
  only:
    - /release-v[0-9|\.]+/
  script:
    - apk add make perl musl-dev openssl-dev
    - docker login -u gitlab-ci-token --password $CI_BUILD_TOKEN registry.gitlab.com
    - make image image-publish

# Only publish release images (image:vX.X.X) on release tags
publish_release_image:
  stage: image+publish
  image: docker
  services:
    - docker:dind
  only:
    - /v[0-9|\.]+/
  except:
    - branches
  script:
    - apk add make perl musl-dev openssl-dev
    - docker login -u gitlab-ci-token --password $CI_BUILD_TOKEN registry.gitlab.com
    - make image image-publish image-release

Of course you need the Makefile targets (you can find these in the GitLab repos) to make this easy (perl is in there as a requirement for one of the makefile targets) but all in all, pretty easy thanks to the Docker in docker service.

Also, don’t forget to set your GitLab CI/CD settings for image tag cleanup!

BONUS: Automatically tagging post-release versions with CI

There are some CI goodies in both the repositories – as I’ve done my usual thing and set up automatic new-version tagging. To get this working you need a few things:

  • Makefile (or other scripting tool) targets that:
    • prints out the project’s current version (ex. make get-version)
    • build the project in “release” mode
    • prep for a release (ex. generate documentation, output swagger.json, update changelog, write the version’s git commit)
    • do the actual “release”, which is really just a git push for this new tagged commit to your repo
  • A Makefile (or other scripting tool) target that prints out the version (ex. make get-version)
  • A SSH identity with write permissions for your CI “robot” (you can store it in-repo with tools like git-crypt (or sops)
  • Uploading the public key of the CI robot as a GitLab Deploy Key
  • Uploading the private key of the CI robot as a “File”-type GitLab CI Variable (after you’re sure it works, make it a protected secret variable)

NOTE make is an awesome tool – having these things as make targets makes it really easy to run them manually or in CI, and will work generally cross-project as make is usually widely available/a light install.

After you have all these things, you should be able to do the following (you may or may not need to install more “missing packages”) in .gitlab-ci.yml:

# Tag new versions whenever a new version (bumped in a release-vX.X.X branch) is merged in
# The vast majority of the time this step will do nothing
tag_new_version:
  stage: extra
  only:
    - main
  script:
    # Testing for missing packages
    - apt-get install -y openssh-client git
    # Install & setup SSH
    - mkdir ~/.ssh && chmod 700 ~/.ssh && touch ~/.ssh/known_hosts
    - eval $(ssh-agent -s)
    - ssh-keyscan -t rsa gitlab.com >> ~/.ssh/known_hosts
    # Load CI SSH key
    - chmod 700 $CI_DEPLOY_PRIVATE_KEY
    - ssh-add $CI_DEPLOY_PRIVATE_KEY
    # Add gitlab remote
    - git remote add gitlab git@gitlab.com:<username or group>/<repository name>.git
    # Get version, exit early if tag already exists
    - export VERSION=v`make get-version`
    - export VERSION_TAG_EXISTS=$(git ls-remote gitlab | grep $VERSION | wc -l)
    - test $VERSION_TAG_EXISTS -eq 1 && exit 0 # exit early if the version tag already exists
    # Set robot git identity
    - git config user.email "email+ci-robot@domain.tld"
    - git config user.name "CI"
    # Add the remote to do a release
    - make build-release release-prep
    - make release-publish GIT_REMOTE=gitlab

I’ve shared variations of this stuff before, but figured some peo

BONUS: Switching to hyper::body::Bytes

After putting the post up on reddit, u/cramert on Reddit offered up an optimization suggestion – hyper::body::Bytes! Turns out hyper::body::Bytes was built to do exactly what I was trying to:

Bytes is an efficient container for storing and operating on contiguous slices of memory. It is intended for use primarily in networking code, but could have applications elsewhere as well.

Bytes values facilitate zero-copy network programming by allowing multiple Bytes objects to point to the same underlying memory.

As soon as I saw the comment I got typing and made a v0.1.1 release (I really should add some tests so I don’t have to test manually every time…) and here is what the results look like for the file sizes we tested earlier. The changes offered a improvement in raw performance and a huge improvement(reduction) in variance, which got better as the file size increased, ~10% gains at 10K to over a 2x+ gain at 1M! Huge thanks to u/cramert.

10K file

About 10% better here which is pretty awesome…

Rust (Arc<String>) Rust (hyper::body::Bytes)
Latency 50% (ms) 4.34 4.12
Latency 75% (ms) 5.71 5.42
Latency 90% (ms) 7.20 6.90
Latency 99% (ms) 11.36 10.71
(Thread stats) Latency avg (ms) 5.70 4.42
(Thread stats) Latency stddev (ms) 26.96 1.97
(Thread stats) Latency max (ms) 1010 24.26
Request/sec 84,674 88690
Trasfer/sec (MB) 833.28 850MB

100K file

Improvements starting to snowball…

Rust (Arc<String>) Rust (hyper::body::Bytes)
Latency 50% (ms) 20.10 14.24
Latency 75% (ms) 36.77 18.80
Latency 90% (ms) 53.87 23.92
Latency 99% (ms) 195.05 36.70
(Thread stats) Latency avg (ms) 33.22 15.23
(Thread stats) Latency stddev (ms) 71.41 6.81
(Thread stats) Latency max (ms) 1280 105.82
Request/sec 15,493 25842
Trasfer/sec (MB) 1480 2470

1M file

2x+ improvement across the board here!

Rust (Arc<String>) Rust (hyper::body::Bytes)
Latency 50% (ms) 181.57 97.53
Latency 75% (ms) 304.86 129.07
Latency 90% (ms) 587.11 165.85
Latency 99% (ms) 1610 251.38
(Thread stats) Latency avg (ms) 279.42 105.53
(Thread stats) Latency stddev (ms) 345.34 46.45
(Thread stats) Latency max (ms) 1990 511.01
Request/sec 34.76 3773
Trasfer/sec (MB) 37.34 3690

UPDATE: lazy_static with kcup-rust, kcup-go memory issues and optimizations

Improving kcup-rust by trying lazy_static

While kcup-rust is pretty nice already there was a nice suggestion made that I captured in a ticket which is worth looking into so I’ll take some time to try it out.

You can find [the changes on the issue’s MR](). These changes were released into [v0.1.2 of kcup-rust]() (also uploaded to cargo).

Locally, without running in a container, the 1M results pre-lazy_static look like this:

$ wrk -t12 -c400 -d30s --latency http://127.0.0.1:5000/any/path/will/work
Running 30s test @ http://127.0.0.1:5000/any/path/will/work
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    22.77ms   12.14ms 103.85ms   68.04%
    Req/Sec     1.03k   213.09     1.65k    67.47%
  Latency Distribution
     50%   20.81ms
     75%   30.29ms
     90%   39.64ms
     99%   56.57ms
  366497 requests in 30.09s, 357.94GB read
Requests/sec:  12180.93
Transfer/sec:     11.90GB

After they look like this:

$ wrk -t12 -c400 -d30s --latency http://127.0.0.1:5000/any/path/will/work
Running 30s test @ http://127.0.0.1:5000/any/path/will/work
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    22.60ms   12.21ms 107.00ms   68.37%
    Req/Sec     1.03k   217.25     1.70k    68.60%
  Latency Distribution
     50%   20.54ms
     75%   30.13ms
     90%   39.51ms
     99%   57.06ms
  366589 requests in 30.10s, 358.02GB read
Requests/sec:  12177.67
Transfer/sec:     11.89GB

There isn’t enough of a change in these stats to justify moving to lazy_static so for now I’m going to keep the code unchanged. Thanks very much to /u/FormalFerret for the suggestion though!

Testing kcup-rust against miniserve

Reddit user /u/vlmutolo brought up miniserve which is a competitor in the space. I often profess that an indicator of library quality is comparison to competitors in the space on a README, and thanks to /u/vlmutolo, I have that! kcup is so small in scope that it performs about 2x as well:

kcup miniserve
Latency 50% (ms) 21.09 83.94
Latency 75% (ms) 31.04 95.73
Latency 90% (ms) 40.81 107.23
Latency 99% (ms) 58.54 129.38
(Thread stats) Latency avg (ms) 23.29 84.94
(Thread stats) Latency stddev (ms) 12.57 17.11
(Thread stats) Latency max (ms) 120.18 182.87
Request/sec 11,900 4599
Trasfer/sec (MB) 1162 450

I updated the README to point this out.

Improving kcup-go one step forward and two steps back

Thanks to Pawel’s help with the []byte changes, I started taking another look at running the golang executable when I noticed something funny… Golang has no way to limit memory usage?, so the resource-constrained container would just… get OOM killed. From what I can find Go leaves this up to the operating system via ulimit (reasonable choice I guess?), and it looks like I can do this within docker with --ulimit memlock=<number>… Shout out to docker stats though for making it really easy to see all this, I was getting very confused as to why the kcup-go container was disappearing until it dawned on me that it was possible the container was being OOM killed. My next thought was “surely Golang doesn’t let you just… run out of memory???” with how much people seem to be OK carrying it to production? No one liked messing with Java’s ~20 -XxXxVaRiABlEs (yes this is hyperbole), but Java surely wouldn’t let you fall through a hole like this (except for that time where it wasn’t aware of containers of course).

I didn’t notice this early on because kcup-go:v1 used net/http and was pretty good with it’s memory usage v2 and v2.1 are actually dead within seconds as memory usage basically goes from 0 to ~99MB and then the OOM killer steps in. I only have myself to blame (how did I not notice v2 just not existing anymore? guess not writing E2E tests has come to bite me even faster than I thought), but I’m still pretty flabbergasted at Golang here. I’ll try to share what I think without being too abrasive:

  • Am I terrible at Go? I must be because this feels like a glaring issue that would just bite people in production until they have very precisely provisioned (or sloppily over-provisioned) containers, and surely this isn’t how people are living? Is setting ulimit memlock like day 2 production Golang skills? Is everyone just watching restarts (let’s say docker/kubernetes is your platform) and tuning accordingly?
  • Did I make a mistake picking fasthttp? I didn’t change the code significantly (I thought) but have gotten myself some pretty unwanted memory characteristics. I wrote just about the simplest handler I can think of with fasthttp and it’s essentially a memory hog now?

Don’t want to be that guy (just kidding, I definitely do) but kcup-rust never had any of these issues and went from around 9MB to 70MB with a memory budget of 100MB with the same workload and did ~2x+ better anyway. Well anyway, to keep this at least something close to a fair comparison I have to put more effort into the Go stuff and try to get it right.

Fixing fasthttp

Thanks again to Pawel for his help putting together a PR that fixes the fasthttp version used on 2.x. I merged that and released it as [version v2.2]().

Memory usage stayed under control for the first couple runs so I took those numbers (again, Go doesn’t really police itself on that front, and isn’t quite efficient enough to be able to repeat this without slight increases in memory usage), but Go did much better than it has in the past:

mrman@mroryxman $ wrk -t12 -c400 -d30s –latency http://127.0.0.1:5001/any/path/will/work Running 30s test @ http://127.0.0.1:5001/any/path/will/work 12 threads and 400 connections Thread Stats Avg Stdev Max +/- Stdev Latency 100.11ms 50.37ms 1.16s 79.57% Req/Sec 335.51 68.97 0.93k 73.37% Latency Distribution 50% 91.22ms 75% 121.58ms 90% 157.83ms 99% 245.12ms 120494 requests in 30.09s, 117.72GB read Requests/sec: 4004.16 Transfer/sec: 3.91GB

Golang (naive fasthttp) Golang (efficient fasthttp)
Latency 50% (ms) 220.98 91.22
Latency 75% (ms) 528.65 121.58
Latency 90% (ms) 1030 157.83
Latency 99% (ms) 1530 245.12
(Thread stats) Latency avg (ms) 359.92 100.11
(Thread stats) Latency stddev (ms) 420.75 50.37
(Thread stats) Latency max (ms) 2000 1160
Request/sec 7.81 4004
Trasfer/sec (MB) 7.88 3910

This looks a lot better – better memory efficiency and much better performance.

Using [fast]http.ServeFile

Another approach that was suggested was to use [fast]http.ServeFile instead of trying ot serve the bytes ourselves to start with, and that approach looks to actually be better (and definitely conceptually simpler). Unfortunately, this doesn’t do the STDIN approach, but I wanted to at least benchmark it and see, so it’s included as well. At first glance I dont’ think I can use http.ServeFile for a production release because I’d essentially have to read STDIN and then drop it in a temporary file, but maybe the interfaces are loose enough where I can use a Buffer instead.

Well it turns out I can’t use a Buffer (the path is a string or []bytes, and has to be the file path), and the results actually got worse.

Golang (efficient fasthttp) Golang (fasthttp.ServeFile)
Latency 50% (ms) 91.22 219.10
Latency 75% (ms) 121.58 354.15
Latency 90% (ms) 157.83 525.86
Latency 99% (ms) 245.12 1070
(Thread stats) Latency avg (ms) 100.11 275.46
(Thread stats) Latency stddev (ms) 50.37 209.19
(Thread stats) Latency max (ms) 1160 1900
Request/sec 4004 1576
Trasfer/sec (MB) 3910 1540

Well I don’t have anything good to say here so I won’t say anything at all. Just kidding – “Is this your king????” Well anyway, please someone reach out if the code that I tried to use for fasthttp.ServeFile is just super wrong in some non-obvious way that I can’t see. At this point I’m surprised it’s even possible to make this many subtle mistakes in Go, or maybe it’s more that Go really really needs lots of memory to at least be present to perform well.

Wrapup

Well I’ve scratched for yak shaving itch for the week, hopefully this was fun to read and maybe this project might even be worth using for others out there. If you ever find yourself with the need to serve a single file, and want to do it without using feature-packed software like NGINX or others, mabye kcup is for you.

Now I can finally get back to doing real work!