A bunch of readers have submitted suggestions and changes to both the Rust and the Go code so I've updated them and released new versions as appropriate! I added a section to the bottom of the post so check that out! -- the biggest changes were in the Go-related code.
A reader named Pavel (Pawel) helped out on the Go implementation by using []byte
and io.Copy
. Much thanks to Pawel, check out the Merge Request.
A testament to the warmness of the rust community u/cramert on r/rust saw fit to bless me with some optimization advice! Turns out hyper::bytes::Body
was more-or-less built for the kind of copying and serving I was doing. I've added a section to this postwith the results from the change, which is part of kcup-rust
version v0.1.1
.
tl;dr - I wrote a program both called kcup
in Rust and Golang (kcup-rust
and kcup-go
respectively) which serve a single file over the internet with the intent to use them as containers for an MTA-STS /.well-known/mta-sts.txt
endpoint. I also compared the performance.
I’ve taken a bit of a detour on a project I’m working on (and looking to release soon!) to set up maddy
for handling emails (think admin@<product>.tld
, hello@<product>.tld
, support@<product>.tld
). I probably shouldn’t even be using maddy
to begin with – I should probably just pick up Amazon Workmail or use Gandi’s 2 free emails per domain or GMail or literally anything else outside of running my own SMTP and IMAP server, but I am very clearly addicted to yak shaving.
I’ve worked on projects like postmgr (which is now defunct) in the past to try and make it easier to deploy veteran email programs like postfix
and dovecot
, but there are a bunch of programs that have done it far better than I have, in a far better way (by rebuilding the basics rather than trying to orchestrate existing heavyweights):
For my personal email I actually stopped running my own email server (postfix
and dovecot
) and switched to ProtonMail and have been very happy (at this point I’m subscribed for multiple years) there so maybe I’m getting somewhat better on the yak shaving addiction front. Yak shaving rationalization aside, there are a bunch of ways to “improve your deliverability” (make sure your emails don’t go to spam on some large email provider your customer uses), and running your own mail servers these days:
Which of these should you do to make sure your emails don’t end up in spam? All of them, and check the IP you’re using and hope it’s not on a blacklist and hope that some word you say doesn’t get picked up by a spam filter in a walled garden somewhere. Don’t get me started on how the large walled gardens are absolutely killing normal email and marking themselves as “trusted” while they send mail from correctly set up servers with security in place to the spam box.
Getting back on track the reason this post and the programs I’m about to walk through creating exist is that to properly complete your MTA-STS support, you need to serve a file – a single file at https://mta-sts.domain.tld/.well-known/mta-sts.txt
. Not a DNS TXT record – you need to return content at that “well known” URL.
The first thing most people would do (and be right) is to wire up NGINX/Caddy/Apache/anything to a folder with only one file at the given address.
I did a bit of searching as well, and found some things around the internet on serving single static files:
Turns out you can use nginx to return text itself with the return
directive. All of these options would have been very very reasonable, not what I chose to do though (see yak shaving talk earlier).
Since I use Traefik, another option would have been to use a plugin that could return static content (looks like Envoy had it but not anymore?), but alas, they don’t and I’m not sure they want that use case. I could write the plugin, but I’m not keen on signing up for Traefik Pilot, though there’s some awesome plugins already up there (btw, I’m off the Ambassadors thing so no need to make disclaimers!).
Then I had a (bad) thought – why aren’t there any programs that I can download that serve literally just one file with the corresponding small performance and security footprint(s)? You could argue that a program serving just the one file is much safer (and maybe even faster?) than bringing in NGINX or a bigger tool? At first I tried to look around at Cert-Manager since I know they have to do this with the ACME HTTP01 challenge solver pods, but the docker image seemed more complicated than I needed it to be. Maybe I should just… write a program that served a single file? And that’s how I got here.
I could write this program, put that in a container, point Traefik to it and live happily ever after (this is one of those projects that I can actuall complete and never touch again). It even seems simple on it’s face:
Of course I thought about it some more and had some more interesting (read: complex) functionality:
POST
, DELETE
, HEAD
, etc?GET /the-right-file
succeed but GET /other-file
404s?I’m going to at least control myself here and keep it simple – return the file from memory for any GET request that comes in.
The first question on my mind was what I should write this in. Two great recent candidates to write this absurdly simple program in – Rust and Go. There are a few reasons:
Go might be the better choice here, because it’s honestly got pretty much everything you need in the standard library, but I like Rust so I’ll do it in Rust as well, not too hard to make another likely one-file codebase.
The next question was then what should I call it? Since it’s serving a single file, I thought something based on wasteful and formerly patent-encumbered Keurig’s K-Cup might be a good name – I’m going with kcup
. The runner up was sfs
for “single file server”.
So again what I need to do is write a cross-platform static binary that:
Rust I have written recently, but not so recently. The last big rust project I did was redis-bootleg-backup
. It was pretty easy though – the biggest issue was trying to pick the lowest level library with the best ergonomics to do what I want to do very simply – I picked hyper
. Here’s src/main.rs
:
use std::{fs, io, process};
use std::path::PathBuf;
use std::io::Read;
use std::net::SocketAddr;
use std::convert::Infallible;
use std::sync::Arc;
use structopt::StructOpt;
use hyper::{Server, Request, Response, Body, Method, StatusCode};
use hyper::service::{service_fn, make_service_fn};
use env_logger::Env;
use tokio::time::{Duration, timeout};
#[derive(StructOpt, Debug)]
#[structopt(name = "kcup")]
struct KCupOpts {
/// Host
#[structopt(short = "h", long ="host", default_value = "127.0.0.1", env = "HOST")]
host: String,
/// Port
#[structopt(short = "p", long ="port", default_value = "5000", env = "PORT")]
port: i32,
/// Amount of seconds to wait for input on STDIN to serve
#[structopt(long ="stdin-read-timeout-seconds", default_value = "60", env = "STDIN_READ_TIMEOUT_SECONDS")]
stdin_read_timeout_seconds: u64,
/// File to read
#[structopt(name = "FILE", short = "f", long = "file", parse(from_os_str), env = "FILE")]
file_path: Option<PathBuf>,
}
/// Utility function fo serving static content
async fn serve_static_content(
req: Request<Body>,
content: Arc<String>,
) -> Result<Response<Body>, Infallible> {
match req.method() {
// Serve the content for every GET request
&Method::GET => Ok(
Response::new(Body::from(format!("{}",content)))
),
// All other non-GET routes are 404s
_ => Ok(
Response::builder()
.status(StatusCode::NOT_FOUND)
.body("No such resource".into())
.unwrap()
),
}
}
#[tokio::main]
async fn main() -> Result<(), std::io::Error> {
// Initialize logger at info
env_logger::init_from_env(
Env::default().filter_or("LOG_LEVEL", "info")
);
// Parse opts
let KCupOpts{
host,
port,
stdin_read_timeout_seconds,
file_path
} = KCupOpts::from_args();
// Combine host and port into an address, and parse it
let addr = String::from(format!("{}:{}", host, port)).parse::<SocketAddr>();
// Stop if parsing failed
if let Err(_) = addr {
log::error!("Failed to parse host & port combination");
process::exit(1);
}
let addr = addr.unwrap(); // worry-free unwrap
log::info!("Server configured to run @ [{}]", addr);
let mut file_contents = String::new();
// Attempt to read content from somewhere
if let Some(path) = file_path {
// Read from file path
log::info!("Reading file from path [{}]", path.to_string_lossy());
file_contents = fs::read_to_string(path)?;
} else {
// Read from STDIN
log::info!(
"No file path provided, waiting for input on STDIN (max {} seconds)...",
stdin_read_timeout_seconds,
);
let stdin_read_task = tokio::task::spawn_blocking(move || {
let _ = io::stdin().read_to_string(&mut file_contents);
return file_contents
});
// Attempt ot read from stdin for a given timeout
match timeout(Duration::from_secs(stdin_read_timeout_seconds), stdin_read_task).await {
Ok(Ok(contents)) => {
file_contents = contents;
log::info!("Successfully read input from STDIN");
}
_ => {
log::error!("Failed to read from STDIN after waiting {} seconds", stdin_read_timeout_seconds);
process::exit(1);
}
}
}
// If contents are *still* empty (no file & STDIN is empty), throw error
if file_contents.is_empty() {
log::error!("No file contents -- please ensure you've specified a file or fed in data via STDIN");
process::exit(1);
}
log::info!("Read [{}] characters", file_contents.len());
// Capture the file contents in an Arc so we can use the reference repeatedly
// across async tasks that the server will spawn
let file_contents = Arc::new(file_contents);
// Build server
let svc_builder = make_service_fn(move |_conn| {
// The move & async combinations that happen in here (including the move above)
// are a bit complicated.
// see: https://www.fpcomplete.com/blog/ownership-puzzle-rust-async-hyper/
// Create a name-shadowed cloned reference to the content we want to serve
let file_contents = Arc::clone(&file_contents);
async {
// Create service fn
Ok::<_, Infallible>(
service_fn(move |req: Request<Body>| {
// Create a another name-shadowed, cloned reference
// since we have moved the original clone past the service_fn boundary
let file_contents = Arc::clone(&file_contents);
serve_static_content(req, file_contents)
})
)
}
});
// Serve the file from a server for any GET request
log::info!("Starting server...");
let server = Server::bind(&addr).serve(svc_builder);
if let Err(e) = server.await {
log::error!("Server error: {}", &e);
eprintln!("Server error: {}", e);
}
Ok(())
}
While trying to make this as simple as possible, I did end up running into an issue that made the experience less than pleasant – I couldn’t navigate Arc<T>
easily enough to pass data to the function being called. Either due to my own rustiness (pun intended) or just the difficulty of rust’s memory management paradigm, one of the things you’d think as a hyper
’s day-two operations (assuming you set up the basic server on day one) is being able to pass in data to a service function:
After a while I got it, but I spent more time than I was comfortable with bumbling around. I like rust a lot, and normally I just brush off complaints about the complexity of the borrow/ownership model (it’s novel, it’s strict and correct, it’s going to brush some feathers the wrong way), but not having used rust in a few months and having to stumble around this much was not a great experience. Rest of everything was pretty easy to get back into though, copied some code and Makefile
targets from the bootleg backup project as well.
So I haven’t written any Golang in a long while but jumping back into it is pretty easy when all you have to write is a single-file web server, here’s cmd/kcup.go
:
package main
import (
"errors"
"flag"
"fmt"
log "github.com/sirupsen/logrus"
"io/ioutil"
"net/http"
"os"
"strconv"
"time"
)
/// Generate a function that does nothing but return the given piece of static content
func generateStaticServerFn(content string) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
// Return 404 for all non-GET methods
if r.Method != http.MethodGet {
http.NotFound(w, r)
return
}
// Return the content
fmt.Fprintf(w, content)
}
}
const (
DEFAULT_PORT = 5000
DEFAULT_STDIN_READ_TIMEOUT_SECONDS = 60
)
func main() {
// Retrieve settings from ENV if present
host, ok := os.LookupEnv("HOST")
var port int
portRaw, ok := os.LookupEnv("PORT")
if !ok {
port = DEFAULT_PORT
} else {
parsedPort, err := strconv.Atoi(portRaw)
if err != nil {
log.Fatal("Failed to parse port")
}
port = parsedPort
}
var stdinReadTimeoutSeconds int
stdinReadTimeoutSecondsRaw, ok := os.LookupEnv("STDIN_READ_TIMEOUT_SECONDS")
if !ok {
stdinReadTimeoutSeconds = DEFAULT_STDIN_READ_TIMEOUT_SECONDS
} else {
parsedStdinReadTimeoutSeconds, err := strconv.Atoi(stdinReadTimeoutSecondsRaw)
if err != nil {
log.Fatal("Failed to parse STDIN read timeout seconds")
}
stdinReadTimeoutSeconds = parsedStdinReadTimeoutSeconds
}
filePath, ok := os.LookupEnv("FILE")
// Retrieve settings from flags
hostPtr := flag.String("host", "", "Host")
portPtr := flag.Int("port", -1, "Port")
stdinReadTimeoutSecondsPtr := flag.Int(
"stdin-read-timeout-seconds",
-1,
"Amount of seconds to wait for input on STDIN to serve",
)
filePathPtr := flag.String("file", "", "File to read")
flag.Parse()
// Override ENV with flags if necessary
if *hostPtr != "" {
host = *hostPtr
}
if *portPtr != -1 {
port = *portPtr
}
if *filePathPtr != "" {
filePath = *filePathPtr
}
if *stdinReadTimeoutSecondsPtr != -1 {
stdinReadTimeoutSeconds = *stdinReadTimeoutSecondsPtr
}
// TODO: Validate settings (ex. port has to be non-negative)
// Buidl address from host and port
addr := fmt.Sprintf("%s:%d", host, port)
log.Info(fmt.Sprintf("Server configured to run @ [%s]", addr))
// Create content to be filled in later
content := ""
// Check if filepath was provided
if filePath != "" {
// Load file into content
// Check if the file exists
if _, err := os.Stat(filePath); err != nil {
log.Fatal(fmt.Sprintf("Failed fo find file [%s]", filePath))
}
// Read from file
contentBytes, err := ioutil.ReadFile(filePath)
if err != nil {
log.Fatal(fmt.Sprintf("Failed fo read file @ [%s]", filePath))
}
log.Info(fmt.Sprintf("Reading file from path [%s]", filePath))
content = string(contentBytes)
} else {
// Attempt to read content from STDIN
log.Info(fmt.Sprintf("No file path provided, waiting for input on STDIN (max %d seconds)...", stdinReadTimeoutSeconds))
// No file path is present, attempt to read from STDIN
stdinContent, err := readStdinWithTimeout(stdinReadTimeoutSeconds)
if err != nil {
log.Fatal(fmt.Sprintf("Failed to read from STDIN after waiting %d seconds", stdinReadTimeoutSeconds))
}
content = stdinContent
}
// Ensure we have *some* content at this point
if content == "" {
log.Fatal("No file contents -- please ensure you've specified a file or fed in data via STDIN")
}
// Set up router tha takes anything
http.HandleFunc("/", generateStaticServerFn(string(content)))
// Start the server
fmt.Println("Starting the server...")
log.Fatal(http.ListenAndServe(addr, nil))
}
func readStdinWithTimeout(timeoutSeconds int) (string, error) {
// Create channel for timeout
ch := make(chan int)
var result string
// Spawn goroutine to attempt to read STDIN
go func() {
stdinBytes, err := ioutil.ReadAll(os.Stdin)
if err != nil {
log.Fatal(fmt.Sprintf("Failed to read from STDIN after waiting %d seconds", timeoutSeconds))
}
result = string(stdinBytes)
ch <- 1
}()
// Wait for ReadAll or timeout
select {
// Read STDIN
case <-ch:
log.Info(fmt.Sprintf("Successfully read input from STDIN"))
return string(result), nil
// Timeout
case <-time.After(time.Duration(timeoutSeconds) * time.Second):
return "", errors.New("Failed to read from STDIN")
}
}
Yeah… this is definitely not “good” Go code (can you spot the leaked goroutine?) or even good Go code in general (so many things could be factored out for better testability, oh right there are no tests either), and there are probably way better ways to do this, but I’m not planning on writing any other Go code for now so I’m leaving it here and not optimizing much more. There’s even a TODO in there for validating input that I’m going to get to round about never (most likely). As far as the dev experience went, it was pretty smooth getting back into writing a tiny bit of Go though I’d forgotten most of the normal paradigms/patterns. One thing that annoyed me a little bit ad-hoc Go seemed to be compared to being able to use tools like Rust’s structopt
and the paradigms available in Rust. Then again, i didn’t spend any time worrying about borrowing issues so…
Of course, no relatively simple technical project is complete without a gratuitous benchmark, so I got out wrk
to see how the servers performed serving files to various file sizes. Another thing that would probably be good to curtail/limit would be the machine size – for this I’m going to use 2 cores and 50mb cgroup
powered resource limiting via docker
which is way too much for these tiny utility libraries.
Thinking about it a little bit, Go actually does an initial stack size of 2kb these days, and since I don’t expect the stacks for the gorountines to go, I guess my goroutine concurrency would be limited by this right up until network saturation. I think what I’ll do for this is to give both Go and Rust the same size of memory and see how they do. So the specs will be:
base64 /dev/urandom | head -c 100K > file.txt
)First the genreated data which I’ll be mounting into /data
inside the container:
$ du -hs /tmp/testfiles/*
100K /tmp/testfiles/file-100K
12K /tmp/testfiles/file-10K
1.0M /tmp/testfiles/file-1M
$ tree /tmp/testfiles/
/tmp/testfiles/
├── file-100K
├── file-10K
└── file-1M
0 directories, 3 files
Here’s the docker
command (for rust
as an example):
$ docker run --detach \
-p 5001:5000 \
-e HOST=0.0.0.0 \
-e FILE=/data/file-10K \
-v /tmp/testfiles:/data \
--cpus 2 \
--memory 100mb \
--name kcup-rust \
registry.gitlab.com/mrman/kcup-rust/cli:v0.1.0
And for go
:
$ docker run --detach \
-p 5002:5000 \
-e HOST=0.0.0.0 \
-e FILE=/data/file-10K \
-v /tmp/testfiles:/data \
--cpus 2 \
--memory 100mb \
--name kcup-go \
registry.gitlab.com/mrman/kcup-go/cli:v1
The command I’ll be using is right out of the wrk
README, with the --latency
option (5001
for rust, 5002
for golang):
$ wrk -t12 -c400 -d30s --latency http://127.0.0.1:5001/any/path/will/work # rust
$ wrk -t12 -c400 -d30s --latency http://127.0.0.1:5002/any/path/will/work # go
Raw wrk
output for rust
and go
:
$ wrk -t12 -c400 -d30s --latency http://127.0.0.1:5001/any/path/will/work # Rust
Running 30s test @ http://127.0.0.1:5001/any/path/will/work
12 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 5.70ms 26.96ms 1.01s 99.79%
Req/Sec 7.10k 649.83 13.47k 68.70%
Latency Distribution
50% 4.34ms
75% 5.71ms
90% 7.20ms
99% 11.36ms
2548559 requests in 30.10s, 24.49GB read
Requests/sec: 84674.28
Transfer/sec: 833.28MB
$ wrk -t12 -c400 -d30s --latency http://127.0.0.1:5002/any/path/will/work # Go
Running 30s test @ http://127.0.0.1:5002/any/path/will/work
12 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 45.13ms 75.88ms 1.04s 97.85%
Req/Sec 0.93k 588.07 5.11k 88.11%
Latency Distribution
50% 28.75ms
75% 63.39ms
90% 83.00ms
99% 351.37ms
335015 requests in 30.05s, 3.24GB read
Requests/sec: 11147.25
Transfer/sec: 110.33MB
In a table:
Rust | Golang | |
---|---|---|
Latency 50% (ms) | 4.34 | 28.75 |
Latency 75% (ms) | 5.71 | 63.39 |
Latency 90% (ms) | 7.20 | 83.00 |
Latency 99% (ms) | 11.36 | 351.37 |
(Thread stats) Latency avg (ms) | 5.70 | 45.13 |
(Thread stats) Latency stddev (ms) | 26.96 | 75.88 |
(Thread stats) Latency max (ms) | 1010 | 1040 |
Request/sec | 84,674 | 11,147 |
Trasfer/sec (MB) | 833.28 | 110 |
Well that’s quite the difference between these two! Rust is beating Golang pretty handily here, though it might just be the Go code I’ve written being really bad, or the standard library being unoptimized compared to hyper
. While Golang has an excellent standard library there are definitely improved versions of various packages out there in the ecosystem (I’ve used some of the alternate web servers & routers in the past) – probably doesn’t make sense to not at least use the fastest I can find in Golang land…
valyala/fasthttp
After like 30 seconds of searching I found valyala/fasthttp
which seems to do very well (top 3 even including prefork) in the Tech Empower Benchmarks Round 20. As expected it was pretty trivial to switch.
Here are the results with fasthttp
(which is v2
of kcup-go
):
$ wrk -t12 -c400 -d30s --latency http://127.0.0.1:5002/any/path/will/work # Go (with fasthttp)
Running 30s test @ http://127.0.0.1:5002/any/path/will/work
12 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 21.43ms 90.40ms 1.32s 98.35%
Req/Sec 3.66k 1.49k 12.26k 61.44%
Latency Distribution
50% 6.70ms
75% 14.38ms
90% 28.56ms
99% 523.14ms
1308339 requests in 30.07s, 12.65GB read
Requests/sec: 43514.39
Transfer/sec: 430.67MB
In a table:
Rust | Golang (w/ fasthttp ) |
|
---|---|---|
Latency 50% (ms) | 4.34 | 6.70 |
Latency 75% (ms) | 5.71 | 14.38 |
Latency 90% (ms) | 7.20 | 28.56 |
Latency 99% (ms) | 11.36 | 523.14 |
(Thread stats) Latency avg (ms) | 5.70 | 21.43 |
(Thread stats) Latency stddev (ms) | 26.96 | 90.40 |
(Thread stats) Latency max (ms) | 1010 | 1320 |
Request/sec | 84,674 | 43514 |
Trasfer/sec (MB) | 833.28 | 430.67 |
Muuuch better results for Go, but it looks like rust is still ahead. Let’s move on to 100KB.
Raw wrk
output:
$ wrk -t12 -c400 -d30s --latency http://127.0.0.1:5001/any/path/will/work # Rust
Running 30s test @ http://127.0.0.1:5001/any/path/will/work
12 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 33.22ms 71.41ms 1.28s 98.26%
Req/Sec 1.32k 485.19 3.18k 69.45%
Latency Distribution
50% 20.10ms
75% 36.77ms
90% 53.87ms
99% 195.05ms
466302 requests in 30.10s, 44.51GB read
Socket errors: connect 0, read 0, write 0, timeout 94
Requests/sec: 15493.35
Transfer/sec: 1.48GB
$ wrk -t12 -c400 -d30s --latency http://127.0.0.1:5002/any/path/will/work # Go (with fasthttp)
Running 30s test @ http://127.0.0.1:5002/any/path/will/work
12 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 371.32ms 538.45ms 2.00s 81.78%
Req/Sec 123.40 99.36 1.68k 88.19%
Latency Distribution
50% 82.34ms
75% 590.48ms
90% 1.34s
99% 1.90s
43736 requests in 30.09s, 4.18GB read
Socket errors: connect 0, read 0, write 0, timeout 1654
Requests/sec: 1453.30
Transfer/sec: 142.12MB
In a table:
Rust | Golang (w/ fasthttp ) |
|
---|---|---|
Latency 50% (ms) | 20.10 | 82.34 |
Latency 75% (ms) | 36.77 | 590.48 |
Latency 90% (ms) | 53.87 | 1340 |
Latency 99% (ms) | 195.05 | 1900 |
(Thread stats) Latency avg (ms) | 33.22 | 371.32 |
(Thread stats) Latency stddev (ms) | 71.41 | 538.45 |
(Thread stats) Latency max (ms) | 1280 | 2000 |
Request/sec | 15,493 | 1,453 |
Trasfer/sec (MB) | 1480 | 142 |
That is a huge diversion in performance with the 100K file. Rust is handily beating Go now, and also transferring more data cross the wire. I’m going to resist the urge to try and dig into why Golang is going slower, I want to see how the languages handle this without a lot of tuning. If you’re a Go expert and are sitting in you chair right now fuming at how wrong I’m doing it, please feel free to send me an email and I’ll update this article with at least what I could have done to optimize. For now let’s move on to some even bigger files, which is probably going to be bad news bears for Go.
Raw wrk
output:
$ wrk -t12 -c400 -d30s --latency http://127.0.0.1:5001/any/path/will/work # Rust
Running 30s test @ http://127.0.0.1:5001/any/path/will/work
12 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 279.42ms 345.34ms 1.99s 91.04%
Req/Sec 55.29 86.85 530.00 91.76%
Latency Distribution
50% 181.57ms
75% 304.86ms
90% 587.11ms
99% 1.61s
1046 requests in 30.09s, 1.10GB read
Socket errors: connect 0, read 1432, write 14179153, timeout 42
Requests/sec: 34.76
Transfer/sec: 37.34MB
$ wrk -t12 -c400 -d30s --latency http://127.0.0.1:5002/any/path/will/work # Go (with fasthttp)
Running 30s test @ http://127.0.0.1:5002/any/path/will/work
12 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 359.92ms 420.75ms 2.00s 85.96%
Req/Sec 18.15 54.69 510.00 96.94%
Latency Distribution
50% 220.98ms
75% 528.65ms
90% 1.03s
99% 1.53s
235 requests in 30.09s, 237.03MB read
Socket errors: connect 0, read 1302, write 13538217, timeout 7
Requests/sec: 7.81
Transfer/sec: 7.88MB
In a table:
Rust | Golang (w/ fasthttp ) |
|
---|---|---|
Latency 50% (ms) | 181.57 | 220.98 |
Latency 75% (ms) | 304.86 | 528.65 |
Latency 90% (ms) | 587.11 | 1030 |
Latency 99% (ms) | 1610 | 1530 |
(Thread stats) Latency avg (ms) | 279.42 | 359.92 |
(Thread stats) Latency stddev (ms) | 345.34 | 420.75 |
(Thread stats) Latency max (ms) | 1990 | 2000 |
Request/sec | 34.76 | 7.81 |
Trasfer/sec (MB) | 37.34 | 7.88 |
There it is – Rust has hit the large file wall (again, I’m going to not optimize this, I suspect raising provided resources should be good enough to up the performance), and Go continues to pound against it. At the edges they perform similarly but over the whole range it looks like rust performs a bit better – 90% of the request are twice as fast which is much better for a larger segment of people receiving the files.
Since DockerHub stopped giving away quite so much bandwidth, it seems likely that the free-loading community (of which I am a happy member) will move elsewhere. For now, I’m going to upload the images for kcup to both the project’s GitLab registry ( and skip AWS’s ECR public for now). GitLab has had free public-facing (and private) container registries bundled forever so I’ll put it there and call it a day so this post doesn’t get any longer.
You can find the containers here (kcup-rust
and kcup-go
)
Repository | Rust | Golang |
---|---|---|
GitLab | registry.gitlab.com/mrman/kcup-rust/cli:v0.1.0 |
registry.gitlab.com/mrman/kcup-go/cli:v1 |
GitLab makes it pretty easy to push container images to CI from your project, but maybe some people don’t know that so here’s a snippet of my .gitlab-ci.yml
you can peruse:
# Publish pre-release images (image:vX.X.X-<hash>) on release-vX.X.X branches
publish_pre_release_image:
stage: image+publish
image: docker
services:
- docker:dind
only:
- /release-v[0-9|\.]+/
script:
- apk add make perl musl-dev openssl-dev
- docker login -u gitlab-ci-token --password $CI_BUILD_TOKEN registry.gitlab.com
- make image image-publish
# Only publish release images (image:vX.X.X) on release tags
publish_release_image:
stage: image+publish
image: docker
services:
- docker:dind
only:
- /v[0-9|\.]+/
except:
- branches
script:
- apk add make perl musl-dev openssl-dev
- docker login -u gitlab-ci-token --password $CI_BUILD_TOKEN registry.gitlab.com
- make image image-publish image-release
Of course you need the Makefile
targets (you can find these in the GitLab repos) to make this easy (perl
is in there as a requirement for one of the makefile targets) but all in all, pretty easy thanks to the Docker in docker service.
Also, don’t forget to set your GitLab CI/CD settings for image tag cleanup!
There are some CI goodies in both the repositories – as I’ve done my usual thing and set up automatic new-version tagging. To get this working you need a few things:
Makefile
(or other scripting tool) targets that:
make get-version
)git push
for this new tagged commit to your repoMakefile
(or other scripting tool) target that prints out the version (ex. make get-version
)git-crypt
(or sops
)NOTE make
is an awesome tool – having these things as make targets makes it really easy to run them manually or in CI, and will work generally cross-project as make
is usually widely available/a light install.
After you have all these things, you should be able to do the following (you may or may not need to install more “missing packages”) in .gitlab-ci.yml
:
# Tag new versions whenever a new version (bumped in a release-vX.X.X branch) is merged in
# The vast majority of the time this step will do nothing
tag_new_version:
stage: extra
only:
- main
script:
# Testing for missing packages
- apt-get install -y openssh-client git
# Install & setup SSH
- mkdir ~/.ssh && chmod 700 ~/.ssh && touch ~/.ssh/known_hosts
- eval $(ssh-agent -s)
- ssh-keyscan -t rsa gitlab.com >> ~/.ssh/known_hosts
# Load CI SSH key
- chmod 700 $CI_DEPLOY_PRIVATE_KEY
- ssh-add $CI_DEPLOY_PRIVATE_KEY
# Add gitlab remote
- git remote add gitlab git@gitlab.com:<username or group>/<repository name>.git
# Get version, exit early if tag already exists
- export VERSION=v`make get-version`
- export VERSION_TAG_EXISTS=$(git ls-remote gitlab | grep $VERSION | wc -l)
- test $VERSION_TAG_EXISTS -eq 1 && exit 0 # exit early if the version tag already exists
# Set robot git identity
- git config user.email "email+ci-robot@domain.tld"
- git config user.name "CI"
# Add the remote to do a release
- make build-release release-prep
- make release-publish GIT_REMOTE=gitlab
I’ve shared variations of this stuff before, but figured some peo
hyper::body::Bytes
After putting the post up on reddit, u/cramert on Reddit offered up an optimization suggestion – hyper::body::Bytes
! Turns out hyper::body::Bytes
was built to do exactly what I was trying to:
Bytes is an efficient container for storing and operating on contiguous slices of memory. It is intended for use primarily in networking code, but could have applications elsewhere as well.
Bytes values facilitate zero-copy network programming by allowing multiple Bytes objects to point to the same underlying memory.
As soon as I saw the comment I got typing and made a v0.1.1 release (I really should add some tests so I don’t have to test manually every time…) and here is what the results look like for the file sizes we tested earlier. The changes offered a improvement in raw performance and a huge improvement(reduction) in variance, which got better as the file size increased, ~10% gains at 10K to over a 2x+ gain at 1M! Huge thanks to u/cramert.
About 10% better here which is pretty awesome…
Rust (Arc<String> ) |
Rust (hyper::body::Bytes ) |
|
---|---|---|
Latency 50% (ms) | 4.34 | 4.12 |
Latency 75% (ms) | 5.71 | 5.42 |
Latency 90% (ms) | 7.20 | 6.90 |
Latency 99% (ms) | 11.36 | 10.71 |
(Thread stats) Latency avg (ms) | 5.70 | 4.42 |
(Thread stats) Latency stddev (ms) | 26.96 | 1.97 |
(Thread stats) Latency max (ms) | 1010 | 24.26 |
Request/sec | 84,674 | 88690 |
Trasfer/sec (MB) | 833.28 | 850MB |
Improvements starting to snowball…
Rust (Arc<String> ) |
Rust (hyper::body::Bytes ) |
|
---|---|---|
Latency 50% (ms) | 20.10 | 14.24 |
Latency 75% (ms) | 36.77 | 18.80 |
Latency 90% (ms) | 53.87 | 23.92 |
Latency 99% (ms) | 195.05 | 36.70 |
(Thread stats) Latency avg (ms) | 33.22 | 15.23 |
(Thread stats) Latency stddev (ms) | 71.41 | 6.81 |
(Thread stats) Latency max (ms) | 1280 | 105.82 |
Request/sec | 15,493 | 25842 |
Trasfer/sec (MB) | 1480 | 2470 |
2x+ improvement across the board here!
Rust (Arc<String> ) |
Rust (hyper::body::Bytes ) |
|
---|---|---|
Latency 50% (ms) | 181.57 | 97.53 |
Latency 75% (ms) | 304.86 | 129.07 |
Latency 90% (ms) | 587.11 | 165.85 |
Latency 99% (ms) | 1610 | 251.38 |
(Thread stats) Latency avg (ms) | 279.42 | 105.53 |
(Thread stats) Latency stddev (ms) | 345.34 | 46.45 |
(Thread stats) Latency max (ms) | 1990 | 511.01 |
Request/sec | 34.76 | 3773 |
Trasfer/sec (MB) | 37.34 | 3690 |
lazy_static
with kcup-rust
, kcup-go
memory issues and optimizationskcup-rust
by trying lazy_static
While kcup-rust
is pretty nice already there was a nice suggestion made that I captured in a ticket which is worth looking into so I’ll take some time to try it out.
You can find the changes on the issue’s MR. These changes were released into v0.1.2
of kcup-rust
(also uploaded to cargo).
Locally, without running in a container, the 1M results pre-lazy_static
look like this:
$ wrk -t12 -c400 -d30s --latency http://127.0.0.1:5000/any/path/will/work
Running 30s test @ http://127.0.0.1:5000/any/path/will/work
12 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 22.77ms 12.14ms 103.85ms 68.04%
Req/Sec 1.03k 213.09 1.65k 67.47%
Latency Distribution
50% 20.81ms
75% 30.29ms
90% 39.64ms
99% 56.57ms
366497 requests in 30.09s, 357.94GB read
Requests/sec: 12180.93
Transfer/sec: 11.90GB
After they look like this:
$ wrk -t12 -c400 -d30s --latency http://127.0.0.1:5000/any/path/will/work
Running 30s test @ http://127.0.0.1:5000/any/path/will/work
12 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 22.60ms 12.21ms 107.00ms 68.37%
Req/Sec 1.03k 217.25 1.70k 68.60%
Latency Distribution
50% 20.54ms
75% 30.13ms
90% 39.51ms
99% 57.06ms
366589 requests in 30.10s, 358.02GB read
Requests/sec: 12177.67
Transfer/sec: 11.89GB
There isn’t enough of a change in these stats to justify moving to lazy_static
so for now I’m going to keep the code unchanged. Thanks very much to /u/FormalFerret for the suggestion though!
kcup-rust
against miniserve
Reddit user /u/vlmutolo brought up miniserve
which is a competitor in the space. I often profess that an indicator of library quality is comparison to competitors in the space on a README, and thanks to /u/vlmutolo, I have that! kcup
is so small in scope that it performs about 2x as well:
kcup |
miniserve |
|
---|---|---|
Latency 50% (ms) | 21.09 | 83.94 |
Latency 75% (ms) | 31.04 | 95.73 |
Latency 90% (ms) | 40.81 | 107.23 |
Latency 99% (ms) | 58.54 | 129.38 |
(Thread stats) Latency avg (ms) | 23.29 | 84.94 |
(Thread stats) Latency stddev (ms) | 12.57 | 17.11 |
(Thread stats) Latency max (ms) | 120.18 | 182.87 |
Request/sec | 11,900 | 4599 |
Trasfer/sec (MB) | 1162 | 450 |
I updated the README to point this out.
kcup-go
one step forward and two steps backThanks to Pawel’s help with the []byte
changes, I started taking another look at running the golang executable when I noticed something funny… Golang has no way to limit memory usage?, so the resource-constrained container would just… get OOM killed. From what I can find Go leaves this up to the operating system via ulimit
(reasonable choice I guess?), and it looks like I can do this within docker with --ulimit memlock=<number>
… Shout out to docker stats
though for making it really easy to see all this, I was getting very confused as to why the kcup-go
container was disappearing until it dawned on me that it was possible the container was being OOM killed. My next thought was “surely Golang doesn’t let you just… run out of memory???” with how much people seem to be OK carrying it to production? No one liked messing with Java’s ~20 -XxXxVaRiABlEs
(yes this is hyperbole), but Java surely wouldn’t let you fall through a hole like this (except for that time where it wasn’t aware of containers of course).
I didn’t notice this early on because kcup-go:v1
used net/http
and was pretty good with it’s memory usage v2
and v2.1
are actually dead within seconds as memory usage basically goes from 0 to ~99MB and then the OOM killer steps in. I only have myself to blame (how did I not notice v2
just not existing anymore? guess not writing E2E tests has come to bite me even faster than I thought), but I’m still pretty flabbergasted at Golang here. I’ll try to share what I think without being too abrasive:
ulimit memlock
like day 2 production Golang skills? Is everyone just watching restarts (let’s say docker/kubernetes is your platform) and tuning accordingly?fasthttp
? I didn’t change the code significantly (I thought) but have gotten myself some pretty unwanted memory characteristics. I wrote just about the simplest handler I can think of with fasthttp
and it’s essentially a memory hog now?Don’t want to be that guy (just kidding, I definitely do) but kcup-rust
never had any of these issues and went from around 9MB to 70MB with a memory budget of 100MB with the same workload and did ~2x+ better anyway. Well anyway, to keep this at least something close to a fair comparison I have to put more effort into the Go stuff and try to get it right.
fasthttp
Thanks again to Pawel for his help putting together a PR that fixes the fasthttp
version used on 2.x. I merged that and released it as version v2.2
.
Memory usage stayed under control for the first couple runs so I took those numbers (again, Go doesn’t really police itself on that front, and isn’t quite efficient enough to be able to repeat this without slight increases in memory usage), but Go did much better than it has in the past:
mrman@mroryxman $ wrk -t12 -c400 -d30s –latency http://127.0.0.1:5001/any/path/will/work Running 30s test @ http://127.0.0.1:5001/any/path/will/work 12 threads and 400 connections Thread Stats Avg Stdev Max +/- Stdev Latency 100.11ms 50.37ms 1.16s 79.57% Req/Sec 335.51 68.97 0.93k 73.37% Latency Distribution 50% 91.22ms 75% 121.58ms 90% 157.83ms 99% 245.12ms 120494 requests in 30.09s, 117.72GB read Requests/sec: 4004.16 Transfer/sec: 3.91GB
Golang (naive fasthttp ) |
Golang (efficient fasthttp ) |
|
---|---|---|
Latency 50% (ms) | 220.98 | 91.22 |
Latency 75% (ms) | 528.65 | 121.58 |
Latency 90% (ms) | 1030 | 157.83 |
Latency 99% (ms) | 1530 | 245.12 |
(Thread stats) Latency avg (ms) | 359.92 | 100.11 |
(Thread stats) Latency stddev (ms) | 420.75 | 50.37 |
(Thread stats) Latency max (ms) | 2000 | 1160 |
Request/sec | 7.81 | 4004 |
Trasfer/sec (MB) | 7.88 | 3910 |
This looks a lot better – better memory efficiency and much better performance.
[fast]http.ServeFile
Another approach that was suggested was to use [fast]http.ServeFile
instead of trying ot serve the bytes ourselves to start with, and that approach looks to actually be better (and definitely conceptually simpler). Unfortunately, this doesn’t do the STDIN approach, but I wanted to at least benchmark it and see, so it’s included as well. At first glance I dont’ think I can use http.ServeFile
for a production release because I’d essentially have to read STDIN and then drop it in a temporary file, but maybe the interfaces are loose enough where I can use a Buffer
instead.
Well it turns out I can’t use a Buffer
(the path
is a string
or []bytes
, and has to be the file path), and the results actually got worse.
Golang (efficient fasthttp ) |
Golang (fasthttp.ServeFile ) |
|
---|---|---|
Latency 50% (ms) | 91.22 | 219.10 |
Latency 75% (ms) | 121.58 | 354.15 |
Latency 90% (ms) | 157.83 | 525.86 |
Latency 99% (ms) | 245.12 | 1070 |
(Thread stats) Latency avg (ms) | 100.11 | 275.46 |
(Thread stats) Latency stddev (ms) | 50.37 | 209.19 |
(Thread stats) Latency max (ms) | 1160 | 1900 |
Request/sec | 4004 | 1576 |
Trasfer/sec (MB) | 3910 | 1540 |
Well I don’t have anything good to say here so I won’t say anything at all. Just kidding – “Is this your king????” Well anyway, please someone reach out if the code that I tried to use for fasthttp.ServeFile
is just super wrong in some non-obvious way that I can’t see. At this point I’m surprised it’s even possible to make this many subtle mistakes in Go, or maybe it’s more that Go really really needs lots of memory to at least be present to perform well.
Well I’ve scratched for yak shaving itch for the week, hopefully this was fun to read and maybe this project might even be worth using for others out there. If you ever find yourself with the need to serve a single file, and want to do it without using feature-packed software like NGINX or others, mabye kcup
is for you.
Now I can finally get back to doing real work!