Hyperfine: A command-line benchmarking tool

235 points by hundredwatt 3 days ago

ratrocket 2 days ago

Perhaps interesting (for some) to note that hyperfine is from the same author as at least a few other "ne{w,xt} generation" command line tools (that could maybe be seen as part of "rewrite it in Rust", but I don't want to paint the author with a brush they disagree with!!): fd (find alternative; https://github.com/sharkdp/fd), bat ("supercharged version of the cat command"; https://github.com/sharkdp/bat), and hexyl (hex viewer; https://github.com/sharkdp/hexyl). (And certainly others I've missed!)

Pointing this out because I myself appreciate comments that do this.

For myself, `fd` is the one most incorporated into my own "toolbox" -- used it this morning prior to seeing this thread on hyperfine! So, thanks for all that, sharkdp if you're reading!

Ok, end OT-ness.

varenc 2 days ago

++ to `fd`
It’s absolutely my preferred `find` replacement. Its CLI interface just clicks for me and I can quickly express my desires. Quite unlike `find`. `fd` is one of the first packages I install on a new system.
- ratrocket 2 days ago
  
  The "funny" thing for me about `fd` is that the set of operations I use for `find` are very hard-wired into my muscle memory from using it for 20+ years, so when I reach for `fd` I often have to reference the man page! I'm getting a little better from more exposure, but it's just different enough from `find` to create a bit of an uncanny valley effect (I think that's the right use of the term...).
  Even with that I reach for `fd` for some of its quality-of-life features: respecting .gitignore, its speed, regex-ability. (Though not its choices with color; I am a pretty staunch "--color never" person, for better or worse!)
  Anyway, that actually points to another good thing about sharkdp's tools: they have good man pages!!
  - n8henrie 2 days ago
    
    Same here -- I like `fd` for quick stuff, but routinely have to fall back to `find`.

mmastrac 3 days ago

Hyperfine is a great tool but when I was using it at Deno to benchmark startup time there was a lot of weirdness around the operating system apparently caching inodes of executables.

If you are looking at shaving sub 20ms numbers, be aware you may need to pull tricks on macos especially to get real numbers.

sharkdp 3 days ago

Caching is something that you almost always have to be aware of when benchmarking command line applications, even if the application itself has no caching behavior. Please see https://github.com/sharkdp/hyperfine?tab=readme-ov-file#warm... on how to run either warm-cache benchmarks or cold-cache benchmarks.
- mmastrac 2 days ago
  
  I'm fully aware but it's not a problem that warmup runs fix. An executable freshly compiled will always benchmark differently than one that has "cooled off" on macos, regardless of warmup runs.
  I've tried to understand what the issue is (played with resigning executables etc) but it's literally something about the inode of the executable itself. Most likely part of the OSX security system.
  - renewiltord 2 days ago
    
    Interesting. I've encountered this obviously on first run (because of the security checking it does on novel executables) but didn't realize this expired. Probably because I usually attribute it to a recompilation. Thanks.
JackYoustra 3 days ago

I've found pretty good results with the System Trace template in xcode instruments. You can also stack instruments, for example combining the file inspector with a virtual memory inspector.
I've run into some memory corruption with it sometimes, though, so be wary of that. Emerge tools has an alternative for iOS at least, maybe one day they'll port it to mac.
- art049 3 days ago
  
  I never tried xcode instruments. Is the UX good for this kind of tool?
  - JackYoustra 2 days ago
    
    System trace is pretty comprehensive, as long as it's something in there and you don't need a crazy strong profiler and are okay with occasional hangs in the profiler UI, it's pretty good
maccard 2 days ago

Not being able to rely on numbers to 20ms is pretty poor. That’s longer than a frame in a video game.
Windows has microsecond precision counters (see QueryPerformanceCounter and friends)

usrme 3 days ago

I've also had a good experience using the 'perf'[^1] tools for when I don't want to install 'hyperfine'. Shameless plug for a small blog post about it as I don't think it is that well known: https://usrme.xyz/tils/perf-is-more-robust-for-repeated-timi....

---

[^1]: https://www.mankier.com/1/perf

vdm 3 days ago

I too have scripted time(1) in a loop badly. perf stat is more likely to be already installed than hyperfine. Thank you for sharing!
CathalMullan 2 days ago

There's also 'poop', which is a nice middle-ground between 'hyperfine' and 'perf'. https://github.com/andrewrk/poop
- llimllib 2 days ago
  
  worth mentioning that it's linux-only

mosselman 3 days ago

Hyperfine is great. I use it sometimes for some quick web page benchmarks:

https://abuisman.com/posts/developer-tools/quick-page-benchm...

As mentioned here in the thread, when you want to go into the single ms optimisations it is not the best approach since there is a lot of overhead especially the way I demonstrate here, but it works very well for some sanity checks.

Sesse__ 2 days ago
> Hyperfine is great.
Is it, though?
What I would expect a system like this to have, at a minimum:
```
  * Robust statistics with p-values (not just min/max, compensation for multiple hypotheses, no Gaussian assumptions)
  * Multiple stopping points depending on said statistics.
  * Automatic isolation to the greatest extent possible (given appropriate permissions)
  * Interleaved execution, in case something external changes mid-way.
```
I don't see any of this in hyperfine. It just… runs things N times and then does a naïve average/min/max? At that rate, one could just as well use a shell script and eyeball the results.
- sharkdp 2 days ago
  
  > Robust statistics with p-values (not just min/max, compensation for multiple hypotheses, no Gaussian assumptions)
  This is not included in the core of hyperfine, but we do have scripts to compute "advanced" statistics, and to perform t-tests here: https://github.com/sharkdp/hyperfine/tree/master/scripts
  Please feel free to comment here if you think it should be included in hyperfine itself: https://github.com/sharkdp/hyperfine/issues/523
  > Automatic isolation to the greatest extent possible (given appropriate permissions)
  This sounds interesting. Please feel free to open a ticket if you have any ideas.
  > Interleaved execution, in case something external changes mid-way.
  Please see the discussion here: https://github.com/sharkdp/hyperfine/issues/21
  > It just… runs things N times and then does a naïve average/min/max?
  While there is nothing wrong with computing average/min/max, this is not all hyperfine does. We also compute modified Z-scores to detect outliers. We use that to issue warnings, if we think the mean value is influenced by them. We also warn if the first run of a command took significantly longer than the rest of the runs and suggest counter-measures.
  Depending on the benchmark I do, I tend to look at either the `min` or the `mean`. If I need something more fine-grained, I export the results and use the scripts referenced above.
  > At that rate, one could just as well use a shell script and eyeball the results.
  Statistical analysis (which you can consider to be basic) is just one reason why I wrote hyperfine. The other reason is that I wanted to make benchmarking easy to use. I use warmup runs, preparation commands and parametrized benchmarks all the time. I also frequently use the Markdown export or the JSON export to generate graphs or histograms. This is my personal experience. If you are not interested in all of these features, you can obviously "just as well use a shell script".
  - Sesse__ 2 days ago
    
    > This is not included in the core of hyperfine, but we do have scripts to compute "advanced" statistics, and to perform t-tests here: https://github.com/sharkdp/hyperfine/tree/master/scripts
    t-tests run afoul of the “no Gaussian assumptions”, though. Distributions arising from benchmarking frequently has various forms of skew which messes up t-tests and gives artificially narrow confidence intervals.
    (I'll gladly give you credit for your outlier detection, though!)
    >> Automatic isolation to the greatest extent possible (given appropriate permissions) > This sounds interesting. Please feel free to open a ticket if you have any ideas.
    Off the top of my head, some option that would:
    * Bind to isolated CPUs, if booted with it (isolcpus=) * Binding to a consistent set of cores/hyperthreads (the scheduler frequently sabotages benchmarking, especially if your cores are have very different maximum frequency) * Warns if thermal throttling is detected during the run * Warns if an inappropriate CPU governor is enabled * Locks the program into RAM (probably hard to do without some sort of help from the program) * Enables realtime priority if available (e.g., if isolcpus= is not enabled, or you're not on Linux)
    Of course, sometimes you would _want_ to benchmark some of these effects, and that's fine. But most people probably won't, and won't know that they exist. I may easily have forgotten some.
    On the flip side (making things more random as opposed to less), something that randomizes the initial stack pointer would be nice, as I've sometimes seen this go really, really wrong (renaming a binary from foo to foo_new made it run >1% slower!).
    
    sharkdp 2 days ago
    
    > On the flip side (making things more random as opposed to less), something that randomizes the initial stack pointer would be nice, as I've sometimes seen this go really, really wrong (renaming a binary from foo to foo_new made it run >1% slower!).
    This is something we do already. We set a `HYPERFINE_RANDOMIZED_ENVIRONMENT_OFFSET` environment variable with a random-length value: https://github.com/sharkdp/hyperfine/blob/87d77c861f1b6c761a...
- renewiltord 2 days ago
  
  Personally, I'm all about the UNIX philosophy of doing one thing and doing it well. All I want is the process to be invoked k times to do a thing with warmup etc. etc. If I want additional stats, it's easy to calculate. I just `--export-json` and then once it's in a dataframe I can do what I want with it.
- bee_rider 2 days ago
  
  What do you suggest? Those sound like great features.
  - Sesse__ 2 days ago
    
    I've only seen such things in internal tools so far, unfortunately, so if you see anything in public, please tell me :-) I'm just confused why everything thinks hyperfine is so awesome, when it does not meet what I'd consider a fairly low bar for benchmarking tools? (“Best publicly available” != “great”, in my book.)
    
    sharkdp 2 days ago
    
    > “Best publicly available” != “great”
    Of course. But it is free and open source. And everyone is invited to make it better.
llimllib 2 days ago

I find k6 a lot nicer for HTTP benching, and no slower to set up than hyperfine (which I love for CLI benching): https://k6.io/
- jiehong 2 days ago
  
  Could hyperfine running curl be an alternative?
  - mosselman 2 days ago
    
    That is what I do in my blog post.

jamietheriveter a day ago

The comment about statistics that I wanted to reply to has disappeared. That commenter said:

> I stand firm in my belief that unless you can prove how CLT applies to your input distributions, you should not assume normality. And if you don't know what you are doing, stop reporting means.

I agree. My research group stopped using Hyperfine because it ranks benchmarked commands by mean, and provides standard deviation as a substitute for a confidence measure. These are not appropriate for heavy-tailed, skewed, and otherwise non-normal distributions.

It's easy to demonstrate that most empirical runtime distributions are not normal. I wrote BestGuess [0] because we needed a better benchmarking tool. Its analysis provides measures of skew, kurtosis, and Anderson-Darling distance from normal, so that you can see how normal or not is your distribution. It ranks benchmark results using non-parametric methods. And, unlike many tools, it saves all of the raw data, making it easy to re-analyze later.

My team also discovered that Hyperfine's measurements are a bit off. It reports longer run times than other tools, including BestGuess. I believe this is due to the approach, which is to call getrusage(), then fork/exec the program to be measured, then call getrusage() again. The difference in user and system times is reported as the time used by the benchmarked command, but unfortunately this time also includes cycles spent in the Rust code for managing processes (after the fork but before the exec).

BestGuess avoids external libraries (we can see all the relevant code), does almost nothing after the fork, and uses wait4() to get measurements. The one call to wait4() gives us what the OS measured by its own accounting for the benchmarked command.

While BestGuess is still a work in progress (not yet at version 1.0), my team has started using it regularly. I plan to continue its development, and I'll write it up soon at [1].

[0] https://gitlab.com/JamieTheRiveter/bestguess [1] https://jamiejennings.com

smartmic 2 days ago

A capable alternative based on "boring, old" technology is multitime [1]

Back at the time I needed it, it had peak memory usage - hyperfine was not able to show it. Maybe this had changed by now.

[1] https://tratt.net/laurie/src/multitime/

shawndavidson7 2 days ago

"Hyperfine seems like an incredibly useful tool for anyone working with command-line utilities. The ability to benchmark processes straightforwardly is vital for optimizing performance. I’m particularly impressed with how simple it is to use compared to other benchmarking tools. I’d love to see more examples of how Hyperfine can be integrated into different workflows, especially for large-scale applications.

https://www.osplabs.com/

edwardzcn 2 days ago

Hyperfine is great! I remember I learned about it when comparing functions with/without tail recursion (not sure if it was from the Go reference or the Rust reference). It provides simple configurations for unit test. But I have not tried it on DBMS (e.g. like sysbench). Does anyone have a try?

7e 3 days ago

What database product does the community commonly send benchmark results to? This tool is great, but I'd love to analyze results relationally.

Zambyte a day ago

You could export data to csv and do relational analytics with DuckDB quite easily.
rmorey 2 days ago

Something like Geekbench for CLI tools would be awesome

accelbred 2 days ago

Hyperfine is a really useful tool.

Weirdest thing I've used it for is comparing io throughput on various disks.

forrestthewoods 3 days ago

Hyperfine is hyper frustrating because it only works with really really fine microsecond level benchmarks. Once you get into the millisecond range it’s worthless.

sharkdp 3 days ago

That doesn't make a lot of sense. It's more like the opposite of what you are saying. The precision of hyperfine is typically in the single-digit millisecond range. Maybe just below 1 ms if you take special care to run the benchmark on a quiet system. Everything below that (microsecond or nanosecond range) is something that you need to address with other forms of benchmarking.
But for everything in the right range (milliseconds, seconds, minutes or above), hyperfine is well suited.
- forrestthewoods 2 days ago
  
  No it’s not.
  Back in the day my goal for Advent of Code was to run all solutions in under 1 second total. Hyperfine would take like 30 minutes to benchmark a 1 second runtime.
  It was hyper frustrating. I could not find a good way to get Hyperfine to do what I wanted.
  - sharkdp 2 days ago
    
    If that's the case, I would consider it a bug. Please feel free to report it. In general, hyperfine should not take longer than ~3 seconds, unless the command itself takes > 300 ms second to run. In the latter case, we do a minimum of 10 runs by default. So if your program takes 3 min for a single iteration, it would take 30 min by default — yes. But this can be controlled using the `-m`/`--min-runs` option. You can also specify the exact amount of runs using `-r`/`--runs`, if you prefer that.
    > I could not find a good way to get Hyperfine to do what I wanted
    This is all documented here: https://github.com/sharkdp/hyperfine/tree/master?tab=readme-... under "Basic benchmarks". The options to control the amount of runs are also listed in `hyperfine --help` and in the man page. Please let us know if you think we can improve the documentation / discovery of those options.
  - fwip 2 days ago
    
    I've been using it for about four or five years, and never experienced this behavior.
    Current defaults: "By default, it will perform at least 10 benchmarking runs and measure for at least 3 seconds." If your program takes 1s to run, it should take 10 seconds to benchmark.
    Is it possible that your program was waiting for input that never came? One "gotcha" is that it expects each argument to be a full program, so if you ran `hyperfine ./a.out input.txt`, it will first bench a.out with no args, then try to bench input.txt (which will fail). If a.out reads from stdin when no argument is given, then it would hang forever, and I can see why you'd give up after a half hour.
    
    sharkdp 2 days ago
    
    > Is it possible that your program was waiting for input that never came?
    We do close stdin to prevent this. So you can benchmark `cat`, for example, and it works just fine.
    
    fwip 2 days ago
    
    Oh, my bad! Thank you for the correction, and for all your work making hyperfine.
anotherhue 3 days ago

It spawns a new process each time right? I would think that would but a cap on how accurate it can get.
For my purposes I use it all the time though, quick and easy sanity-check.
- forrestthewoods 3 days ago
  
  The issue is it runs a kajillion tests to try and be “statistical”. But there’s no good way to say “just run it for 5 seconds and give me the best answer you can”. It’s very much designed for nanosecond to low microsecond benchmarks. Trying to fight this is trying to smash a square peg through a round hole.
  - sharkdp 3 days ago
    
    > The issue is it runs a kajillion tests to try and be “statistical”.
    If you see any reason for putting “statistical” in quotes, please let us know. hyperfine does not run a lot of tests, but it does try to find outliers in your measurements. This is really valuable in some cases. For example: we can detect when the first run of your program takes much longer than the rest of the runs. We can then show you a warning to let you know that you probably want to either use some warmup runs, or a "--prepare" command to clean (OS) caches if you want a cold-cache benchmark.
    > But there’s no good way to say “just run it for 5 seconds and give me the best answer you can”.
    What is the "best answer you can"?
    > It’s very much designed for nanosecond to low microsecond benchmarks.
    Absolutely not. With hyperfine, you can not measure execution times in the "low microsecond" range, let alone nanosecond range. See also my other comment.
  - PhilipRoman 3 days ago
    
    I disagree that it is designed for nano/micro benchmarks. If you want that level of detail, you need to stay within a single process, pinned to a core which is isolated from scheduler. At least I found it almost impossible to benchmark assembly routines with it.
  - gforce_de 3 days ago
    
    At least it gives some numbers and point in a direction:
    $ hyperfine --warmup 3 './hello-world-bin-sh.sh' './hello-world-env-python3.py' Benchmark 1: ./hello-world-bin-sh.sh Time (mean ± σ): 1.3 ms ± 0.4 ms [User: 1.0 ms, System: 0.5 ms] ... Benchmark 2: ./hello-world-env-python3.py Time (mean ± σ): 43.1 ms ± 1.4 ms [User: 33.6 ms, System: 8.4 ms] ...
- oguz-ismail 3 days ago
  
  It spawns a new shell for each run and subtracts the average shell startup time from final results. Too much noise
  - PhilipRoman 3 days ago
    
    The shell can be disabled, leaving just fork+exec
    
    sharkdp 3 days ago
    
    Yes. If you don't make use of shell builtins/syntax, you can use hyperfine's `--shell=none`/`-N` option to disable the intermediate shell.
    
    oguz-ismail 3 days ago
    
    You still need to quote the command though. `hyperfine -N ls "$dir"' won't work, you need `hyperfine -N "ls ${dir@Q}"' or something. It'd be better if you could specify commands like with `find -exec'.
    
    PhilipRoman 3 days ago
    
    Oh that sucks, I really hate when programs impose useless shell parsing instead of letting the user give an argument vector natively.
    
    sharkdp 2 days ago
    
    I don't think it's useless. You can use hyperfine to run multiple benchmarks at the same time, to get a comparison between multiple tools. So if you want it to work without quotes, you need to (1) come up with a way to separate commands and (2) come up with a way to distinguish hyperfine arguments from command arguments. It's doable, but it's also not a great UX if you have to write something like
    hyperfine -N -- ls "$dir" \; my_ls "$dir"
    
    oguz-ismail 2 days ago
    
    > not a great UX
    Looks fine to me. Obviously it's too late to undo that mistake, but a new flag to enable new behavior wouldn't hurt anyone.