How to compare images fairly

caption {caption-side: bottom; font-size:0.9em}
table {margin: 1em auto;}
>

Absolutely the worst way to compare images is to convert one lossy format to another and conclude you “can't see the difference”.

Why is it bad? Save a photo as a couple JPEGs at quality=98 and quality=92. It will be hard to tell them apart, but their file sizes will differ by nearly 40%! Does it prove that JPEG is 40% better thanâ¦ itself? No, it shows that “quality appears the same, but the file much is smaller!” can easily be nonsense that proves nothing.

Straight out of imagemagick: convert kodim22.png -resize 320x -filter spline -quality 94"><caption>Nearly the same quality and 40% smaller! Obviously JPEG is so much better than JPEGâ¦</caption><tr><td><img src="/faircomparison-q94.jpg alt=""/></td><td><img src="/faircomparison-q98.jpg" alt=""/></td></tr><tr><td> 65KB JPEG</td><td> 39KB JPEG</td></tr></table>

To make a fair comparison you really have to pay meticulous attention to encoder settings, normalizing quality, and ensuring that compared images are in fact comparable.

It's really hard to make a fair comparison. It's like multiÂthreaded programming: you think it's simple until you realize how many subtle things can ruin everything.

From now on I'm going to say “codec” instead of “image format”, because efficiency of image formats depends on tools used. Some tools are compressing images poorly and it can be a fault of the tool, not the image format.

Comparing lossy formats (codecs)

To make a fair comparison:

Compare only one variable at a time.

Unless you find Pareto Improvement

If images differ in both quality and file size it may be impossible to tell which one would be better if they were in a head-to-head comparison. ComÂpaÂrison of lossÂless formats (like PNG) with lossy ones is an extreme case of this problem.

Convert from high-quality source

Converting from one lossy format to another creates an unfair situÂation: you tell the second codec to disÂtorÂtions made by the previous codec in addition to compressing the image.

An obvious case of this is saving a photo as GIF and then as JPEG. Such flawed test will make JPEG look worse than the GIF, even though we know JPEG is clearly much better for photos.

This applies to all lossy-to-lossy conversions. Even distortions that are invisible to the naked eye can bias the results.

Ensure tools' settings are as close as possible

If you publish your benchmark it's a good custom to also publish encoders' settings, versions, and source images in a lossless format to let others verify your results (here I've used mozjpeg 3.0 + imagemagick 6.8 + this + this.)

Do they save color with the same resolution? JPEG and some other formats have option to save color at half resolution (chroma subÂsampÂling). It often makes sharp red lines blocky, but otherÂwise is hard to notice. If you don't correct for this, you may be telling one tool to compress twice as much data!

Are all tools set to their “best” settings? Some encoders have default settings tuned for speed or compatibility.

cjpeg -quality 74 with and without -revert><caption>mozjpeg's fast profile gives almost 20% larger files at the same quality setting (5% at the same SSIM)</caption><tr><td><img src="/faircomparison-tokyo-fast.jpg" alt=""/></td><td><img src="/faircomparison-tokyo-best.jpg" alt=""/></td></tr><tr><td> 20KB JPEG</td><td> 17KB JPEG</td></tr></table>

Compare at realistic quality

Compare images only at qualities you'd actually use. Codecs are optimized for real-world use cases and may perform very poorly outside sensible quality range.

Choosing lowest quality may seem like a clever idea to make differences obvious, but actually it makes benchmarks irrelevant. It's like running a Formula 1 race in a muddy field: proves that tractors are faster than race cars.

The easy case: exactly the same file size

Adjust quality until compared imaÂges have exactly the same file size. Pick the image that looks closer to the original.

Potential pitfalls:

It's tempting to pick an image which “looks nicer”, but that's not the game codecs are playing. If the original is noisy, then the codec that preserves the noise better should be judged as better.
If a smoother version of a photo looks nicer to you, then make the benchmark fair by smoothing the photo in a photo processing tool first, and then test how that compresses. Image codecs are not Instagrams or Photoshops. They're supposed to save images with minimum distortion, not add distortions that look pretty.

It may not be possible to achieve exact file sizes. To remove any doubt ensure that the winner also has the smallest file size. Otherwise it could be better only because it's slightly larger (lossy codecs can use as little as one bit per pixel, so even a few bytes may make a difference).

Harder case: exactly the same quality

Compress to exactly same quality measured very precisely using an objective tool. Compare which file is smaller.

Unfortunately, ensuring same quality is much harder than it seems:

Quality is not what you think it is

Quality “75” in one tool doesn't have to look like “75” in another tool. It's never comparable between image formats.

There's no objective matheÂmaÂtical definition of “quality”. It's an arbitrary made-up scale (theoreÂtically it makes sense to measure difference on a scale from 0 to infinity, but for “quality %” you need to define what “0% quality” means). Quality setting often isn't even consistent between images or proportional to quality perceived by humans.

Your eyes are actually terrible at judging quality

It's funny, because our eyes are supposed to be the ultimate judge, but:

Quality is subjective. Is one big distortion worse than two smaller ones? Is too-blurry image better than a too-noisy one? People routinely disagree on this and change their mind based on images tested.

Quality is hard to quantify. You can probably judge quality on a scale 1 to 5, but if asked to judge precisely on scale of 1 to 100 you'd be making things up like it was a wine tasting (“No, these pixels are too desaturated for 74/100. Definitely looks south of 73/100.")”

Subjective judgement is too “noisy” for opinion of one person to matter. It's necessary to combine scores from hundreds of people to get statistically meaningful results.

If you don't have hundreds of people to test in a controlled environment, then you have to resort to an objective quality measurement tool. You won't be able to see the difference between JPEGs at quality 98 and 99, but a tool easily can.

Choose a good tool for objective measurement

There isn't an ideal measureÂmentâthey're all only approximations of human perceptionâbut some are much better than others.

Don't use dumb pixel-by-pixel measuÂreÂments like PSNR (peak signal to noise ratio) or MSE (mean square error) â it's been shown over and over again that they're easily fooled.

Better tools are based on the SSIM algorithm (and its extended versions such as MS-SSIM/IW-SSIM).

Beware of tools that don't apply gamma correction as they will be biased towards dumb encoders that don't do correction either.

Most tools deal with color badly. They either only analyze grayÂscale version of the image (unfairly penalizing codecs that encode color very well) or measure distortions in the RGB color space, which isn't a good approxÂimation of human perception.

Test on many images

Benchmarks are like a game of Top Trumps: features of images you choose for the test are going to decide which codec will win.

All codecs have to make trade-offs. They are tuned for particular use-cases and have their weak spots. Some codecs are great at preserving sharp lines, but inefficient at storing noise. Some codecs compress noise well, but lose all fine details at the same time. It's possible to make a codec that handles all these cases well, but at a cost of hideous comÂplexity and high CPU requireÂments.

If you test only on one or few images, your test may be skewed by luck and outliers. When you test on many images, be careful about statistics: similarity scores are usually on a non-linear scale and file sizes vary by few orders of magnitude, so naive sums or averages can give misleading results. For example, you can run into Simpson's Paradox where one codec may be the best in almost all cases, but still get the worst score overall (due to scoring very very badly on one image).

To sum it up

Be very careful before you make sweeping judgements. It's too easy to make an unrealistic test that inaccurately measures the wrong thing.