Compare images only at qualities you'd actually use. Codecs are optimized for real-world use cases and may perform very poorly outside sensible quality range.
Choosing lowest quality may seem like a clever idea to make differences obvious, but actually it makes benchmarks irrelevant. It's like running a Formula 1 race in a muddy field: proves that tractors are faster than race cars.
The easy case: exactly the same file size
Adjust quality until compared imaÂges have exactly the same file size. Pick the image that looks closer to the original.
Potential pitfalls:
- It's tempting to pick an image which “looks nicer”, but that's not the game codecs are playing. If the original is noisy, then the codec that preserves the noise better should be judged as better.
If a smoother version of a photo looks nicer to you, then make the benchmark fair by smoothing the photo in a photo processing tool first, and then test how that compresses. Image codecs are not Instagrams or Photoshops. They're supposed to save images with minimum distortion, not add distortions that look pretty.
- It may not be possible to achieve exact file sizes. To remove any doubt ensure that the winner also has the smallest file size. Otherwise it could be better only because it's slightly larger (lossy codecs can use as little as one bit per pixel, so even a few bytes may make a difference).
Harder case: exactly the same quality
Compress to exactly same quality measured very precisely using an objective tool. Compare which file is smaller.
Unfortunately, ensuring same quality is much harder than it seems:
Quality is not what you think it is
Quality “75” in one tool doesn't have to look like “75” in another tool. It's never comparable between image formats.
There's no objective matheÂmaÂtical definition of “quality”. It's an arbitrary made-up scale (theoreÂtically it makes sense to measure difference on a scale from 0 to infinity, but for “quality %” you need to define what “0% quality” means). Quality setting often isn't even consistent between images or proportional to quality perceived by humans.
Your eyes are actually terrible at judging quality
It's funny, because our eyes are supposed to be the ultimate judge, but:
- Quality is subjective. Is one big distortion worse than two smaller ones? Is too-blurry image better than a too-noisy one? People routinely disagree on this and change their mind based on images tested.
- Quality is hard to quantify. You can probably judge quality on a scale 1 to 5, but if asked to judge precisely on scale of 1 to 100 you'd be making things up like it was a wine tasting (“No, these pixels are too desaturated for 74/100. Definitely looks south of 73/100.")”
Subjective judgement is too “noisy” for opinion of one person to matter. It's necessary to combine scores from hundreds of people to get statistically meaningful results.
If you don't have hundreds of people to test in a controlled environment, then you have to resort to an objective quality measurement tool. You won't be able to see the difference between JPEGs at quality 98 and 99, but a tool easily can.
Choose a good tool for objective measurement
There isn't an ideal measureÂmentâthey're all only approximations of human perceptionâbut some are much better than others.
Don't use dumb pixel-by-pixel measuÂreÂments like PSNR (peak signal to noise ratio) or MSE (mean square error) â it's been shown over and over again that they're easily fooled.
Better tools are based on the SSIM algorithm (and its extended versions such as MS-SSIM/IW-SSIM).
Beware of tools that don't apply gamma correction as they will be biased towards dumb encoders that don't do correction either.
Most tools deal with color badly. They either only analyze grayÂscale version of the image (unfairly penalizing codecs that encode color very well) or measure distortions in the RGB color space, which isn't a good approxÂimation of human perception.
Test on many images
Benchmarks are like a game of Top Trumps: features of images you choose for the test are going to decide which codec will win.
All codecs have to make trade-offs. They are tuned for particular use-cases and have their weak spots. Some codecs are great at preserving sharp lines, but inefficient at storing noise. Some codecs compress noise well, but lose all fine details at the same time. It's possible to make a codec that handles all these cases well, but at a cost of hideous comÂplexity and high CPU requireÂments.
If you test only on one or few images, your test may be skewed by luck and outliers. When you test on many images, be careful about statistics: similarity scores are usually on a non-linear scale and file sizes vary by few orders of magnitude, so naive sums or averages can give misleading results. For example, you can run into Simpson's Paradox where one codec may be the best in almost all cases, but still get the worst score overall (due to scoring very very badly on one image).
To sum it up
Be very careful before you make sweeping judgements. It's too easy to make an unrealistic test that inaccurately measures the wrong thing.