Fine-grained Image Aesthetic Assessment: Learning Discriminative Scores from Relative Ranks

1School of Artificial Intelligence, Xidian University   2State Key Laboratory of EMIM, Xidian University
Equal contribution   *Corresponding author
介绍图

Abstract

Image aesthetic assessment (IAA) has extensive applications in content creation, album management, and recommendation systems, etc. In such applications, it is commonly needed to pick out the most aesthetically pleasing image from a series of images with subtle aesthetic variations, a topic we refer to as Fine-grained IAA. Unfortunately, state-of-the-art IAA models are typically designed for coarse-grained evaluation, where images with notable aesthetic differences are evaluated independently on an absolute scale. These models are inherently limited in discriminating fine-grained aesthetic differences.

To address the dilemma, we contribute FGAesthetics, a fine-grained IAA database with 32,217 images organized into 10,028 series, which are sourced from diverse categories including Natural, AIGC, and Cropping. Annotations are collected via pairwise comparisons within each series. We also devise Series Refinement and Rank Calibration to ensure the reliability of data and labels.

Based on FGAesthetics, we further propose FGAesQ, a novel IAA framework that learns discriminative aesthetic scores from relative ranks through Difference-preserved Tokenization (DiffToken), Comparative Text-assisted Alignment (CTAlign), and Rank-aware Regression (RankReg). FGAesQ enables accurate aesthetic assessment in fine-grained scenarios while still maintains competitive performance in coarse-grained evaluation.

FGAesthetics Dataset

We contribute FGAesthetics, a benchmark specifically designed for fine-grained aesthetics, comprising 32,217 images organized into 10,028 series with aesthetic ranking labels. To ensure diversity, we collect image series from three distinct sources: Natural, AIGC, and Aesthetic cropping. These series undergo rigorous filtering via a Metrics-MLLMs-Human refinement protocol, avoiding images that are either visually significantly dissimilar or indistinguishable. Human annotators then perform pairwise comparisons within each series, where pairs with ambiguous relative ordering are filtered to calibrate the global rankings and produce the final dataset.

algebraic reasoning

The pipeline consists of three stages:
(a) Data Collection: Visually similar photo series are collected from three distinct sources: Natural, AIGC, and Cropping.
(b) Series Refinement: Noisy series data undergo rigorous filtering using a Metric-MLLMs-Human refinement protocol.
(c) Rank Calibration: Pairwise comparisons are annotated within each series, excluding data that cannot be aesthetically distinguished to obtain calibrated aesthetic rankings.

arithmetic reasoning
Comparison between the proposed FGAesthetics and representative IAA datasets.

FGAesQ Model

Building on FGAesthetics, we further propose FGAesQ, a new IAA model that enables accurate aesthetic assessment in fine-grained scenarios while maintaining robust performance in coarse-grained evaluation. This is achieved by learning discriminative aesthetic scores from relative ranks, where coarse-grained data establishes foundational aesthetic perception and fine-grained data refines the regression space for subtle aesthetic distinctions. Specifically, we introduce Difference-preserved Tokenization (DiffToken) that maintains distinctive details between similar images while scaling homogeneous regions. Comparative Text-assisted Alignment (CTAlign) is designed to further enhance the discrimination of visual representations. Finally, we develop Rank-aware Regression (RankReg), which leverages aesthetic rankings to calibrate score predictions, ensuring consistency between absolute assessments and relative aesthetic ordering.

grade-lv

Overall pipeline of the proposed FGAesQ.

Quantitative Results

Benchmark Results on FGAesthetics. First, existing IAA models demonstrate significant performance degradation on fine-grained evaluation scenarios, particularly for series-level testing. Second, MLLM-based IAA approaches outperform traditional deep learning-based methods, with Q-Align showing very encouraging results even against fine-tuned models. This stems from the enhanced fine-grained perception capabilities afforded by larger parameters. Third, input scale-focused methods exhibit superior performance on cropping data, with fine-tuned MUSIQ achieving second-best results. Finally, FGAesQ achieves the best performance across all evaluation protocols, demonstrating superior aesthetic discrimination capabilities in FG-IAA. Additionally, we evaluate FGAesQ excluding DiffToken during inference, which operates without reference image comparison yet still maintains competitive performance.

grade-lv

Performance comparison between the proposed FGAesQ and state-of-the-art IAA methods on FGAesthetics.


Balance between Coarse- and Fine-grained IAA. A robust IAA model is expected to handle both coarse- and fine-grained evaluation scenarios. Table-Left compares performance on both tasks, where FGAesQ achieves the highest average results. Existing SOTA models fine-tuned with ranking loss show enhanced fine-grained capabilities but suffer severe coarse-grained degradation.
Cross-dataset Evaluation. We further conduct generalization evaluations on three IAA benchmarks, including ICAA17K (color aesthetics), AADB (aesthetic attributes), and TAD66K (theme-specific aesthetics), as summarized in Table-Right. FGAesQ demonstrates superior generalization across all datasets, particularly exhibiting significant advantages on AADB. This suggests that training with fine-grained aesthetic comparisons enhances the perception of aesthetic attributes, enabling more accurate evaluation.

grade-lv

Left: Balance between coarse-grained (AVA) and fine-grained (FGAesthetics) evaluations. Right: Cross-dataset validation on three IAA benchmarks: ICAA17K, AADB, and TAD66K.

Qualitative Results

BibTeX


@article{yang2026fine,
  title={Fine-grained Image Aesthetic Assessment: Learning Discriminative Scores from Relative Ranks},
  author={Yang, Zhichao and Wang, jianjie and Zhang, Zhixianhe and Xie, Pangu and Sheng, Xiangfei and Chen, Pengfei and Li, Leida},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2026}
}