GoldiCLIP: The Goldilocks Approach for Balancing Explicit Supervision for Language-Image Pretraining

Deen Dayal Mohan^*, Hossein Souri^*, Vitali Petsiuk^*,
Juhong Min, Gopal Sharma, Luowei Zhou, Suren Kumar^†

AI Center — Mountain View, Samsung Electronics

^* Equal contribution, order chosen randomly.

^† Work done while at Samsung Electronics.

GoldiCLIP Architecture Overview

Abstract

Until recently, the success of large-scale vision-language models (VLMs) has primarily relied on billion-sample datasets, posing a significant barrier to progress. Latest works have begun to close this gap by improving supervision quality, but each addresses only a subset of the weaknesses in contrastive pretraining. We present GoldiCLIP, a framework built on a Goldilocks principle of finding the right balance of supervision signals. Our multifaceted training framework synergistically combines three key innovations: (1) a text-conditioned self-distillation method to align both text-agnostic and text-conditioned features; (2) an encoder integrated decoder with Visual Question Answering (VQA) objective that enables the encoder to generalize beyond the caption-like queries; and (3) an uncertainty-based weighting mechanism that automatically balances all heterogeneous losses. Trained on just 30 million images, 300x less data than leading methods, GoldiCLIP achieves state-of-the-art among data-efficient approaches, improving over the best comparable baseline by 2.2 points on MSCOCO retrieval, 2.0 on fine-grained retrieval, and 5.9 on question-based retrieval, while remaining competitive with billion-scale models.

Method

GoldiCLIP is a unified training framework that systematically integrates diverse supervision signals to dramatically improve data efficiency in vision-language pre-training. Our framework has three key components:

Text-Conditioned Self-Distillation: A novel self-distillation method that extends to both text-conditioned and text-agnostic representations, improving local-to-global consistency to enhance vision encoder performance.
Decoder Objectives with VQA: Incorporates a transformer-based decoder that autoregressively generates text sequences conditioned on visual features. By integrating a Visual Question Answering (VQA) objective, the model learns a more robust feature space encoding object identities, attributes, and spatial relationships.
Task Balancing: Optimizes the model using a diverse set of six supervisory signals via uncertainty-based multi-task learning, eliminating the need for expensive hyperparameter tuning while avoiding training collapse.

Results

Trained on just 30 million images, GoldiCLIP achieves state-of-the-art performance among data-efficient approaches. Our evaluation demonstrates significant improvements over existing models (such as FLAIR) across diverse benchmarks including:

Zero-Shot Retrieval: Superior performance on MSCOCO and Flickr30k.
Fine-Grained Retrieval: Exceptional granular understanding on challenging datasets like DOCCI-FG and IIW-FG.
Zero-Shot Semantic Segmentation: Outperforms all models trained on comparable datasets by capturing nuanced spatial representations.

Method	Data Size	MSCOCO		Flickr30k		DOCCI-FG
Method	Data Size	T2I (R@1)	I2T (R@1)	T2I (R@1)	I2T (R@1)	T2I (R@1)
DreamLIP	30M	44.8	62.3	73.3	89.9	21.6
COSMOS	30M	52.5	68.0	80.3	92.9	23.1
FLAIR	30M	53.3	68.0	81.1	94.7	25.0
SigLIP 2	10B	52.5	70.0	80.0	91.8	23.2
GoldiCLIP (Ours)	30M	55.5	70.3	83.0	94.8	27.0

Zero-shot image-text retrieval on standard (MSCOCO, Flickr30k) and fine-grained (DOCCI-FG) benchmarks comparing GoldiCLIP to other models at similar and larger scales.

BibTeX

@article{mohan2026goldiclip,
  title={GoldiCLIP: The Goldilocks Approach for Balancing Explicit Supervision for Language-Image Pretraining},
  author={Mohan, Deen Dayal and Souri, Hossein and Petsiuk, Vitali and Min, Juhong and Sharma, Gopal and Zhou, Luowei and Kumar, Suren},
  journal={arXiv preprint},
  year={2024}
}