Wiki · Concept · Last reviewed May 19, 2026

CLIP

CLIP, short for Contrastive Language-Image Pretraining, is a model family that learns a shared embedding space for images and text by matching captions to images at scale.

Definition

CLIP is a contrastive multimodal training approach introduced by OpenAI. It trains an image encoder and a text encoder so that matching image-text pairs are close in embedding space and nonmatching pairs are farther apart.

This makes images accessible through language. Instead of training a classifier for every label, a system can compare an image embedding against text prompts such as "a diagram of a neural network" or "a photo of a dog."

Mechanism

CLIP-style training uses large batches of image-caption pairs. The model learns which caption belongs with which image by contrastive comparison. The result is not just classification; it is a shared language-image coordinate system.

That shared space later became important for image search, zero-shot classification, content filtering, dataset analysis, generative-image guidance, and multimodal assistant systems.

Uses

Zero-shot classification. A model can classify images using natural-language label prompts without task-specific training.

Image retrieval. Users can search visual material with text queries.

Dataset curation. Image collections can be filtered, clustered, deduplicated, or audited through text-image similarity.

Generative media. CLIP-like scoring influenced early text-to-image systems and broader multimodal generation pipelines.

Risk Pattern

CLIP inherits the biases, categories, captions, and cultural assumptions of its training data. It can attach confident language to ambiguous images, make harmful associations, or turn visual interpretation into an apparently neutral score.

Governance questions include dataset provenance, consent, cultural labeling, biometric misuse, surveillance, safety-filter overreach, and the use of language prompts to steer visual judgment.

Sources


Return to Wiki