In-Context Learning: Image Classification

Adapted from https://intro-to-icl.github.io/

Consider to download this Jupyter Notebook and run locally, or test it with Colab.

Overview

This notebook presents a step-by-step walkthrough of an interactive demo that explores how Large Language Models (LLMs) can perform image-based image classification purely through in-context learning. Inspired by the idea that LLMs can function as general pattern recognizers, this experiment uses small image snippets of weld defects, labeled with their respective defect—to examine how effectively Gemini can internalize visual defect patterns and reproduce consistent judgments on new samples.

The goal is to illustrate how an LLM—without explicit training, fine-tuning, or feature engineering—can infer quality cues (defects) from a few examples and generalize to other images.

Background

This demo focuses on real weld defect images categorized into 6 classes: - Weld Cracks - Burn Through - Lack of Fusion - Slag Inclusion - Weld Splatter - Surface Porosity

Each example image serves as an example pairing: - A compact visual representation - A category label from one of the 6 classes

What makes this setup particularly compelling: - The data was checked before utilization to make sure it couldn’t classify the industrial defect correctly before ICL. - The model receives visual examples only through prompt context - No training or gradient updates occur—classification arises from pattern matching - Prompts can include descriptions, annotations, or multi-step chains of examples - The test images require generalization, not memorization - Predictions are generated sample-by-sample, mimicking standard evaluation flows

This setting provides a clear benchmark for understanding how well LLMs can perform visual classification tasks when guided only through carefully constructed prompts.

Let’s Take a Look at an Example

The illustration below shows an example of weld defects. Given a set of labeled samples in context, the LLM must detect the defect for new, unseen defect images during evaluation.

LLM as the Classifier

In this demo, we use qwen-vl-plus in non-reasoning mode, prompting the model to rely on direct pattern recognition rather than symbolic explanation or analytic reasoning.

The workflow proceeds as follows: - Provide 9 total images of weld defects (1-2 for each defect) as in-context examples - Send a new unlabeled image through the prompt to classify - Ask the model to output the category of the new unlabeled image

The LLM functions as a lightweight, prompt-driven classifier—absorbing visual differences, structural patterns, and defect signatures from the in-context examples.

Evaluation

Finally, we compare the model’s predicted labels against ground-truth labels and compute accuracy, which provides insight into how effectively a LLM can approximate visual quality-control decisions through in-context learning alone—without any dedicated training pipeline.

Code Overview

The implementation is structured modularly, with each component handling a distinct stage of the ICL classification pipeline. This separation makes the system easy to modify, extend, and reuse: - Data loading & preprocessing: Read images, convert to model-compatible format - Visualization: Display sets of good/bad examples - Prompt construction: Insert labeled samples into few-shot prompts - LLM inference: Retrieve predictions one image at a time

Install required packages

!pip install numpy
!pip install openai
!pip install scikit-image
!pip install scikit-learn

Requirement already satisfied: numpy in /opt/anaconda3/lib/python3.12/site-packages (1.26.4)
Requirement already satisfied: openai in /opt/anaconda3/lib/python3.12/site-packages (2.12.0)
Requirement already satisfied: anyio<5,>=3.5.0 in /opt/anaconda3/lib/python3.12/site-packages (from openai) (4.2.0)
Requirement already satisfied: distro<2,>=1.7.0 in /opt/anaconda3/lib/python3.12/site-packages (from openai) (1.9.0)
Requirement already satisfied: httpx<1,>=0.23.0 in /opt/anaconda3/lib/python3.12/site-packages (from openai) (0.27.0)
Requirement already satisfied: jiter<1,>=0.10.0 in /opt/anaconda3/lib/python3.12/site-packages (from openai) (0.12.0)
Requirement already satisfied: pydantic<3,>=1.9.0 in /opt/anaconda3/lib/python3.12/site-packages (from openai) (2.8.2)
Requirement already satisfied: sniffio in /opt/anaconda3/lib/python3.12/site-packages (from openai) (1.3.0)
Requirement already satisfied: tqdm>4 in /opt/anaconda3/lib/python3.12/site-packages (from openai) (4.66.5)
Requirement already satisfied: typing-extensions<5,>=4.11 in /opt/anaconda3/lib/python3.12/site-packages (from openai) (4.15.0)
Requirement already satisfied: idna>=2.8 in /opt/anaconda3/lib/python3.12/site-packages (from anyio<5,>=3.5.0->openai) (3.7)
Requirement already satisfied: certifi in /opt/anaconda3/lib/python3.12/site-packages (from httpx<1,>=0.23.0->openai) (2024.8.30)
Requirement already satisfied: httpcore==1.* in /opt/anaconda3/lib/python3.12/site-packages (from httpx<1,>=0.23.0->openai) (1.0.2)
Requirement already satisfied: h11<0.15,>=0.13 in /opt/anaconda3/lib/python3.12/site-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai) (0.14.0)
Requirement already satisfied: annotated-types>=0.4.0 in /opt/anaconda3/lib/python3.12/site-packages (from pydantic<3,>=1.9.0->openai) (0.6.0)
Requirement already satisfied: pydantic-core==2.20.1 in /opt/anaconda3/lib/python3.12/site-packages (from pydantic<3,>=1.9.0->openai) (2.20.1)
Requirement already satisfied: scikit-image in /opt/anaconda3/lib/python3.12/site-packages (0.24.0)
Requirement already satisfied: numpy>=1.23 in /opt/anaconda3/lib/python3.12/site-packages (from scikit-image) (1.26.4)
Requirement already satisfied: scipy>=1.9 in /opt/anaconda3/lib/python3.12/site-packages (from scikit-image) (1.13.1)
Requirement already satisfied: networkx>=2.8 in /opt/anaconda3/lib/python3.12/site-packages (from scikit-image) (3.3)
Requirement already satisfied: pillow>=9.1 in /opt/anaconda3/lib/python3.12/site-packages (from scikit-image) (10.4.0)
Requirement already satisfied: imageio>=2.33 in /opt/anaconda3/lib/python3.12/site-packages (from scikit-image) (2.33.1)
Requirement already satisfied: tifffile>=2022.8.12 in /opt/anaconda3/lib/python3.12/site-packages (from scikit-image) (2023.4.12)
Requirement already satisfied: packaging>=21 in /opt/anaconda3/lib/python3.12/site-packages (from scikit-image) (24.1)
Requirement already satisfied: lazy-loader>=0.4 in /opt/anaconda3/lib/python3.12/site-packages (from scikit-image) (0.4)
Requirement already satisfied: scikit-learn in /opt/anaconda3/lib/python3.12/site-packages (1.5.1)
Requirement already satisfied: numpy>=1.19.5 in /opt/anaconda3/lib/python3.12/site-packages (from scikit-learn) (1.26.4)
Requirement already satisfied: scipy>=1.6.0 in /opt/anaconda3/lib/python3.12/site-packages (from scikit-learn) (1.13.1)
Requirement already satisfied: joblib>=1.2.0 in /opt/anaconda3/lib/python3.12/site-packages (from scikit-learn) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /opt/anaconda3/lib/python3.12/site-packages (from scikit-learn) (3.5.0)

Qwen_API_Key = "sk-40840707a91d45c0b22585f6c8bfecfb" # this is the API-Key generated by Zeju in Dec 2025.

#@title **Import Necessary Libraries**
import numpy as np
import matplotlib.pyplot as plt
import re
import itertools
import math
import ast
import json
import ipywidgets as widgets
from IPython.display import display
import os
import getpass
from IPython.display import Image
import time
import os
import base64
from openai import OpenAI

Let’s first try to use the Qwen API with some simple examples.

client = OpenAI(
    api_key=Qwen_API_Key,
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

# 读取图片并转换为Base64
image_path = "./weld_defects.png"

# 检查文件是否存在
if not os.path.exists(image_path):
    print(f"错误: 文件 {image_path} 不存在")
    exit()

# 读取图片文件
with open(image_path, "rb") as img_file:
    base64_image = base64.b64encode(img_file.read()).decode('utf-8')

# 构建base64 URL（假设是PNG格式）
base64_url = f"data:image/png;base64,{base64_image}"

# 创建请求
completion = client.chat.completions.create(
    model="qwen-vl-plus",
    messages=[{"role": "user",
               "content": [{"type": "image_url", "image_url": {"url": base64_url}},
                           {"type": "text", "text": "What is it"}]}]
)

print(completion.model_dump_json())
print("\n提取的文本内容:")
print(completion.choices[0].message.content)

{"id":"chatcmpl-e1533dc7-9cf4-49cb-a7ba-a0a2d6d408ee","choices":[{"finish_reason":"stop","index":0,"logprobs":null,"message":{"content":"The image shows four different types of welding defects, each accompanied by a visual example and a label. The first three defects are labeled as \"Weld Splatter,\" \"Burn Through,\" and \"Surface Porosity.\" The fourth defect is not labeled, and the question mark suggests that you need to identify it.\n\nLet's analyze the fourth image:\n\n1. **Visual Characteristics**: The weld in the fourth image appears to have a series of small, irregular holes or cavities along the length of the weld bead. These cavities are distributed unevenly and seem to be embedded within the weld metal.\n\n2. **Comparison with Other Defects**:\n   - **Weld Splatter**: This defect involves molten metal being ejected from the weld pool and solidifying on the surrounding surface. It does not match the pattern seen in the fourth image.\n   - **Burn Through**: This defect occurs when the heat input is too high, causing the base metal to melt through, resulting in a hole or gap in the weld. The fourth image does not show a hole but rather internal cavities.\n   - **Surface Porosity**: This defect involves gas pockets forming on the surface of the weld. While the fourth image does have some surface imperfections, the primary issue appears to be internal cavities.\n\n3. **Identification**: The internal cavities in the weld bead are characteristic of **Internal Porosity**. Internal porosity occurs when gases become trapped within the weld metal during solidification, forming voids or cavities.\n\nTherefore, the defect in the fourth image is **Internal Porosity**.","refusal":null,"role":"assistant","annotations":null,"audio":null,"function_call":null,"tool_calls":null,"reasoning_content":""}}],"created":1765907191,"model":"qwen-vl-plus","object":"chat.completion","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":321,"prompt_tokens":958,"total_tokens":1279,"completion_tokens_details":{"accepted_prediction_tokens":null,"audio_tokens":null,"reasoning_tokens":null,"rejected_prediction_tokens":null,"text_tokens":321},"prompt_tokens_details":{"audio_tokens":null,"cached_tokens":0}}}

提取的文本内容:
The image shows four different types of welding defects, each accompanied by a visual example and a label. The first three defects are labeled as "Weld Splatter," "Burn Through," and "Surface Porosity." The fourth defect is not labeled, and the question mark suggests that you need to identify it.

Let's analyze the fourth image:

1. **Visual Characteristics**: The weld in the fourth image appears to have a series of small, irregular holes or cavities along the length of the weld bead. These cavities are distributed unevenly and seem to be embedded within the weld metal.

2. **Comparison with Other Defects**:
   - **Weld Splatter**: This defect involves molten metal being ejected from the weld pool and solidifying on the surrounding surface. It does not match the pattern seen in the fourth image.
   - **Burn Through**: This defect occurs when the heat input is too high, causing the base metal to melt through, resulting in a hole or gap in the weld. The fourth image does not show a hole but rather internal cavities.
   - **Surface Porosity**: This defect involves gas pockets forming on the surface of the weld. While the fourth image does have some surface imperfections, the primary issue appears to be internal cavities.

3. **Identification**: The internal cavities in the weld bead are characteristic of **Internal Porosity**. Internal porosity occurs when gases become trapped within the weld metal during solidification, forming voids or cavities.

Therefore, the defect in the fourth image is **Internal Porosity**.

It already exhibits a degree of in-context learning, albeit in a highly flexible and unstructured manner. Further analysis will explore this in greater detail.

#@title **Download Data from GitHub**
if not os.path.exists("intro_to_icl_data"):
    !git clone https://github.com/hsiang-fu/intro_to_icl_data.git

The cell ICL Image Classification below is responsible for running ICL weld defect image classification. It loads labeled training examples, constructs the few-shot prompt, sends the prompt to the LLM for each test image, parses the prediction, evaluates correctness, and finally reports overall accuracy.

Each test image is displayed, and the model’s prediction and the ground truth label are printed. Running each cell will perform inference across all test images.

After the cell runs, it performs several steps. First, it builds the ICL training examples by loading labled training images. Each example is converted into a OpenAI Part so it can be embedded directly into the prompt. These form the annotated few-shot demonstrations the model uses to learn the classification pattern. Next, it constructs the full ICL prompt for each test item by including the instruction, all labeled example images, the unlabeled test image, and a rule specifying that the model should respond only with the labels of the weld defect. Then it loads the unseen test images and their ground-truth labels. For each test image, the cell displays the image, sends the entire ICL prompt to Qwen, reads the model’s label prediction, compares it to the ground truth, and stores the results. After all images are processed, the cell computes summary metrics such as accuracy, total number of correct predictions, and incorrect predictions.

Each evaluation cycle outputs the test hazelnut image, the model’s predicted label, the true label, and whether the prediction was correct. At the end, the code prints a performance summary showing the model’s accuracy across all ten test images.

#@title **ICL Image Classification**
test_labels = [
    "Burn Through",
    "Cracks",
    "Lack of Fusion",
    "Slag Inclusion",
    "Splatter",
    "Surface Porosity",
    "Cracks",
    "Cracks",
    "Lack of Fusion",
    "Splatter",
    "Surface Porosity"
    ]


def load_part_as_dict(path):
    # 读取图片文件
    with open(path, "rb") as img_file:
        base64_image = base64.b64encode(img_file.read()).decode('utf-8')
    
    # 构建base64 URL（假设是JPG格式）
    base64_url = f"data:image/jpg;base64,{base64_image}"

    return {
        "type": "image_url",
        "image_url": {"url": base64_url}
    }

train_paths = [f"intro_to_icl_data/industrial_defects/train{i}.jpg" for i in range(1, 10)]
train_labels = ["Cracks", "Surface Porosity", "Slag Inclusion", "Splatter", "Burn Through", "Lack of Fusion", "Splatter", "Slag Inclusion", "Surface Porosity", "Burn Through", "Burn Through"]

train_parts = [load_part_as_dict(p) for p in train_paths]

def classify_image(label, index):
    test_path = f"intro_to_icl_data/industrial_defects/test{index}.jpg"
    test_part = load_part_as_dict(test_path)

    contents = []
    
    for i, (tlabel, tpart) in enumerate(zip(train_labels, train_parts)):
        contents.append(f"Example {i+1}: {tlabel}")
        contents.append(tpart)

    contents.extend([
        "What is the defect in the test image? Only return the label.",
        test_part,
    ])
    
    response = client.chat.completions.create(
        model="qwen-vl-plus",
        messages=[
            {"role": "system", "content": "You are an expert in detecting industrial defects. By only using the provided examples, classify the defect.\n"},
            {"role": "user","content": contents}]
    )

    return response, test_path

correct = 0
results = []
print("Starting Image Classification for Welding Defects\n")
for i, label in enumerate(test_labels, start=1):
    response, path = classify_image(label, i)
    pred = response.choices[0].message.content

    display(Image(filename=path, height = 300))

    print(f"\nGround Truth: {label}")
    print(f"Model Output: {pred}\n")

    is_correct = (pred.lower() == label.lower())

    results.append((label, pred, is_correct))
    correct += int(is_correct)

accuracy = correct / len(test_labels)
print(f"Overall Accuracy: {accuracy:.2f}\n")

Starting Image Classification for Welding Defects


Ground Truth: Burn Through
Model Output: Burn Through


Ground Truth: Cracks
Model Output: Cracks


Ground Truth: Lack of Fusion
Model Output: Cracks


Ground Truth: Slag Inclusion
Model Output: Surface Porosity


Ground Truth: Splatter
Model Output: Splatter


Ground Truth: Surface Porosity
Model Output: Surface Porosity


Ground Truth: Cracks
Model Output: Surface Porosity


Ground Truth: Cracks
Model Output: Cracks


Ground Truth: Lack of Fusion
Model Output: Cracks


Ground Truth: Splatter
Model Output: Splatter


Ground Truth: Surface Porosity
Model Output: Surface Porosity

Overall Accuracy: 0.64

This cell Baseline SVM Model implements the traditional machine-learning baseline used to compare against the in-context learning approaches. Rather than learning patterns directly from image pixels, this baseline relies on hand-crafted visual descriptors—specifically Histogram of Oriented Gradients (HOG)—which are then classified using a linear Support Vector Machine (SVM). The code loads the training images, converts them to grayscale, resizes each to 256×256, and extracts their HOG feature vectors. The same preprocessing steps are applied to the test set, ensuring a consistent feature representation. Once the features are assembled, a LinearSVC classifier is trained and evaluated on the same set of test images used by the ICL methods. The resulting accuracy provides a structured, feature-engineered benchmark to compare against the LLM’s prompt-based classification performance.

#@title **Baseline SVM Model**
from skimage.feature import hog
from skimage.io import imread
from skimage.color import rgb2gray
from skimage.transform import resize
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

# Load train data
X_train = []
y_train = []

for path, label in zip(train_paths, train_labels):
    img = imread(path)
    img = resize(img, (256, 256), anti_aliasing=True)
    img = rgb2gray(img)

    feats = hog(img, pixels_per_cell=(16,16), cells_per_block=(2,2))
    X_train.append(feats)
    y_train.append(label)

# Load test data
X_test = []
y_test = []

for i, label in enumerate(test_labels, start=1):
    img = imread(f"intro_to_icl_data/industrial_defects/test{i}.jpg")
    img = resize(img, (256, 256), anti_aliasing=True)
    img = rgb2gray(img)

    feats = hog(img, pixels_per_cell=(16,16), cells_per_block=(2,2))
    X_test.append(feats)
    y_test.append(label)

# Train classifier
clf = LinearSVC()
clf.fit(X_train, y_train)

# Predict
preds = clf.predict(X_test)
print("Baseline HOG+SVM Accuracy:", accuracy_score(y_test, preds))

Baseline HOG+SVM Accuracy: 0.2727272727272727

#@title **ICL and Baseline SVM Comparison**
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report

# ----------------------------------------------------
# Extract predictions from your existing ICL results
# results = [(label, pred, is_correct), ...]
icl_true = [r[0] for r in results]
icl_pred = [r[1] for r in results]

# ----------------------------------------------------
# Extract baseline predictions (already in 'preds')
baseline_pred = list(preds)
baseline_true = list(test_labels)   # Same ordering as test inputs

# ----------------------------------------------------
# Compute accuracies
icl_accuracy = sum(np.array(icl_true) == np.array(icl_pred)) / len(icl_true)
baseline_accuracy = sum(np.array(baseline_true) == np.array(baseline_pred)) / len(baseline_true)

print("===========================================")
print("         Model Comparison Summary")
print("===========================================\n")

print(f"ICL Accuracy:       {icl_accuracy:.3f}")
print(f"Baseline Accuracy:  {baseline_accuracy:.3f}\n")

print("ICL Classification Report:")
print(classification_report(icl_true, icl_pred, zero_division=0))

print("Baseline Classification Report:")
print(classification_report(baseline_true, baseline_pred, zero_division=0))

# ----------------------------------------------------
# Confusion Matrices (ICL and Baseline)
labels_sorted = sorted(list(set(test_labels)))

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# ICL Confusion Matrix
cm_icl = confusion_matrix(icl_true, icl_pred, labels=labels_sorted)
disp_icl = ConfusionMatrixDisplay(confusion_matrix=cm_icl, display_labels=labels_sorted)
disp_icl.plot(ax=axes[0], xticks_rotation=45, cmap="Blues", colorbar=False)
axes[0].set_title("ICL Confusion Matrix")

# Baseline Confusion Matrix
cm_baseline = confusion_matrix(baseline_true, baseline_pred, labels=labels_sorted)
disp_base = ConfusionMatrixDisplay(confusion_matrix=cm_baseline, display_labels=labels_sorted)
disp_base.plot(ax=axes[1], xticks_rotation=45, cmap="Greens", colorbar=False)
axes[1].set_title("Baseline Confusion Matrix")

plt.tight_layout()
plt.show()

===========================================
         Model Comparison Summary
===========================================

ICL Accuracy:       0.636
Baseline Accuracy:  0.273

ICL Classification Report:
                  precision    recall  f1-score   support

    Burn Through       1.00      1.00      1.00         1
          Cracks       0.50      0.67      0.57         3
  Lack of Fusion       0.00      0.00      0.00         2
  Slag Inclusion       0.00      0.00      0.00         1
        Splatter       1.00      1.00      1.00         2
Surface Porosity       0.50      1.00      0.67         2

        accuracy                           0.64        11
       macro avg       0.50      0.61      0.54        11
    weighted avg       0.50      0.64      0.55        11

Baseline Classification Report:
                  precision    recall  f1-score   support

    Burn Through       0.00      0.00      0.00         1
          Cracks       1.00      0.33      0.50         3
  Lack of Fusion       0.00      0.00      0.00         2
  Slag Inclusion       0.00      0.00      0.00         1
        Splatter       0.22      1.00      0.36         2
Surface Porosity       0.00      0.00      0.00         2

        accuracy                           0.27        11
       macro avg       0.20      0.22      0.14        11
    weighted avg       0.31      0.27      0.20        11

Summary

This demo illustrates how LLMs can perform in-context learning (ICL) for multi-class image classification, using real weld defect imagery as the target domain. By supplying the model with a small set of annotated image–label pairs—covering six defect categories such as weld cracks, burn through, slag inclusion, weld splatter, lack of fusion, and surface porosity—we show that the model can infer subtle structural cues that distinguish one defect type from another. These cues include local texture disruptions, cavity patterns, shape irregularities, and characteristic weld-surface anomalies. Crucially, the model learns entirely from the examples embedded in the prompt: no fine-tuning, no gradient updates, and no specialized vision training occurs.

To contextualize performance, the ICL approach is compared against a traditional machine-learning baseline that must learn directly from pixel-level information. While the baseline relies on supervised training and engineered features, the LLM derives its classification behavior purely from pattern recognition using images and the prompt. When evaluated on unseen weld defect images, ICL consistently performs better than the baseline—demonstrating stronger generalization from just a few examples and outperforming the benchmark in overall accuracy. These results highlight the efficiency and adaptability of ICL for inspection-style tasks, especially when training data is scarce or rapid deployment is required.

Conclusion

This demonstration shows that LLMs can successfully classify complex weld defects using only in-context visual examples, effectively acting as prompt-driven inspectors capable of recognizing defect signatures from minimal supervision. The ICL paradigm proves especially powerful in this setting: with only nine example images, the model generalizes to new, unseen weld defects more reliably than the supervised baseline, reflecting the model’s flexibility and its ability to internalize visual patterns without any training pipeline.

Compared to traditional approaches—which typically require substantial datasets, model tuning, and iterative optimization—the ICL method provides a fast, low-overhead alternative that can be adapted to new defect categories simply by revising the prompt. Together, the comparison between ICL and the baseline model demonstrates why LLMs are well-suited for rapid, lightweight visual classification tasks such as weld inspection, quality assurance, and defect triage, offering accurate and consistent performance with minimal setup and significantly reduced computational cost.

Reuse

CC BY-SA 4.0

Citation

For attribution, please cite this work as:

Li, Zeju. n.d. “In-Context Learning: Image Classification.” https://zerojumpline.github.io//teaching/2025-08-08-Pattern Recognition/code_5_icl_classification.html.