2025-04-16

Adversarial examples derived from MNIST. Time series plotting model confidence deterioration over the iterative tuning of the examples. Heatmaps showing how choices of ε and α affect the resulting example's potency. Inspired by an Adversarial AI course at UVM.

Adversarial examples

Imagine an image classifier built from a convolutional neural network. The classifier knows about a handful of labels, or things that an image could possibly be. It accepts an image as input, and outputs a probability for each label that the image is an example of that label. The resulting list of probabilities is called a confidence vector, because it reflects the classifier's confidence that about what the image is. The label with the highest probability is understood to be the classifier's best guess.

An adversarial example is an image perturbed in a careful way to give the classifier high confidence that the image is something it is not. However from a human's perspective, the adversarial example looks almost exactly the same as the original image. It may appear slightly blurrier or have a small meaningless looking artifact, depending on the technique used to create it.

GitHub has no shortage of repositories about creating adversarial examples, and I take from some preexisting code as a starting point. I look at a technique that minimizes the classifier's confidence in the true label and offer some interesting data visualizations I have not seen elsewhere. I may later describe additional techniques that maximize the classifier's confidence in a particular chosen false label, or that spread the classifier's confidence evenly across all the labels.

To maintain focus, I only include parts of the source code that help explain and clarify the experiments. For complete source code, see the Jupyter notebook on which this article is based.

Sample images and model

The MNIST dataset is a public dataset, formidable to classifiers in its day, but now commonly used for simple examples like mine. it contains many images of handwritten digits. The ubiquitous pytorch library for python even has built-in functions for loading it.

I randomly select six images and their corresponding labels from MNIST; however the selection is not quite random, because I choose the random seed deliberately so as to achieve a certain diversity in effects that I describe later.

mnist = MNIST(
    root=DATA_PATH,
    train=True,
    download=True,
    transform=ToTensor())

loader = DataLoader(
    dataset=mnist,
    batch_size=N_SAMPLES,
    shuffle=True)

images, labels = next(iter(loader))
labels = [label.item() for label in labels]

Surprisingly and somewhat annoyingly for the purposes of a simple example, the extensive repository of pre-trained pytorch models at PyTorch Hub does not include a MNIST classifier. I borrow a model from Sarath Chandra Kothapalli and reduce it to a single sequential module, just for clarity of presentation. The model is more complicated than necessary, but standard and effective.

Incidentally, Sarath explores adversarial examples on MNIST too. Their work is worth checking out.

model = Sequential(
    Conv2d(1, 32, kernel_size=3, padding=1),
    ReLU(),
    Conv2d(32, 32, kernel_size=3, padding=1),
    ReLU(),
    MaxPool2d(kernel_size=2),
    Conv2d(32, 64, kernel_size=3, padding=1),
    ReLU(),
    Conv2d(64, 64, kernel_size=3, padding=1),
    ReLU(),
    MaxPool2d(kernel_size=2),
    Flatten(),
    Linear(7 * 7 * 64, 200),
    ReLU(),
    Linear(200, 10),
    Softmax(dim=1))

The experiments revolve around the classifier's confidence, so I define a function to calculate the confidence vector for an image. The function takes an image as input, and returns the classifier's confidence vector and prediction.

def calculate_confidence(image):
    unsqueezed_image = image.unsqueeze(0)
    output = model(unsqueezed_image)
    arrays = output.detach().numpy()
    return output, arrays[0]

Just to concretize my progress and establish a baseline for comparison, I exercise the classifier on the six images I selected. I ask the classifier to label each image. I display the image, the true label, and the predicted label all together.

for i in range(N_SAMPLES):
    output, confidence = calculate_confidence(images[i])
    prediction = np.argmax(confidence)
    plt.figure(figsize=THUMBNAIL_SIZE)
    plt.imshow(images[i].squeeze(), cmap='gray')
    plt.axis('off')

    plt.title(
        f'True:{labels[i]} Pred:{prediction}',
        fontsize=8)

    plt.savefig(
        f'{OUT_PATH}/baseline_{i}.png',
        transparent=True,
        bbox_inches='tight',
        pad_inches=0.1)

    plt.show()

The classifier's predicted labels match the true labels in all but one case. MNIST does not contain many images where the classifier predicts incorrectly. I deliberately include such an image to show what happens differently in the visualizations that follow.

Untargeted iterative FSGM

Of the many techniques for creating adversarial examples, I choose the untargeted iterative fast gradient sign method. It iteratively perturbs the image against the direction of the gradient of the loss function with respect to the image.

A parameter called α proscribes the distance to move in that direction in a single iteration. Put another way, α controls how much should each pixel be perturbed in a single iteration. Another parameter called ε bounds the maximum distance a pixel may be perturbed over all the iterations combined. I have seen different approaches to choosing the number of iterations, but they usually depend on ε and α. For now I choose a simple ratio.

EPSILON = 0.07
ALPHA = .006

iterations = int(EPSILON / ALPHA)

To implement FSGM, I borrow and adapt some code from an interactive demonstration by Sarath Chandra Kothapalli. For each of the iterations, the attack gets the confidence vector from the classifier. It uses the loss to calculate the perturbation, ensures the perturbation keeps the images within the bounds of epsilon, and then applies it to the image.

def attack(image, label, epsilon, alpha, iterations):
    criterion = CrossEntropyLoss()
    orig = image
    image = image.clone().detach().requires_grad_(True)
    label = torch.tensor([label])
    confidences = []
    for _ in range(iterations):
        output, confidence = calculate_confidence(image)
        loss = criterion(output, label)
        loss.backward()
        perturbation = alpha * torch.sign(image.grad.data)

        perturbation = torch.clamp(
            image.data + perturbation - orig,
            min=-epsilon,
            max=epsilon)

        image.data = orig + perturbation
        image.grad.data.zero_()
        confidences.append(confidence)

    image = image.detach().squeeze()
    return image, confidences, np.argmax(confidence)

I demonstrate the attack on the six images, showing the perturbed images, their true labels, and the classifier's prediction all together. I tuned ε and α very particularly, to get a small but noticeable effect on the images. Notice the blotchiness in the background, as opposed to the pure black background in the unperturbed images. The classifier predicts incorrectly now, most of the time. I notice the classifier seems drawn toward 3 and 6, but I do not know if there is anything to be said about that.

perturbed = []
confidences = []
predictions = []

for i in range(N_SAMPLES):
    output, confidence, prediction = attack(
        images[i], labels[i], EPSILON, ALPHA, iterations)

    perturbed.append(output)
    confidences.append(confidence)
    predictions.append(prediction)

    plt.figure(figsize=THUMBNAIL_SIZE)
    plt.axis('off')
    plt.imshow(perturbed[i], cmap='gray')

    plt.title(
        f'True:{labels[i]} Pred:{predictions[i]}',
        fontsize=8)

    plt.savefig(
        f'{OUT_PATH}/untargeted_{i}.png',
        transparent=True,
        bbox_inches='tight',
        pad_inches=0.1)

    plt.show()

I may return to explore other techniques in the future. A targeted attack that makes the classifier predict a chosen label is possible; as are attacks that try to distribute the classifier's confidence evenly across all the labels.

Visualizing the attack

I devote the rest of the article to making some interesting visualizations I have not seen elsewhere. In the previous section, I captured the confidence vector at each iteration of the attack. Now, I plot the confidence vectors for each label over the iterations.

for i in range(N_SAMPLES):
    plt.figure(figsize=SERIES_SIZE)

    plt.title(
        f'True: {labels[i]}   Pred:{predictions[i]}',
        fontsize=10)

    plt.xlabel('Iteration')
    plt.ylabel('Confidence')
    plt.xticks(list(range(iterations)))

    for j in range(10):
        cfs = [confidences[i][k][j] for k in range(iterations)]
        plt.plot(list(range(len(cfs))), cfs, label=f'Label {j}')

    plt.legend(loc='upper left', bbox_to_anchor=(1, 1.15))

    plt.savefig(
        f'{OUT_PATH}/series_{i}.png',
        bbox_inches='tight',
        pad_inches=0.1,
        transparent=True)
    
    plt.show()

Classifier changes prediction from 9 to 3

Classifier changes prediction from 4 to 6

Classifier changes prediction from 2 to 3

Classifier changes prediction from 1 to 3

Classifier retains its incorrect prediction of 6

Classifier retains its correct prediction of 3

In the first four examples, the classifier's confidence in the true label decreases over the iterations, and is eventually eclipsed by its confidence in some other label. I think it is interesting that the trend in confidence looks somewhat sigmoid-like, instead of linear. I also think it is interesting that the classifier consistently trends towards high confidence in a single incorrect label, instead of towards greater uncertainty.

In the fifth example, the classifier predicts an incorrect label from the beginning, and its confidence in that label only increases across iterations. The sixth example shows an attack that was not strong enough to dissuade the classifier from its correct prediction.

Parameter choices

The last experiment I try is a grid search over ε and α to see what happens to the classifier's confidence in the true label for different values of these parameters.

epsilons = np.arange(0.0001, 0.1501, 0.0025)
alphas = np.arange(0.0001, 0.1501, 0.0025)

output = np.full(
    (N_SAMPLES, len(epsilons), len(alphas)),
    np.nan)

grid = product(
    enumerate(zip(images, labels)),
    enumerate(epsilons),
    enumerate(alphas))

for (k, (image, label)), (i, e), (j, a) in grid:
    if a > e: continue
    its = int(e / a) + 1
    _, cv, _ = attack(image, label, e, a, its)
    output[k][i][j] = cv[-1][label]

Setting up the visuals is a little tricky. The idea is to have ε and α along the axes, and the heat represents the classifier's confidence in the true label for an adversarial example trained the exact same technique as before using those parameters.

INTERVAL = 10

def ticks(lst):
    return np.arange(0, len(lst), INTERVAL)

def tick_labels(lst):
    return [f'{lst[x]:.2f}' for x in range(0, len(lst), INTERVAL)]

for i in range(N_SAMPLES):
    plt.figure(figsize=HEATMAP_SIZE)
    plt.imshow(output[i], cmap='hot', interpolation='nearest')
    plt.colorbar().set_label('Confidence')
    plt.clim(0.0, 1.0)
    plt.xlabel('Alpha')
    plt.ylabel('Epsilon')

    plt.xticks(
        ticks=ticks(alphas),
        labels=tick_labels(alphas),
        rotation=90)

    plt.yticks(
        ticks=ticks(epsilons),
        labels=tick_labels(epsilons))

    inset_ax = inset_axes(
        parent_axes=plt.gca(),
        width='35%',
        height='35%',
        loc='upper right')

    inset_ax.imshow(images[i].squeeze(), cmap='gray')
    inset_ax.axis('off')

    plt.savefig(
        f'{OUT_PATH}/heatmap_{i}.png',
        bbox_inches='tight',
        pad_inches=0.1,
        transparent=True)

    plt.show()

The pure white triangle in the upper right represents the region where ε < α. No adversarial examples can be created under that restriction, because the total perturbation is bounded by an amount less than the perturbation applied at each iteration.

The fifth plot is different from the others, being almost all black. In this example, the classifier never predicted the correct label in the first place.

All the other plots look similar. Much of the space looks like a sequence of triangles with narrower and narrower bases. This surprised me at first, but it makes sense. It boils down to the number of iterations that are allowed, which in these experiments is ε / α. More iterations generally produces more effective adversarial examples. The jumps in classifier confidence are where the ratio permits a new number of iterations.

Mere curiosities

Some hacks succeed less frequently than in the past because effective defenses against them become common practice. In the 1990s when many web applications stored user-submitted form fields in databases, people invented SQL injection. They entered specially crafted text into form fields that tricked the application into running arbitrary queries and commands against the database. SQL injection depends on easy-to-make programming mistakes, and most development teams today use web frameworks or libraries that help them get it right.

Other hacks become outright obsolete because the technology they depend on changes in a fundamental way. In the 1960s and beyond, people invented phreaking. They exploited the telephone system by playing specially crafted audible tones that that tricked the system into making free long distance calls. By the 1990s though, the telephone system had changed to superior digital signaling. Phreaking is no longer possible, but it remains an interesting historical curiosity.

Classifier training techniques that imperfectly resist adversarial examples already exist, as a paper from the Adversarial AI course discusses. Adversarial example resistant classifiers may not see widespread deployment though. Decision makers may be unaware of the risks. Resistant classifiers may take more time and resources to train. They may perform slightly worse on actual images than their counterparts. Also, adversarial examples may not be of concern in the environment in which a classifier is deployed.

I believe that adversarial examples face more existential problems. Consider that humans in 2025 can classify images far better than any state-of-the-art classifier. Furthermore, humans need far fewer reference examples, and can learn new labels ad hoc. Their superior image classification abilities serve as a witness that better image classification techniques are possible. Likely those more human-like solutions will still be representable as artificial neural networks, but ones with novel architectures, and training processes very different from back-propagation. They will summarily replace the solutions of today. And to the point, adversarial examples will not fool them because adversarial examples do not fool humans.