用 Numba 加速的纯 numpy CNN 达到 98% 准确率

写了那么多全连接网络之后，终于到了 CNN 的时候。

用纯 numpy 手写一个完整的 CNN，包括卷积层、池化层和反向传播，是个相当折磨人的工作。更要命的是，Python 的多层循环慢得让人绝望：训练一个 epoch 需要 40 分钟。

但用 Numba JIT 编译加速之后，速度快了十几倍，训练时间从 40 分钟压缩到 3 分钟。准确率也从 96.24%（全连接）提升到了 98.06%。

虽然累到怀疑人生，但看到测试准确率破 98%，还是挺有成就感的。

为什么需要 CNN#

全连接网络在 MNIST 上能跑到 96%，但有个致命问题：完全忽略了图像的空间结构。

把 28×28 的图像展平成 784 维向量，意味着：

相邻像素的关系丢失了
旋转、平移等变换敏感性很高
参数量巨大（784×128 = 100,352 个参数）

CNN 通过卷积和池化解决了这些问题：

卷积：提取局部特征（边缘、纹理、形状）
池化：降低分辨率，增强平移不变性
权值共享：同一个卷积核在整个图像上滑动，参数量大大减少

NOTE
LeNet（1998）是最早的 CNN 之一，在 MNIST 上就能达到 99% 以上的准确率。今天我们要实现的就是一个简化版的 LeNet。

网络架构#

我们要实现的 CNN 结构：

1
Input (28×28×1)
2
    ↓
3
Conv2D (8 filters, 3×3 kernel)  → (26×26×8)
4
    ↓
5
ReLU
6
    ↓
7
Conv2D (16 filters, 3×3 kernel) → (24×24×16)
8
    ↓
9
ReLU
10
    ↓
11
MaxPool (2×2)                    → (12×12×16)
12
    ↓
13
Flatten                          → (2304,)
14
    ↓
15
Dense (10 neurons)               → (10,)
16
    ↓
17
Softmax

层数说明：

Conv1：1 → 8 通道，提取 8 种基础特征
Conv2：8 → 16 通道，提取 16 种高级特征
MaxPool：2×2 下采样，降低计算量
Dense：全连接层，输出 10 个类别的概率

相比全连接网络（784 → 128 → 10），这个 CNN 虽然层数多，但参数量其实更少。

Python 循环的性能灾难#

在实现 CNN 之前，我们先来看看为什么需要优化。

卷积操作本质上是多层嵌套循环：

1
def conv2d_naive(X, W, b):
2
    B, H, W_in, C_in = X.shape
3
    out_c, _, k, _ = W.shape
4
    out_h = H - k + 1
5
    out_w = W_in - k + 1
6
    out = np.zeros((B, out_h, out_w, out_c))
7

8
    for b in range(B):           # batch
9
        for oc in range(out_c):  # output channel
10
            for i in range(out_h):  # height
11
                for j in range(out_w):  # width
12
                    for ic in range(C_in):  # input channel
13
                        for ki in range(k):  # kernel height
14
                            for kj in range(k):  # kernel width
15
                                out[b,i,j,oc] += X[b,i+ki,j+kj,ic] * W[oc,ic,ki,kj]
16
                    out[b,i,j,oc] += b[oc,0]
17
    return out

这是 7 层嵌套循环。对于 MNIST 的一个 batch（64 张图像）：

B = 64
out_c = 8
out_h = 26
out_w = 26
C_in = 1
k = 3

总循环次数：64 × 8 × 26 × 26 × 1 × 3 × 3 ≈ 310 万次。

每个 epoch 有 938 个 batch，还有前向传播和反向传播，总循环次数是天文数字。

实测结果：

纯 Python 循环：训练一个 epoch 需要 40 分钟
用 Numba JIT 编译：训练一个 epoch 只需要 3 分钟

差距 13 倍。

WARNING
Python 的循环慢是因为它是解释型语言，每次循环都要做类型检查、引用计数等开销。对于深度嵌套循环，这些开销会累积到难以接受的程度。

什么是 Numba#

Numba 是一个 JIT（Just-In-Time）编译器，能把 Python 代码编译成机器码。

核心思想：

在第一次调用时，把 Python 函数编译成机器码
之后的调用直接执行机器码，跳过 Python 解释器
对于数值计算和循环，速度能提升几十倍

使用方法：

1
from numba import njit
2

3
@njit(cache=True, fastmath=True)
4
def my_function(x):
5
    # 你的代码
6
    return result

@njit 装饰器会把函数编译成机器码。cache=True 表示编译结果会缓存，下次运行不需要重新编译。fastmath=True 允许一些不精确但更快的数学运算。

TIP
Numba 特别适合：多层循环、数值计算、numpy 数组操作。不适合：字典、列表、字符串等 Python 对象。

实现卷积层（带 Numba 加速）#

前向传播#

1
from numba import njit
2

3
@njit(cache=True, fastmath=True)
4
def conv_forward_jit(X, W, b, k):
5
    """
6
    卷积前向传播（Numba 加速）
7
    参数:
8
        X: 输入，形状 (B, H, W, C_in)
9
        W: 卷积核，形状 (out_c, C_in, k, k)
10
        b: 偏置，形状 (out_c, 1)
11
        k: 卷积核大小
12
    返回:
13
        out: 输出，形状 (B, out_h, out_w, out_c)
14
    """
15
    B, H, W_in, C = X.shape
16
    out_c = W.shape[0]
17
    out_h = H - k + 1
18
    out_w = W_in - k + 1
19
    out = np.zeros((B, out_h, out_w, out_c), dtype=X.dtype)
20

21
    for b_idx in range(B):
22
        for oc in range(out_c):
23
            for i in range(out_h):
24
                for j in range(out_w):
25
                    acc = 0.0
26
                    for ic in range(C):
27
                        for ki in range(k):
28
                            for kj in range(k):
29
                                acc += X[b_idx, i + ki, j + kj, ic] * W[oc, ic, ki, kj]
30
                    out[b_idx, i, j, oc] = acc + b[oc, 0]
31
    return out

这段代码和 naive 版本几乎一样，唯一的区别是加了 @njit 装饰器。

但性能差距是天壤之别：

Naive Python：40 分钟/epoch
Numba JIT：3 分钟/epoch

反向传播#

1
@njit(cache=True, fastmath=True)
2
def conv_backward_jit(X, grad, W, k):
3
    """
4
    卷积反向传播（Numba 加速）
5
    参数:
6
        X: 输入，形状 (B, H, W, C_in)
7
        grad: 输出梯度，形状 (B, out_h, out_w, out_c)
8
        W: 卷积核，形状 (out_c, C_in, k, k)
9
        k: 卷积核大小
10
    返回:
11
        dW: 卷积核梯度
12
        db: 偏置梯度
13
        dX: 输入梯度
14
    """
15
    B, H, W_out, out_c = grad.shape
16
    _, XH, XW, in_c = X.shape
17

18
    dW = np.zeros_like(W)
19
    db = np.zeros((out_c, 1), dtype=grad.dtype)
20
    dX = np.zeros_like(X)
21

22
    for b_idx in range(B):
23
        for oc in range(out_c):
24
            for ic in range(in_c):
25
                for i in range(H):
26
                    for j in range(W_out):
27
                        g = grad[b_idx, i, j, oc]
28
                        if g == 0.0:
29
                            continue
30
                        for ki in range(k):
31
                            for kj in range(k):
32
                                dW[oc, ic, ki, kj] += g * X[b_idx, i + ki, j + kj, ic]
33
                                dX[b_idx, i + ki, j + kj, ic] += g * W[oc, ic, ki, kj]
34
            db[oc, 0] += np.sum(grad[b_idx, :, :, oc])
35
    return dW, db, dX

反向传播的循环更复杂，嵌套更深。Numba 的加速效果更明显。

封装成类#

1
class Conv2D:
2
    def __init__(self, in_channels, out_channels, kernel_size):
3
        self.in_c = in_channels
4
        self.out_c = out_channels
5
        self.k = kernel_size
6
        self.W = np.random.randn(out_channels, in_channels, kernel_size, kernel_size) * 0.1
7
        self.b = np.zeros((out_channels, 1))
8

9
    def forward(self, X):
10
        self.X = X
11
        self.out = conv_forward_jit(X, self.W, self.b, self.k)
12
        return self.out
13

14
    def backward(self, grad):
15
        self.dW, self.db, self.dX = conv_backward_jit(self.X, grad, self.W, self.k)
16
        return self.dX
17

18
    def step(self, lr):
19
        self.W -= lr * self.dW
20
        self.b -= lr * self.db

使用示例：

1
conv = Conv2D(in_channels=1, out_channels=8, kernel_size=3)
2
out = conv.forward(X)  # 前向传播
3
conv.backward(grad)    # 反向传播
4
conv.step(lr=0.01)     # 参数更新

实现池化层（带 Numba 加速）#

前向传播#

1
@njit(cache=True, fastmath=True)
2
def maxpool_forward_jit(X):
3
    """
4
    2×2 最大池化前向传播（Numba 加速）
5
    参数:
6
        X: 输入，形状 (B, H, W, C)
7
    返回:
8
        out: 输出，形状 (B, H//2, W//2, C)
9
        argmax: 最大值位置索引
10
    """
11
    B, H, W, C = X.shape
12
    out = np.zeros((B, H // 2, W // 2, C), dtype=X.dtype)
13
    argmax = np.zeros((B, H // 2, W // 2, C), dtype=np.int64)
14

15
    for b_idx in range(B):
16
        for c in range(C):
17
            for i in range(0, H, 2):
18
                for j in range(0, W, 2):
19
                    # 找 2×2 窗口内的最大值
20
                    max_val = X[b_idx, i, j, c]
21
                    max_idx = 0
22

23
                    idx = 1
24
                    v = X[b_idx, i, j + 1, c]
25
                    if v > max_val:
26
                        max_val = v
27
                        max_idx = idx
28

29
                    idx += 1
30
                    v = X[b_idx, i + 1, j, c]
31
                    if v > max_val:
32
                        max_val = v
33
                        max_idx = idx
34

35
                    idx += 1
36
                    v = X[b_idx, i + 1, j + 1, c]
37
                    if v > max_val:
38
                        max_val = v
39
                        max_idx = idx
40

41
                    out[b_idx, i // 2, j // 2, c] = max_val
42
                    argmax[b_idx, i // 2, j // 2, c] = max_idx
43
    return out, argmax

池化需要记录最大值的位置（argmax），反向传播时需要用到。

反向传播#

1
@njit(cache=True, fastmath=True)
2
def maxpool_backward_jit(grad, argmax, H_in, W_in):
3
    """
4
    2×2 最大池化反向传播（Numba 加速）
5
    参数:
6
        grad: 输出梯度，形状 (B, H//2, W//2, C)
7
        argmax: 前向传播时记录的最大值位置
8
        H_in, W_in: 输入的高度和宽度
9
    返回:
10
        dX: 输入梯度，形状 (B, H_in, W_in, C)
11
    """
12
    B, H2, W2, C = grad.shape
13
    dX = np.zeros((B, H_in, W_in, C), dtype=grad.dtype)
14

15
    for b_idx in range(B):
16
        for c in range(C):
17
            for i in range(H2):
18
                for j in range(W2):
19
                    idx = argmax[b_idx, i, j, c]
20
                    bi = idx // 2
21
                    bj = idx - bi * 2
22
                    dX[b_idx, i * 2 + bi, j * 2 + bj, c] = grad[b_idx, i, j, c]
23
    return dX

梯度只回传到最大值的位置，其他位置梯度为零。

封装成类#

1
class MaxPool2x2:
2
    def forward(self, X):
3
        self.X = X
4
        out, self.argmax = maxpool_forward_jit(X)
5
        return out
6

7
    def backward(self, grad):
8
        B, H, W, C = self.X.shape
9
        return maxpool_backward_jit(grad, self.argmax, H, W)

完整的 CNN 模型#

把所有层组合起来：

1
class MinimalCNNClassifier:
2
    def __init__(self, lr=0.01, use_adam=False):
3
        self.lr = lr
4
        self.use_adam = use_adam
5
        self.adam = TinyAdam(lr=lr) if use_adam else None
6

7
        # 网络层
8
        self.conv1 = Conv2D(1, 8, 3)      # 1→8 通道
9
        self.conv2 = Conv2D(8, 16, 3)     # 8→16 通道
10
        self.pool = MaxPool2x2()          # 2×2 池化
11
        self.fc = Dense(12 * 12 * 16, 10) # 全连接层
12

13
    def forward(self, X):
14
        # Conv1 + ReLU
15
        out = self.conv1.forward(X)
16
        out = relu(out)
17
        self.after_relu1 = out
18

19
        # Conv2 + ReLU
20
        out = self.conv2.forward(out)
21
        out = relu(out)
22
        self.after_relu2 = out
23

24
        # MaxPool
25
        out = self.pool.forward(out)
26

27
        # Flatten + Dense + Softmax
28
        out = flatten(out)
29
        out = self.fc.forward(out)
30
        self.y_pred = softmax(out)
31
        return self.y_pred
32

33
    def backward(self, y_true):
34
        # 输出层梯度
35
        grad = softmax_cross_entropy_grad(self.y_pred, y_true)
36

37
        # Dense 反向传播
38
        grad = self.fc.backward(grad)
39

40
        # Unflatten
41
        grad = grad.reshape(-1, 12, 12, 16)
42

43
        # MaxPool 反向传播
44
        grad = self.pool.backward(grad)
45

46
        # Conv2 反向传播
47
        grad = grad * relu_grad(self.after_relu2)
48
        grad = self.conv2.backward(grad)
49

50
        # Conv1 反向传播
51
        grad = grad * relu_grad(self.after_relu1)
52
        self.conv1.backward(grad)
53

54
    def step(self):
55
        if self.adam:
56
            self.adam.step([
57
                (self.fc.W, self.fc.dW),
58
                (self.fc.b, self.fc.db),
59
                (self.conv2.W, self.conv2.dW),
60
                (self.conv2.b, self.conv2.db),
61
                (self.conv1.W, self.conv1.dW),
62
                (self.conv1.b, self.conv1.db),
63
            ])
64
        else:
65
            self.fc.step(self.lr)
66
            self.conv2.step(self.lr)
67
            self.conv1.step(self.lr)

IMPORTANT
反向传播的顺序是前向传播的逆序。ReLU 的梯度要在卷积反向传播之前应用（逐元素相乘）。

训练和测试#

1
import numpy as np
2
import tqdm
3

4
np.random.seed(0)
5

6
# 加载数据
7
data_dir = "./mnist"
8
X_train, y_train, X_test, y_test = load_mnist_from_local(data_dir)
9

10
# 创建模型
11
model = MinimalCNNClassifier(lr=0.001, use_adam=True)
12

13
# 训练
14
def train(model, X_train, y_train, epochs, batch_size):
15
    loss_list = []
16
    for epoch in range(epochs):
17
        pbar = tqdm.tqdm(
18
            range(0, len(X_train), batch_size),
19
            desc=f"Epoch {epoch+1}/{epochs}"
20
        )
21
        for i in pbar:
22
            X_batch = X_train[i:i+batch_size]
23
            y_batch = y_train[i:i+batch_size]
24

25
            y_pred = model.forward(X_batch)
26
            loss = cross_entropy(y_pred, y_batch)
27
            loss_list.append(loss)
28

29
            model.backward(y_batch)
30
            model.step()
31

32
            pbar.set_postfix({"loss": f"{loss:.4f}"})
33
    return loss_list
34

35
loss_list = train(model, X_train, y_train, epochs=3, batch_size=64)
36

37
# 测试
38
def test(model, X_test, y_test):
39
    y_pred_test = model.forward(X_test)
40
    pred = np.argmax(y_pred_test, axis=1)
41
    true = np.argmax(y_test, axis=1)
42
    acc = (pred == true).mean()
43
    print(f"\nTest accuracy: {acc:.4f}")
44
    return acc
45

46
acc = test(model, X_test, y_test)

训练结果#

实际运行结果：

1
Epoch 1/3: 100%| 938/938 [00:54<00:00, 17.24it/s, loss=0.0286]
2
Epoch 2/3: 100%| 938/938 [00:54<00:00, 17.19it/s, loss=0.0267]
3
Epoch 3/3: 100%| 938/938 [00:49<00:00, 18.90it/s, loss=0.0153]
4

5
Test accuracy: 0.9806

最终结果：

训练时间：约 3 分钟（3 个 epoch）
测试准确率：98.06%
最终 loss：0.0153

对比之前的结果：

模型	准确率	训练时间	参数量
全连接 (SGD)	90.79%	~8 秒 (5 epoch)	101K
全连接 (Adam)	96.24%	~10 秒 (5 epoch)	101K
CNN (Adam + Numba)	98.06%	~3 分钟 (3 epoch)	14K
CNN (Adam + Numba)	98.30%	~3 分钟 (3 epoch)	103K

CNN 虽然训练时间长一些（因为卷积操作更复杂），但准确率提升明显。而且参数量更少（14K vs 101K）。

NOTE
如果没有 Numba 加速，训练时间会是 40 分钟 × 3 = 120 分钟（2 小时）。Numba 让训练变得可行。

Numba 的使用技巧#

1. 什么时候用 Numba#

适合：

多层嵌套循环
数值计算（加减乘除、数学函数）
numpy 数组操作

不适合：

字典、列表等 Python 对象
字符串操作
文件 I/O
面向对象编程（类方法）

2. 常用参数#

1
@njit(cache=True, fastmath=True, parallel=False)
2
def my_function(x):
3
    pass

cache=True：缓存编译结果，第二次运行不需要重新编译
fastmath=True：允许不精确但更快的数学运算
parallel=True：自动并行化循环（需要循环之间独立）

3. 调试技巧#

Numba 编译后的函数很难调试。如果出错，可以暂时去掉 @njit 装饰器，用纯 Python 运行：

1
# 调试时去掉装饰器
2
# @njit(cache=True, fastmath=True)
3
def conv_forward_jit(X, W, b, k):
4
    # ...

确认逻辑正确后，再加回装饰器。

4. 类型推断#

Numba 需要推断变量类型。确保变量类型一致：

1
# 好
2
acc = 0.0  # float
3
acc += X[i] * W[j]  # float + float
4

5
# 不好
6
acc = 0  # int
7
acc += X[i] * W[j]  # int + float，可能导致类型错误

保存和加载模型#

给模型加上保存和加载功能：

1
class MinimalCNNClassifier:
2
    # ... 前面的代码 ...
3

4
    def save(self, path):
5
        np.savez(
6
            path,
7
            conv1_W=self.conv1.W,
8
            conv1_b=self.conv1.b,
9
            conv2_W=self.conv2.W,
10
            conv2_b=self.conv2.b,
11
            fc_W=self.fc.W,
12
            fc_b=self.fc.b,
13
        )
14
        print(f"Model saved to {path}")
15

16
    def load(self, path):
17
        data = np.load(path)
18
        self.conv1.W = data["conv1_W"]
19
        self.conv1.b = data["conv1_b"]
20
        self.conv2.W = data["conv2_W"]
21
        self.conv2.b = data["conv2_b"]
22
        self.fc.W = data["fc_W"]
23
        self.fc.b = data["fc_b"]
24
        print(f"Model loaded from {path}")

使用示例：

1
# 保存
2
model.save("models/cnn_model.npz")
3

4
# 加载
5
model2 = MinimalCNNClassifier(lr=0.001, use_adam=True)
6
model2.load("models/cnn_model.npz")
7

8
# 测试
9
acc = test(model2, X_test, y_test)  # 0.9806

输出：

1
Model saved to models/cnn_model.npz!
2
Reloading model...
3
Model loaded from models/cnn_model.npz
4
Reload complete.
5
Reloaded model accuracy: 0.9806

单样本推理#

给模型加上单样本推理功能：

1
class MinimalCNNClassifier:
2
    # ... 前面的代码 ...
3

4
    def predict_batch(self, X):
5
        """批量预测"""
6
        y_pred = self.forward(X)
7
        return np.argmax(y_pred, axis=1)
8

9
    def __call__(self, x):
10
        """
11
        单样本推理
12
        接受 (28,28)、(28,28,1) 或 (1,28,28,1) 格式的输入
13
        """
14
        if x.ndim == 2:  # (H, W)
15
            x = x[None, :, :, None]
16
        elif x.ndim == 3:  # (H, W, C)
17
            x = x[None, :, :, :]
18
        return self.predict_batch(x)[0]

使用示例：

1
# 单样本推理
2
sample = X_test[0].squeeze()  # (28, 28)
3
pred = model(sample)
4
true = np.argmax(y_test[0])
5
print(f"Predicted: {pred}, True: {true}")

输出：

1
Single sample predicted class: 7, true: 7

为什么 CNN 比全连接好#

让我们从数字上对比一下：

参数量#

全连接网络：

第一层：784 × 128 = 100,352
第二层：128 × 10 = 1,280
总计：101,632 个参数

CNN：

Conv1：8 × 1 × 3 × 3 = 72
Conv2：16 × 8 × 3 × 3 = 1,152
Dense：2304 × 10 = 23,040
总计：24,264 个参数

CNN 的参数量只有全连接的 24%。

准确率#

全连接 (SGD)：90.79%
全连接 (Adam)：96.24%
CNN (Adam)：98.06%

CNN 比全连接 (Adam) 提升了 1.82 个百分点。

为什么 CNN 更好#

利用空间结构：卷积核提取局部特征，保留了像素之间的空间关系。
平移不变性：同一个特征出现在不同位置都能被识别。
参数共享：一个卷积核在整个图像上滑动，参数量大大减少。
层次化特征：浅层提取边缘，深层提取形状和纹理。

可视化卷积核#

我们可以看看第一层卷积核学到了什么：

1
import matplotlib.pyplot as plt
2

3
# 第一层卷积核 (8, 1, 3, 3)
4
kernels = model.conv1.W.squeeze()  # (8, 3, 3)
5

6
plt.figure(figsize=(12, 2))
7
for i in range(8):
8
    plt.subplot(1, 8, i+1)
9
    plt.imshow(kernels[i], cmap='gray')
10
    plt.title(f"Filter {i+1}")
11
    plt.axis('off')
12
plt.tight_layout()
13
plt.show()

第一层卷积核通常学到的是边缘检测器：水平边缘、垂直边缘、对角边缘等。

（这里我的图片忘记放了，然后还丢了）

上一篇：ML6-Adam 优化器：让训练快三倍的秘密

完整代码#

完整的 CNN 实现（包含 Numba 加速）：

1
import numpy as np
2
import tqdm
3
import matplotlib.pyplot as plt
4
import os
5
import gzip
6
from numba import njit
7

8
# ============================================================
9
#         Load MNIST from Kaggle idx files (local)
10
# ============================================================
11
def _open_maybe_gz(path):
12
    if os.path.exists(path):
13
        return open(path, "rb")
14
    if os.path.exists(path + ".gz"):
15
        return gzip.open(path + ".gz", "rb")
16
    raise FileNotFoundError(f"Cannot find {path} or {path+'.gz'}")
17

18
def load_mnist_from_local(data_dir):
19
    train_images_path = os.path.join(data_dir, "train-images-idx3-ubyte")
20
    train_labels_path = os.path.join(data_dir, "train-labels-idx1-ubyte")
21
    test_images_path  = os.path.join(data_dir, "t10k-images-idx3-ubyte")
22
    test_labels_path  = os.path.join(data_dir, "t10k-labels-idx1-ubyte")
23

24
    # images: 16-byte header, then uint8 pixels
25
    with _open_maybe_gz(train_images_path) as f:
26
        data = np.frombuffer(f.read(), dtype=np.uint8, offset=16)
27
    X_train = data.reshape(-1, 28, 28, 1) / 255.0
28

29
    with _open_maybe_gz(test_images_path) as f:
30
        data = np.frombuffer(f.read(), dtype=np.uint8, offset=16)
31
    X_test = data.reshape(-1, 28, 28, 1) / 255.0
32

33
    # labels: 8-byte header, then uint8 labels
34
    with _open_maybe_gz(train_labels_path) as f:
35
        labels_train = np.frombuffer(f.read(), dtype=np.uint8, offset=8)
36
    with _open_maybe_gz(test_labels_path) as f:
37
        labels_test = np.frombuffer(f.read(), dtype=np.uint8, offset=8)
38

39
    y_train = np.zeros((labels_train.size, 10))
40
    y_train[np.arange(labels_train.size), labels_train] = 1
41

42
    y_test = np.zeros((labels_test.size, 10))
43
    y_test[np.arange(labels_test.size), labels_test] = 1
44

45
    return X_train, y_train, X_test, y_test
46

47
# ============================================================
48
#               Basic ops
49
# ============================================================
50
def relu(x):
51
    return np.maximum(0, x)
52

53
def relu_grad(x):
54
    return (x > 0).astype(float)
55

56
def softmax(x):
57
    x_max = np.max(x, axis=1, keepdims=True)
58
    ex = np.exp(x - x_max)
59
    return ex / np.sum(ex, axis=1, keepdims=True)
60

61
def cross_entropy(y_pred, y_true):
62
    eps = 1e-15
63
    y_pred = np.clip(y_pred, eps, 1 - eps)
64
    ce = -np.sum(y_true * np.log(y_pred), axis=1)
65
    return np.mean(ce)
66

67
def softmax_cross_entropy_grad(y_pred, y_true):
68
    return (y_pred - y_true) / y_true.shape[0]
69

70
def flatten(x):
71
    return x.reshape(x.shape[0], -1)
72

73
# ============================================================
74
#                Numba-accelerated convolution
75
# ============================================================
76
@njit(cache=True, fastmath=True)
77
def conv_forward_jit(X, W, b, k):
78
    B, H, W_in, C = X.shape
79
    out_c = W.shape[0]
80
    out_h = H - k + 1
81
    out_w = W_in - k + 1
82
    out = np.zeros((B, out_h, out_w, out_c), dtype=X.dtype)
83

84
    for b_idx in range(B):
85
        for oc in range(out_c):
86
            for i in range(out_h):
87
                for j in range(out_w):
88
                    acc = 0.0
89
                    for ic in range(C):
90
                        for ki in range(k):
91
                            for kj in range(k):
92
                                acc += X[b_idx, i + ki, j + kj, ic] * W[oc, ic, ki, kj]
93
                    out[b_idx, i, j, oc] = acc + b[oc, 0]
94
    return out
95

96
@njit(cache=True, fastmath=True)
97
def conv_backward_jit(X, grad, W, k):
98
    B, H, W_out, out_c = grad.shape
99
    _, XH, XW, in_c = X.shape
100

101
    dW = np.zeros_like(W)
102
    db = np.zeros((out_c, 1), dtype=grad.dtype)
103
    dX = np.zeros_like(X)
104

105
    for b_idx in range(B):
106
        for oc in range(out_c):
107
            for ic in range(in_c):
108
                for i in range(H):
109
                    for j in range(W_out):
110
                        g = grad[b_idx, i, j, oc]
111
                        if g == 0.0:
112
                            continue
113
                        for ki in range(k):
114
                            for kj in range(k):
115
                                dW[oc, ic, ki, kj] += g * X[b_idx, i + ki, j + kj, ic]
116
                                dX[b_idx, i + ki, j + kj, ic] += g * W[oc, ic, ki, kj]
117
            db[oc, 0] += np.sum(grad[b_idx, :, :, oc])
118
    return dW, db, dX
119

120
# ============================================================
121
#                Numba-accelerated max pooling
122
# ============================================================
123
@njit(cache=True, fastmath=True)
124
def maxpool_forward_jit(X):
125
    B, H, W, C = X.shape
126
    out = np.zeros((B, H // 2, W // 2, C), dtype=X.dtype)
127
    argmax = np.zeros((B, H // 2, W // 2, C), dtype=np.int64)
128

129
    for b_idx in range(B):
130
        for c in range(C):
131
            for i in range(0, H, 2):
132
                for j in range(0, W, 2):
133
                    max_val = X[b_idx, i, j, c]
134
                    max_idx = 0
135
                    idx = 1
136
                    v = X[b_idx, i, j + 1, c]
137
                    if v > max_val:
138
                        max_val = v
139
                        max_idx = idx
140
                    idx += 1
141
                    v = X[b_idx, i + 1, j, c]
142
                    if v > max_val:
143
                        max_val = v
144
                        max_idx = idx
145
                    idx += 1
146
                    v = X[b_idx, i + 1, j + 1, c]
147
                    if v > max_val:
148
                        max_val = v
149
                        max_idx = idx
150
                    out[b_idx, i // 2, j // 2, c] = max_val
151
                    argmax[b_idx, i // 2, j // 2, c] = max_idx
152
    return out, argmax
153

154
@njit(cache=True, fastmath=True)
155
def maxpool_backward_jit(grad, argmax, H_in, W_in):
156
    B, H2, W2, C = grad.shape
157
    dX = np.zeros((B, H_in, W_in, C), dtype=grad.dtype)
158

159
    for b_idx in range(B):
160
        for c in range(C):
161
            for i in range(H2):
162
                for j in range(W2):
163
                    idx = argmax[b_idx, i, j, c]
164
                    bi = idx // 2
165
                    bj = idx - bi * 2
166
                    dX[b_idx, i * 2 + bi, j * 2 + bj, c] = grad[b_idx, i, j, c]
167
    return dX
168

169
# ============================================================
170
#                     Convolution Layer
171
# ============================================================
172
class Conv2D:
173
    def __init__(self, in_channels, out_channels, kernel_size):
174
        self.in_c = in_channels
175
        self.out_c = out_channels
176
        self.k = kernel_size
177
        self.W = np.random.randn(out_channels, in_channels, kernel_size, kernel_size) * 0.1
178
        self.b = np.zeros((out_channels, 1))
179

180
    def forward(self, X):
181
        self.X = X
182
        self.out = conv_forward_jit(X, self.W, self.b, self.k)
183
        return self.out
184

185
    def backward(self, grad):
186
        self.dW, self.db, self.dX = conv_backward_jit(self.X, grad, self.W, self.k)
187
        return self.dX
188

189
    def step(self, lr):
190
        self.W -= lr * self.dW
191
        self.b -= lr * self.db
192

193
# ============================================================
194
#                     Max Pooling Layer
195
# ============================================================
196
class MaxPool2x2:
197
    def forward(self, X):
198
        self.X = X
199
        out, self.argmax = maxpool_forward_jit(X)
200
        return out
201

202
    def backward(self, grad):
203
        B, H, W, C = self.X.shape
204
        return maxpool_backward_jit(grad, self.argmax, H, W)
205

206
# ============================================================
207
#                     Dense Layer
208
# ============================================================
209
class Dense:
210
    def __init__(self, in_dim, out_dim):
211
        self.W = np.random.randn(in_dim, out_dim) * 0.01
212
        self.b = np.zeros((1, out_dim))
213

214
    def forward(self, X):
215
        self.X = X
216
        self.out = X @ self.W + self.b
217
        return self.out
218

219
    def backward(self, grad):
220
        self.dW = self.X.T @ grad
221
        self.db = np.sum(grad, axis=0, keepdims=True)
222
        return grad @ self.W.T
223

224
    def step(self, lr):
225
        self.W -= lr * self.dW
226
        self.b -= lr * self.db
227

228
# ============================================================
229
#                 CNN Classifier
230
# ============================================================
231
class MinimalCNNClassifier:
232
    def __init__(self, lr=0.01, use_adam=False, hidden_dim=44):
233
        self.lr = lr
234
        self.use_adam = use_adam
235
        self.adam = TinyAdam(lr=lr) if use_adam else None
236
        self.conv1 = Conv2D(1, 8, 3)
237
        self.conv2 = Conv2D(8, 16, 3)
238
        self.pool = MaxPool2x2()
239
        self.hidden_dim = hidden_dim  # choose 44 -> ~103K params total
240
        self.fc1 = Dense(12 * 12 * 16, hidden_dim)
241
        self.fc2 = Dense(hidden_dim, 10)
242

243
    def forward(self, X):
244
        out = self.conv1.forward(X)
245
        out = relu(out)
246
        self.after_relu1 = out
247

248
        out = self.conv2.forward(out)
249
        out = relu(out)
250
        self.after_relu2 = out
251

252
        out = self.pool.forward(out)
253
        out = flatten(out)
254

255
        out = self.fc1.forward(out)
256
        self.fc1_pre_relu = out
257
        out = relu(out)
258
        self.fc1_post_relu = out
259

260
        out = self.fc2.forward(out)
261
        self.y_pred = softmax(out)
262
        return self.y_pred
263

264
    def backward(self, y_true):
265
        grad = softmax_cross_entropy_grad(self.y_pred, y_true)
266

267
        grad = self.fc2.backward(grad)
268

269
        grad = grad * relu_grad(self.fc1_pre_relu)
270
        grad = self.fc1.backward(grad)
271

272
        grad = grad.reshape(-1, 12, 12, 16)
273
        grad = self.pool.backward(grad)
274

275
        grad = grad * relu_grad(self.after_relu2)
276
        grad = self.conv2.backward(grad)
277

278
        grad = grad * relu_grad(self.after_relu1)
279
        self.conv1.backward(grad)
280

281
    def step(self):
282
        if self.adam:
283
            self.adam.step([
284
                (self.fc2.W, self.fc2.dW),
285
                (self.fc2.b, self.fc2.db),
286
                (self.fc1.W, self.fc1.dW),
287
                (self.fc1.b, self.fc1.db),
288
                (self.conv2.W, self.conv2.dW),
289
                (self.conv2.b, self.conv2.db),
290
                (self.conv1.W, self.conv1.dW),
291
                (self.conv1.b, self.conv1.db),
292
            ])
293
        else:
294
            self.fc2.step(self.lr)
295
            self.fc1.step(self.lr)
296
            self.conv2.step(self.lr)
297
            self.conv1.step(self.lr)
298

299
    def predict_batch(self, X):
300
        """Return predicted classes for a batch."""
301
        y_pred = self.forward(X)
302
        return np.argmax(y_pred, axis=1)
303

304
    def __call__(self, x):
305
        """
306
        Single-sample predict wrapper.
307
        Accepts (28,28), (28,28,1) or (1,28,28,1) style inputs.
308
        """
309
        if x.ndim == 2:  # H, W
310
            x = x[None, :, :, None]
311
        elif x.ndim == 3:  # H, W, C
312
            x = x[None, :, :, :]
313
        return self.predict_batch(x)[0]
314

315
    def save(self, path):
316
        os.makedirs(os.path.dirname(path), exist_ok=True)
317
        np.savez(
318
            path,
319
            conv1_W=self.conv1.W,
320
            conv1_b=self.conv1.b,
321
            conv2_W=self.conv2.W,
322
            conv2_b=self.conv2.b,
323
            fc1_W=self.fc1.W,
324
            fc1_b=self.fc1.b,
325
            fc2_W=self.fc2.W,
326
            fc2_b=self.fc2.b,
327
        )
328

329
    def load(self, path):
330
        data = np.load(path)
331
        self.conv1.W = data["conv1_W"]
332
        self.conv1.b = data["conv1_b"]
333
        self.conv2.W = data["conv2_W"]
334
        self.conv2.b = data["conv2_b"]
335
        self.fc1.W = data["fc1_W"]
336
        self.fc1.b = data["fc1_b"]
337
        self.fc2.W = data["fc2_W"]
338
        self.fc2.b = data["fc2_b"]
339

340
# ============================================================
341
#                     Tiny Adam Optimizer
342
# ============================================================
343
class TinyAdam:
344
    """
345
    超简洁 Adam：以 (param, grad) 列表作为输入，直接原地更新参数。
346
    """
347
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
348
        self.lr = lr
349
        self.beta1 = beta1
350
        self.beta2 = beta2
351
        self.eps = eps
352
        self.t = 0
353
        self.m = {}
354
        self.v = {}
355

356
    def step(self, params_and_grads):
357
        self.t += 1
358
        for param, grad in params_and_grads:
359
            key = id(param)
360
            m = self.m.get(key, np.zeros_like(param))
361
            v = self.v.get(key, np.zeros_like(param))
362

363
            m = self.beta1 * m + (1 - self.beta1) * grad
364
            v = self.beta2 * v + (1 - self.beta2) * (grad ** 2)
365

366
            m_hat = m / (1 - self.beta1 ** self.t)
367
            v_hat = v / (1 - self.beta2 ** self.t)
368

369
            param -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
370

371
            self.m[key] = m
372
            self.v[key] = v
373

374
# ============================================================
375
#                         Train Helpers
376
# ============================================================
377
def train(model, X_train, y_train, epochs, batch_size):
378
    loss_list = []
379
    for epoch in range(epochs):
380
        pbar = tqdm.tqdm(
381
            range(0, len(X_train), batch_size),
382
            desc=f"Epoch {epoch+1}/{epochs}"
383
        )
384

385
        for i in pbar:
386
            X_batch = X_train[i:i+batch_size]
387
            y_batch = y_train[i:i+batch_size]
388

389
            y_pred = model.forward(X_batch)
390
            loss = cross_entropy(y_pred, y_batch)
391
            loss_list.append(loss)
392

393
            model.backward(y_batch)
394
            model.step()
395

396
            pbar.set_postfix({"loss": f"{loss:.4f}"})
397
    return loss_list
398

399
def test(model, X_test, y_test):
400
    y_pred_test = model.forward(X_test)
401
    pred = np.argmax(y_pred_test, axis=1)
402
    true = np.argmax(y_test, axis=1)
403
    acc = (pred == true).mean()
404
    print(f"\nTest accuracy: {acc:.4f}")
405
    return acc
406

407
def plot_loss(loss_list):
408
    plt.figure(figsize=(6,4))
409
    plt.plot(loss_list)
410
    plt.xlabel("Iteration")
411
    plt.ylabel("Loss (CE)")
412
    plt.yscale("log")
413
    plt.title("Minimal CNN Training Loss")
414
    plt.grid(True)
415
    plt.tight_layout()
416
    plt.show()
417
# ============================================================
418
#                         Train + Test
419
# ============================================================
420
if __name__ == "__main__":
421
    np.random.seed(0)
422

423
    # change to your dataset directory
424
    data_dir = "./mnist"
425

426
    # 1. Load MNIST (from your Kaggle files)
427
    X_train, y_train, X_test, y_test = load_mnist_from_local(data_dir)
428

429
    # 2. Create CNN model
430
    model = MinimalCNNClassifier(lr=0.001, use_adam=True)
431

432
    # 3. Train
433
    loss_list = train(model, X_train, y_train, epochs=3, batch_size=64)  # CNN 收敛快，3 epoch 就能 90%+
434

435
    # 4. Test accuracy
436
    acc = test(model, X_test, y_test)
437

438
    # 5. Plot loss
439
    plot_loss(loss_list)
440

441
    # 6. Save model
442
    save_path = os.path.join("models", "cnn_model.npz")
443
    model.save(save_path)
444
    print(f"Model saved to {save_path}!")
445

446
    # 7. Load model (optional)
447
    print("Reloading model...")
448
    model2 = MinimalCNNClassifier(lr=0.001, use_adam=True)
449
    model2.load(save_path)
450
    print("Reload complete.")
451

452
    # 8. Test reloaded model
453
    acc2 = test(model2, X_test, y_test)
454
    print(f"Reloaded model accuracy: {acc2:.4f}")
455

456
    # 9. Single-sample inference demo
457
    sample_pred = model2(X_test[0].squeeze())
458
    print(f"Single sample predicted class: {sample_pred}, true: {np.argmax(y_test[0])}")