CNN Regression on Rendered Meshes: What Improved It

I’ve been training a CNN to detect a specific feature in multi-angle renders of 3D objects. The task: given a single rendered image, predict the 2D pixel position of a known feature point. Sounds straightforward. It took a while to get right.

The first model was off by about 17% of image width on average. The task isn’t hard, so a 17% error pointed at the evaluation setup rather than capacity.

Validation loss was stable throughout training. That made it look like a signal problem or a capacity problem. It was neither: the train/val split was leaking same-object views into both sides.

Data leakage from multi-angle renders

The dataset was multi-angle renders of 3D objects. Eighteen renders per object from different camera positions. Each render labelled with the 2D projected position of a specific feature point.

Train/val split was random by sample. That put the same object into both sets, just at different angles. The model was scoring against objects it had already seen from adjacent viewpoints. which is closer to lookup than generalisation.

Fix: split by object ID. Every render of a given object goes entirely to train or entirely to val. After this, validation error shot up and I had an honest baseline.

Perspective vs orthographic projection

The renders used a perspective camera. Apparent feature size varied with camera distance - larger close up, smaller further away, even if real-world position was the same. That’s extra variance the model has to learn through.

Switching to orthographic removes it. Parallel projection means the feature appears at consistent scale regardless of camera position. The model can focus on location rather than accounting for projection geometry.

Both pipelines run in parallel now. Perspective is more realistic. Orthographic is a cleaner training signal.

Doubling data with horizontal mirroring

The target feature appears twice in each object, symmetrically. In side views, the near instance is visible and the far one is hidden. At the opposite angle, it flips.

Flipping every training image horizontally and inverting the x coordinate turns each sample into two. No re-rendering required. This also handled front-facing views, which were initially skipped because both instances are equidistant. Assign one instance’s label, flip to get the other.

Occluded instances were also excluded initially. The 2D projected coordinate is still geometrically correct even when the feature is behind other geometry, so they’re included too. More data, no meaningful downside.

Memory management for a large tf.data pipeline

Adding mirroring and front-facing views roughly doubled the dataset. The training script had been loading all images into RAM at startup. That worked until the dataset grew large enough that the OS killed the process mid-epoch.

Loading tens of thousands of 512x512 float32 RGB images in one shot is a lot of memory. The fix was a tf.data pipeline that loads from disk per batch:

def _load_image(path, coord, mirror):
    img = tf.io.read_file(path)
    img = tf.io.decode_png(img, channels=3)
    img = tf.image.resize(img, [512, 512])
    img = tf.cast(img, tf.float32) / 255.0
    img = tf.cond(mirror, lambda: tf.image.flip_left_right(img), lambda: img)
    return img, coord

The dataset stores paths and a mirror flag instead of image arrays. Memory use stays flat regardless of dataset size.

AUTOTUNE parallelism was too aggressive on this machine - enough I/O threads to starve the system. Fixed with explicit settings:

dataset = (
    dataset
    .shuffle(buffer_size=2000)
    .map(load_and_augment, num_parallel_calls=4)
    .batch(batch_size)
    .prefetch(2)
)

Also added set_memory_growth(True) for the GPU. TensorFlow pre-allocates all device memory by default, which competed with the data pipeline.

CNN architecture and training schedule

MobileNetV2 pretrained on ImageNet, fine-tuned for regression. Output head:

GlobalAveragePooling2D
Dense(256, L2)
Dropout(0.4)
Dense(128, L2)
Dropout(0.3)
Dense(2, sigmoid)

Two outputs: normalised x and y of the nearest feature instance. Sigmoid constrains output to [0, 1], MSE loss.

Training in two phases. First 15 epochs: backbone frozen, head trains at a low learning rate. Then the backbone unfreezes and the whole network fine-tunes at a lower rate. Early stopping on validation loss with patience 15, restores best checkpoint.

After fixing the split and adding the data improvements, error dropped to under 2% of image width - roughly a 9x reduction.

Second-layer MLP for 3D position

The CNN predicts 2D position in a single rendered image. The actual target is a 3D point in object space. To get there: run the CNN on all 9 camera angles, then combine the results.

For the orthographic pipeline you can do this analytically. Orthographic cameras project with parallel rays, so a 2D prediction back-projects to a ray in 3D. Rays from multiple viewpoints intersect at the 3D point.

Ray intersection is brittle when predictions are noisy though. A small MLP works better in practice.

Input: normalised 2D predictions from all 9 angles - 18 values.

Dense(128, relu)
Dropout(0.2)
Dense(64, relu)
Dropout(0.2)
Dense(6, linear)

Output: 3D coordinates for both feature instances in millimetres, in aligned object space.

Targets are normalised using training-set mean and standard deviation. The mean and std are saved alongside the model for inference.

During training, Gaussian noise is added to the 2D inputs to simulate real CNN prediction error. Without it the MLP trains on clean 2D coordinates from the ground truth, then gets surprised at inference time when CNN predictions are off.

Validating 3D predictions

Numbers alone aren’t enough to trust a 3D prediction. I built a viewer: pre-rendered 2D images with a dot overlay on the left, a Three.js 3D model on the right.

The 3D panel loads the original object, applies the same alignment transform as the render pipeline, and places four dots - predicted and actual for both feature instances. An X-Ray toggle makes the mesh translucent so you can see whether a dot is sitting on the surface or sunk inside it.

The alignment uses the same quaternion rotation sequence as the render pipeline. Getting that wrong puts dots in the wrong coordinate frame. I validated it against known reference points before trusting any MLP output.

The viewer calls /infer for the 2D result and /infer/3d for the MLP. The 3D endpoint runs the CNN across all nine pre-rendered angles, assembles the 18-value input, runs the MLP, and returns coordinates in a few hundred milliseconds. Two-tier structure means the 2D overlay responds fast and the 3D panel follows.

Where this goes depends on the 3D error. The 2D layer is usable now. If the MLP can hold error to a few millimetres it becomes applicable. If not, geometric triangulation is the fallback - no training required, just noisier.