From Roboflow to SAM 3 on Rented GPUs - Writing

The classical CV pipeline from the first post worked in controlled conditions but broke in real photos. Too many parameters, too sensitive to lighting and angle. I needed a model that could generalise.

Roboflow

Roboflow is a platform for training and deploying computer vision models. You upload labelled images, train a model, and get an API endpoint or downloadable weights.

I annotated about 200 Go board images, drawing bounding boxes around black and white stones. Roboflow’s annotation tool made this faster than I expected. The trained YOLOv8 model detected stones with decent accuracy on clean photos.

from roboflow import Roboflow

rf = Roboflow(api_key="your_key")
project = rf.workspace().project("go-stones")
model = project.version(1).model

result = model.predict("board.jpg", confidence=40)
predictions = result.json()["predictions"]

Each prediction includes a bounding box, class (black or white), and confidence score.

The problem was coverage. 200 training images wasn’t enough for the variety of boards, lighting, and angles I wanted to handle. I could annotate more, but labelling Go stones is tedious. A 19x19 board can have over 300 stones. I wanted a segmentation approach that didn’t require per-stone annotation.

Meta SAM 3

Meta’s Segment Anything Model 3 (SAM 3) is designed for zero-shot segmentation. You give it an image, optionally with point or box prompts, and it segments objects without training on your specific domain.

The appeal for Go board reading: SAM can segment stones without any Go-specific training data. It understands object boundaries from its massive pre-training.

from sam3 import SAM3Predictor

predictor = SAM3Predictor.from_pretrained("meta-sam3-large")
predictor.set_image(image)

# prompt with a point on a stone
masks, scores, _ = predictor.predict(
    point_coords=np.array([[250, 300]]),
    point_labels=np.array([1]),
)

SAM 3 segmented individual stones cleanly, even on cluttered boards with shadows. The segmentation masks were far more precise than bounding boxes, giving me exact stone boundaries.

But SAM 3 needs prompts. It won’t automatically find every stone. I needed either a grid of prompt points (one per intersection) or another model to generate prompts. This created a two-stage pipeline: detect candidate locations, then segment each one with SAM.

Running SAM 3 through Ultralytics

Ultralytics wraps SAM into its YOLO library. This simplified the pipeline by combining detection and segmentation.

from ultralytics import SAM

model = SAM("sam3_l.pt")
results = model("board.jpg")

for result in results:
    masks = result.masks
    boxes = result.boxes

The Ultralytics wrapper handles prompt generation automatically. It runs everything in a single pass, finding and segmenting objects without manual point prompts.

The catch: SAM models are large. The sam3_l checkpoint is several gigabytes. Running inference on a CPU takes minutes per image. I needed a GPU.

Jupyter on Google Colab’s GPU

SAM 3 on a CPU is unusable for iteration. A single image takes minutes. I moved the whole pipeline into a Jupyter notebook on Google Colab, which gives you a free T4 GPU (16GB VRAM).

The Jupyter environment was the right fit for this stage. Each cell runs independently, so I could load the model once and re-run inference on different images without restarting. Visualising segmentation masks inline with matplotlib made it easy to see what SAM was detecting.

# Cell 1: setup
!pip install ultralytics matplotlib

import torch
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")

# Cell 2: load model (runs once, stays in GPU memory)
from ultralytics import SAM
model = SAM("sam3_l.pt")

# Cell 3: run inference and visualise
import matplotlib.pyplot as plt
from PIL import Image

results = model("board.jpg")
fig, axes = plt.subplots(1, 2, figsize=(16, 8))
axes[0].imshow(Image.open("board.jpg"))
axes[0].set_title("Original")

# overlay masks on the image
annotated = results[0].plot()
axes[1].imshow(annotated)
axes[1].set_title("Segmentation")
plt.show()

Inference dropped from minutes to about 2 seconds per image. The notebook workflow meant I could tweak parameters, re-run a cell, and immediately see the visual output. No reloading the model, no restarting scripts.

I uploaded test images directly through Colab’s file browser or mounted Google Drive for larger batches.

from google.colab import drive
drive.mount("/content/drive")

import glob
board_images = glob.glob("/content/drive/MyDrive/go-boards/*.jpg")
print(f"Found {len(board_images)} board images")

What SAM segments (and doesn’t)

The automatic segmentation mode segments everything in the image. Not just stones. It finds the board, the table, shadows, fingers if they’re in frame. Filtering stone segments from the noise required post-processing based on mask size, circularity, and position relative to the detected grid.

stone_masks = []
for mask in all_masks:
    area = mask.sum()
    if min_stone_area < area < max_stone_area:
        contours, _ = cv2.findContours(
            mask.astype(np.uint8),
            cv2.RETR_EXTERNAL,
            cv2.CHAIN_APPROX_SIMPLE,
        )
        if len(contours) > 0:
            perimeter = cv2.arcLength(contours[0], True)
            circularity = 4 * np.pi * area / (perimeter ** 2)
            if circularity > 0.7:
                stone_masks.append(mask)

Circularity filtering worked well. Stones are round, most other segments aren’t. But I still got false positives from round shadows and the circular decorations some boards have at the star points.

Why Colab wasn’t enough

Colab worked for experimentation but had real limits. Free tier sessions disconnect after 30-90 minutes of inactivity. The GPU allocation isn’t guaranteed. Sometimes you get a T4, sometimes you get nothing and have to wait. For batch processing a full game (50+ photos), a session disconnect halfway through means starting over.

Colab Pro helps, but at that price point I started looking at alternatives where I controlled the session lifecycle.

Vast.ai for on-demand GPUs

Vast.ai is a marketplace for renting GPUs. You browse available machines, pick one, and SSH in or run Docker containers. Pricing varies by GPU type and demand, but it’s consistently cheaper than cloud providers.

I set up a Docker image with the pipeline pre-installed.

FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
RUN pip install ultralytics opencv-python-headless
COPY pipeline/ /app/pipeline/
WORKDIR /app
CMD ["python", "pipeline/detect.py"]

On Vast.ai, I rent an RTX 4090 for about $0.30/hour. The pipeline runs, processes a batch of board images, and I terminate the instance. No idle costs.

vastai search offers "gpu_name=RTX_4090 num_gpus=1"
vastai create instance <offer_id> \
  --image my-go-pipeline:latest \
  --disk 20

The workflow: upload board photos to the instance, run the pipeline, download the results. For batch processing (analysing a full game from a series of photos), this is more practical than Colab. Sessions don’t disconnect, I control the hardware, and the cost is predictable.

Where the project stands

The current pipeline combines Ultralytics SAM 3 for segmentation with post-processing to filter stone masks and map them to grid positions. It runs on rented GPUs via Vast.ai.

Accuracy on varied real-world photos is around 92-95% for stone detection, higher for colour classification. The remaining errors are mostly edge cases: stones on the board perimeter where perspective distortion is worst, and boards where stones are tightly clustered.

Next steps include fine-tuning on Go-specific images, adding temporal tracking for game recording (matching board states across consecutive photos), and building a simple API that accepts a photo and returns an SGF file.