GatedFusion-Net: Per-pixel modality weighting in a five-cue transformer for RGB-D-I-T-UV fusion

Martin Brenner*, Napoleon H. Reyes, Teo Susnjak, Andre Luis Chautard Barczak

*Corresponding author for this work

Research output: Contribution to journalArticleResearchpeer-review

7 Downloads (Pure)

Abstract

We introduce GatedFusion-Net (GF-Net), built on the SegFormer Transformer backbone, as the first architecture to unify RGB, depth (D), infrared intensity (I), thermal (T), and ultraviolet (UV) imagery for dense semantic segmentation on the MM5 dataset. GF-Net departs from the CMX baseline via: (1) stage-wise RGB-intensity-depth enhancement that injects geometrically aligned D, I cues at each encoder stage, together with surface normals (N), improving illumination invariance without adding parameters; (2) per-pixel sigmoid gating, where independent Sigmoid Gate blocks learn spatial confidence masks for T and UV and add their contributions to the RGB+DIN base, trimming computational cost while preserving accuracy; and (3) modality-wise normalisation using per-stream statistics computed on MM5 to stabilise training and balance cross-cue influence. An ablation study reveals that the five-modality configuration (RGB+DIN+T+UV) achieves a peak mean IoU of 88.3 %, with the UV channel contributing a 1.7-percentage-point gain under optimal lighting (RGB3). Under challenging illumination, it maintains comparable performance, indicating complementary but situational value. Modality-ablation experiments reveal strong sensitivity: removing RGB, T, DIN, or UV yields relative mean IoU reductions of 83.4 %, 63.3 %, 56.5 %, and 30.1 %, respectively. Sigmoid-Gate fusion behaves primarily as static, lighting-dependent weighting rather than adapting to sensor loss. Throughput on an RTX 3090 with a MiT-B0 backbone is real-time: 640 × 480 at 74 fps for RGB+DIN+T, 55 fps for RGB+DIN+T+UV, and 41 fps with five gated streams. These results establish the first RGB-D-I-T-UV segmentation baselines on MM5 and show that per-pixel sigmoid gating is a lightweight, effective alternative to heavier attention-based fusion.
Original languageEnglish
Article number103986
Pages (from-to)1-25
Number of pages25
JournalInformation Fusion
Volume129
Early online date26 Nov 2025
DOIs
Publication statusE-pub ahead of print - 26 Nov 2025

Fingerprint

Dive into the research topics of 'GatedFusion-Net: Per-pixel modality weighting in a five-cue transformer for RGB-D-I-T-UV fusion'. Together they form a unique fingerprint.

Cite this