Who this is for: MVI administrators staring at a cryptic error log, developers whose training jobs have been pending for hours, and anyone who has searched IBM documentation for an answer and found architecture diagrams instead. If something is broken and you need it fixed, start here.
Read Time: 30-35 minutes
The Error Message That Says Nothing
Your training job failed. The log says: Error: Training could not be completed.
That is the entire error. No stack trace. No error code. No hint about whether the GPU is missing, the dataset is corrupt, or the storage ran out of space. You check the pod logs. You check the events. You check the documentation. The documentation shows you a screenshot of a successful training run.
"We spent three days troubleshooting what turned out to be a PVC access mode set to ReadWriteOnce instead of ReadWriteMany. Three days. One wrong dropdown selection."
This blog exists because MVI error messages are often unhelpful and the official troubleshooting documentation assumes you already know what is wrong. We have compiled every failure mode we have encountered across dozens of MVI deployments into a single reference. Symptoms. Diagnosis. Fixes. No ambiguity.
If you came here from a search engine at 2 AM with a broken training pipeline, we respect that. Let us fix it.
GPU Troubleshooting
GPU issues are the number one cause of MVI deployment failures. Not because GPUs are hard -- because the stack between "GPU exists" and "MVI can use GPU" has six layers, and any one of them can silently fail.
Problem: GPU Not Detected by MVI
Symptom: Training jobs start but run on CPU. Training that should take 20 minutes takes 6+ hours. The MVI UI shows no GPU resources available.
Diagnosis flowchart:
GPU NOT DETECTED - DIAGNOSIS
════════════════════════════
Step 1: Is a physical GPU present?
┌─────────────────────────────────┐
│ oc describe node <gpu-node> │
│ Look for: nvidia.com/gpu │
└────────────┬────────────────────┘
│
┌────┴────┐
│ Found? │
└────┬────┘
Yes │ No
│ │ │
v │ v
Step 2 │ PROBLEM: GPU Operator not installed
│ or node not labeled. Go to Fix A.
│
Step 2: Is NVIDIA GPU Operator healthy?
┌─────────────────────────────────┐
│ oc get pods -n gpu-operator- │
│ resources │
│ All pods should be Running │
└────────────┬────────────────────┘
│
┌────┴────┐
│ All │
│ Running?│
└────┬────┘
Yes │ No
│ │ │
v │ v
Step 3 │ PROBLEM: GPU Operator pods crashed.
│ Go to Fix B.
│
Step 3: Is CUDA version correct?
┌─────────────────────────────────┐
│ oc exec <gpu-pod> -- nvidia-smi │
│ Check CUDA Version line │
└────────────┬────────────────────┘
│
┌────┴────┐
│ CUDA │
│ 11.8+? │
└────┬────┘
Yes │ No
│ │ │
v │ v
Step 4 │ PROBLEM: CUDA version too old.
│ Go to Fix C.
│
Step 4: Does GPU have 16+ GB VRAM?
┌─────────────────────────────────┐
│ nvidia-smi output shows │
│ total memory per GPU │
└────────────┬────────────────────┘
│
┌────┴────┐
│ 16 GB+? │
└────┬────┘
Yes │ No
│ │ │
v │ v
GPU is │ PROBLEM: Insufficient VRAM.
healthy │ Go to Fix D.Fix A -- GPU Operator not installed:
# Verify GPU Operator is installed
oc get csv -n gpu-operator-resources | grep gpu
# If missing, install via OperatorHub
# OpenShift Console > Operators > OperatorHub > NVIDIA GPU Operator
# Minimum OpenShift version: 4.8.22Fix B -- GPU Operator pods unhealthy:
# Check operator pod logs
oc logs -n gpu-operator-resources <crashed-pod-name>
# Common cause: driver version mismatch
# Reinstall with correct driver version for your GPU architectureFix C -- CUDA version too old:
MAS 9.0 requires CUDA 11.8 or later. If your nvidia-smi output shows an older version, update the NVIDIA GPU Operator to a version that bundles CUDA 11.8+.
Fix D -- Insufficient VRAM:
MVI requires a minimum of 16 GB GPU memory per GPU. If your GPU has less, training will either fail or fall back to CPU. Supported GPUs with sufficient VRAM include A100 (40/80 GB), A40 (48 GB), A30 (24 GB), V100 (16/32 GB), and T4 (16 GB).
Source: MVI Supported GPU Devices
Problem: Kepler GPU (K80) No Longer Works After MAS 9.0 Upgrade
Symptom: Training jobs that worked on MAS 8.x fail after upgrading to 9.0. GPU appears present but MVI cannot use it.
Diagnosis: Check your GPU model. If it is a Tesla K80 or any Kepler-architecture GPU, that is the problem.
Fix: Kepler GPUs are not supported from MAS 9.0 onward. You must replace the hardware. The minimum supported architecture is Pascal (P4, P40, P100).
SUPPORTED GPU ARCHITECTURES
═══════════════════════════
Architecture Example GPUs MAS Support
──────────── ──────────── ───────────
Kepler K80 8.x only (REMOVED in 9.0)
Pascal P4, P40, P100 8.8+
Volta V100 8.8+
Turing T4 8.8+
Ampere A10, A16, A40, 8.8+
A30, A100
Ada Lovelace RTX 4000, L40 9.0+
Hopper H100 9.0+If you are running K80s in a cloud environment, switch to T4 or A10 instances. If on-premises, budget for GPU replacement before upgrading to MAS 9.0.
Problem: GPU Out of Memory (OOM) During Training
Symptom: Training job starts, runs for a few minutes, then crashes with an out-of-memory error. Pod logs show CUDA out of memory or RuntimeError: CUDA error: out of memory.
Diagnosis:
# Check current GPU memory usage
oc exec <training-pod> -- nvidia-smi
# Look for memory allocation vs. total
# If used memory is near total, OOM is expectedFix:
- Reduce batch size. This is the most effective single change. Halving the batch size roughly halves GPU memory usage.
- Use a smaller model architecture. Tiny YOLO v3 uses significantly less memory than Faster R-CNN or Detectron2.
- Reduce input image resolution. Smaller images consume less GPU memory during training.
- Use MAS 9.0 GPU workload optimization. MAS 9.0 added the ability to assign GPUs specifically to training versus inference workloads, preventing inference traffic from consuming memory during training.
Problem: Multiple GPUs Present but Only One Used
Symptom: You have a multi-GPU node but training only uses one GPU.
Diagnosis: Check the training job's resource requests. If it requests nvidia.com/gpu: 1, it will only get one GPU regardless of how many are available.
Fix: MVI training jobs use the GPU count specified in the configuration. For multi-GPU training, ensure your deployment configuration requests the correct number of GPUs. Note that not all model architectures benefit equally from multi-GPU training. Object detection models (Faster R-CNN, YOLO v3) typically scale better across GPUs than classification models (GoogLeNet).
Training Troubleshooting
Problem: Training Is Extremely Slow
Symptom: A training job that should complete in 20-40 minutes has been running for hours.
Diagnosis:
SLOW TRAINING - DIAGNOSIS
═════════════════════════
Is it running on GPU or CPU?
┌────────────────────────────────┐
│ Check training pod resources: │
│ oc describe pod <training-pod> │
│ Look for nvidia.com/gpu in │
│ the resource requests │
└──────────────┬─────────────────┘
│
┌────┴────┐
│ GPU │
│ request │
│ present?│
└────┬────┘
Yes │ No
│ │ │
v │ v
Check │ CAUSE: Training on CPU.
storage │ CPU is 10-50x slower.
IOPS │ Fix GPU detection first.
│
┌─────────────────────────────────┐
│ Is storage IOPS adequate? │
│ Training needs 3000+ IOPS │
│ for loading images to GPU │
└──────────────┬──────────────────┘
│
┌────┴────┐
│ 3000+ │
│ IOPS? │
└────┬────┘
Yes │ No
│ │ │
v │ v
Check │ CAUSE: I/O bottleneck.
dataset │ GPU idles waiting for
size │ images. Upgrade storage.
│
┌─────────────────────────────────┐
│ How many images? │
│ < 100: Fast (minutes) │
│ 100-1000: Moderate (10-60 min) │
│ 1000-10000: Longer (1-4 hrs) │
│ > 10000: Hours to days │
└─────────────────────────────────┘Fixes:
- Verify GPU is being used (see GPU Troubleshooting above)
- Ensure storage provides 3000+ IOPS. NFS shares commonly bottleneck here. Use block storage or high-performance NFS.
- Reduce
max_iterfor initial experimentation. Default is 1500. Drop to 500 for quick validation runs, then increase for final training. - Use auto-labeling: train a preliminary model on 5-10 manually labeled images, then use that model to auto-label the rest of your dataset before full training.
Problem: Model Will Not Converge (Accuracy Stays Low)
Symptom: Training completes but accuracy, precision, or recall remain below acceptable thresholds. The loss curve plateaus early or oscillates without decreasing.
Diagnosis:
CONVERGENCE FAILURE - DIAGNOSIS
════════════════════════════════
┌───────────────────┐ ┌───────────────────┐
│ Loss curve flat │ │ Loss curve │
│ from the start │ │ oscillates wildly │
│ │ │ │
│ CAUSE: Learning │ │ CAUSE: Learning │
│ rate too low or │ │ rate too high │
│ dataset quality │ │ │
│ issue │ │ FIX: Reduce │
│ │ │ learning_rate by │
│ FIX: Check labels │ │ factor of 10 │
│ first, then │ │ (0.001 -> 0.0001) │
│ increase LR │ │ │
└───────────────────┘ └───────────────────┘
┌───────────────────┐ ┌───────────────────┐
│ Loss decreases │ │ Training accuracy │
│ then plateaus at │ │ high, validation │
│ high value │ │ accuracy low │
│ │ │ │
│ CAUSE: Model │ │ CAUSE: Overfitting │
│ capacity too low │ │ │
│ or not enough │ │ FIX: More data, │
│ data │ │ augmentation, or │
│ │ │ reduce max_iter │
│ FIX: More data, │ │ │
│ switch to larger │ │ │
│ architecture │ │ │
└───────────────────┘ └───────────────────┘Fixes:
- Check your labels first. Bad labels are the most common cause. Open your dataset and visually verify 20+ random labels. Mislabeled images inject noise that prevents convergence.
- Add augmentation. MVI offers 8 augmentation options: blur, sharpen, crop, rotate, vertical flip, horizontal flip, color, and noise. Enable at least 3-4 to increase effective dataset diversity.
- Adjust hyperparameters. Default values (max_iter=1500, test_iter=100, test_interval=20, learning_rate=0.001) work for most cases. If loss oscillates, reduce learning_rate. If training is too slow to converge, increase max_iter.
- Switch architecture. GoogLeNet is the default for classification. Faster R-CNN is the default for object detection. If accuracy is insufficient, these defaults are usually correct -- the problem is data, not architecture.
Problem: SSD Model Will Not Train After MAS 9.1 Upgrade
Symptom: You have SSD (Single Shot Detector) models that trained successfully on MAS 9.0 or earlier. After upgrading to MAS 9.1, attempting to train or retrain SSD models fails. The training option may not appear in the UI.
Diagnosis: This is not a bug. SSD training was deprecated in MAS 9.1.
Fix:
SSD DEPRECATION MIGRATION PATH
═══════════════════════════════
Your SSD Model
│
├── Inference only?
│ └── YES: Existing SSD models continue
│ to run inference. No action needed
│ until you need to retrain.
│
└── Need to retrain?
└── Migrate to:
├── YOLO v3 (real-time detection,
│ similar speed to SSD)
└── Faster R-CNN (higher accuracy,
slower inference)
YOLO v3 is the recommended replacement for most
SSD use cases. It matches or exceeds SSD speed
with better accuracy.Your existing SSD models continue to run inference after the upgrade. You just cannot train new SSD models or retrain existing ones. Migrate to YOLO v3 for similar real-time performance or Faster R-CNN when accuracy is the priority.
Source: What's New in MVI 9.0
Problem: Overfitting Detected (Training Accuracy High, Validation Low)
Symptom: MVI's real-time training visualization shows training accuracy climbing above 95% while validation accuracy stagnates below 80%. The gap widens as training continues.
Diagnosis: Classic overfitting. Your model is memorizing the training images instead of learning generalizable patterns.
Fix:
- More data. The simplest fix. Add more diverse images. Different angles, lighting, times of day, camera positions.
- Enable augmentation. All 8 options if possible: blur, sharpen, crop, rotate, vertical flip, horizontal flip, color, noise. Augmentation artificially increases dataset diversity.
- Reduce max_iter. If the gap between training and validation accuracy starts widening at iteration 800, set max_iter to 800. Longer training makes overfitting worse, not better.
- Increase test_interval. Default is 20. Check validation more frequently to catch the divergence point earlier.
Problem: Choosing the Wrong Model Architecture
Symptom: You are not sure which architecture to use, or you picked one and results are poor.
Reference:
MODEL ARCHITECTURE SELECTION
════════════════════════════
Task: Image Classification (Is this a defect? Yes/No)
┌─────────────────────────────────────────────────┐
│ Default: GoogLeNet │
│ Use for: Pass/fail grading, condition ranking │
│ Input: Entire image → single label │
│ Export: Core ML (mobile), TensorRT, Edge │
└─────────────────────────────────────────────────┘
Task: Object Detection (Where are the defects?)
┌─────────────────────────────────────────────────┐
│ Speed priority: YOLO v3 or Tiny YOLO v3 │
│ → Real-time use, edge, mobile │
│ → Export: Core ML, TensorRT, Edge │
│ │
│ Accuracy priority: Faster R-CNN │
│ → Highest accuracy, slower inference │
│ → Export: TensorRT, Edge (NO Core ML) │
│ │
│ DEPRECATED: SSD (removed in 9.1) │
└─────────────────────────────────────────────────┘
Task: Instance Segmentation (Pixel-level defect maps)
┌─────────────────────────────────────────────────┐
│ Use: Detectron2 │
│ → Precise defect boundaries │
│ → Export: TensorRT, Edge (NO Core ML) │
└─────────────────────────────────────────────────┘
MOBILE COMPATIBILITY (Core ML export):
═══════════════════════════════════════
YES: GoogLeNet, YOLO v3, Tiny YOLO v3
NO: Faster R-CNN, Detectron2, High Resolution,
SSD, Anomaly, SSNSource: MVI Models and Supported Functions
Deployment Troubleshooting
Problem: Model Deployed but Inference Returns Errors
Symptom: You deployed a trained model. The API endpoint exists. But inference calls return errors or empty results.
Diagnosis:
# Check the deployed model pod status
oc get pods | grep infer
# Check pod logs for errors
oc logs <inference-pod-name>
# Test inference endpoint directly
curl -X POST https://<mvi-host>/api/v1/infer \
-H "X-Auth-Token: <your-api-key>" \
-F "files=@test_image.jpg" \
-F "model_id=<your-model-id>"Common causes and fixes:
- Model not fully deployed. The model status shows "deployed" in the UI but the inference pod is still initializing. Check pod status with
oc get pods. Wait for Running state. - Incorrect API authentication. MVI uses the
X-Auth-Tokenheader for REST API authentication. NotAuthorization: Bearer. Notapi-key. The header isX-Auth-Token. - Model format mismatch. If you deployed to TensorRT and the inference runtime does not have the matching GPU, inference fails silently. Verify GPU availability on the inference node.
- Image format issues. MVI expects standard image formats (JPEG, PNG). Verify your test image is not corrupt and the file size is reasonable.
Problem: Inference Pod Crashes or Restarts
Symptom: The inference pod starts, runs for a short time, then enters CrashLoopBackOff.
Diagnosis:
# Get pod events
oc describe pod <inference-pod-name>
# Check for OOMKilled
# If the pod was killed by Kubernetes for exceeding memory limits,
# the "Last State" will show "OOMKilled"
# Check resource limits
oc get pod <inference-pod-name> -o yaml | grep -A 5 resourcesFixes:
- OOMKilled: Increase the pod memory limits. Large models (Faster R-CNN, Detectron2) require more memory than lightweight models (Tiny YOLO v3).
- GPU not available on inference node: If the model was trained with GPU and deployed for GPU inference, but the inference node has no GPU, the pod crashes. Verify GPU availability with
oc describe node. - Persistent volume not accessible: If the model artifact storage is unreachable, the pod cannot load the model and crashes. Verify PVC status.
Problem: PVC Storage Failures
Symptom: Training or inference pods fail to start. Events show FailedMount or FailedAttachVolume. Or pods start but cannot write training artifacts.
Diagnosis:
# Check PVC status
oc get pvc -n <mvi-namespace>
# Check PVC access mode
oc get pvc <pvc-name> -o yaml | grep accessModesFix:
This is one of the most common and most frustrating MVI issues. The fix depends on the root cause:
- Access mode is ReadWriteOnce (RWO) instead of ReadWriteMany (RWX). MVI requires ReadWriteMany because multiple pods (training, inference, API) need simultaneous access to the same storage. ReadWriteOnce only allows one pod at a time. This is the silent killer -- the PVC binds successfully, but multi-pod workloads fail intermittently.
PVC ACCESS MODE FIX
════════════════════
WRONG: ReadWriteOnce (RWO)
- One pod can mount at a time
- Training works, then inference fails
- Or inference works, then training fails
- No clear error message
RIGHT: ReadWriteMany (RWX)
- Multiple pods mount simultaneously
- Training and inference coexist
- Required for MVI- Storage capacity insufficient. MVI requires minimum 40 GB PVC storage. Training datasets, model artifacts, and intermediate files consume space quickly. Monitor usage with
oc exec <pod> -- df -h. - Docker image storage full. Minimum 75 GB in
/varfor Docker images on each node. If the node runs out of Docker image space, new pods cannot start.
Problem: Model Export Fails
Symptom: You trained a model successfully but the export to TensorRT, Core ML, or Edge format fails.
Diagnosis: Check the model architecture against the supported export formats.
Fix:
MODEL EXPORT COMPATIBILITY
══════════════════════════
Architecture TensorRT Core ML Edge
──────────── ──────── ─────── ────
GoogLeNet YES YES YES
YOLO v3 YES YES YES
Tiny YOLO v3 YES YES YES
Faster R-CNN YES NO YES
Detectron2 YES NO YES
High Resolution YES NO YES
Anomaly YES NO YES
SSN YES NO YES
If you need Core ML (mobile), you MUST use
GoogLeNet, YOLO v3, or Tiny YOLO v3.
No exceptions. No workarounds.If you trained a Faster R-CNN model and need mobile deployment, you cannot export it to Core ML. You must retrain using YOLO v3 or Tiny YOLO v3. Plan your model architecture around your deployment target before training.
Mobile Troubleshooting
Problem: MVI Mobile Not Available for Android
Symptom: You searched the Google Play Store for MVI Mobile or IBM Maximo Visual Inspection. Nothing found.
Diagnosis: This is not a search issue. MVI Mobile does not exist for Android.
Fix: MVI Mobile is exclusively for iOS and iPadOS, available on the Apple App Store. IBM has not announced Android support. If your field teams use Android devices, your options are:
- Procure iOS/iPadOS devices for inspectors who need mobile visual inspection.
- Use MVI Edge with network cameras instead of mobile device inspection. Edge provides real-time inference without requiring specific mobile hardware.
- Use the MVI web interface on Android devices for manual image upload and analysis (not real-time inference).
Source: MVI Mobile
Problem: Model Cannot Export to Core ML for Mobile
Symptom: You trained a model (Faster R-CNN, Detectron2, or another architecture) and want to deploy it to MVI Mobile. The Core ML export option is not available.
Diagnosis: Only 3 of the available model architectures support Core ML export.
Fix: Only GoogLeNet (classification), YOLO v3 (object detection), and Tiny YOLO v3 (lightweight object detection) export to Core ML. All other architectures -- Faster R-CNN, Detectron2, High Resolution, SSD, Anomaly, SSN -- do not support Core ML export.
If you need mobile deployment, choose your model architecture accordingly before training. Retraining with a compatible architecture is the only path.
Problem: MVI Mobile Model Sync Fails
Symptom: MVI Mobile is installed and connected to your MVI Server, but models do not sync to the device.
Diagnosis:
- Verify MVI Server version is 1.3 or later. MVI Mobile requires v1.3+ for model sync.
- Check network connectivity between the iOS device and the MVI Server endpoint.
- Verify the model was exported to Core ML format on the server before attempting sync.
- Check that the API key configured in MVI Mobile is valid and not revoked.
Fix:
# Verify server API is accessible from a network the device can reach
curl -k -H "X-Auth-Token: <api-key>" \
https://<mvi-server>/api/v1/trained_models
# Confirm response includes your model with status "deployed"Ensure the server endpoint is reachable from the device's network. Corporate firewalls and VPN configurations frequently block the connection. The device does not need to be on the same network as the OpenShift cluster -- it needs HTTPS access to the MVI API endpoint.
Problem: MVI Mobile Offline Inference Not Working
Symptom: MVI Mobile does not run inference when the device has no network connection.
Diagnosis: Models must be synced to the device before going offline. If the model was never successfully downloaded, offline inference has nothing to run.
Fix:
- Connect the device to a network with access to the MVI Server.
- Open MVI Mobile and verify the model appears in the model list.
- Trigger a sync and confirm the model downloads completely (check model file size).
- Disconnect from the network and test inference.
Core ML models run entirely on-device using Apple's Neural Engine. Once downloaded, no network connection is required for inference. The network dependency is only for the initial model download and subsequent updates.
Edge Troubleshooting
Problem: MVI Edge Cannot Connect to MQTT Broker
Symptom: MVI Edge is deployed and running inference, but alerts are not reaching Maximo Monitor. The MQTT connection shows disconnected or failed.
Diagnosis:
# On the edge device, check MQTT connectivity
docker logs <mvi-edge-container> | grep -i mqtt
# Test MQTT broker connectivity
mosquitto_pub -h <broker-host> -p <broker-port> \
-t "test/topic" -m "test message" \
-u <username> -P <password>Common causes and fixes:
- Incorrect broker credentials. MQTT broker authentication is case-sensitive. Verify username, password, and client ID exactly match the broker configuration.
- Firewall blocking MQTT port. Default MQTT port is 1883 (unencrypted) or 8883 (TLS). Verify the edge device can reach the broker on the correct port.
- TLS certificate mismatch. If the broker uses TLS, the edge device must trust the broker's certificate. Self-signed certificates need to be explicitly added to the trust store.
- v2 API format mismatch. MAS 9.0 introduced v2 APIs for MQTT. If your Monitor instance expects v2 format but Edge is sending v1, messages are silently dropped. Verify both sides use the same API version.
Problem: GigE Vision Camera Not Detected
Symptom: You connected a Basler or other GigE Vision camera to MVI Edge (MAS 9.0+) but it does not appear in the camera list.
Diagnosis:
- GigE Vision camera support was added in MAS 9.0. Verify your Edge version.
- Check physical network connectivity. GigE Vision uses Ethernet, not USB.
- Verify the camera and edge device are on the same subnet.
Fix:
GigE VISION TROUBLESHOOTING
════════════════════════════
Step 1: Network configuration
- Camera and edge device on same subnet
- Jumbo frames enabled (MTU 9000) for
high-resolution image transfer
- No firewall between camera and edge
Step 2: Camera discovery
- GigE Vision uses UDP broadcast for discovery
- Verify UDP broadcast not blocked
- Camera must have valid IP in same range
Step 3: Driver compatibility
- Basler cameras officially supported
- Other GigE Vision cameras may work
- Check IBM compatibility matrixSource: What's New in MVI 9.0
Problem: NVIDIA Jetson Xavier NX Edge Device Issues
Symptom: MVI Edge deployed on an NVIDIA Jetson Xavier NX is not running or inference is failing.
Diagnosis: Verify the Jetson software version. MVI Edge requires nvidia-jetpack 4.5.1-b17.
Fix:
- Flash the Jetson with JetPack 4.5.1-b17 if running a different version. Newer JetPack versions are not guaranteed to be compatible.
- Verify Docker is running on the Jetson:
systemctl status docker. - Check available memory. The Jetson Xavier NX has shared CPU/GPU memory. If system processes consume too much, inference will fail.
- Verify MVI Edge container has access to the NVIDIA runtime:
docker run --runtime nvidia --rm nvidia/cuda:11.4.3-base-ubuntu20.04 nvidia-smi.
Source: MVI Edge Planning
Problem: Edge Storage Running Out
Symptom: MVI Edge device runs out of disk space over time. Inference slows or stops.
Diagnosis: MVI Edge stores captured images, inference results, and log files locally. Without lifecycle management, storage fills up.
Fix: MAS 9.0 introduced the Data Lifecycle Manager for Edge devices. Configure it to automatically purge old images and results based on retention policies.
DATA LIFECYCLE MANAGER CONFIGURATION
═════════════════════════════════════
Set retention policies:
- Image retention: 7-30 days (depending on
regulatory requirements)
- Inference results: 30-90 days
- Log files: 7-14 days
Enable automatic purge on disk threshold:
- Trigger purge at 80% disk usage
- Delete oldest data first
- Always retain flagged/alerted itemsIf you are running a pre-9.0 Edge version, implement manual cleanup with a cron job or upgrade to 9.0 for the Data Lifecycle Manager.
Problem: Edge Alert Messages Not Formatted Correctly
Symptom: MVI Edge sends alerts via MQTT or Twilio SMS, but the messages contain raw data instead of human-readable content.
Diagnosis: MAS 9.0 introduced alert message templates. If templates are not configured, Edge sends raw JSON payloads.
Fix: Configure alert message templates in the Edge administration interface. Templates support variable substitution for detection class, confidence score, timestamp, and camera ID. For Twilio SMS integration, keep messages under 160 characters for single-segment delivery.
Integration Troubleshooting
Problem: MVI to Monitor MQTT Alert Pipeline Fails
Symptom: MVI Edge detects defects and sends MQTT messages, but Monitor does not receive them or does not create alerts.
Diagnosis:
MVI-TO-MONITOR MQTT DIAGNOSIS
══════════════════════════════
MVI Edge ──MQTT──> Broker ──MQTT──> Monitor
│ │ │
│ │ │
Sending? Receiving? Processing?
Check Edge Check broker Check Monitor
logs for subscription device type
MQTT publish topics match configuration
confirmations publish topicsFixes:
- Topic mismatch. The MQTT topic MVI Edge publishes to must exactly match the topic Monitor subscribes to. A single character difference causes silent failure. Verify both configurations.
- Device type not configured. MVI Edge auto-configures device type and gateway in Monitor. If auto-configuration failed, manually create the device type in Monitor matching the Edge device identifier.
- v2 API version mismatch. MAS 9.0 introduced v2 MQTT APIs. If Edge sends v2 format but Monitor expects v1 (or vice versa), messages are received but not parsed. Align API versions.
Problem: REST API Authentication Failures
Symptom: API calls return 401 Unauthorized or 403 Forbidden errors.
Diagnosis:
# Test API connectivity and authentication
curl -k -H "X-Auth-Token: <your-api-key>" \
https://<mvi-host>/api/datasets
# If 401: Token is invalid or expired
# If 403: Token is valid but lacks permissions
# If connection refused: Wrong host/portFix:
MVI REST API authentication uses the X-Auth-Token header. Common mistakes:
- Wrong header name. It is
X-Auth-Token, notAuthorization, notBearer, notapi-key. This is the most common API integration error. - Token revoked. MVI API keys do not expire, but they can be revoked by an administrator. Generate a new key if yours was revoked.
- Incorrect host URL. The API endpoint is the MVI route on OpenShift, not the OpenShift console URL. Verify with
oc get route -n <mvi-namespace>.
# Key API endpoints for reference
GET /api/datasets # List all datasets
POST /api/datasets # Create a dataset
POST /api/datasets/<id>/files # Upload images
GET /api/dnn-script # List training scripts
POST /api/v1/infer # Run inferenceSource: MVI REST APIs
Problem: Work Order Creation From MVI Detections Not Working
Symptom: MVI detects defects and shows them in the UI, but work orders are not created in Maximo Manage.
Diagnosis: MVI-to-Manage integration typically flows through Monitor as an intermediary or through direct API calls. Identify which pattern your implementation uses.
Fix:
- If using Monitor as intermediary: Fix the MVI-to-Monitor MQTT pipeline first (see above). Then verify Monitor-to-Manage work order rules are configured. Monitor creates work orders based on alert rules -- if no rule exists for MVI alert types, no work order is created.
- If using direct API integration: Verify the Manage API endpoint is accessible from the MVI/integration layer. Check that the integration service has valid Manage API credentials. Verify work order template mappings exist for each detection class.
- ITSM workflow support. MAS 9.1 added ITSM workflow support. If you upgraded to 9.1, verify your existing integration is compatible with the new workflow engine.
Problem: Vision-Tools CLI Not Found or Not Working
Symptom: You are trying to use the vision-tools CLI for batch operations but the command is not found or returns errors.
Diagnosis: The vision-tools CLI is a companion utility for batch operations against the MVI API. It is not installed by default with MVI.
Fix: Install vision-tools according to the IBM documentation for your version. Verify Python dependencies are met. The CLI uses the same X-Auth-Token authentication as the REST API -- configure it with your API key and MVI endpoint URL.
MAS 9.0/9.1 Migration Issues
Upgrading to MAS 9.0 or 9.1 introduces several breaking changes specific to MVI. If you upgraded and things stopped working, this section is your checklist.
Kepler GPU Removal (MAS 9.0)
Impact: Tesla K80 and all Kepler-architecture GPUs are no longer supported.
Action: Replace K80s before upgrading. If you already upgraded and training fails, check nvidia-smi output for your GPU model. Any Kepler GPU must be replaced with Pascal or newer.
SSD Training Deprecation (MAS 9.1)
Impact: SSD models cannot be trained or retrained. Existing SSD models still run inference.
Action: Inventory all SSD models. Plan migration to YOLO v3 (speed priority) or Faster R-CNN (accuracy priority). Retrain using the same training datasets. Compare inference performance before switching production workloads.
v2 MQTT API (MAS 9.0)
Impact: MQTT message format changed for Edge-to-Monitor communication.
Action: Update all Monitor subscriptions and parsing rules to handle v2 format. Update any custom MQTT consumers. Test the full alert pipeline after migration.
GPU Workload Optimization (MAS 9.0)
Impact: New capability to assign GPUs specifically to training or inference workloads.
Action: This is not a breaking change but an optimization opportunity. Review GPU allocation and assign dedicated GPUs to training during business hours and inference during off-hours, or vice versa based on your workload pattern.
Edge Diagnostics Dashboard (MAS 9.0)
Impact: New centralized monitoring for edge devices.
Action: Configure the diagnostics dashboard after upgrade. Use it to monitor edge device health, model version consistency across fleet, and connectivity status.
Facial Redaction (MAS 9.0)
Impact: New capability to automatically redact faces in captured images for privacy compliance.
Action: Enable facial redaction on Edge deployments where cameras may capture worker or bystander faces. This addresses GDPR and privacy requirements in regions where visual inspection cameras are deployed in areas with human traffic.
Java 17 Migration
Impact: MAS 9.x components may require Java 17 runtime. Custom integrations using older Java versions may break.
Action: Audit all custom integration code. Update Java runtime dependencies. Test custom REST API clients and MQTT consumers against the upgraded environment.
MAS 9.0/9.1 MIGRATION CHECKLIST
═════════════════════════════════
PRE-UPGRADE:
[ ] Inventory all GPU hardware (remove Kepler)
[ ] Inventory all SSD models (plan YOLO v3 migration)
[ ] Document current MQTT topic/format configuration
[ ] Backup all trained models and datasets
[ ] Test custom integrations against 9.x APIs
POST-UPGRADE:
[ ] Verify GPU detection (nvidia-smi on all nodes)
[ ] Verify CUDA 11.8+ on all GPU nodes
[ ] Test training job on each GPU node
[ ] Verify MQTT v2 message flow (Edge to Monitor)
[ ] Configure GPU workload optimization
[ ] Enable Edge diagnostics dashboard
[ ] Enable facial redaction where required
[ ] Retrain SSD models as YOLO v3
[ ] Validate all API integrations (X-Auth-Token)
[ ] Run end-to-end test: capture -> train -> deploy
-> infer -> alert -> work orderThe Top 20 FAQs
1. Why is my MVI training running on CPU instead of GPU?
Check three things: (1) NVIDIA GPU Operator is installed and healthy on OpenShift, (2) your GPU node has the nvidia.com/gpu label applied, (3) the training pod has GPU resource requests in its spec. Also verify CUDA version is 11.8+ for MAS 9.0 and your GPU has at least 16 GB VRAM. CPU-only training is 10-50x slower.
2. Is MVI available for Android?
No. MVI Mobile is iOS and iPadOS only. There is no Android version and IBM has not announced plans for one. For Android-equipped teams, use MVI Edge with network cameras or the MVI web interface.
3. Why did my SSD model stop training after upgrading?
SSD training was deprecated in MAS 9.1. Existing SSD models continue to run inference, but you cannot train new SSD models or retrain existing ones. Migrate to YOLO v3 for comparable real-time detection performance.
4. What GPUs does MVI support?
NVIDIA GPUs only. CUDA required. Supported architectures from MAS 9.0: Hopper (H100), Ada Lovelace (RTX 4000, L40), Ampere (A10, A16, A40, A30, A100), Turing (T4), Volta (V100), and Pascal (P4, P40, P100). Kepler (K80) was removed in MAS 9.0. Minimum 16 GB VRAM per GPU.
5. Why does my PVC keep failing?
MVI requires ReadWriteMany (RWX) access mode. ReadWriteOnce (RWO) silently fails when multiple pods try to mount the volume simultaneously. Recreate the PVC with RWX access mode. Also ensure minimum 40 GB storage and 3000+ IOPS.
6. Which models can I deploy to MVI Mobile?
Only three: GoogLeNet (classification), YOLO v3 (object detection), and Tiny YOLO v3 (lightweight object detection). These are the only architectures that export to Core ML. Faster R-CNN, Detectron2, High Resolution, SSD, Anomaly, and SSN do not export to Core ML.
7. How do I authenticate to the MVI REST API?
Use the X-Auth-Token HTTP header with your API key. Not Authorization: Bearer. Not api-key. The header is specifically X-Auth-Token. API keys do not expire but can be revoked by an administrator.
8. What is the minimum storage requirement?
40 GB PVC storage minimum with ReadWriteMany access mode. 75 GB minimum in /var for Docker images on each node. 3000+ IOPS recommended for training workloads.
9. How many images do I need to train a model?
Start with 50-100 high-quality, well-labeled images per class for a baseline model. Use auto-labeling to accelerate: train a preliminary model on 5-10 manually labeled images, then use auto-label to annotate the rest. More diverse images (different angles, lighting, conditions) matter more than sheer volume.
10. What hyperparameters should I use?
Start with the defaults: max_iter=1500, test_iter=100, test_interval=20, learning_rate=0.001. Reduce learning_rate by 10x if loss oscillates. Increase max_iter if the model has not converged by the end of training. Reduce max_iter if overfitting is detected.
11. Can I use custom model architectures?
Custom models were discontinued after MVI v8.7. Use the built-in architectures: GoogLeNet, YOLO v3, Tiny YOLO v3, Faster R-CNN, Detectron2, or the remaining supported options. Custom TensorFlow or PyTorch models are no longer supported.
12. What OpenShift version do I need?
OpenShift 4.8.22 or later for the NVIDIA GPU Operator. Check the MAS compatibility matrix for your specific MAS version, as newer MAS releases may require newer OpenShift versions.
13. How does MVI Edge communicate with Monitor?
Through MQTT. MVI Edge publishes detection alerts to an MQTT broker. Monitor subscribes to the same topics. MAS 9.0 introduced v2 MQTT APIs. Edge auto-configures the device type and gateway in Monitor during initial setup.
14. Can MVI Edge send SMS alerts?
Yes. MVI Edge supports Twilio SMS integration for alert notifications. Configure Twilio credentials in the Edge administration interface. MAS 9.0 added alert message templates so you can customize the SMS content.
15. What is the difference between TensorRT and Core ML export?
TensorRT is NVIDIA's inference optimization format -- it runs on NVIDIA GPUs (server, edge, cloud). Core ML is Apple's format -- it runs on iOS/iPadOS devices using the Neural Engine. TensorRT is available for most model architectures. Core ML is only available for GoogLeNet, YOLO v3, and Tiny YOLO v3.
16. Why are my augmented training results worse?
Not all augmentations help every dataset. Vertical flip on text or directional defects makes images unnatural. Aggressive blur can destroy the features the model needs to detect. Start with rotate and color augmentation, then add others incrementally. Test each addition against a baseline.
17. How do I handle model drift?
Monitor prediction distribution and confidence score trends over time. Track human override rate (how often inspectors disagree with the model). If override rate exceeds 15% or accuracy drops more than 5% from baseline, retrain with recent production data. See Part 10 for the complete governance framework.
18. Can I run MVI without GPUs?
Training requires NVIDIA GPUs. There is no way around this requirement. Inference can technically run on CPU, but it is 10-50x slower and not suitable for real-time use cases. For production deployments, GPU is effectively required for both training and inference.
19. What deployment options are available?
Five: SaaS on AWS (via AWS Marketplace), SaaS on IBM Cloud (via IBM Cloud Satellite or Terraform), on-premises on Red Hat OpenShift, Azure via Azure Red Hat OpenShift (ARO), or client-managed on RHOCP across any cloud or on-premises environment.
20. How do I back up my trained models?
Export models from the MVI UI or via the REST API (GET /api/dnn-script). Store exported model artifacts in version-controlled, backed-up storage. Include the training dataset, model configuration, and hyperparameters alongside the model artifact. Before any MAS upgrade, export all production models as a rollback safety net.
Diagnostic Quick Reference
When something breaks and you need to diagnose fast, use this flowchart:
MVI MASTER DIAGNOSTIC FLOWCHART
═════════════════════════════════
SYMPTOM: What is happening?
┌─────────────────────────────────────────────┐
│ │
├─► Training won't start │
│ └─► GPU detected? ──No──► GPU section │
│ └─Yes─► PVC mounted? ──No──► Storage │
│ └─Yes─► Check pod events │
│ │
├─► Training is slow │
│ └─► GPU or CPU? ──CPU──► GPU section │
│ └─GPU─► IOPS? ──Low──► Upgrade storage│
│ └─OK─► Normal for dataset size │
│ │
├─► Model won't converge │
│ └─► Check labels first (90% of the time) │
│ └─Labels OK─► Adjust hyperparameters │
│ │
├─► Inference returns errors │
│ └─► Pod running? ──No──► Check pod events │
│ └─Yes─► Auth correct? ──No──► Fix │
│ │ X-Auth-Token │
│ └─Yes─► Check model format │
│ │
├─► Mobile sync fails │
│ └─► iOS device? ──No──► Not supported │
│ └─Yes─► Core ML model? ──No──► Only 3 │
│ │ architectures
│ └─Yes─► Check network/API key │
│ │
├─► Edge alerts not reaching Monitor │
│ └─► MQTT connected? ──No──► Check broker │
│ └─Yes─► Topics match? ──No──► Align │
│ └─Yes─► v2 API version match? │
│ │
├─► Post-upgrade failures │
│ └─► 9.0: Check GPU arch (Kepler removed) │
│ 9.1: Check SSD models (deprecated) │
│ Both: Check MQTT v2 API format │
│ │
└─► None of the above │
└─► Collect: pod logs, events, nvidia-smi │
oc describe pod, PVC status │
Open IBM support case with all output │
└─────────────────────────────────────────────┘Quick Command Reference
# GPU health check
oc describe node <gpu-node> | grep nvidia
oc exec <gpu-pod> -- nvidia-smi
# Pod diagnostics
oc get pods -n <mvi-namespace>
oc describe pod <pod-name>
oc logs <pod-name>
oc logs <pod-name> --previous # logs from crashed pod
# Storage diagnostics
oc get pvc -n <mvi-namespace>
oc get pvc <pvc-name> -o yaml | grep accessModes
# API health check
curl -k -H "X-Auth-Token: <key>" https://<mvi-host>/api/datasets
# MQTT diagnostics (from edge device)
mosquitto_sub -h <broker> -p <port> -t "#" -v # subscribe to all topics
# Training status
curl -k -H "X-Auth-Token: <key>" https://<mvi-host>/api/dnn-scriptKey Takeaways
- GPU issues are the number one cause of MVI setup failures. Before troubleshooting anything else, verify NVIDIA hardware, CUDA version, GPU Operator health, and 16 GB VRAM minimum. If the GPU is not working, nothing downstream works.
- ReadWriteMany PVC access mode is required and ReadWriteOnce silently fails. This single configuration error causes intermittent failures that are maddening to diagnose because they only appear when multiple pods try to mount the volume simultaneously.
- MVI Mobile is iOS and iPadOS only. Only 3 model architectures export to Core ML. Plan your model architecture around your deployment target before training. Training a Faster R-CNN model and then discovering it cannot export to mobile wastes weeks of effort.
- SSD training is deprecated in MAS 9.1. If you relied on SSD models, migrate to YOLO v3. Existing SSD models still run inference, giving you time to migrate without production downtime.
- Most API integration issues are the wrong authentication header. MVI uses
X-Auth-Token, notAuthorization: Bearer. This single header name causes more integration failures than any other configuration issue. - The MAS 9.0 and 9.1 upgrades introduce breaking changes. Kepler GPU removal, SSD deprecation, v2 MQTT APIs, and Java 17 migration all require pre-upgrade planning. Do not upgrade production without running through the migration checklist.
- When in doubt, check labels. Model convergence failures, poor accuracy, unexpected predictions -- the root cause is bad labels more often than bad architecture, bad hyperparameters, or bad infrastructure.
Conclusion: The Troubleshooting Mindset
MVI troubleshooting follows a pattern: infrastructure first, then data, then model, then integration.
Most teams troubleshoot in the wrong order. They adjust hyperparameters when the GPU is not detected. They retrain models when the labels are wrong. They debug API code when the authentication header is misspelled.
Start at the bottom of the stack. Verify the GPU. Verify the storage. Verify the pod is running. Then move up to data quality, model configuration, and integration. This order solves 90% of issues faster than any other approach.
And if you have read this entire blog and your problem still is not solved, you now have the vocabulary and diagnostics to open an IBM support case that gets results. "My training is slow" gets triaged to the back of the queue. "Training pod has GPU resource requests but nvidia-smi shows 0% utilization, GPU Operator pods are in CrashLoopBackOff, and the operator logs show a driver version mismatch on a Tesla V100 node running CUDA 11.7" gets escalated immediately.
Be specific. Show your work. Fix it faster.
Previous: Part 11 - REST API Reference
Series Index: MAS Visual Inspection Series
Series: MAS VISUAL INSPECTION | Part 12 of 12
TheMaximoGuys | Enterprise Maximo. No fluff. Just results.



