Proxmox Container Troubleshooting Guide

Common Container Start Issues

After three years of Proxmox, I've ran into a couple issues when using containers. This little guide provides a systematic approach to diagnosing and resolving container start problems, with special attention to High Availability (HA) managed containers.

Basic Container Start Troubleshooting

Check Container Status

Begin by verifying the current status of the container:

pct status <container_id>

If the container shows as "stopped", proceed with further troubleshooting.

Examine Container Configuration

Review the container's configuration for potential issues:

pct config <container_id>

Look for potentially problematic settings such as: - Network configuration issues - Storage allocation errors - Resource constraints - Privileged/unprivileged settings

Check System Logs

Examine system logs for errors related to the container:

# Check LXC-specific logs (if they exist)
tail -n 100 /var/log/pve/lxc/<container_id>.log

# Check system journal for container service events
journalctl -e -u pve-container@<container_id>

# Check general system logs
grep -i "<container_id>" /var/log/syslog

Verify Storage Accessibility

Ensure the container's storage is accessible and healthy:

# For ZFS storage
zfs list | grep <container_id>
zpool status

# For LVM storage
lvs
vgs

Check for Lock Files

Verify if there are any lock files preventing container operations:

# Check for LXC lock files
ls -la /var/lock/lxc/

# Remove a specific lock file if needed
rm -f /var/lock/lxc/<container_id>.lock

# Use the unlock command
pct unlock <container_id>

Mount and Inspect Container Filesystem

Mount the container filesystem to check if it's accessible:

pct mount <container_id>
ls -la /var/lib/lxc/<container_id>/rootfs/

A successful mount indicates that the container's filesystem is intact, which is good news for data preservation.

Troubleshooting High Availability (HA) Managed Containers

Identify if Container is HA-Managed

If a container doesn't start and shows messages like "Requesting HA start for CT ", it's likely under HA management.

Check HA status with:

# List all HA resources
ha-status

# Check specific container status
ha-status service vm:<container_id>

Also examine HA related tasks:

# Look for hastart tasks
pvesh get /nodes/$(hostname)/tasks --limit 10 | grep <container_id>

Check HA Logs

Examine the HA manager logs for errors:

tail -n 100 /var/log/pve/ha-manager.log

Remove Container from HA Management

If HA is preventing the container from starting properly, remove it from HA management:

# Try with vm: prefix
ha-manager remove vm:<container_id>

# If that fails, try with ct: prefix
ha-manager remove ct:<container_id>

Temporarily Disable HA Services

In extreme cases, temporarily stopping HA services can help troubleshoot:

# Stop HA cluster resource manager
systemctl stop pve-ha-crm

# Stop HA local resource manager
systemctl stop pve-ha-lrm

# Try starting the container directly
pct start <container_id>

# Re-enable HA services after successful start
systemctl start pve-ha-crm
systemctl start pve-ha-lrm

Advanced Troubleshooting

Force Container Stop

If a container is in a "stuck" state, force stopping it may help:

pct stop <container_id> --force

Restarting relevant Proxmox services can resolve certain issues:

# Restart container service
systemctl restart pve-container

# Restart LXC service (if applicable)
systemctl restart lxc

Check for Ongoing Tasks

Identify any tasks that might be conflicting with container operations:

pvesh get /nodes/$(hostname)/tasks --limit 10

Examine Process Activity

Check for processes related to the container:

ps aux | grep <container_id>

Recovery Options

Backup Container Data

Before attempting more invasive recovery measures, backup important data:

# If container is mounted
rsync -av /var/lib/lxc/<container_id>/rootfs/ /path/to/backup/

# Or use Proxmox backup features
vzdump <container_id> --mode snapshot

Clone Container

Create a clone to test if the issue is with the container configuration:

pct clone <container_id> <new_id>

Recreate Container Config

If configuration is corrupt but data is intact:

# Back up the original config
cp /etc/pve/lxc/<container_id>.conf /etc/pve/lxc/<container_id>.conf.bak

# Use the GUI to recreate with correct settings, pointing to the existing storage

Special Considerations

Nested Virtualisation

If container uses nested virtualisation, ensure the feature is properly enabled:

# Check if enabled in config
grep -i "nesting" /etc/pve/lxc/<container_id>.conf

# Enable if needed
pct set <container_id> -features nesting=1

Container Upgrade Issues

After OS upgrades, init systems may change, causing start failures:

# Check container OS type
pct config <container_id> | grep ostype

# Update if needed
pct set <container_id> -ostype debian

Network Issues

Networking problems can prevent containers from starting properly:

# Verify bridge interface exists
ip link show

# Check bridge in container config
pct config <container_id> | grep net

Preventative Measures

To avoid future container start issues:

Regularly back up containers using vzdump
Document container configurations
Test HA failover scenarios in a controlled environment
Keep Proxmox updated to the latest stable version
Monitor system resources to prevent overcommitment
Use resource limits appropriate for your hardware