Over-the-air (OTA) updates in multi-tenant systems often face task conflicts, cache overlap, and weak fault recovery during parallel updates. This study designed a layered fault-tolerance and isolation method that combines task redundancy, cache separation, and snapshot rollback. Tests were carried out on 120 devices across six tenants with a fault rate of up to 95%. The system kept stable operation, extended the mean time between failures (MTBF) to 182 hours, and raised total availability from 98.2% to 99.7%. The average update delay per tenant stayed below 1.1 seconds, showing that higher reliability did not slow the process. The method effectively avoided tenant interference, reduced recovery time, and improved update stability. It provides a simple and practical solution for dependable OTA updates in industrial, automotive, and IoT systems.