toplogo
Sign In

Characterizing Soft-Error Resilience of Arm's Ethos-U55 Embedded Machine Learning Accelerator


Core Concepts
The Ethos-U55 embedded machine learning accelerator from Arm does not meet the most stringent Automotive Safety Integrity Level D (ASIL-D) resiliency standard against soft errors, even with commonly used soft-error mitigation techniques. Selective protection of the most vulnerable functional blocks can achieve the ASIL-D standard with lower area overhead compared to full duplication.
Abstract
The paper presents a thorough characterization of the soft-error resilience of Arm's Ethos-U55 embedded machine learning accelerator. Key highlights: Ethos-U55 does not meet the ASIL-D resiliency standard against soft errors, even with various neural network workloads and hardware configurations. The resilience of different functional blocks within Ethos-U55 (e.g., MAC array, DMA, control logic) varies significantly, with some blocks being more vulnerable to soft errors. The resilience of the functional blocks also depends on the application running on the accelerator and the underlying technology node. Conventional soft-error mitigation techniques like Dual Modular Redundancy (DMR) and flip-flop hardening can improve the resilience, but require significant area overhead to meet the ASIL-D standard. The authors propose a selective protection strategy, where only the most vulnerable functional blocks are protected, achieving the ASIL-D standard with 38% area overhead, compared to 100% for full duplication. The authors develop a statistical analysis tool to quickly navigate the large design space and identify the optimal protection strategy under area constraints.
Stats
The Silent Data Corruption (SDC) rate of Ethos-U55 is lower than 0.1 × 10−15 per inference, which violates the ASIL-D standard. The SDC rate increases with the scale of the NPU (e.g., MAC array/on-chip SRAM sizes). Different functional blocks in Ethos-U55 have inherently different sensitivity toward soft errors. The sensitivity pattern of the functional blocks changes significantly depending on whether faults in logic elements are considered.
Quotes
"To the best of our knowledge, this is the first large-scale resiliency characterization of a commercial NPU based on RTL fault injections." "We show that it is possible to meet the ASIL-D level resiliency without resorting to conventional strategies like Dual Core Lock Step (DCLS) that has an area overhead of 100%." "We show that by carefully duplicating a small fraction of the functional blocks and hardening the Flops in other blocks meets the ASIL-D safety standard while introducing an area overhead of only 38%."

Deeper Inquiries

How can the proposed selective protection strategy be extended to other types of safety-critical accelerators beyond neural network inference

The proposed selective protection strategy can be extended to other types of safety-critical accelerators by identifying the sensitive functional blocks within the specific accelerator architecture. By conducting a thorough analysis of the hardware structures and their sensitivity to soft errors, similar to what was done for the Ethos-U55 NPU, designers can determine which blocks require selective protection. This analysis would involve evaluating the impact of soft errors on different functional blocks and determining the optimal protection strategy for each block based on its sensitivity and importance to overall system reliability. Once the critical blocks are identified, designers can implement selective protection measures such as redundancy or hardening only in those specific blocks, rather than applying these techniques uniformly across the entire accelerator. This targeted approach allows for efficient use of resources and minimizes the area overhead associated with traditional fault tolerance methods. By tailoring the protection strategy to the specific vulnerabilities of each functional block, designers can optimize the resilience of the accelerator while minimizing the additional hardware costs.

What are the potential trade-offs between resilience, performance, and power consumption when applying the selective protection approach

When applying the selective protection approach, there are several potential trade-offs to consider between resilience, performance, and power consumption. Resilience: By selectively protecting only the most critical functional blocks, the overall resilience of the accelerator can be improved without incurring a significant area overhead. This targeted approach ensures that resources are allocated where they are most needed, enhancing the system's ability to withstand soft errors and maintain reliable operation. Performance: Depending on the protection strategy implemented in each functional block, there may be a trade-off between performance and resilience. For example, adding redundancy or hardening mechanisms can introduce additional latency or overhead, potentially impacting the accelerator's performance. Designers must carefully balance the level of protection with the desired performance metrics to ensure optimal system operation. Power Consumption: Selective protection measures can also impact power consumption in the accelerator. Additional hardware resources used for redundancy or hardening may increase power requirements, leading to higher energy consumption. Designers need to consider the trade-off between improved resilience and increased power consumption, aiming to achieve a balance that meets the system's power constraints while maintaining adequate protection against soft errors. Overall, designers must carefully evaluate and optimize the trade-offs between resilience, performance, and power consumption when implementing selective protection strategies in safety-critical accelerators. Balancing these factors effectively is essential to ensure the overall reliability and efficiency of the system.

How can the resilience characterization methodology be adapted to account for the impact of software-level fault tolerance techniques on the overall system reliability

Adapting the resilience characterization methodology to account for the impact of software-level fault tolerance techniques on the overall system reliability involves integrating software-based fault tolerance mechanisms into the fault injection and analysis process. Identification of Software Fault Tolerance Techniques: The first step is to identify the specific software-level fault tolerance techniques implemented in the system, such as error detection and recovery algorithms, redundancy in software components, or error handling mechanisms. Understanding how these techniques operate and interact with the hardware can provide insights into their effectiveness in mitigating soft errors. Incorporating Software Fault Injection: Software fault injection techniques can be used to simulate the effects of soft errors on the software components of the system. By injecting faults into the software at different points and observing the system's response, designers can assess the software's resilience to errors and its ability to maintain system functionality in the presence of faults. Combined Hardware-Software Resilience Analysis: Integrating hardware and software fault injection experiments allows for a comprehensive analysis of the system's overall resilience. By evaluating the impact of soft errors on both hardware and software components, designers can assess the effectiveness of the combined fault tolerance mechanisms in ensuring system reliability. Quantifying System Reliability: The methodology should include metrics for quantifying the system's reliability in the presence of soft errors, considering both hardware and software resilience factors. This may involve calculating the overall system SDC rate, assessing the effectiveness of fault tolerance mechanisms, and identifying areas for improvement in system reliability. By adapting the resilience characterization methodology to incorporate software-level fault tolerance techniques, designers can gain a holistic understanding of the system's resilience to soft errors and optimize the overall reliability of safety-critical accelerators.
0
visual_icon
generate_icon
translate_icon
scholar_search_icon
star