本文提出了一種基於諧振 SRAM 的新型計算內存 (rCiM) 架構,並開發了一個自動化工具來探索應用程序特定的 rCiM 設計,旨在最大程度地降低能耗和延遲。
A distributed Union-Find decoder that can exploit parallel computing resources to achieve sublinear average time complexity with respect to the surface code distance, enabling faster error correction for large surface codes.
ARM 프로세서의 다단계 메모리 프로파일링 도구를 설계하고, ARM SPE를 활용하여 메모리 접근 패턴을 분석하고 정량적으로 평가한다.
This work presents a multi-level memory-centric profiling tool called NMO that leverages ARM's Statistical Profiling Extension (SPE) to enable precise memory access tracing on ARM processors. It provides the first quantitative assessment of time overhead and sampling accuracy of ARM SPE for memory-centric profiling at different sampling periods and aux buffer sizes.
성능 모델을 활용하여 CVA6 프로세서의 수퍼스칼라 기능을 구현하고, 이를 통해 40%의 성능 향상을 달성하였다.
A cycle-accurate performance model was developed to guide the implementation of a superscalar version of the open-source CVA6 RISC-V processor, resulting in a 40% performance improvement on the CoreMark benchmark.
Specialized software-only framework Relic enables significant performance improvements over state-of-the-art parallel programming frameworks for fine-grained tasks on simultaneous multithreading CPU cores.
A scalable and programmable Look-Up Table (LUT) based Neural Accelerator (LUT-NA) framework that employs a divide-and-conquer approach to overcome the scalability limitations of traditional LUT-based techniques, and utilizes mixed-precision analysis to further reduce energy and area consumption without significant accuracy loss.
The proposed platform enables the integration of approximate circuits at the core level with diverse structures, accuracies, and timings without requiring modifications to the core, particularly in the control logic. It introduces novel control features, allowing configurable trade-offs between accuracy and energy consumption based on specific application requirements.
A novel streaming architecture with hybrid computing engines and a balanced dataflow strategy is proposed to efficiently accelerate lightweight convolutional neural networks by minimizing on-chip memory overhead and off-chip memory access while enhancing computational efficiency.