返回
类型 应用研究 预答辩日期 2017-09-15
开始(开题)日期 2014-05-21 论文结束日期 2017-05-31
地点 计算机楼213室 论文选题来源 973、863项目     论文字数 8.1 (万字)
题目 面向软错误的软件检测技术研究
主题词 软错误,软件容错,结果错误,错误传播分析,不变量技术
摘要 辐射环境中高能带电粒子撞击设备的逻辑单元导致单粒子效应,而单粒子效应造成的瞬时性故障称为软错误。软错误是影响航天计算可靠性的重要因素。随着芯片集成晶体管数呈指数级增长,软错误率按照摩尔定律快速增长,使得航天计算可靠性问题日益严峻。软错误检测技术是解决航天计算可靠性问题的关键环节。硬件层面的软错误检测技术水平目前还难以达到防护的要求,软件层面的软错误检测技术具备制造成本低、独立于底层硬件设计和配置灵活的特点。软件层面的软错误检测技术成为航天计算可靠性的研究重点之一。 本文主要研究软件层面的软错误检测问题。针对现有检测方法在检出率、检测代价等方面存在的不足,本文的目标是研究高检出率、低检测代价的检测方法,从而在现有硬件水平基础上,提高航天计算的可靠性,保障卫星在轨期间稳定提供服务。在四种软错误造成的后果类型中,结果错误(Silent Data Corruption,后文简称SDC) 是最难检测的一种。当SDC 发生时,整个执行过程与正常运行时没有区别,只是程序的输出结果发生了错误。本文重点开展针对SDC 的错误传播机理和检测方法研究,并通过错误注入实验评估方法的有效性。本论文的主要工作和贡献包括: (1) 通过错误注入实验分析了软错误对栈行为的影响,总结了栈操作常用指针导致SDC 的条件和错误传播路径。程序的执行由一系列的函数调用组成,而函数调用一般由栈来实现。栈操作常用的指针包括栈指针、栈帧基址指针和返回地址。栈指针和栈帧基址指针分别存放在寄存器ESP 和寄存器EBP。ESP、EBP 和返回地址在程序运行中起重要作用,ESP 和EBP 被频繁地用来寻址,返回地址改变了程序的控制流,但还没有研究工作分析错误的ESP、EBP 和返回地址对程序执行的影响。针对返回地址、寄存器ESP 和寄存器EBP 展开了一系列错误注入实验,分析了错误的返回地址、ESP 和EBP 导致SDC 的条件和错误传播路径,发现控制RET 型ESP 和EBP 只有满足特定条件才会造成SDC,控制RET 型ESP 导致SDC 仅当执行RET 指令前ESP 指向返回地址;指出了注入时机和翻转位对于注入结果的影响;造成控制RET 型EBP 发生挂起的原因是返回环路,分析了返回环路的形成过程并给出了其必要条件。 (2) 通过错误传播分析设计了SDC 脆弱指令的识别方法,研究了SDC 脆弱指令的分布特征和传播特征。SDC 脆弱指令是指在其操作数发生软错误会导致SDC 的指令。分析SDC 脆弱指令对于软错误检测设计及优化都有重要的参考价值。现有寻找SDC 脆弱指令方法需要进行巨量的错误注入,时间代价巨大。根据数据关联图建立了指令的数据依赖关系,研究了函数间和函数内部错误传播过程;进而推导出判定SDC 脆弱指令的充分条件,提出了SDC 脆弱指令识别方法。该方法根据已执行的错误注入实验信息动态推测潜在的SDC 脆弱指令,由此减少未进行的错误注入。在保证较高准确率和覆盖率的前提下,时间代价显著减少。通过分析得到SDC 脆弱指令,发现导致SDC 的静态指令和源代码的分布集中,并且传播路径上的关键位置是接口指令和比较指令。这些结论为检测器的放置位置提供了依据。 (3) 针对SDC 难以检测的问题,通过引入软件测试中的不变量设计了一种源码级断言检测方 法。不变量是运行时刻保持不变的程序特征。在软错误发生后,由于程序受到影响,不变量一般不再满足。根据该原理,在源代码中插入以不变量为内容的断言,利用发生软错误后断言报错来检测软错误。根据SDC 脆弱指令分析确定了检测位置,提取了检测位置的不变量;定义了表征不变量检测能力的渗透率,在同一检测位置依据渗透率将不变量转化为断言。通过错误注入实验验证了该检测方法的有效性。相比于源码级断言检测方法的典型工作FaultScreening,检出代价基本相同,SDC 检出率提高了21%。根据该检测方法实现了针对c 程序的程序加固系统1.0 版本Radish。Radish 通过添加基于不变量的断言对c 程序源文件进行加固。基于程序不变量的断言检测方法在检测器的形式上进行了创新,断言包含的关系更丰富、检出率更高,为检测SDC 提供新的解决思路。 (4) 虽然部署了基于不变量的断言,但是错误可能会在未被保护的区域传播。针对此问题,通过设计指令复算改进了基于不变量的断言检测方法。指令复算的检测粒度比断言更细,通过添加指令复算保护了基于不变量的断言未防护的区域,实现了指令复算和基于不变量的断言两种方法的协同。错误注入实验结果表明:相比于原有断言检测方法,添加指令复算后SDC 检出率提高了15.5%。根据指令复算实现了程序加固系统2.0 版本Radish_D。Radish_D 系统包括原有的Radish系统和指令复算模块,加固后输出部署了基于不变量断言和指令复算的可执行文件。
英文题目 Software-based techniques for soft error detection
英文主题词 Soft Error,Software Fault Tolerance,Silent Data Corruption, Fault Propagation Analysis,Invariant Technology
英文摘要 Single Event Effects (SEEs) are caused by energetic particles in radiation environment. The transient errors caused by SEEs are called soft errors. Soft error has a great influence on computing reliability of space devices. With the increase in the number of transistors on a chip, soft error rate will grow with Moore’s Law, exacerbating the challenge of computing reliability. Error detection is a crucial step toward soft error mitigation. Applying hardware-based detection methods is not able to reach the requirement of soft error mitigation. Due to the advantages on costs, independence and configurability, software-based detection methods are being paid more and more attention in soft error mitigation research. This dissertation focuses on software-based detection methods against soft errors. The purpose of this dissertation is to provide software-based detection methods which achieve high detection rates with low costs. After these methods are applied, the reliability of aerospace computing can be improved without requiring any changes to the hardware. Among the outcomes of soft error, silent data corruption (SDC) is hardest one to detect. When SDC occurs, the program generates erroneous output without any indications. This dissertation proposes the propagation theory and detection methods against SDC, then evaluates the designed detection methods by using fault injections. The main contributions of this dissertation are summarized as follows: (1) The effects of soft error on the stack behavior are analyzed and a few observations about SDC are concluded. The execution of a program is composed of a series of calls to procedures and calls are usually implemented by using stack. The processor provides three pointers for stack operations: the stack pointer, the stack-frame base pointer and the return address. The stack pointer is contained in the ESP register and the stack-frame base pointer is contained in the EBP register. ESP and EBP are often used for addressing and the return address determines the control flow after RET instruction, thus ESP, EBP and return address are important to the correctness of the program. To the best of our knowledge, the stack behavior has not been characterized in prior work. A series of fault injection experiment are conducted to characterize ESP, EBP and return address. Experimental results show that injections on ESP lead to SDC only if the flipped ESP points to another return address when executing the RET instruction. The injected bits of these SDC cases are distributed in the particular bits and the timing of injection impacts the results of injection. Hang cases of injections on RET-control EBP are caused by return cycle and the essential conditions for the occurrence of return cycle are obtained. (2) A novel method of identifying SDC-causing instruction by fault propagation analysis are proposed and the distribution characteristic and propagation characteristic of SDC-causing instruction are obtained. An instruction is an SDC-causing instruction if an error in its operand can cause SDC. The design and improvement of SDC detectors often need a profile of SDC-causing instructions. According to the state of the art, a huge number of faults need to be injected to locate SDC-causing instruction, which incurs prohibitive time cost. Data dependence graph is built to capture the dependencies among the values of instructions. The inter-function and intra-function propagation that leads to SDC is analyzed and the sufficient condition of SDC-causing instructions is demonstrated. Further, a novel method of identifying SDC-causing instructions is proposed. Taking advantage of the trace files of injection, our method can detect underlying SDC-causing instructions without any expensive computations. Validation efforts show that our method yields high accuracy and coverage rate with a great reduction of injection cost. After analyzing the SDC-causing instructions, we find that only a small fraction of static instructions or sections of source codes cause most of SDC cases. Moreover, the critical program points of fault propagation refer to connector instructions and branch instructions. These conclusions guide the strategic placement of detectors. (3) An approach for detecting SDC is proposed by using program invariant, which is originated from software testing. A program invariant is a set of properties of program. Normally, the invariant holds during runtime. But when soft error occurs, the invariant is often violated due to the impact of soft error. Based on this principle, invariant-based assertions are inserted into source code. Once an exception is thrown by an assertion, it indicates that soft error is detected. By analyzing the propagation of the fault that leads to SDC, the locations where assertions are embedded are selected and then invariants are extracted. Some of the invariants are converted to assertions based on their permeability, which indicates the capabilities of detecting soft error. The proposed approach is evaluated by fault injection experiment which shows that it achieves high coverage with low overhead. The SDC detection rate of the proposed approach is 21% higher than FaultScreening with nearly the same cost. By applying this approach, the version 1.0 of program-hardening system called Radish is implemented. Radish enhances the resilience of the program to soft error by inserting invariant-based assertions to the source code. The proposed approach provides novel detectors which contain more types of relationships and achieve higher detection rate, broadening the ways of detecting SDC. (4) The assertions generated by Radish cannot fully monitor all the variables and program points; thus certain faults might propagate through unprotected code sections. To address this problem, software-based instruction duplication mechanism is introduced. Compared with assertions, software-based instruction duplication mechanism detect soft errors at a finer level of granularity, thus it can protect code sections that are not covered by assertions. Experiments show that adding software-based instruction duplication mechanism increases the SDC detection rate by 15.5% compared with pure assertions. By applying the software-based instruction duplication mechanism, the version 2.0 of program-hardening system called Radish_D is implemented. Radish_D extends Radish by adding a module which implements instruction duplication mechanism. Radish_D produces executable files with invariant-based assertions and instruction duplication mechanism.
学术讨论
主办单位时间地点报告人报告主题
GDOC实验室 2013.9.13 三楼会议室 马骏驰 SDCInfer: Silent Data Corruption Causing Instruction Inference
GDOC实验室 2013.5.10 三楼会议室 马骏驰 软错误注入实验及错误特征初步分析
GDOC实验室 2013.11.22 三楼会议室 马骏驰 SDC错误特征探讨
GDOC实验室 2014.4.4 三楼会议室 马骏驰 检测器放置策略研究
GDOC实验室 2014.5.15 三楼会议室 马骏驰 Understanding Software Application Behavior in Presence of Permanent and Intermittent Hardware Faults
GDOC实验室 2017.1.5 计算机楼317 马骏驰 软错误下栈的行为分析
GDOC实验室 2017.4.10 计算机楼317 马骏驰 Design, Automation and Test in Europe 2017参会分享
GDOC实验室 2013.6.6 计算机楼317 史培中 睡眠调度模式下的 WSN数据传输协议研究
GDOC实验室 2017.3.12 计算机楼317 张玉健 GreenMatch: Renewable-Aware Workload Scheduling for Massive Storage Systems
     
学术会议
会议名称时间地点本人报告本人报告题目
6th IEEE International Conference on Software Engineering and 2015.9.22 北京 SDCInfer: Inference of Silent Data Corruption Causing Instructions
2nd International Conference On Human-Centered Computing 2016.1.8 科伦坡,斯里兰卡 Identification of Critical Variables for Soft Error Detection
20th Design, Automation and Test in Europe 2017.3.30 洛桑,瑞士 Characterization of Stack Behavior under Soft Errors
     
代表作
论文名称
Detecting Silent Data Corruptions in Aerospace-based Computing Using Program Invariants
Identification of Critical Variables for Soft Error Detection
SDCInfer: Inference of Silent Data Corruption Causing Instructions
 
答辩委员会组成信息
姓名职称导师类别工作单位是否主席备注
陆桑璐 正高 教授 博导 南京大学
秦小麟 正高 教授 博导 南京航空航天大学
冯经 正高 教授 博导 国防科技大学
曹玖新 正高 教授 博导 东南大学
宋爱波 正高 教授 博导 东南大学
      
答辩秘书信息
姓名职称工作单位备注
周玲 副高 副研究员 东南大学