Analysis of crash dump mechanism based on Linux operating system kernel

With the widespread application of embedded Linux systems, higher requirements are put forward for the reliability of the system, especially when it comes to important areas such as life and property. The system is required to reach safety integrity level 3 or higher, and the failure rate (hazardous per hour) The probability of failure) is 10-7 or less, which is equivalent to the system's mean time between failures (MTBF) of at least 1,141 years. Therefore, improving system reliability has become an arduous task. An application survey of a companyâ€™s 14878 controller systems in the industrial field showed that from the beginning of 2004 to the end of September 2007, with the continuous improvement of hardware and software, the failure rate based on error reports has been reduced to one-fifth of that in 2004 It is less than 1, but the time to find the error has increased to more than 3 times.

This trend of increasing time required to solve the problem is of course a software problem, but the lack of necessary means to assist in solving the problem is the main reason. Through the statistical tracking of the faults, it is found that the software errors that are difficult to solve and the software errors that take a long time from discovery to resolution are concentrated in the core part of the operating system, and a large proportion of them are concentrated in the driver part. Therefore, error tracking technology is regarded as an important measure to improve the safety and integrity level of the system. Most modern operating systems provide the operating system kernel "crash dump" mechanism for development, that is, when the software system is down, the memory The content is saved to the disk, or sent to the faulty server through the network, or directly started the kernel debugger, etc., for later analysis and improvement.

The crash dump mechanism based on the Linux operating system kernel has the following types in recent years:

(1) LKCD (Linux Kernel Crash Dump) mechanism;

(2) KDUMP (Linux Kernel Dump) mechanism;

(3) KDB mechanism;

(4) KGDB mechanism.

Based on the above-mentioned mechanisms, it can be found that these four mechanisms have the following three points in common:

(1) Suitable for applications with abundant computing resources and sufficient storage space;

(2) There is no strict requirement for the recovery time after a system crash;

(3) Mainly aimed at more general hardware platforms, such as X86 platforms.

If you want to directly use one of the above mechanisms in embedded applications, you encounter the following three difficulties that cannot be solved:

(1) Insufficient storage space

Embedded systems generally use Flash as a memory, but the Flash capacity is limited and may be much smaller than the memory capacity in the embedded system. Therefore, it is not feasible to save all memory contents to Flash.

(2) Keep the recording time as short as possible

Embedded systems generally require the reset response time to be as short as possible. Some embedded operating systems have a reset and restart time of no more than 2s. However, the above-mentioned kernel crash dump mechanisms that can be used for Linux systems cannot take less than 30s. The operation of writing to Flash is also time-consuming. Experiments show that it takes as much as 400ms to write 2MB of data to Flash.

(3) It is required to be able to support a specific hardware platform

The hardware of the embedded system is diverse. The four mechanisms mentioned above all provide better support for the X86 platform, while the hardware support for other systems is not mature.

Due to these difficulties, it is very difficult to transplant one of the above four kernel crash dump mechanisms to a specific embedded application platform. Therefore, in view of the three characteristics of the above-mentioned embedded system, this article introduces a specific platform-based embedded Linux kernel crash information recording mechanism LCRT (Linux Crash Record and Trace) to locate and solve software faults in embedded Linux systems. Provide auxiliary means.

1 Analysis of Linux kernel crash

Analyzing the Linux kernel's handling of various "traps" during operation, it can be known that the Linux kernel can monitor the errors caused by the application program. When the application program has errors such as division by zero, memory access violation, buffer overflow, etc., the Linux kernel Exception handling routines can handle these exceptions caused by the application. When an application generates an unrecoverable error, the Linux kernel can just terminate the application that generated the error, and other applications can still run normally.

Analysis of crash dump mechanism based on Linux operating system kernel

If there is a bug in the Linux kernel itself or the newly developed Linux kernel module, errors such as "divide by zero", "memory access out of bounds", "buffer overflow", etc., will also be handled by the exception handling routine of the Linux kernel. The Linux kernel judges in the exception handler, if a "serious unrecoverable" kernel exception is found, it will cause a "kernel panic" (kernel panic), that is, a Linux kernel crash. Figure 1 shows the Linux kernel's processing flow for abnormal situations.

2 Design and implementation of LCRT mechanism

Through the analysis of the Linux kernel code, it can be seen that the Linux kernel itself provides a "kernel notification mechanism" and predefines the "kernel event notification chain" so that Linux kernel extension developers can use these predefined kernel event notification chains in Execute additional processing flow when a specific kernel event occurs. Through the study of the Linux kernel source code, it is found that for the â€œserious unrecoverable kernel exceptionâ€ mentioned above, a notification chain and notification point are predefined, so that after a Linux kernel crash, the panic function of the Linux kernel can be used A "kernel crash notification chain" predefined in, connect the LCRT mechanism to obtain some information of the Linux kernel crash scene and record it in the non-volatile memory, so as to analyze the cause of the Linux kernel crash.

2.1 Design points

The design and implementation of the LCRT mechanism is based on the following specific mechanisms:

(1) Compiler options and kernel dependencies

The Linux kernel and corresponding drivers are compiled with GNU's open source compiler GCC. In order to combine the LCRT mechanism to easily extract and record information, specific GCC compiler options need to be used to compile the Linux kernel and related drivers and applications. The option used is: -mpoke-function-name. The binary program compiled with this option can contain the information of the C language function name to facilitate the readability of the recorded information when the function call chain is traced back.

(2) Linux kernel noTIfy_chain mechanism

The Linux kernel provides the "notification chain" function, and predefines a kernel crash notification chain. When the system is judged to enter the "unrecoverable" state in the exception handling routine of the Linux kernel, it will call the registration register along the predefined notification chain sequence. The notification function in the corresponding chain.

(3) Stack layout of function calls

Most of the Linux kernel is implemented by C language, and C language is mostly used for Linux kernel development. The Linux kernel and the code that uses the LKM extension to join the Linux kernel execution environment are regular, and the stack layout generated during the execution of these codes is related to these regular codes. For example, before executing the function, these functions save the return address after the function is called, the parameters passed when the function is called, and the bottom of the stack frame owned by the function that called the function.

2.2 The design idea of â€‹â€‹LCRT mechanism

The LCRT mechanism is divided into a Linux kernel module part and a Linux user program part. The design of the kernel module part adopts the mode of Linux kernel module instead of directly modifying the Linux kernel. This design reduces the coupling between the Linux kernel and the LCRT mechanism, and at the same time satisfies the convenience of independent upgrades of the Linux kernel and the LCRT mechanism. The user program partly completes related functions such as reading and clearing the information stored in the LCRT mechanism from the non-volatile memory.

In the design of the LCRT mechanism, in view of the characteristics of the embedded system, the design decisions are as follows:

(1) Record the function call relationship chain that is most helpful to solve and locate the problem.

(2) In order not to occupy too much storage space, selectively save the stack contents used by the functions on the function call sequence instead of saving the entire contents.

(3) Save the recorded information to the non-volatile memory, which not only achieves the purpose of power-down preservation, but also shortens the writing time.

The design of the LCRT mechanism includes the following five aspects.

(1) Design the Linux kernel module, dynamically load the LCRT mechanism, and modify the Linux kernel code as little as possible.

(2) Connect the notification function of LCRT to the corresponding, predefined Linux kernel notification chain.

(3) Perform stack traceback in the notification processing function of the LCRT mechanism to obtain function call information.

(4) Record the function call information traced back and the contents of the stack space to the non-volatile memory.

(5) Develop user space tools that can read stored information from non-volatile memory.

2.3 Implementation of LCRT mechanism

The realization of the LCRT mechanism can be realized step by step with reference to the design idea in section 2.2. Due to space limitations, this article does not cover the principles and implementation-related details of the Linux kernel module, but only gives the pseudo-code for the implementation of the kernel module of the LCRT mechanism. Pseudo-code describes the loading function of the LCRT mechanism as follows:

int lcrt_init(void)

{

printk("Registering my__panic noTIfier.");

bt_nvram_ptr=(volaTIle unsigned char*)ioremap_

nocache (BT_NVRAM_BASE, BT_NVRAM_LENGTH);

bt_nvram_index+=sizeof(struct bt_info);

*) bt_nvram_ptr, BT_NVRAM_LENGTH);

noTIfier_chain_register(&panic_notifier_list, &my_

panic_block);

return 0;

}

The notification processing function of the LCRT mechanism completes the work of backtracking the function call relationship, obtaining the function name, and the content of the function stack. Due to space limitations, the following pseudo code is used to illustrate:

void ll_bt_information(struct pt_regs *pr)

{

Initialization work such as variable definition

do{

reglist=*(unsigned long *)(*myfp-8);

//Get the register information saved when the function starts to execute from the top of the function stack frame

//Get the name of the function from the code area of â€‹â€‹the function

//Remove the function parameter information saved before the function executes the function body code from the stack frame of the function

//Get the location of the code calling this function and the bottom of the stack frame of the function calling this function from the stack frame of this function

}while (until the head of the function call chain);

//Get the contents of the function call stack frame

//Fill the record header of the information record

//Save the information obtained in the above loop to non-volatile memory

write_to_nvram((void *)bt_nvram_ptr, &bt_record_header, sizeof(bt_info_t));

}

3 Verify and evaluate the LCRT mechanism

3.1 Deploy the LCRT mechanism

The relevant work that needs to be done before the LCRT mechanism is deployed to make the LCRT mechanism effective is as follows:

(1) Compile the Linux kernel module part of the LCRT mechanism for the target Linux kernel;

(2) Load the kernel module part of the LCRT mechanism into the Linux kernel.

3.2 Experimental results

In order to experiment with the effects of the LCRT mechanism, construct a device driver module that will cause the Linux kernel to crash, remember this kernel driver module as bugguy.ko, and list the code in bugguy.ko that will cause the Linux kernel to crash as shown below :

irqreturn_t my_timer_interrupt(int irq, void *dev_id, struct pt_regs* regs)

{

Confirm hardware status and clear interrupt status

if (ujiffies "5000) {

void * ill_pointer=NULL;

* (Unsigned long *) ill_pointer=0;

}

else {

ujiffies++;

}

return IRQ_HANDLED;

}

Note: The code marked in bold is the code that generates the bug

As can be seen from the above code, this error is caused by parsing the null pointer. If the parsing of the null pointer occurs in an interrupt handling function, it will cause the Linux kernel to crash. Load this bugguy.ko into the Linux kernel on the embedded linux system where the LCRT mechanism is deployed, so that the interrupt handler that will cause the Linux kernel to crash can run. The LCRT mechanism can save relevant information to the non-volatile memory. After the system is reset, the saved information can be read out through the user space tool of the LCRT mechanism. The experimental results show that the function call chain information shown in Figure 2 can be obtained.

Analysis of crash dump mechanism based on Linux operating system kernel

Figure 2 is marked as the interrupt handling function of the error code that can cause the Linux kernel to crash, which is the "culprit" that really causes the system downtime. All the recorded information only occupies less than 1KB of storage space, and the time consumed for writing into the non-volatile memory is controlled within 50ms. In the case of using a small amount of space and a small amount of time, the recorded information is of great help in finding and solving problems.

The experimental results show that under the action of the LCRT mechanism, it is possible to quickly locate the hidden software defects in the embedded Linux system that may cause system downtime. This provides key auxiliary information for subsequent troubleshooting and software improvement. As far as the embedded Linux kernel is concerned, it provides help to improve the stability and reliability of the Linux kernel.

In ARM-based embedded Linux applications, the LCRT mechanism is developed to record the function call chain and stack information that caused the crash when the system kernel crashes into the non-volatile memory. Up to now, the LCRT mechanism can record ARM-based embedded The function call chain information when the Linux kernel crashes, you can directly get the function name, the parameter information when a single function in the function call chain is called, and the stack frame information of each function in the function call chain. The recorded information has important auxiliary significance for the improvement and development of ARM-based embedded Linux applications.

Wireless Charging Coils

Wireless Charging Coils,10W Wireless Charging Coil,Wireless Charging Coil For Cell Phone,Car Wireless Charging Coils

Shenzhen Sichuangge Magneto-electric Co. , Ltd , https://www.scginductor.com