Demystifying Linux Kernel Initialization

The kernel initialization sequence is complex. Let’s examine the boot process.

Feb 14th, 2023 9:16am by Shuah Khan

Featued image for: Demystifying Linux Kernel Initialization

If you ever wondered what happens between the time you power up your laptop and when you get back with that steaming cup of your favorite beverage? This article will take the lid off and demystify the Linux kernel boot process. Let’s first talk about how much visibility a user has into what’s happening on their system and the state of the kernel itself. The Linux kernel is constantly printing diagnostic messages to the kernel ring buffer. As the name indicates, this ring buffer is a circular buffer in which older messages get overwritten by newer ones. It prints important messages about the initialization process during initialization and boot up. Several priority levels, such as emergency, critical, error, warning and informational are supported and logged to the ring buffer. If you are wondering how to view these messages, a command line tool dmesg exists just for that reason. Users can run the `dmesg` command line tool to examine or control the kernel ring buffer. `dmesg` control options allow users to clear, read first, then clear, enable and disable printing of messages to the console. We are going to focus on the examine options primarily. The `dmesg` command is a versatile tool with several options to zero in on specific messages for supported log facility and priority levels. Supported log facilities:

kern – kernel messages
user – random user-level messages
mail – mail system
daemon – system daemons
auth – security/authorization messages
syslog – messages generated internally by syslogd
lpr – line printer subsystem
news – network news subsystem

Supported log levels (priorities):

emerg — system is unusable
alert — action must be taken immediately
crit — critical conditions
err — error conditions
warn — warning conditions
notice — normal but significant condition
info — informational
debug — debug-level messages

As an example, the following commands can be used to view specific priority messages:

dmesg -l emerg# shows emergency messages
dmesg -l alert# shows alerts
dmesg -l crit# shows critical messages
dmesg -l err# shows error messages below emergency and critical levels
dmesg -l err+# shows all error levels emergency, critical and error

As an example, the following commands can be used to view log facility messages:

dmesg -f kern
dmesg -f daemon

These messages can be hard to parse and understand unless you look at them day in and day out. Let’s look at a couple of useful options that improve readability.

-t (–notime) option shows messages without timestamps;
-x (–decode) option decodes the log facility and priority numbers into user-friendly human-readable information.

Combining these two options yields easily readable and understandable messages. In addition, this format without timestamps lends itself well to comparing and analyzing messages from different kernel boots for regressions in errors.

dmesg -t -x

Here is an excerpt from dmesg that includes info, debug, warn-level messages:

kern  :warn  : [drm] Fence fallback timer expired on ring gfx
kern  :info  : fbcon: amdgpudrmfb (fb0) is primary device
kern  :info  : [drm] DSC precompute is not needed.
kern  :debug : calling  nf_defrag_init+0x0/0xff0 [nf_defrag_ipv4] @ 1037
kern  :debug : initcall nf_defrag_init+0x0/0xff0 [nf_defrag_ipv4] returned 0 aft
er 1 usecs
kern  :debug : calling  nf_defrag_init+0x0/0xff0 [nf_defrag_ipv6] @ 1037

With this overview of the kernel ring buffer and how to view kernel messages, we can dive into the kernel initialization sequence. The kernel initialization sequence is complex and carried out in several stages using initcalls. The initcalls contribute a significant amount of time to the Linux boot process. initcalls are grouped by functionality, and early initcalls [early_initcall(fn)] run before initializing SMP (symmetric multiprocessing) and are for built-in (statically linked) code, not for dynamically loadable modules. As of this writing, there are more than 110 early_initcall() defined in the kernel. A pure initcall is much more restricted, has no dependencies on anything else and purely initializes variables that couldn’t be statically initialized. Like early_initcall(), pure_initcall() also exists for built-in code, not for modules. There are just about 10 pure_initcall() definitions in the kernel as one would expect based on the restrictive uses of this type of initcall. One such use of a pure initcall can be found in the pci subsystem to initialize parameters when early parameters are parsed in pci_realloc_setup_params(). There are eight initcall call levels as defined in include/linux/init.h:

#define pure_initcall(fn)               __define_initcall(fn, 0)

#define core_initcall(fn)               __define_initcall(fn, 1)
#define core_initcall_sync(fn)          __define_initcall(fn, 1s)
#define postcore_initcall(fn)           __define_initcall(fn, 2)
#define postcore_initcall_sync(fn)      __define_initcall(fn, 2s)
#define arch_initcall(fn)               __define_initcall(fn, 3)
#define arch_initcall_sync(fn)          __define_initcall(fn, 3s)
#define subsys_initcall(fn)             __define_initcall(fn, 4)
#define subsys_initcall_sync(fn)        __define_initcall(fn, 4s)
#define fs_initcall(fn)                 __define_initcall(fn, 5)
#define fs_initcall_sync(fn)            __define_initcall(fn, 5s)
#define rootfs_initcall(fn)             __define_initcall(fn, rootfs)
#define device_initcall(fn)             __define_initcall(fn, 6)
#define device_initcall_sync(fn)        __define_initcall(fn, 6s)
#define late_initcall(fn)               __define_initcall(fn, 7)
#define late_initcall_sync(fn)          __define_initcall(fn, 7s)

And kept in sync with initcall_level_names[] in init/main.c:

static const char *initcall_level_names[] __initdata = {
        "pure",
        "core",
        "postcore",
        "arch",
        "subsys",
        "fs",
        "device",
        "late",
};

Enabling KNL (kernel start-up parameter) initcall_debug allows us to trace initcalls as they are executed. This is useful for understanding the kernel boot sequence, debugging early kernel panics and measuring the amount of time each initcall spends initializing. This is a good way to track regressions in kernel boot times from release to release. If one or more initcalls start taking longer to run, it could be an indication that something might be wrong. Another useful KNL initcall_blacklist parameter can be used to specify a list of initcall functions to skip. This is useful for debugging built-in modules and initcalls. Please refer to The kernel’s command-line parameters. Let’s now look at how to enable `initcall_debug` and parse `dmesg` for initcalls. The first step is to make sure `CONFIG_PRINTK_TIME` and `CONFIG_KALLSYMS` kernel configuration options are enabled in your kernel. If not, you will have to build, install and boot a new kernel with these options enabled. Please refer to “A Beginner’s Guide to Linux Kernel Development” (LFD103) or other resources on how to build and boot a custom kernel. Once you have the right options enables, you can edit `/etc/default/grub` to append `printk.time=1 initcall_debug` to the `GRUB_CMDLINE_LINUX` variable, as shown below: GRUB_CMDLINE_LINUX="earlyprintk=vga printk.time=1 initcall_debug" Now let’s reboot and run `dmesg -t -x` command once the system is ready. I usually save the `dmesg` to a file to parse and analyze.

kern  :debug : initcall con_init+0x0/0x378 returned 0 after 0 usecs
kern  :debug : initcall hvc_console_init+0x0/0x1f returned 0 after 0 usecs
kern  :debug : initcall xen_cons_init+0x0/0x80 returned 0 after 0 usecs
kern  :debug : initcall univ8250_console_init+0x0/0x39 returned 0 after 0 usecs
kern  :debug : initcall kgdboc_earlycon_late_init+0x0/0x31 returned 0 after 0 usecs
kern  :debug : initcall init_hw_perf_events+0x0/0x6d8 returned 0 after 0 usecs
kern  :debug : initcall do_init_real_mode+0x0/0x1a returned 0 after 0 usecs
kern  :debug : initcall trace_init_perf_perm_irq_work_exit+0x0/0x1c returned 0 after 0 usecs

The following (original from https://elinux.org/Initcall_Debug) short script shows you the initcalls that are taking longer to complete: cat dmesg.out | grep "initcall" | sed "s/\(.*\)after\(.*\)/\2 \1/g" | sort -n Example from my system:

64564 usecs kern  :debug : initcall serial8250_init+0x0/0x245 returned 0 
87029 usecs kern  :debug : initcall btusb_driver_init+0x0/0xff0 [btusb] returned 0 
110960 usecs kern  :debug : initcall ahci_pci_driver_init+0x0/0xff0 [ahci] returned 0 
120384 usecs kern  :debug : initcall init_kprobe_trace+0x0/0x21c returned 0 
227677 usecs kern  :debug : initcall br_init+0x0/0x12a [bridge] returned 0 
281999 usecs kern  :debug : initcall init_module+0x0/0xff0 [raid6_pq] returned 0 
1104000 usecs kern  :debug : initcall acpi_init+0x0/0x647 returned 0 
2218805 usecs kern  :debug : initcall inet6_init+0x0/0x413 returned 0

Enabling `initcall_debug` increases the number of messages. Please ensure the `CONFIG_LOG_BUF_SHIFT` kernel config option is 18 to set the log buffer size to 256K. The kernel boot process can be sped up by enabling driver_async_probe KNL to run the driver probe in asynchronous mode. Using this option, a list of driver names can be specified to be probed asynchronously. Please note that not all drivers support asynchronous probing. If you would like to learn more about initcalls, please refer to “An introduction to Linux kernel initcalls.” `bootgraph.pl` script turns a `dmesg` output into a SVG graphic that shows which functions take how much time. You can view SVG graphics with various programs, including Inkscape, The Gimp and Firefox. I will leave this as a resource for you to explore as you learn more about initcalls. Here is a shell script I use to keep track of initcall times and errors.

====================================================================
#!/bin/bash
#
# SPDX-License-Identifier: GPL-2.0
#
# Copyright(c) Shuah Khan <skhan@linuxfoundation.org>
# License: GPLv2
#
# Generates wiki format table of initcall times list from
# dmesg | grep initcall output
# Boot kernel with printk.time=1 initcall_debug command line options 
# Usage: boot_initcall_tbl.sh <dmesg.out>
# See async_run_entry_fn() and Documentation/core-api/printk-formats.rst


# Print Table header
echo "Init Call times"
uname -a
cat /proc/cmdline
grep "Command line" $1 | awk -F ']' '{print $2}'
echo "===================================================================="
echo "Init calls that took longer to complete > 0 usecs"
echo "===================================================================="
echo "| **Init Call** | **Module** | **Retun Value** | **Time (uses)** |"
echo "===================================================================="
grep initcall $1 | awk -F ' ' '{ if (NF == 9 && $8 > 0) print "| " $4 " | --- | " $6 " | " $8 " | "}'
grep initcall $1 | awk -F ' ' '{ if (NF == 10 && $9 > 0) print "| " $4 " | " $5 " | " $7 " | " $9 " | "}'
echo "===================================================================="
echo "Failed Init calls (return value < 0)"
echo "===================================================================="
echo "| **Init Call** | **Module** | **Retun Value** | **Time (uses)** |"
echo "===================================================================="
grep initcall $1 | awk -F ' ' '{ if (NF == 9 && $6 < 0) print "| " $4 " | --- | " $6 " | " $8 " | "}'
grep initcall $1 | awk -F ' ' '{ if (NF == 10 && $7 < 0) print "| " $4 " | " $5 " | " $7 " | " $9 " | "}'
echo "===================================================================="
====================================================================

As you can see, once you have the initcall information, the data can be used to understand the boot sequence, debug problems and keep track of kernel boot times from one kernel release to another.

Shuah Khan is a Linux Foundation fellow and an upstream kernel maintainer and developer.