Intro to systemtap
Posted by Ian S. Nelson Wed, 30 Aug 2006 23:13:00 GMT
What is systemtap?
Systemtap is a tool built upon the kprobes framework for probing, debugging and analyzing the linux kernel at runtime. IBM contributed kprobes to linux and had it included during the 2.5 development tree in 2002. It's a supported and standard portion of the Linux kernel. Systemtap builds upon kprobes and creates a fairly easy to use tool that let's you probe just about anything and everything in the linux kernel with fairly simple commands, you can almost effortlessly modify any variable you wish on a running kernel. All with minimal performance impact. Essentially it's the linux version of dtrace the differences being that systemtap is probably more flexible in kernel space and dtrace currently does a lot more for probing userspace applications.
So how do I get this working?
It's fairly easy and straight forward to install. There are RPMs for Fedora, Redhat Enterprise and Suse based linux distributions. There is a slight trick, at least with Fedora Core 5, you need to have the kernel debug RPM also installed. Basically, it's just a set of object code files with debug symbols. I couldn't find it in the latest Fedora Core 5 and upon closer look the kernel RPM didn't have the debug package enabled by default. I placed the following in my kernel-2.6.spec from Fedora Core 5 and was able to build the kernel debug symbol package.
%define _enable_debug_packages 1
With that added to your kernel spec file you can use rpmbuild and rebuild it and it will emit the debug package.
What can you do now?
Now you can start probing stuff. The systemtap site has some fairly simple demos, one is to print the top 20 most common syscalls every 5 seconds.
#!/usr/bin/env stap
#
# This script continuously lists the top 20 systemcalls on the system
#
global syscalls
function print_top () {
cnt=0
log ("SYSCALL\t\t\t\tCOUNT")
foreach ([name] in syscalls-) {
printf("%-20s\t\t%5d\n",name, syscalls[name])
if (cnt++ == 20)
break
}
printf("--------------------------------------\n")
delete syscalls
}
probe kernel.function("sys_*") {
syscalls[probefunc()]++
}
# print top syscalls every 5 seconds
probe timer.ms(5000) {
print_top ()
}
This little probe will sleep for 5 seconds counting each system call as it happens and then dumps the list out and starts counting again. Here is a sample output:
SYSCALL COUNT sys_gettimeofday 771 sys_futex 637 sys_clock_gettime 214 sys_rt_sigprocmask 141 sys_select 136 sys_poll 106 sys_newstat 84 sys_ioctl 62 sys_read 48 sys_write 47 sys_fcntl 47 sys_rt_sigaction 38 sys_time 30 sys_getppid 26 sys_setitimer 16 sys_recvfrom 13 sys_rt_sigreturn 13 sys_nanosleep 12 sys_close 11 sys_open 10 sys_newfstat 6
Try altering the load on your system and seeing how that affects things. That's kind of interesting but not terribly useful. Try this version:
#! /usr/bin/env stap
#
# This script continuously lists the top 20 systemcalls on the system
#
global syscalls
function print_top () {
cnt=0
log ("SYSCALL\t\t\t\tCOUNT")
foreach ([name] in syscalls-) {
printf("%-20s\t\t%5d\n",name, syscalls[name])
if (cnt++ == 20)
break
}
printf("--------------------------------------\n")
delete syscalls
}
probe kernel.function("sys_*") {
if (target() == tid())
syscalls[probefunc()]++
}
# print top syscalls every 5 seconds
probe timer.ms(5000) {
print_top ()
}
I added a line to the probe which checks to see if the target is a value that system let's you pass in. Now you can add a -x argument and the pid of a process and it will perform the same probes only against that particular process. I have a postgresql database running and ps tells me that pid 16679 is the writer thread.
... postgres 16679 0.0 4.4 199916 91000 ? S Jul21 0:00 postgres: writer process ...
Let's look at it.
SYSCALL COUNT sys_getppid 25 sys_time 25 sys_select 25
over and over and over, since the database isn't doing anything this makes sense, it's just waiting for something to tell it what to do. You can attach strace to that process (strace -p 16679) and see that it is correct. Let's distburb the database and make it do some I/O and see how it behaves. I'll insert a record into a table.
SYSCALL COUNT sys_getppid 25 sys_time 25 sys_select 25 sys_lseek 4 sys_write 4
That makes some sense, one record insereted resulted in 4 seeks and 4 writes. Maybe that's a write to an index, a couple writes to b+tree nodes and then an actual record being written, or maybe it is writing to a journal file. I don't know, but it sounds kind of reasonable and I haven't looked at the postgresql code yet. In fact we haven't looked at any source code yet or really compiled anything. This is kind of a cheap example since strace and the ptrace API can provide all of this information (although there is considerable impact on system performance with them) so here is something a bit harder to do.
#! /usr/bin/env stap
global comps
global compsizes
probe kernel.function("memcmp")
{
comps++
compsizes = compsizes + $count
}
probe timer.ms(5000)
{
printf("memcmp call count bytes compared\n")
printf("%d %d\n", comps, compsizes)
comps = 0
compsizes = 0
}
This probe monitors the kernel's memcmp function which compares two buffers of memory. $count is an argument to that function which contains the number of bytes to compare.
memcmp call count bytes compared 17151 114214 memcmp call count bytes compared 5455 34990 memcmp call count bytes compared 17165 114269 memcmp call count bytes compared 5398 34817
As you can see memcmp is called a lot and on average to compare about 6 bytes worth of data. I wonder what it sort of oscillates? I'll have to look in to that.
Here is another simple one, __kmalloc is used to allocate memory in the kernel.
#! /usr/bin/env stap
global allocs
global allocsizes
probe kernel.function("__kmalloc")
{
allocs ++
allocsizes = allocsizes + $size
}
probe timer.ms(5000)
{
printf("malloc count bytes malloced \n")
printf("%d %d \n", allocs, allocsizes)
allocs = 0
allocsizes = 0
}
which produces something looking like this:
malloc count bytes malloced 16 30128 malloc count bytes malloced 36 69312 malloc count bytes malloced 76 64469 malloc count bytes malloced 28 56651So every 5 seconds this particular system is doing about 40 mallocs and it's averaging to about 5K per malloc if my napkin math is correct.
That's just a taste, next week I'll do a more interesting example.

Pardon the look of the data, I’ll get tate to fix the CSS to have fixed width fonts for the fixed width data…
Thanks for the nice intro!
As one of the SystemTap developers, I have a few comments that might give you more success.
First, for the debuginfo, Fedora does provide the necessary files in a separate package called
kernel-debuginfo. You can useyumto install this, though you may need to enable extra repositories (e.g.,core-debuginfoand/orupdates-debuginfo). For example:Make sure that the version you grab matches your running kernel!
I would also recommend that you look into the stats capability. Not only does it give you access to functions like min, max, avg, sum, etc., but it also uses percpu data structures for better scalability. Consider the following “port” of your kmalloc example:
#! /usr/bin/env stap global allocs probe kernel.function("__kmalloc") { allocs <<< $size } probe timer.ms(5000) { count = @count(allocs) sum = count ? @sum(allocs) : 0 printf("malloc count bytes malloced\n") printf("%d %d\n", count, sum) delete allocs }Please continue to blog your SystemTap experiences, and feel free to join us on the mailing list as well…
Very cool, thanks for the pointers. Thanks for commenting too. I can’t tell you how many times I’ve baked a kernel just to measure something or tweak it. It’s a great tool, thanks for working on it.
I started looking at some of the taps in /usr/share/systemtap but I wanted to get an article out and couldn’t figure out a couple of problems I kept running in to. In the process I joined the mailing list because I think I may have found a slight bug that I’ll be submitting a patch to fix as soon as I verify it.
I’ll definitely blog some more and try to generate some more interest, like I said, it’s an incredibly cool tool.
It is fixed.
So I was having the hardest time probing my e1000 driver. Turns out there is a bug in the Linux kernel. /sys/modules/foo/sections was limited to 32 characters and many of the actual object sections in modules are quite a bit longer than that. Andrew and Linus have accepted my patch and it should be in the next mm tree and Linus tree.