Intro to systemtap
Posted by Ian S. Nelson Wed, 30 Aug 2006 23:13:00 GMT
What is systemtap?
Systemtap is a tool built upon the kprobes framework for probing, debugging and analyzing the linux kernel at runtime. IBM contributed kprobes to linux and had it included during the 2.5 development tree in 2002. It's a supported and standard portion of the Linux kernel. Systemtap builds upon kprobes and creates a fairly easy to use tool that let's you probe just about anything and everything in the linux kernel with fairly simple commands, you can almost effortlessly modify any variable you wish on a running kernel. All with minimal performance impact. Essentially it's the linux version of dtrace the differences being that systemtap is probably more flexible in kernel space and dtrace currently does a lot more for probing userspace applications.
So how do I get this working?
It's fairly easy and straight forward to install. There are RPMs for Fedora, Redhat Enterprise and Suse based linux distributions. There is a slight trick, at least with Fedora Core 5, you need to have the kernel debug RPM also installed. Basically, it's just a set of object code files with debug symbols. I couldn't find it in the latest Fedora Core 5 and upon closer look the kernel RPM didn't have the debug package enabled by default. I placed the following in my kernel-2.6.spec from Fedora Core 5 and was able to build the kernel debug symbol package.
%define _enable_debug_packages 1
With that added to your kernel spec file you can use rpmbuild and rebuild it and it will emit the debug package.
What can you do now?
Now you can start probing stuff. The systemtap site has some fairly simple demos, one is to print the top 20 most common syscalls every 5 seconds.
#!/usr/bin/env stap
#
# This script continuously lists the top 20 systemcalls on the system
#
global syscalls
function print_top () {
cnt=0
log ("SYSCALL\t\t\t\tCOUNT")
foreach ([name] in syscalls-) {
printf("%-20s\t\t%5d\n",name, syscalls[name])
if (cnt++ == 20)
break
}
printf("--------------------------------------\n")
delete syscalls
}
probe kernel.function("sys_*") {
syscalls[probefunc()]++
}
# print top syscalls every 5 seconds
probe timer.ms(5000) {
print_top ()
}
This little probe will sleep for 5 seconds counting each system call as it happens and then dumps the list out and starts counting again. Here is a sample output:
SYSCALL COUNT sys_gettimeofday 771 sys_futex 637 sys_clock_gettime 214 sys_rt_sigprocmask 141 sys_select 136 sys_poll 106 sys_newstat 84 sys_ioctl 62 sys_read 48 sys_write 47 sys_fcntl 47 sys_rt_sigaction 38 sys_time 30 sys_getppid 26 sys_setitimer 16 sys_recvfrom 13 sys_rt_sigreturn 13 sys_nanosleep 12 sys_close 11 sys_open 10 sys_newfstat 6
Try altering the load on your system and seeing how that affects things. That's kind of interesting but not terribly useful. Try this version:
#! /usr/bin/env stap
#
# This script continuously lists the top 20 systemcalls on the system
#
global syscalls
function print_top () {
cnt=0
log ("SYSCALL\t\t\t\tCOUNT")
foreach ([name] in syscalls-) {
printf("%-20s\t\t%5d\n",name, syscalls[name])
if (cnt++ == 20)
break
}
printf("--------------------------------------\n")
delete syscalls
}
probe kernel.function("sys_*") {
if (target() == tid())
syscalls[probefunc()]++
}
# print top syscalls every 5 seconds
probe timer.ms(5000) {
print_top ()
}
I added a line to the probe which checks to see if the target is a value that system let's you pass in. Now you can add a -x argument and the pid of a process and it will perform the same probes only against that particular process. I have a postgresql database running and ps tells me that pid 16679 is the writer thread.
... postgres 16679 0.0 4.4 199916 91000 ? S Jul21 0:00 postgres: writer process ...
Let's look at it.
SYSCALL COUNT sys_getppid 25 sys_time 25 sys_select 25
over and over and over, since the database isn't doing anything this makes sense, it's just waiting for something to tell it what to do. You can attach strace to that process (strace -p 16679) and see that it is correct. Let's distburb the database and make it do some I/O and see how it behaves. I'll insert a record into a table.
SYSCALL COUNT sys_getppid 25 sys_time 25 sys_select 25 sys_lseek 4 sys_write 4
That makes some sense, one record insereted resulted in 4 seeks and 4 writes. Maybe that's a write to an index, a couple writes to b+tree nodes and then an actual record being written, or maybe it is writing to a journal file. I don't know, but it sounds kind of reasonable and I haven't looked at the postgresql code yet. In fact we haven't looked at any source code yet or really compiled anything. This is kind of a cheap example since strace and the ptrace API can provide all of this information (although there is considerable impact on system performance with them) so here is something a bit harder to do.
#! /usr/bin/env stap
global comps
global compsizes
probe kernel.function("memcmp")
{
comps++
compsizes = compsizes + $count
}
probe timer.ms(5000)
{
printf("memcmp call count bytes compared\n")
printf("%d %d\n", comps, compsizes)
comps = 0
compsizes = 0
}
This probe monitors the kernel's memcmp function which compares two buffers of memory. $count is an argument to that function which contains the number of bytes to compare.
memcmp call count bytes compared 17151 114214 memcmp call count bytes compared 5455 34990 memcmp call count bytes compared 17165 114269 memcmp call count bytes compared 5398 34817
As you can see memcmp is called a lot and on average to compare about 6 bytes worth of data. I wonder what it sort of oscillates? I'll have to look in to that.
Here is another simple one, __kmalloc is used to allocate memory in the kernel.
#! /usr/bin/env stap
global allocs
global allocsizes
probe kernel.function("__kmalloc")
{
allocs ++
allocsizes = allocsizes + $size
}
probe timer.ms(5000)
{
printf("malloc count bytes malloced \n")
printf("%d %d \n", allocs, allocsizes)
allocs = 0
allocsizes = 0
}
which produces something looking like this:
malloc count bytes malloced 16 30128 malloc count bytes malloced 36 69312 malloc count bytes malloced 76 64469 malloc count bytes malloced 28 56651So every 5 seconds this particular system is doing about 40 mallocs and it's averaging to about 5K per malloc if my napkin math is correct.
That's just a taste, next week I'll do a more interesting example.
