<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="/stylesheets/rss.css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>ClearNet Security: Intro to systemtap</title>
    <link>http://blog.clearnetsec.com/articles/2006/08/30/intro-to-systemtap</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description></description>
    <item>
      <title>Intro to systemtap</title>
      <description>&lt;h2&gt;What is systemtap?&lt;/h2&gt;
&lt;p&gt;
&lt;a href=http://sources.redhat.com/systemtap/&gt;Systemtap&lt;/a&gt; is a tool built upon the kprobes framework for probing, debugging and analyzing the &lt;a href=http://www.kernel.org/&gt;linux kernel&lt;/a&gt; at runtime.   &lt;a href=http://www.ibm.com&gt;IBM&lt;/a&gt; contributed &lt;a href=http://sources.redhat.com/systemtap/kprobes/&gt;kprobes&lt;/a&gt; to linux and had it included during the 2.5 development tree in 2002.   It's a supported and standard portion of the Linux kernel.  Systemtap builds upon kprobes and creates a fairly easy to use tool that let's you probe just about anything and everything in the linux kernel with fairly simple commands, you can almost effortlessly modify any variable you wish on a running kernel.  All with minimal performance impact.  Essentially it's the linux version of dtrace the differences being that systemtap is probably more flexible in kernel space and dtrace currently does a lot more for probing userspace applications.
&lt;p&gt;
&lt;h2&gt;So how do I get this working?&lt;/h2&gt;
&lt;p&gt;
It's fairly easy and straight forward to install.  There are RPMs for Fedora, Redhat Enterprise and Suse based linux distributions.   There is a slight trick, at least with Fedora Core 5,  you need to have the kernel debug RPM also installed.   Basically, it's just a set of object code files with debug symbols.   I couldn't find it in the latest Fedora Core 5 and upon closer look the kernel RPM didn't have the debug package enabled by default.   I placed the following in my kernel-2.6.spec from Fedora Core 5 and was able to build the kernel debug symbol package.
&lt;p&gt;
&lt;pre&gt;
%define _enable_debug_packages 1
&lt;/pre&gt;
&lt;p&gt;
With that added to your kernel spec file you can use rpmbuild and rebuild it and it will emit the debug package.
&lt;p&gt;
&lt;h2&gt;
What can you do now?
&lt;/h2&gt;
&lt;p&gt;
Now you can start probing stuff.   The systemtap site has some fairly simple demos,  one is to print the top 20 most common syscalls every 5 seconds.
&lt;p&gt;
&lt;pre&gt;
#!/usr/bin/env stap
#
# This script continuously lists the top 20 systemcalls on the system
#

global syscalls

function print_top () {
        cnt=0
        log ("SYSCALL\t\t\t\tCOUNT")
        foreach ([name] in syscalls-) {
                printf("%-20s\t\t%5d\n",name, syscalls[name])
                if (cnt++ == 20)
                        break
        }
        printf("--------------------------------------\n")
        delete syscalls
}

probe kernel.function("sys_*") {
        syscalls[probefunc()]++
}

# print top syscalls every 5 seconds
probe timer.ms(5000) {
        print_top ()
}
&lt;/pre&gt;

&lt;p&gt;
This little probe will sleep for 5 seconds counting each system call as it happens and then dumps the list out and  starts counting again.  Here is a sample output:
&lt;p&gt;
&lt;pre&gt;
SYSCALL                         COUNT
sys_gettimeofday                  771
sys_futex                         637
sys_clock_gettime                 214
sys_rt_sigprocmask                141
sys_select                        136
sys_poll                          106
sys_newstat                        84
sys_ioctl                          62
sys_read                           48
sys_write                          47
sys_fcntl                          47
sys_rt_sigaction                   38
sys_time                           30
sys_getppid                        26
sys_setitimer                      16
sys_recvfrom                       13
sys_rt_sigreturn                   13
sys_nanosleep                      12
sys_close                          11
sys_open                           10
sys_newfstat                        6
&lt;/pre&gt;

&lt;p&gt;
Try altering the load on your system and seeing how that affects things.   That's kind of interesting but not terribly useful.  Try this version:
&lt;p&gt;
&lt;pre&gt;
#! /usr/bin/env stap
#
# This script continuously lists the top 20 systemcalls on the system
#

global syscalls

function print_top () {
        cnt=0
        log ("SYSCALL\t\t\t\tCOUNT")
        foreach ([name] in syscalls-) {
                printf("%-20s\t\t%5d\n",name, syscalls[name])
                if (cnt++ == 20)
                        break
        }
        printf("--------------------------------------\n")
        delete syscalls
}

probe kernel.function("sys_*") {
        if (target() == tid())
                syscalls[probefunc()]++
}

# print top syscalls every 5 seconds
probe timer.ms(5000) {
        print_top ()


}
&lt;/pre&gt;

&lt;p&gt;
I added a line to the probe which checks to see if the target is a value that system let's you pass in.    Now you can add a -x argument and the pid of a process and it will perform the same probes only against that particular process.  I have a postgresql database running and ps tells me that pid 16679 is the writer thread.
&lt;p&gt;
&lt;pre&gt;
...
postgres 16679  0.0  4.4 199916 91000 ?        S    Jul21   0:00 postgres: writer process
...
&lt;/pre&gt;
&lt;p&gt;

Let's look at it.
&lt;pre&gt;
SYSCALL                         COUNT
sys_getppid                        25
sys_time                           25
sys_select                         25
&lt;/pre&gt;
&lt;p&gt;
over and over and over, since the database isn't doing anything this makes sense, it's just waiting for something to tell it what to do.  You can attach strace to that process (strace -p 16679) and see that it is correct.  Let's distburb the database and make it do some I/O and see how it behaves.  I'll insert a record into a table.
&lt;p&gt;
&lt;pre&gt;
SYSCALL                         COUNT
sys_getppid                        25
sys_time                           25
sys_select                         25
sys_lseek                           4
sys_write                           4
&lt;/pre&gt;

&lt;p&gt;
That makes some sense,  one record insereted resulted in 4 seeks and 4 writes.  Maybe that's a write to an index, a couple writes to b+tree nodes and then an actual record being written, or maybe it is writing to a journal file.  I don't know, but it sounds kind of reasonable and I haven't looked at the postgresql code yet.  In fact we haven't looked at any source code yet or really compiled anything.   This is kind of a cheap example since strace and the ptrace API can provide all of this information (although there is considerable impact on system performance with them)  so here is something a bit harder to do.  
&lt;p&gt;
&lt;pre&gt;
#! /usr/bin/env stap
global comps
global compsizes
probe kernel.function("memcmp")
{
                comps++
                compsizes = compsizes + $count
}

probe timer.ms(5000)
{
        printf("memcmp call count      bytes compared\n")
                    printf("%d                    %d\n", comps, compsizes)
        comps = 0
        compsizes = 0
}
&lt;/pre&gt;
&lt;p&gt;
This probe monitors the kernel's memcmp function which compares two buffers of memory.  $count is an argument to that function which contains the number of bytes to compare.  

&lt;pre&gt;
memcmp call count      bytes compared
17151                    114214
memcmp call count      bytes compared
5455                    34990
memcmp call count      bytes compared
17165                    114269
memcmp call count      bytes compared
5398                    34817
&lt;/pre&gt;

&lt;p&gt;
As you can see memcmp is called a lot and on average to compare about 6 bytes worth of data.    I wonder what it sort of oscillates?  I'll have to look in to that.
&lt;p&gt;
Here is another simple one, __kmalloc is used to allocate memory in the kernel.
&lt;pre&gt;
#! /usr/bin/env stap
global allocs
global allocsizes


probe kernel.function("__kmalloc")
{
        allocs ++
        allocsizes = allocsizes + $size
}


probe timer.ms(5000)
{
        printf("malloc count      bytes malloced   \n")
        printf("%d                     %d              \n", allocs, allocsizes)
        allocs = 0
        allocsizes = 0
}

&lt;/pre&gt;

which produces something looking like this:
&lt;pre&gt;
malloc count      bytes malloced 
16                     30128              
malloc count      bytes malloced
36                     69312              
malloc count      bytes malloced    
76                     64469             
malloc count      bytes malloced    
28                     56651              
&lt;/pre&gt;
So every 5 seconds this particular system is doing about 40 mallocs and it's averaging to about 5K per malloc if my napkin math is correct. 
&lt;p&gt;
That's just a taste, next week I'll do a more interesting example.


</description>
      <pubDate>Wed, 30 Aug 2006 17:13:00 -0600</pubDate>
      <guid isPermaLink="false">urn:uuid:faa4a008-f42a-40e1-88d9-6827e25aceca</guid>
      <author>ian@ClearNetSec.com (Ian S. Nelson)</author>
      <link>http://blog.clearnetsec.com/articles/2006/08/30/intro-to-systemtap</link>
      <category>linux</category>
      <category>systemtap</category>
      <category>kernel</category>
      <category>profiling</category>
    </item>
    <item>
      <title>"Intro to systemtap" by Ian</title>
      <description>&lt;p&gt;It is fixed.   &lt;/p&gt;

&lt;p&gt;So I was having the hardest time probing my e1000 driver.  Turns out there is a bug in the Linux kernel.  /sys/modules/foo/sections was limited to 32 characters and many of the actual object sections in modules are quite a bit longer than that.   Andrew and Linus have accepted my patch and it should be in the next mm tree and Linus tree.&lt;/p&gt;</description>
      <pubDate>Fri, 01 Sep 2006 19:34:14 -0600</pubDate>
      <guid isPermaLink="false">urn:uuid:c1dcd0a4-de82-4e7e-acbc-d29a6340bf08</guid>
      <link>http://blog.clearnetsec.com/articles/2006/08/30/intro-to-systemtap#comment-16</link>
    </item>
    <item>
      <title>"Intro to systemtap" by Ian</title>
      <description>&lt;p&gt;Very cool, thanks for the pointers.   Thanks for commenting too.  I can&amp;#8217;t tell you how many times I&amp;#8217;ve baked a kernel just to measure something or tweak it.  It&amp;#8217;s a great tool, thanks for working on it.&lt;/p&gt;

&lt;p&gt;I started looking at some of the taps in /usr/share/systemtap but I wanted to get an article out and couldn&amp;#8217;t figure out a couple of problems I kept running in to.   In the process I joined the mailing list because I think I may have found a slight bug that I&amp;#8217;ll be submitting a patch to fix as soon as I verify it.    &lt;/p&gt;

&lt;p&gt;I&amp;#8217;ll definitely blog some more and try to generate some more interest,  like I said, it&amp;#8217;s an incredibly cool tool.&lt;/p&gt;</description>
      <pubDate>Thu, 31 Aug 2006 01:29:11 -0600</pubDate>
      <guid isPermaLink="false">urn:uuid:5e347615-7bf0-492a-9e24-a40459144cb1</guid>
      <link>http://blog.clearnetsec.com/articles/2006/08/30/intro-to-systemtap#comment-15</link>
    </item>
    <item>
      <title>"Intro to systemtap" by Josh Stone</title>
      <description>&lt;p&gt;Thanks for the nice intro!&lt;/p&gt;

&lt;p&gt;As one of the SystemTap developers, I have a few comments that might give you more success.&lt;/p&gt;

&lt;p&gt;First, for the debuginfo, Fedora does provide the necessary files in a separate package called &lt;code&gt;kernel-debuginfo&lt;/code&gt;.  You can use &lt;code&gt;yum&lt;/code&gt; to install this, though you may need to enable extra repositories (e.g., &lt;code&gt;core-debuginfo&lt;/code&gt; and/or &lt;code&gt;updates-debuginfo&lt;/code&gt;).  For example:&lt;/p&gt;

&lt;pre&gt;yum --enablerepo=updates-debuginfo install kernel-debuginfo&lt;/pre&gt;

&lt;p&gt;Make sure that the version you grab matches your running kernel!&lt;/p&gt;

&lt;p&gt;I would also recommend that you look into the stats capability.  Not only does it give you access to functions like min, max, avg, sum, etc., but it also uses percpu data structures for better scalability.  Consider the following &amp;#8220;port&amp;#8221; of your kmalloc example:&lt;/p&gt;

&lt;pre&gt;#! /usr/bin/env stap
global allocs

probe kernel.function("__kmalloc")
{
    allocs &amp;lt;&amp;lt;&amp;lt; $size
}

probe timer.ms(5000)
{
    count = @count(allocs)
    sum = count ? @sum(allocs) : 0
    printf("malloc count      bytes malloced\n")
    printf("%d                     %d\n", count, sum)
    delete allocs
}&lt;/pre&gt;

&lt;p&gt;Please continue to blog your SystemTap experiences, and feel free to join us on the mailing list as well&amp;#8230;&lt;/p&gt;</description>
      <pubDate>Wed, 30 Aug 2006 21:56:07 -0600</pubDate>
      <guid isPermaLink="false">urn:uuid:524cc8fa-4a27-4f31-a8bd-07d5f5f9485e</guid>
      <link>http://blog.clearnetsec.com/articles/2006/08/30/intro-to-systemtap#comment-12</link>
    </item>
    <item>
      <title>"Intro to systemtap" by Ian</title>
      <description>&lt;p&gt;Pardon the look of the data,  I&amp;#8217;ll get tate to fix the CSS to have fixed width fonts for the fixed width data&amp;#8230;&lt;/p&gt;</description>
      <pubDate>Wed, 30 Aug 2006 14:15:47 -0600</pubDate>
      <guid isPermaLink="false">urn:uuid:f283a194-a22b-4cc7-b490-2694fa4c8adb</guid>
      <link>http://blog.clearnetsec.com/articles/2006/08/30/intro-to-systemtap#comment-1</link>
    </item>
  </channel>
</rss>
