Month: November 2004

So in the past two days, my posts have contained a lot of code/text samples, that have to be formatted in a fixed with font and have their spacing preserved. Previously, I’ve just been using <pre></pre> tags around these samples. This work on blogs.sun.com, but I’ve found that this wreaks havoc with some RSS readers because the whitespace is not preserved. The newlines go missing, or whitespace disappears from the beginning of lines.

In an effort to be more RSS-friendly, I wrote a script that goes through and replaces spaces with &nbsp;, puts <br/> at the end of each line, and encloses the whole thing in <tt></tt> tags. The result is extremely ugly, but it seems to get the job done. I’m wondering, is this the best way to accomplish this? I’m not too familiar with RSS, so if anyone out there knows a better way that works for all varieties of RSS readers, please let me know. My googling abilities have yet to turn up anything…

Last post I talked about one of the annoying features of the amd64 ABI – the optional frame pointer. Today, I’ll examine the much more painful problem of argument passing on amd64. For sake of discussion, I’ll avoid structure passing and floating point – nasty little kinks in the problem.

Argument Passing on i386

On i386, all arguments are passed on the stack. Before establishing a frame, the caller pushes each argument to the function in reverse order. This gives you this stack layout:

...
arg1
arg0
return PC
previous frame
%ebp

current frame

%esp

If you want to access the third argument, you simply reference 16(%ebp) (8 for the frame + 8 to skip first two args). This makes debugging a breeze. For any given frame pointer (easy to find thanks to the i386 ABI), we can always find the initial arguments to the function. Another trick we use is that nearly every function call is followed by a addl x, %esp instruction. Using this information, we can figure out how many arguments were passed to the function, without relying on CTF or STABS data. Putting this all together, it’s easy to get a meaningful stack trace:


        > a76de800::findstack -v
        stack pointer for thread a76de800: a77c5dd4
        [ a77c5dd4 0xfe81994d() ]
          a77c5dec swtch+0x1cb()
          a77c5e10 cv_wait_sig+0x12c(a78a79b0, a6c57028)
          a77c5e70 cte_get_event+0x4d()
          a77c5ea4 ctfs_endpoint_ioctl+0xc2()
          a77c5ec4 ctfs_bu_ioctl+0x2f()
          a77c5ee4 fop_ioctl+0x1e(a79a7980, 63746502, 80d3f48, 102001, a69daf08, a77c5f74)
          a77c5f80 ioctl+0x19b()
          a77c5fac sys_call+0x16e()

Arguments Passing on AMD64

Enter amd64. As previously mentioned, the amd64 ABI was designed primarily for for performance, not debugging. The architects decided that pushing arguments on the stack was expensive, and that with 16 general purpose registers, we might as use some of them to pass arguments. Specifically, we have:

arg0 %rdi
arg1 %rsi
arg2 %rdx
arg3 %rcx
arg4 %r8
arg5 %r9
argN 8*(N-4)(%ebp)

This is an disaster for debugging. Debugging tools that operate in-place (DTrace and truss) can get meaningful arguments, but cannot know how many there are. Tools which examine a stack trace (pstack, mdb) cannot get arguments for any frame. The arguments may or may not be pushed on the stack, or they could be lost completely. If we try to get a stack with arguments, we find:


        > ffffffff8af1c720::findstack -v
        stack pointer for thread ffffffff8af1c720: ffffffffb2a51af0
          ffffffffb2a51d00 vpanic()
          ffffffffb2a51d30 0xfffffffffe972ae3()
          ffffffffb2a51d60 exitlwps+0x1f1()
          ffffffffb2a51dd0 proc_exit+0x40()
          ffffffffb2a51de0 exit+9()
          ffffffffb2a51e40 psig+0x2bc()
          ffffffffb2a51ee0 post_syscall+0x7d5()
          ffffffffb2a51f00 syscall_exit+0x5d()
          ffffffffb2a51f10 sys_syscall32+0x1d8()

The solution

The solution, as envisioned by the amd64 ABI designers, is to rely on DWARF to get the necessary information. If you have ever read the DWARF spec, you know that it a gigantic, ugly beast – an interpreted language that can be used to mine virtually any debugging data in an abstract manner. The problem here is that it requires significantly more work than on i386, and it requires debugging information to be present in the target object.

Implementing a DWARF interpreter is technically quite doable. We even had one brave soul go so far as to implement a limited DWARF disassembler capable of grabbing arguments for functions. But it turns out that the sheer amount of data we would have to add to the kernel to enable this was prohibitive. The bloat would have pushed us past the limit of the miniroot, not to mention the increased memory footprint and necessary changes to krtld and KMDB. That’s not to say we won’t support it in userland some day.

The lack of an argument count is a less serious. DTrace doesn’t need to know how many arguments there are. For the moment, truss simply shows the first 6 arguments always. But truss could be enhanced to use CTF and/or DWARF data to determine the number of arguments to a given function. But it probably won’t happen any time soon.

Workaround

Given that there will be no solution to this problem any time soon, you may ask how one can do any kind of debugging at all. The answer is “painfully”. I’ll walk through an example of finding the arguments to a function, using the following stack:


        > ffffffff8356c100::findstack -v
        stack pointer for thread ffffffff8356c100: ffffffffb2bbdb10
        [ ffffffffb2bbdb10 _resume_from_idle+0xe4() ]
          ffffffffb2bbdb40 swtch+0xc9()
          ffffffffb2bbdb90 cv_wait_sig+0x170()
          ffffffffb2bbdc50 cte_get_event+0xb0()
          ffffffffb2bbdc70 ctfs_endpoint_ioctl+0x7e()
          ffffffffb2bbdc80 ctfs_bu_ioctl+0x32()
          ffffffffb2bbdc90 fop_ioctl+0xb()
          ffffffffb2bbdd70 ioctl+0xac()
          ffffffffb2bbde00 dosyscall+0x12b()
          ffffffffb2bbdf00 trap+0x1308()
        >

Let’s say that we want to know the first argument to fop_ioctl(), which is a vnode. The first step is to look at the caller and see where the argument came from:


        > ioctl+0xac::dis -n 6
------> ioctl+0x8e:                     movq   0x10(%r12),%rdi
        ioctl+0x93:                     movq   0x1a0(%rax),%r8
        ioctl+0x9a:                     leaq   -0xcc(%rbp),%r9
        ioctl+0xa1:                     movq   %r15,%rdx
        ioctl+0xa4:                     movl   %r13d,%esi
------> ioctl+0xa7:                     call   +0xeed99 <fop_ioctl>
        ioctl+0xac:                     testl  %eax,%eax
        ioctl+0xae:                     movl   %eax,%ebx
        ioctl+0xb0:                     jne    +0x74    <ioctl+0x124>
        ioctl+0xb2:                     cmpl   $0x8004667e,%r13d
        ioctl+0xb9:                     je     +0x27    <ioctl+0xe0>
        ioctl+0xbb:                     movl   %r14d,%edi
        ioctl+0xbe:                     call   -0x1408e <releasef>

We can see that %rdi (the first argument) came from %r12. Looks like we lucked out – %r12 must be preserved by the function being called. So we look at fop_ioctl():


        > fop_ioctl::dis
        fop_ioctl:                      movq   0x40(%rdi),%rax
        fop_ioctl+4:                    pushq  %rbp
        fop_ioctl+5:                    movq   %rsp,%rbp
        fop_ioctl+8:                    call   *0x28(%rax)
        fop_ioctl+0xb:                  leave
        fop_ioctl+0xc:                  ret

No dice. We can see that %r12 (as well as %rdi) is still active at this point. Let’s keep looking:


        > ctfs_bu_ioctl::dis ! grep r12
        > ctfs_endpoint_ioctl::dis ! grep r12
        > cte_get_event::dis ! grep r12
        cte_get_event+0x13:             pushq  %r12
        cte_get_event+0x32:             movq   0x20(%rdi),%r12
        ...

Finally, we found a function that preserves %r12. Taking a closer look at cte_get_event():


        > cte_get_event::dis -n 8
        cte_get_event:                  pushq  %rbp
        cte_get_event+1:                movq   %rsp,%rbp
        cte_get_event+4:                pushq  %r15
        cte_get_event+6:                movl   %esi,%r15d
        cte_get_event+9:                pushq  %r14
        cte_get_event+0xb:              movq   %rcx,%r14
        cte_get_event+0xe:              pushq  %r13
        cte_get_event+0x10:             movl   %r9d,%r13d
        cte_get_event+0x13:             pushq  %r12

We can see that %r12 was pushed fourth after establishing the frame pointer. This would put it 32 bytes below %rbp for this frame. Remembering that what was really passed was 0x10(%r12), we can finally find our original argument:


        > ffffffffb2bbdc50-20/K
        0xffffffffb2bbdc30:             ffffffff8330ec88
        > ffffffff8330ec88+10/K
        0xffffffff8330ec98:             ffffffff83a5f600
        > ffffffff83a5f600::print vnode_t v_path
        v_path = 0xffffffff83978c40 "/system/contract/process/pbundle"

Whew. We can see that we have the proper vnode, since the path references a /system/contract file. And all it took was about 12 steps! You can see how this has become such a pain for us kernel developers. From the above example, you can see the approximate method is:

  1. Determine where the argument came from in the caller. Hopefully, you will find something that came from the stack, or one of the callee-saved registers (%r12-%r15). If not, look at the function and see if the argument was pushed on the stack or moved somewhere more permanent. This doesn’t happen often, so it may be that your argument is lost forever.

  2. If the argument came from a callee-saved register, examine every function in the stack until you find one that saves the value.

  3. By this point, you’ve hopefully found a place where the value is stored relative to %ebp. Using the frame pointers displayed in the stack trace, fetch the value from the stack.

This is not always guaranteed to work, and is obviously a royal pain. In my next post, I’ll go into some future ideas we have to make this (and other debugging) better.

The amd64 port of Solaris has been available (internally) for about a month and a half, and the rest of the group is starting to realize what those of us on the project team have known for a while: debugging on amd64 is a royal pain. The difficulty comes not from processor changes, but from design choices made in the AMD64 ABI. The ABI was designed primarily with performance in mind – debuggability and observability was largely an afterthought. There are two features of the ABI that really hurt debuggability. In this post I’ll cover the less annoying of the two – look for another followup soon.

Frame Pointers

In the i386 ABI, you almost always have to establish a frame pointer for the current function (leaf routines being the exception). This gives you the familiar opening function sequence:


        pushl   %ebp
        movl    %esp, %ebp

And your frame ends up looking like this:

...
arg1
arg0
return PC
previous frame
%ebp

current frame

%esp

This is a restriction of the ABI, not the processor. You can cheat by using the -fomit-frame-pointer flag to gcc, but this is not ABI compliant (although some people still think it’s a great idea).

The problem

With amd64, you would think that they would just keep this convention. At first glance it seems that way, until you find this little footnote in section 3.3.2:

The conventional use of %rbp as a frame pointer for the stack frame may be avoided by using %rsp (the stack pointer) to index into the stack frame. This technique saves two instructions in the prologue and epilogue and makes one additional general-purpose register (%rbp) available.

On amd64, the frame pointer is explicitly optional. To make debugging somewhat easier, they provide a .eh_frame ELF section that gives enough information (in the form of a binary search table) to traverse a stack from any point. This is slightly better than DWARF, but still requires a lot of processing. The problem with this is that it unnecessarily restricts the context from which you can gather a backtrace. On i386, your stack walking function is something like:


      frame = %ebp
      while (not at top of stack)
             process frame
             frame = *frame

Simple and straightforward. This omits a few nasty details like signal frames and #gp faults, but it’s largely correct. On amd64, you now have to load the .eh_frame section, process it, and keep it someplace where you have easy access to it. While this doesn’t sound so bad for gdb, it becomes a huge nightmare for something like DTrace. If you read a little bit of the technical details behind DTrace, you’ll understand that probes execute in arbitrary context. You may be in the middle of handling an interrupt, in dispatcher or VM code, or processing a trap (although on SPARC, DTracing code that executes at TL > 0 is strictly verboten). This means that the set of possible actions is severely limited, not to mention performance-critical. In order to process a stack() directive on amd64, we would now have to do something like:


        frame = %ebp
        while (not at top of stack)
                process frame
                for (each module in the system)
                        next = binary search in .eh_frame
                        if (next)
                                frame = next
                if (frame not found)
                        frame = *frame

Of course, you could maintain a merged lookup table for all modules on the system, but this is considerably more difficult and a maintenance nightmare. The real show stopper comes with the ustack() action. It is impossible, from arbitrary context within the kernel, to process the objects in userland and find the necessary debugging information. And unless we’re using only the pid provider, there’s no way to know a priori what processes we will need to examine via ustack(), so we can’t even cache the information ahead of time.

The solution

What do we do in Solaris? We punt. Our linkers will happily process .eh_frame sections correctly, but our debugging tools (DTrace, mdb, pstack, etc) will only understand executables that use a frame pointer. All of our code (kernel, libraries, binaries) is compiled with frame pointers, and hopefully our users will do so as well.

The amd64 ABI is still a work in progress, and the Solaris supplement is not yet finished. More language may be added to clarify the Solaris position on this “feature”. It will probably be a non-issue as long as GCC defaults to having frame pointers on amd64 Solaris. I’m not completely sure how the latest GCC behaves – I believe that it defaults to using frame pointers, which is good. I just hope -fomit-frame-pointer never becomes common practice as we move to OpenSolaris and a larger development community.

Motivation

Why was this written into the amd64 ABI? It’s a dubious optimization that severely hinders debuggability. Some research claims a substantial improvement, though their own data shows questionable gains. On i386, you at least had the advantage of increasing the number of usable registers by 20%. On amd64, adding a 17th general purpose register isn’t going to open up a whole new world of compiler optimizations. You’re just saving a pushl, movl, an series of operations that (for obvious reasons) is highly optimized on x86. And for leaf routines (which never establish a frame), this is a non-issue. Only in extreme circumstances does the cost (in processor time and I-cache footprint) translate to a tangible benefit – circumstances which usually resort to hand-coded assembly anyway. Given the benefit and the relative cost of losing debuggability, this hardly seems worth it.

It may seem a moot point, since you’ve been able to use -fomit-frame-pointer on i386 for years. The difference here is that on i386, you were knowingly breaking ABI compatibility by using that option. Your application was no longer guaranteed to work properly, especially when it came to debugging. On amd64, this behavior has received official blessing, so that your application can be ABI compliant but completely opaque to DTrace and mdb. I’m not looking forward to “DTrace can’t ustack() my gcc-compiled app” bugs (DTrace already has enough trouble dealing with curious gcc-isms as it is).

It’s conceivable that we could add support for this functionality in our userland tools, but don’t expect it any time soon. And it will never happen for DTrace. If you think saving a pushl, movl here or there is worth it, then you’re obviously so performance-oriented that debuggability is the last thing on your mind. I can understand some of our HPC customers needing this; it’s when people start compiling /usr/bin/* without frame pointers that it gets out of control. Just don’t be suprised when you try to DTrace your highly tuned app and find out you can’t get a proper stack trace…

Next post, I’ll discuss register passing conventions, which is a much more visible (and annoying) problem.

Despite what it may seem, I have not fallen off the face of the earth. Those of us in the solaris kernel group have been a little busy lately, as we’re in the final stretch of Solaris 10. Hopefully, this post will be the start of a return to my old blogging form…

A few things I’ve been up to recently:

  • Fixing bugs. Lots of bugs. SMF, procfs, amd64, you name it.

    The only user-visible changes is one of the SMF features: abbreviated FMRIs. Those of you out there never had to endure the dark ages where svcadm disable sendmail was not a valid command – you had to make sure to remember the entire FMRI (network/smtp:sendmail). This has since propagated to all the SMF commands, including svccfg(1M) and svcprop(1).

    On the not-so-visible side, one of the more interesting bugs I tracked down was a nasty hang in the kernel when running the GDB test suite. If a process with watchpoints enabled received a SIGKILL at an inopportune moment, it would descend into the dark pits of the kernel, never to return. Once OpenSolaris goes live, I’d love to blog about some of the crazy dances procfs has to do, but it’s incomprehensible without some source code to go along with it.

  • Tinkering with ZFS. I’m working part-time on a cool ZFS enhancement that (we hope) will do wonders for remote administration and zones virtualization. I’ll post some more as the details get flushed out.

  • Solaris 10 launch. I was at the launch, talking about DTrace and sitting in on the Experts Exchange. It was a great event – full of great announcements and lots of customer enthusiasm. Once you see S10 in action, and experience it for yourself, it’s hard not to be enthusiastic.

It’s hard to surf tech news these days without hitting a Solaris 10 article, but one particularly interesting one is this Forrester analysis. You can also catch the HP backlash, with a mother-approved response, as well an I-don’t-have-to-answer-to-my-boss response.

More tech-heavy posts are in the works…

Recent Posts

April 21, 2013
February 28, 2013
August 14, 2012
July 28, 2012

Archives