DTrace and customer service

February 1, 2005

Today, I thought I’d share a real-world experience that might portray DTrace in a slightly different light than you’re used to. The other week, I was helping a customer with the following question:

Why is automountd constantly taking up 1.2% of CPU time?

The first thought that came to mind was a broken automountd. But if that were the case, you’d be more likely to see it spinning and stealing 100% of the CPU. Just to be safe, I asked the customer to send truss -u a.out:: output for the automountd process. As expected, I saw automountd chugging away, happily servicing each request as it came in. Automountd was doing nothing wrong – some process was indirectly sending millions of requests a day to the automounter. Taking a brief look at the kernel code, I responded with the following D script:

#!/usr/sbin/dtrace -s
auto_lookup_request:entry
{
@lookups[execname, stringof(args[0]->fi_path)] = count();
}

The customer gave it a shot, and found a misbehaving program that was continuously restarting and causing loads of automount activity. Without any further help from me, the customer could easily see exactly which application was the source of the problem, and quickly fixed the misconfiguration.

Afterwards, I reflected on how simple this exchange was, and how difficult it would have been in the pre-Solaris 10 days. Now, I don’t expect customers to be able to come up with the above D script on their own (though industrious admins will soon be able to wade through OpenSolaris code). But I was able to resolve their problem in just 2 emails. I was reminded of the infamous gtik2_applet2 fiasco described in the DTrace USENIX paper – automountd was just a symptom of an underlying problem, part of an interaction that was prohibitively difficult to trace to its source. One could turn on automountd debug output, but you’d still only see the request itself, not where it came from. To top it off, the offending processes were so short-lived, that they never showed up in prstat(1) output, hiding from traditional system-wide tools.

After a little thought, I imagined a few Solaris 9 scenarios where I’d either set a kernel breakpoint via kadb, or set a user breakpoint in automountd and use mdb -k to see which threads were waiting for a response. But these (and all other solutions I came up with) were:

Disruptive to the running system
Not guaranteed to isolate the particular problem
Difficult for the customer to understand and execute

It really makes me feel the pain our customer support staff must go through now to support Solaris 8 and Solaris 9. DTrace is such a fundamental change in the debugging and observability paradigm that it changes not only the way we kernel engineers work, but also the way people develop applications, administer machines, and support customers. Too bad we can’t EOL Solaris 8 and Solaris 9 next week for the benefit of Sun support…

6 Responses

Rayson Ho says:

February 2, 2005 at 6:09 am

Dead libk:
http://www.sun.com/bigadmin/content/dtrace/dtrace_usenix.pdf”
Shouldn’t it be?
http://www.sun.com/bigadmin/content/dtrace/dtrace_usenix.pdf
Eric Schrock says:

February 2, 2005 at 9:28 am

Yep, fixed now. Thanks for catching that.
Alan Hargreaves says:

February 3, 2005 at 8:33 pm

In fact you could have done this purely from the command line …

$ dtrace -n ‘auto_lookup_request:entry{@lookups[execname, stringof(args[0]->fi_path)] = count();}
trace: description ‘auto_lookup_request:entry’ matched 1 probe
^C
ls /clones 1
csh /clones 5
dtwm /home 5

I edited the aove for clarity (removing some spaces)
Alan.
Eric Schrock says:

February 3, 2005 at 10:52 pm

Yout got me – that’s what I actually sent to the customer. But the script looks nicer for illustrative purposes 😉
asiapower says:

March 17, 2005 at 4:34 am

Yout got me – that’s what I actually sent to the customer. But the script looks nicer for illustrative purposes 😉
cell phone
assurance quality
home loans
iwan rahabok says:

March 22, 2005 at 6:04 am

You might want to consider giving more examples. I’m doing a Solaris 10 training for 40+ IT folks of my key customers in April, and example like this is certainly useful to drive the point. Thank you. Iwan.

Eric Schrock's Blog

DTrace and customer service

6 Responses

Recent Posts

Agile Data Technology

Enterprise Software Hackathons

Engineer Anti-Patterns

A node.js CLI?

Data Replication: Building a better NDMP

Data Replication: Approaching the Problem

Archives

Archives