Smalltalk/X and STX:LIBJAVA Debugging Session
STX:LIBJAVA is gradualy becoming more and more stable, but there are still few bugs, which can cause the virtual machine (VM, runtime) to crash.
Some of the crashes are caused by the actual STX:LIBJAVA code. Some are caused by bugs in the VM, since STX:LIBJAVA stresses the runtime in a different way than the standard Smalltalk code. And also, some bugs can be brought in by used libraries and compilers. In past, STX:LIBJAVA has revealed bugs in the VM that were older than 10 years. And one particularly peculiar bug was found in gcc. Ask Jan Vraný how much fun he had for those 2 sleepless weeks he’s spent debugging.
Maybe some of you would be interested in how we (well, Jan. I, for most of the time, am involved with introducing the bug) debug these crashes. We will show you how to find the cause and interpret the problem. In this article you’ll be honored to watch Jan Vraný debugging :)
Actual Session
Imagine we were playing with STX:LIBJAVA and the VM crashed. This is what we might see in the terminal:
Program received signal SIGSEGV, Segmentation fault.
0x5107b104 in ?? ()
Because this is not the first time it crashed, we made sure to start it in the gdb:
$> projects/smalltalk/smalltalk --gdb
(gdb) run -I --quick
Now we can examine the state of the system at the moment of the crash. Let’s look around to see what’s the state of the program:
(gdb) info registers
eax 0x5107b104 1359458564
ecx 0xf1600b83 -245363837
edx 0xf3a23210 -207474160
ebx 0xf3a1fff4 -207486988
esp 0xf068d50c 0xf068d50c
ebp 0x3b 0x3b
esi 0x3e 62
edi 0xf38836e0 -209176864
eip 0x5107b104 0x5107b104
eflags 0x210206 [ PF IF RF ID ]
cs 0x23 35
ss 0x2b 43
ds 0x2b 43
es 0x2b 43
Ok so this, at least for me, is not showing anything suspecting. We will look at the code are we currently executing:
(gdb) disas $pc - 10, $pc + 10
Dump of assembler code from 0x5107b0fa to 0x5107b10e:
0x5107b0fa: add %al,(%eax)
0x5107b0fc: add %al,(%eax)
0x5107b0fe: add %al,(%eax)
0x5107b100: add %al,(%eax)
0x5107b102: add %al,(%eax)
=> 0x5107b104: adc $0xc7,%al
0x5107b106: pop %es
0x5107b107: push %ecx
0x5107b108: mov %al,(%eax)
0x5107b10a: add %al,(%eax)
0x5107b10c: add %eax,(%eax)
End of assembler dump.
This was more helpful. We see, that we must have jumped to the data part of the memory, because code like that is too funny to be generated by any reasonable compiler. We can verify that by checking memory layout:
(gdb) call dumpSpaceInfo()
MEM: new: f1600000 .. f16efd34 .. f1800000 surv: f1200000 .. f1400000
MEM: old: 50000000 .. 546007b8 .. 5986a000 cold: 0 .. 0
MEM: rem: f07a0000 .. f07a04c8 .. f07a8000
MEM: nrem: f0700000 .. f0700000 .. f0708000
MEM: lrem: f2a08000 .. f2a083cc .. f2a0c000
So we see we jumped into the oldspace. Let’s check stack pointer:
(gdb) p $sp
$23 = (void *) 0xf068d50c
Good, stack pointer is ok, it points somewhere into stack segment of
Smalltalk/X VM. Usually, anything that start with 0xfxxx
is stack or
newspace. Stack usually takes 0xf0xx
, and 0xf100
and above is the
newspace. This is of course just a guess, you are never 100% sure when
dealing with complex C programs. Now we can look at what stack pointer
points to.
(gdb) p /x *((int*)$sp)
$26 = 0xf388373e
If we could guess, the instruction before the current one was call
.
This instruction, on x86 architecture, stores current program counter to
++sp
and jumps. Therefore, stack pointer gives us address from where
we jumped. Let’s examine the code so we can be a little bit more sure it
was call
instruction:
less /proc/$(pgrep stx)/maps
...
f375e000-f3761000 r-xp 00000000 fd:02 11014091 /lib32/libdl-2.13.so
f3761000-f3762000 r--p 00002000 fd:02 11014091 /lib32/libdl-2.13.so
f3762000-f3763000 rw-p 00003000 fd:02 11014091 /lib32/libdl-2.13.so
f3763000-f378b000 r-xp 00000000 fd:02 11014092 /lib32/libm-2.13.so
f378b000-f378c000 r--p 00028000 fd:02 11014092 /lib32/libm-2.13.so
f378c000-f378d000 rw-p 00029000 fd:02 11014092 /lib32/libm-2.13.so
f378d000-f37b3000 rw-p 00000000 00:00 0
f37b3000-f39e3000 r-xp 00000000 fd:03 5643709 .../stx/librun/librun.so
f39e3000-f3a20000 r--p 0022f000 fd:03 5643709 .../stx/librun/librun.so
...
So the address from where we jumped is somewhere in the text segment of
librun.so
. librun
is one of the libraries of Smalltalk/X.
(gdb) info symbol 0xf388373e
__jInterpret + 28274 in section .text of .../smalltalk/librun.so
Good, now which line is it:
(gdb) info line *0xf388373e
Line 2940 of "jinterpret.c" starts at address 0xf388370f <__jInterpret+28227>
2937│ switch (nA) {
2938│ case 0:
2939│ SET_LINENO_FOR_SEND(dummy0, bp-3-bp0);
2940├> rslt = (*code)(rec, sel,
2941│ __JavaMethodInstPtr(resolvedMethod)->jm_javaClass,
2942│ &dummy0);
2943│ break;
And here we go. This is the actual code:
OBJFUNC code = __ExecutableCodeInstPtr(resolvedMethod)->ex_code;
/* INVSTATIC or INVNONVIRT */
if (code) {
switch (nA) {
case 0:
SET_LINENO_FOR_SEND(dummy0, bp-3-bp0);
rslt = (*code)(rec, sel,
__JavaMethodInstPtr(resolvedMethod)->jm_javaClass,
&dummy0);
break;
Somehow, code contains normal data instead of being a code pointer for resolvedMethod - which is normal Java method object used internally in the VM. Code pointer should point at compiled machine code (compiled by Just-in-Time compiler). It surely cannot point at some data.
The rest of the debugging session is not that interesting. We had to find who set the incorrect value there. Usually where many bugs go hide is a C code sequence of variable assignment, message send, and variable access. Garbage collector can be executed on message sends, and we use Baker algorithm, which moves objects within heap. When you assign an object into a variable, and GC moves the object, you will get some garbage in there instead. The most fun bugs show themselves after many GC cycles, and in a completely different place in the execution, when you cannot possibly know where the bug was introduced.
For this the VM contains PROTECT
and UNPROTECT
macros, which will
make sure your variable is updated when GC runs. You just have to
remember to use them. And I always forget.
Conclusion
This was quite meaty post. What you should take from it is that you can get a ton of information out of crashed C program, if you know where to look. So next time the STX:LIBJAVA, or Ruby or whichever virtual machine you use crashes, try to investigate a bit. You will learn something, for sure!