So, we are bringing up Plan 9 on the Blue Gene supercomputer and stuff looks great, we boot it to a prompt, we can interact with the machine. Its just every so often we see some corruption in the output, and then a little later on the thing dies with some form of panic or critical exception. Shit... We think we have some weird memory corruption problem and go on a witch hunt through the code. So, two bits of additional technical information -- our output is obtained by dumping the console prints into an SRAM and then sniffing it with a JTAG. The SRAM is shared by two non-coherent processor cores. We had assumed the one core was held in reset during boot.
We were wrong.
The freaky shit is, the damn OS booted to a prompt and I could type, list directories, screw around -- and we were getting a single output stream that was more or less coherent. The cores were executing in such precise synchronization (in the same memory space no less) that they even were writing the same characters to the console buffers and incrementing the console pointer to the same value. Complete lock step.
We wouldn't have even noticed except that we started doing the ethernet and touching it started skewing the two processors, and we started getting ddoouubbllee ccoonnssoollee oouuttppuutt and then they would go back into sync.
Its amazing how really, terminally, completely broken shit can run for a damn long time...
Tuesday, April 17, 2007
Subscribe to:
Post Comments (Atom)

4 comments:
Oh boy.. it's amazing!
Freaky side effect of having the whole system share a common clock. Sweet.
It`s great news for all Plan9 users!
Cool :)
Post a Comment