WebLog Pro Olivier Berger

27/06/2006

Presentation of CALIBRE project at objectweb+orientware worlshop at CISIS in Dalian (China)

Filed under: CALIBRE project, Uncategorized — Olivier Berger @ 14:07

I’m back from Dalian (China), where I’ve presented the Calibre project during a joint one day workshop organised by collegues in France and China, namely mainly by people from ObjectWeb and OrientWare (two consortium developping middleware), during the CISIS IT fair. Here’s a link to the program.

Here is a link to my slides : “Some results on the Calibre project”

13/06/2006

Fixing SCSI aacraid driver related FS crashes on Dell PowerEdge 2650 with Perc 3/Di runing kernel 2.6.8 in Debian stable

Filed under: Uncategorized — Olivier Berger @ 14:01

I’ve experienced random crashes of the file-system on a Dell server, model PowerEdge 2650, with a Perc 3/Di SCSI controller, runninng a Debian testing system with the standard 2.6.8 Debian kernel (i686+smp), mainly during disk-intensive operations (for instance, I suspect such a crash happened when amanda backup task were launched on the machine).

There have been numerous discussions on the linux-poweredge mailing-list and many proposals for fixing this issue (see details on google).

The symptoms look like this :

Jun 9 20:52:58 myhost kernel: aacraid: Host adapter reset request. SCSI hang ?
Jun 9 20:52:58 myhost kernel: aacraid: Host adapter reset request. SCSI hang ?
Jun 9 20:52:58 myhost kernel: aacraid: SCSI bus appears hung
Jun 9 20:52:58 myhost kernel: aacraid: SCSI bus appears hung
Jun 9 20:52:58 myhost syslogd: /var/log/messages: Read-only file system
Jun 9 20:52:58 myhost kernel: scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
Jun 9 20:52:58 myhost kernel: SCSI error : <0 0 0 0> return code = 0x6000000
Jun 9 20:52:58 myhost kernel: end_request: I/O error, dev sda, sector 401836233
Jun 9 20:52:58 myhost kernel: scsi0 (0:0): rejecting I/O to offline device
Jun 9 20:52:58 myhost kernel: scsi0 (0:0): rejecting I/O to offline device

I think I have come closer than never to a solution, applying the following steps :

  1. upgrading the firmware of the Perc 3/Di controller : look at the Dell site for the right version…
  2. disabling the cache with afacli :

    # afacli

    open AFA0

    AFA0 container set cache /read_cache_enable=FALSE /write_cache_enable=FALSE 0

    AFA0 container show cache 0
    Executing: container show cache 0

    Global Container Read Cache Size : 0
    Global Container Write Cache Size : 118259712

    Read Cache Setting : DISABLE
    Write Cache Setting : DISABLE
    Write Cache Status : Inactive, cache disabled

  3. patching the 2.6.8 aacraid driver’s code with the following patch : aac-remove-handle-aif.patch), to avoid tacking the controller offline in some circumstances (see explanation in this post : http://marc.theaimsgroup.com/?l=linux-scsi&m=110252243627410&w=2).
    1. get the kernel-source-2.6.8 package from stable
    2. unpack it and apply patch
    3. get the running (uname -r) kernel’s .config from /boot and copy it to the /usr/src/kernel-source-2.6.7/
    4. make-kpkg clean
    5. make oldconfig
    6. make-kpkg –append_to_version=patchaacremovehandleaif –initrd kernel_image
    7. install resulting kernel, and reboot
  4. pray ;)

The machine had worked almost OK since it was in Debian’s 2.6.8 kernel with cache disabled and firmware upgraded, but it finally crashed again…

I hope that the patch against aacraid driver will solve the issue.

Powered by WordPress