linux model crash...

Does your model behave differently than expected (e.g. losing timesteps, etc.) - let us know...

Moderators: geophi, Honza, Moderators, Site admin

linux model crash...

Postby fridgemagnet » Sun Oct 19, 2008 10:52 am

Hi, I originally posted this on the climateprediction.net message board but didn't get a lot of helpful responses and this board seems to have a lot more useful info, knowledge....

I've just started trying to run the model on a Linux box (command line only, no graphics required), it's based on a fairly old distro (Suse 8.2). I can get Boinc running okay with the 'Older Linux' versions (ie. don't get the GLIBC issues), can attach to the model okay but the model crashes quickly after starting. On the command line you get:

19-Oct-2008 10:49:25 [climateprediction.net] Starting hadsm3fub_k2lm_005968157_6
19-Oct-2008 10:49:25 [climateprediction.net] Starting task hadsm3fub_k2lm_005968
157_6 using hadsm3 version 608
19-Oct-2008 10:49:44 [climateprediction.net] Computation for task hadsm3fub_k2lm
_005968157_6 finished
19-Oct-2008 10:49:44 [climateprediction.net] Output file hadsm3fub_k2lm_00596815
7_6_1.zip for task hadsm3fub_k2lm_005968157_6 absent
19-Oct-2008 10:49:44 [climateprediction.net] Output file hadsm3fub_k2lm_00596815
7_6_2.zip for task hadsm3fub_k2lm_005968157_6 absent
19-Oct-2008 10:49:44 [climateprediction.net] Output file hadsm3fub_k2lm_00596815
7_6_3.zip for task hadsm3fub_k2lm_005968157_6 absent
19-Oct-2008 10:50:46 [climateprediction.net] Sending scheduler request: To fetch work. Requesting 30240 seconds of work, reporting 1 completed tasks

Which is reported at the server end:

http://climateapps2.oucs.ox.ac.uk/cpdnb ... id=8140535

The machine has now got 1GB of RAM fitted (as per suggestion on the other board), this made no difference.

The thing that confuses me though is that the explanation for error code -232 on the BOINC side reads as follows:

"When running a 64-bit Linux on a project that sends 32-bit applications only, you can run into results erroring out with process exited with code 22.

The explanation for this is that 32-bit binaries don't just work on every 64-bit Linux. If for example you install a fresh Ubuntu 6.10 or 7.04, 32-bit binaries won't work. They are not even recognized as valid executables. You first have to install the ia32 package and dependent packages. Further, for programs that link with the graphic library, you will manually have to copy a 32-bit libglut library to the usr/lib32 directory.

If after this you still get client errors, post on the forums of the project that you have this problem and ran ldd on the executable in the projects directory to see what libraries are missing. Post which libraries these are and ask for instructions on how to get them."

However this isn't a 64-bit version of Linux so it doesn't seem to apply. Unless these are 64-bit versions of the model...

I've had a bash at rebuilding BOINC from source but I'm not convinced it's a problem with the client and there are a fair few dependency libraries popping up so I'm wondering if it's worth persuing this aspect.

Jon.
fridgemagnet
 
Posts: 6
Joined: Sun Oct 19, 2008 10:34 am

Postby Les Bayliss » Sun Oct 19, 2008 1:58 pm

There aren't any 64 bit climate models, just some code that sends one type of 32 bit models when the server is asked for work by a 64 bit version of BOINC.
The 'error 22' shown, seems to be a general 'catchall' code for several different problems encountered by model failures, and isn't very helpful. It's been around for a while.
The 'volc' type of models are new, and may require something in newer versions of linux to work.

edit
As an afterthought, try setting your prefs on the server to only get slab models. These are fairly reliable on most systems.
The opinions here are my own, and the help ideas either mine, or that gleaned from other posts.
Les Bayliss
Forum Admin
Forum Admin
 
Posts: 4773
Joined: Sun Sep 05, 2004 10:56 am
Location: Sydney, Australia

Postby mo.v » Sun Oct 19, 2008 5:31 pm

Jon's thread on the CPDN-BOINC forum is here.

Les, I don't think selecting a different model type will solve the problem. Jon has already tried slabs, mid-Holocenes and HADCMs ie every type available for this computer. They are all crashing almost immediately. There's something fundamentally wrong and code 22 provides no clue to the cause.

It may be better to crunch CPDN on the other computer(s) and see whether this one with this Linux distro can succeed with another project.
Spinnaker Tower & Tyne Bridges I'm a volunteer participant
User avatar
mo.v
Forum Admin
Forum Admin
 
Posts: 5992
Joined: Sun Oct 10, 2004 5:25 am
Location: Portsmouth UK

Postby fridgemagnet » Sun Oct 19, 2008 6:48 pm

So I assume the -234 is also a red herring then...? I don't suppose anyone could shed any light on what sort of libraries could cause this effect - I'm okay with updating various bits of the OS from scratch but would need some clue as to which bits need the upgrade.

I'll put it back to doing the folding@home work for now then since that does appear to run okay.

It's a bit annoying because that's the machine that's left on most of the time so would get through the work, the others are only on normally for an hour or two a day tops.
fridgemagnet
 
Posts: 6
Joined: Sun Oct 19, 2008 10:34 am

Postby Ananas » Sun Oct 19, 2008 10:23 pm

fridgemagnet wrote:So I assume the -234 is also a red herring then...? ...


It is (bit-wise) the same as 22 and 0x16, expressed as a inverted byte value. (256 - 22, 2's complement)
User avatar
Ananas
Forum moderator
Forum moderator
 
Posts: 498
Joined: Wed Apr 20, 2005 1:39 pm
Location: Nordlichter Cologne

Postby mo.v » Sun Oct 19, 2008 10:57 pm

When code 22 occurs on a Windows computer, BOINC thinks it's a Windows error code (which it isn't) and generates the Windows error message that corresponds to it, which is therefore a red herring.

How BOINC treats code 22 on Linux I don't know. I don't think I've seen code 22 on Linux before. If this code on Linux is in fact generated by the model, Thyme Lawn may know what it means or be able to find out from Tolu (CPDN's chief programmer).
Spinnaker Tower & Tyne Bridges I'm a volunteer participant
User avatar
mo.v
Forum Admin
Forum Admin
 
Posts: 5992
Joined: Sun Oct 10, 2004 5:25 am
Location: Portsmouth UK

Postby fridgemagnet » Wed Oct 22, 2008 6:59 pm

ok well my "last ditch effort" which I really didn't expect to make any difference is to rebuild the BOINC client from source and give that a whirl and sure enough it hasn't.

As I said if anyone's got any pointers I'm happy to try a few things out but right now I'm out of ideas.

Thanks all anyway,

jon.
fridgemagnet
 
Posts: 6
Joined: Sun Oct 19, 2008 10:34 am

Postby Jayargh » Sat Oct 25, 2008 6:58 pm

I have the same issue on a new install ......using UBUNTU 8.04....as I suspected as in other projects you must obtain the ia32 package found in the synaptic package manager in my distro. These libraires are required when running 32 bit applications in a 64bit enviornment.

Even though these slab models are saying they are 64 bit ,they still use the 32 bit libraries...hope this helps get you going fridgemagnet as it did me on this install :wink:
Jayargh
 
Posts: 1
Joined: Sat Oct 25, 2008 6:52 pm

Postby fridgemagnet » Tue Oct 28, 2008 9:59 am

unfortunatly it's not a 64 bit issue, I thought the error coded indicated it was but that was a red herring particularly as my system is totally 32 bit. No I think it's more likely an incompatibility with one of the older glibc shared libraries that the models depend on. I've managed to build a new (seperate) version of glibc 2.4 and can successfully run the new boinc client against it but the current stumbling block is when it forks the model tasks they aren't using it, I just need to spend a bit more time & research on it.

Cheers anyway.
fridgemagnet
 
Posts: 6
Joined: Sun Oct 19, 2008 10:34 am

Postby fridgemagnet » Mon Nov 03, 2008 7:14 pm

ok, well I think I've got something working now (fingers crossed its been running for 10 minutes now without incident) - it's been a bit painful and I doubt that many out there will want to go through the pain but hey, I'm a s/w engineer, we tinker with stuff. So for anyone else this is what I ended up doing - you may get away with less than this but now it's working I'm leaving well alone.

First, figuring the model crash was due to an old glibc build, download and build glibc2.4 but install it somewhere else on your system ie. don't overwrite the existing version!! This also necessitated an update of gcc to build the thing but that's neither here not there.

The tricky bit was getting boinc & the model to use it - I could start boinc up using the library by doing:

<glibc2>/lib/ld-linux.so.2 --library-path <glibc2>/lib ./boinc

but when the model gets forked it reverts to using the default loader in /lib. There doesn't seem to be a way to override the default loader so in the end I opted to setup a chroot environment for it - this proved the painful bit.... in the BOINC install folder:

1. create ./etc and copy in nsswitch.conf, resolv.conf, fstab, hosts, localtime, networks, passwd
2. create ./lib, copy the contents of the new glibc (or do a mount -bind):

cp -a <glibc2>/lib/ ./lib

3. I also needed to copy *libz*, *libacl* and *libattr* from my existing /lib folder along with libstdc++ from my gcc install folder.

4. Create folders ./dev, ./proc, ./sys. <A>this thread</A> suggests doing a mount -bind on all these folders but this seemed to cause boinc to hang? - I managed to get away with just:

/dev
/dev/shm

although boinc then guesses the memory, swap space sizes which may not be ideal.

5. Finally, create /bin and copy in '/bin/cp' and I also copied a statically linked 'ash' shell in (do an rpm search).

6. Then it was 'chroot <boinc> /bin/ash.static

In the new shell:
export LD_LIBRARY_PATH=/lib
./boinc

and off it trundles.

hope this is of *some* help to others.


jon.
fridgemagnet
 
Posts: 6
Joined: Sun Oct 19, 2008 10:34 am

Postby Thyme Lawn » Mon Nov 03, 2008 7:58 pm

Wow :shock:

You deserve a prize for your perseverence and ingenuity Jon!
Thyme Lawn
Forum Admin
Forum Admin
 
Posts: 1905
Joined: Tue Sep 16, 2003 10:39 am
Location: Poole, Dorset, UK

Postby MikeMarsUK » Tue Nov 04, 2008 9:04 am

I've added a link to your post from the 'readme's... many thanks for investigating :-)
Image
I'm a volunteer and my views are my own.
News, Announcements and README posts
User avatar
MikeMarsUK
Forum Admin
Forum Admin
 
Posts: 3928
Joined: Mon Jan 30, 2006 11:22 pm
Location: UK

Postby fridgemagnet » Sat Nov 15, 2008 6:58 pm

as hopefully a final summary on this, after a few days running, BOINC starting keeling over and/or consuming 100% CPU usage (I think it was attempting to perform a trickle) - after a degree of further messing around I opted to delete everything BOINC related and reinstall the application - this time I used the last stable 5.x release. This time, the 'mount -o bind' for /proc, /sys and /dev haven't caused any issues to BOINC and it's been running (& trickling) fine for the past few days.

Can't explain whether it's the version (dropping from 6.x to 5.x) of BOINC change or all the messing around trying to get it to work or even the work unit it was processing which upset things but it does appear to be a lot happier (and stable) now.

cheers.
fridgemagnet
 
Posts: 6
Joined: Sun Oct 19, 2008 10:34 am


Return to Unexpected behaviour of your model?

Who is online

Users browsing this forum: No registered users and 0 guests