Elapsed time exceeded
Message boards : Number crunching : Elapsed time exceeded
| Author | Message |
|---|---|
|
I have had 23 wu's error out for exceeding the elapsed time limit. These began occurring on 20110809. Others are likely to follow as I have a considerable number of wu's in progress. Any educated guesses? | |
| ID: 75 | Rating: 0 | rate: | |
|
I've looked at the workunits with the elapsed time exceeded error on one of your host The computational size parameters (rsc_fpops_est, rsc_fpops_bound) are expressed in terms of number of floating-point operations. For example, suppose a job takes 1 hour to complete on a machine with a Whetstone benchmark of 1 GFLOPS; then the "size" of J is 3.6e11 FLOPs. To get an initial estimate of job size, run several typical jobs on your own computer, see how long they take, and multiply by the Whetstone score of the computer (to find this, run BOINC on the computer and look at the event log). We get a computational size estimation of ~1.05e12 FLOPs (3.835e11 x 2.74). All jobs are send with the following computational size estimations: <rsc_fpops_est>1e13</rsc_fpops_est> <rsc_fpops_bound>2e13</rsc_fpops_bound> This should have been enough to calculate the jobs, to find out why it didn't i searched the web and came across the following post (although it's dated from 2009 so wouldn't know if it still applies): The rsc_fpops_bound value is divided by the host's fpops/second value in order to set the actual max_elapsed_time value. The host's fpops/second is either the Whetstone benchmark or a <flops> value from an app_info.xml file. The Whetstone benchmark shown as "Measured floating point speed" on the host page looks reasonable (though that doesn't guarantee it was reasonable on the host at the times the errors occurred). Is it possible your app_info.xml has a <flops> value for the 6.03 app which has a couple extra zeroes in the value? That's my best guess why BOINC is thinking that less than an hour is too much time. The cause of these errors could be because of this <flops> value or that the "Measured floating point speed" during the run was different. Comparing the cpu times with the elapsed times for the failed jobs would suggest the latter. Anis Abuseiris Erasmus Grid Office | |
| ID: 76 | Rating: 0 | rate: | |
|
I'm not quite clear on this, so I looked at one of my own systems. | |
| ID: 79 | Rating: 0 | rate: | |
|
I followed the lines of thinking suggested by both Honto ni and Krunchin Keith and found that the error problem was not successfully resolved nor its source confirmed. All units generating the errors had OS xp 64-bit and boinc 6.10.58. Looking at similar active computers of zombie67,which were generating next to no errors, the only difference seemed to be I am heavy on AMD cpus. As a result, I directed all the affected machines to other projects, testing a few wu's now and then. | |
| ID: 222 | Rating: 0 | rate: | |
|
Yes, but the problem you refer to is an app specific setting, this could happen on any project to any app, so what goes on at another project is irrelavant here. | |
| ID: 224 | Rating: 0 | rate: | |
Message boards : Number crunching : Elapsed time exceeded