Project

Participants

Community

Powered by BOINC

Elapsed time exceeded

https | Log in

Advanced search

Message boards : Number crunching : Elapsed time exceeded

AuthorMessage
bill brandt-gasuen
Send message
Joined: Aug 8 11
Posts: 6
Credit: 2,596,098
RAC: 0
Message 75 - Posted 12 Aug 2011 20:42:11 UTC

    I have had 23 wu's error out for exceeding the elapsed time limit. These began occurring on 20110809. Others are likely to follow as I have a considerable number of wu's in progress. Any educated guesses?

    Profile Honto ni
    Project administrator
    Project developer
    Project tester
    Project scientist
    Send message
    Joined: Jul 30 11
    Posts: 95
    Credit: 18,139
    RAC: 0
    Message 76 - Posted 13 Aug 2011 0:03:47 UTC - in response to Message.

      Last modified: 13 Aug 2011 5:03:28 UTC

      I've looked at the workunits with the elapsed time exceeded error on one of your host
      and the elapsed time was ~3835 seconds for those jobs.

      If we take in account the following information:

      The computational size parameters (rsc_fpops_est, rsc_fpops_bound) are
      expressed in terms of number of floating-point operations. For example, suppose
      a job takes 1 hour to complete on a machine with a Whetstone benchmark of 1
      GFLOPS; then the "size" of J is 3.6e11 FLOPs. To get an initial estimate of job
      size, run several typical jobs on your own computer, see how long they take,
      and multiply by the Whetstone score of the computer (to find this, run BOINC on
      the computer and look at the event log).


      We get a computational size estimation of ~1.05e12 FLOPs (3.835e11 x 2.74).

      All jobs are send with the following computational size estimations:
      <rsc_fpops_est>1e13</rsc_fpops_est>
      <rsc_fpops_bound>2e13</rsc_fpops_bound>

      This should have been enough to calculate the jobs, to find out why it didn't i
      searched the web and came across the following post (although it's dated
      from 2009 so wouldn't know if it still applies):

      The rsc_fpops_bound value is divided by the host's fpops/second value in
      order to set the actual max_elapsed_time value. The host's fpops/second is
      either the Whetstone benchmark or a <flops> value from an app_info.xml file.
      The Whetstone benchmark shown as "Measured floating point speed" on the host
      page looks reasonable (though that doesn't guarantee it was reasonable on the
      host at the times the errors occurred). Is it possible your app_info.xml has a
      <flops> value for the 6.03 app which has a couple extra zeroes in the value?
      That's my best guess why BOINC is thinking that less than an hour is too much
      time.


      The cause of these errors could be because of this <flops> value or that the
      "Measured floating point speed" during the run was different. Comparing the cpu
      times with the elapsed times for the failed jobs would suggest the latter.

      Anis Abuseiris
      Erasmus Grid Office

      Profile Krunchin-Keith [USA]
      Volunteer moderator
      Volunteer tester
      Avatar
      Send message
      Joined: Aug 3 11
      Posts: 93
      Credit: 1,277,096
      RAC: 419
      Message 79 - Posted 13 Aug 2011 10:50:02 UTC

        Last modified: 13 Aug 2011 10:55:42 UTC

        I'm not quite clear on this, so I looked at one of my own systems.

        Figures pulled out on init_data files in slots

        Simap
        <rsc_fpops_est>18000000000000.000000</rsc_fpops_est> = 18e12
        <rsc_fpops_bound>1000000000000000.000000</rsc_fpops_bound> = 1e15
        Run time 53 minutes, Task duration correction factor = 0.787821

        DistDataMine
        <rsc_fpops_est>2656104109030.000000</rsc_fpops_est> = 2.65e12 <rsc_fpops_bound>100000000000000000000.000000</rsc_fpops_bound> - 1e20
        Estimate 55 minutes, Task duration correction factor = 2.928691

        Correlizer
        <rsc_fpops_est>10000000000000.000000</rsc_fpops_est> = 1e13 <rsc_fpops_bound>20000000000000.000000</rsc_fpops_bound> = 2e13
        Runtime about 35 minutes, Task duration correction factor of 0.93915

        In cases of other projects, the bound is much larger (x55,x37e6) than the estimate, Correlizer is only x2 (???).

        Now for the estimate, measured flops showing on the project page for this system is 2650.74 million ops/sec.

        So 3.6e11 x 2650 = 954000000000000 = 954e12 = 0.95e13 So it appears that the estimate is good, at least for me as i am also showing a "Task duration correction factor of 0.93915", where i believe if it is 1.000000 then the estimate is dead on.


        ---

        @Bill
        Check and see what your TDCF is showing ?

        You might have some numbers that are off for some reason

        Things to try, one at a time to see if this clears up problem.
        Reboot computer
        Manually run benchmarks
        Update project to clear and pending not reported work then reset project - note this will kill any work in progress. But then you should get fresh work with numbers relating to the last benchmark.

        bill brandt-gasuen
        Send message
        Joined: Aug 8 11
        Posts: 6
        Credit: 2,596,098
        RAC: 0
        Message 222 - Posted 31 Aug 2011 23:47:42 UTC

          I followed the lines of thinking suggested by both Honto ni and Krunchin Keith and found that the error problem was not successfully resolved nor its source confirmed. All units generating the errors had OS xp 64-bit and boinc 6.10.58. Looking at similar active computers of zombie67,which were generating next to no errors, the only difference seemed to be I am heavy on AMD cpus. As a result, I directed all the affected machines to other projects, testing a few wu's now and then.
          Have you recently made some subtle changes to the BioMedical Genome Correlations v1.00 application? Now they seem to be running fine, so I am slowly pulling all my computers back on the project. My best guess at this point is my computers run in a non-air conditioned environment and when the ambient temperature is in excess of 80F (as it has been most of August) the heat buildup may have been a factor, but no wu's from other projects running simultaneously seemed to exhibit similar behavior.
          But all's well that ends well. It's a pleasure to participate again at full strength.

          Profile Krunchin-Keith [USA]
          Volunteer moderator
          Volunteer tester
          Avatar
          Send message
          Joined: Aug 3 11
          Posts: 93
          Credit: 1,277,096
          RAC: 419
          Message 224 - Posted 1 Sep 2011 14:18:10 UTC

            Yes, but the problem you refer to is an app specific setting, this could happen on any project to any app, so what goes on at another project is irrelavant here.
            -
            We are in the process of beta testing a new version to work out the bugs. I believe this error has been removed, but not sure since I don't see it on my computers at all.

            If you feel adventurous, could you change your settings for this project under your account, deslect "BioMedical Genome Correlations:", selec "run test apps" and select the "Correlizer Beta Applications". Save setting then do update on this project on your host to get the new settings. Let a few run report weather the error is still there or not ?

            Post to thread

            Message boards : Number crunching : Elapsed time exceeded