WCG problems

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,364
15,501
136
I have hundreds of "downloading" forever, so I aborted them all. So news tasks came up, and sure enough, I had to abort them. For the most part, WCG is dead at the moment. Anybody know what's going on ?
 

StefanR5R

Elite Member
Dec 10, 2016
6,018
9,044
136
They restarted the African Rainfall Project (arp1) last week. Each task has got more than ten input files, and several of them are tens of Megabytes big. The result files of arp1 (seven files per result) are even larger IIRC. This is probably just too much for Krembil's internet connection. [...] I read elsewhere that WCG, after the move to Krembil and for as long as the arp1 project was active, had about the same issues like now whenever they submitted a new arp1 work batch.
I tried one ARP1 task. Downloading the input files took two hours with countless retries, but I did not rely on BOINC's built-in retry periods... Executing the task took 21 h on my Haswell Xeon E3 (with all other CPU threads busy with concurrent work, mostly MilkyWay nbody). The seven result files were 94 MB big in total, but they uploaded within a quarter hour without a single interruption. (I had BOINC configured to only 1 transfer/project and 100 kB/s cap on upload speed, which this ARP1 upload fully used.) Either I was lucky and met an unusually quiet point in time at WCG for the upload, or WCG's HTTP troubles only affect downloads, not uploads. Or maybe all work of the current ARP1 batch had already been distributed to hosts and WCG's HTTP errors are over... until the next ARP1 batch.
OK, that is, the download troubles are still not over. That must mean that Krembil's infrastructure can still somehow deal with the uploads which they are receiving, and only the downloads have the high error rate.

So news tasks came up, and sure enough, I had to abort them.
Sounds like it's better to either _not_ abort them (instead, let BOINC continue to retry until the tasks time out), because the BOINC client does not request more new work from a project at which it has stalled downloads. Or to set WCG to no new work and check back again in a week or so.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,364
15,501
136
OK, that is, the download troubles are still not over. That must mean that Krembil's infrastructure can still somehow deal with the uploads which they are receiving, and only the downloads have the high error rate.


Sounds like it's better to either _not_ abort them (instead, let BOINC continue to retry until the tasks time out), because the BOINC client does not request more new work from a project at which it has stalled downloads. Or to set WCG to no new work and check back again in a week or so.
so NO work for a week ? WOW . So 10 minutes ago, I aborted 200 or so tasks for every box I have, and 200 more just showed up ! I been doing this for days now. I wait 8 hours or so, then abort another 200. You would think they get a clue. And my internet id 300 million/300million speed ! fiber optic all the way to the house.
 

StefanR5R

Elite Member
Dec 10, 2016
6,018
9,044
136
What works:
  • scheduler requests to receive new work
  • uploading result files
  • scheduler requests to report results
What somewhat works:
  • visiting the WCG web site to change settings etc.
What almost doesn't work at all:
  • downloading input files of new work
Hence, better don't request new work for as long as the download troubles persist.

You can however periodically tell BOINC to retry stuck transfers, in addition to its own built-in retry policy. Of course more retries from more users won't fix whatever problems Krembil has. (Rather, more retries from more users will only aggravate these problems.)

Another thing you could do while this is going on, is to limit the number of tasks which you want to receive. There should be a setting for this somewhere deep within your account pages on the WCG web server.

You don't need to abort tasks which are stuck downloading: Just let BOINC keep retrying the downloads. If it takes longer than the reporting deadline of the tasks permits, then the WCG server will tell the client to abort them.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,364
15,501
136
The problem continues, now also with Rosetta. Did something change in BOINC recently ? I did not change anything, but now 2 systems doing the same thing. For now they are set to "no new work" until I figure this out.
 

mmonnin03

Senior member
Nov 7, 2006
246
232
116
I would suspend the tasks stuck downloading so that they don't take up your queue. As long as all of the concurrent downloads are taken up prohibiting other projects from downloading.

I had some tasks stuck downloading. Of course Grant at Rosetta thinks its everyone else and not the server as they are fine when this is a server issue. I hit retry and after a couple of attempts I was able to download tasks.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,364
15,501
136

Is there something here that would make it try to load more than 60-70 tasks ? Its a 64 core machine.
 

StefanR5R

Elite Member
Dec 10, 2016
6,018
9,044
136
@Markfw, besides temporarily suspending the WCG tasks which are stuck in downloading state like @mmonnin03 suggested, try if an increased "Minimum work buffer" and decreased "Additional work buffer" keep your workqueue better filled.

Besides, make sure that cc_config.xml's <max_file_xfers> is always larger than <max_file_xfers_per_project>. Also, I wouldn't set <max_file_xfers_per_project> too high if dealing with weak project servers. BOINC defaults are <max_file_xfers>8</max_file_xfers> and <max_file_xfers_per_project>2</max_file_xfers_per_project> which should be good in most situations.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,364
15,501
136
@StefanR5R , below of the culprit I think. This affect windows and linux, and I don't use a config, and all apps and all machines started this at once, do something is up with boinc.

7742 dual Titan V

74418 World Community Grid 11/9/2024 9:14:29 AM Temporarily failed download of MCM1_0227680_8800_MCM1_0227680_8800.txt: transient HTTP error

7742 dual Titan V

74919 World Community Grid 11/9/2024 4:04:38 PM Temporarily failed upload of MCM1_0227680_8811_1_r1650405902_0: transient HTTP error


I will try to get a Rosetta one.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |