They restarted the African Rainfall Project (arp1) last week. Each task has got more than ten input files, and several of them are tens of Megabytes big. The result files of arp1 (seven files per result) are even larger IIRC. This is probably just too much for Krembil's internet connection. [...] I read elsewhere that WCG, after the move to Krembil and for as long as the arp1 project was active, had about the same issues like now whenever they submitted a new arp1 work batch.
OK, that is, the download troubles are still not over. That must mean that Krembil's infrastructure can still somehow deal with the uploads which they are receiving, and only the downloads have the high error rate.I tried one ARP1 task. Downloading the input files took two hours with countless retries, but I did not rely on BOINC's built-in retry periods... Executing the task took 21 h on my Haswell Xeon E3 (with all other CPU threads busy with concurrent work, mostly MilkyWay nbody). The seven result files were 94 MB big in total, but they uploaded within a quarter hour without a single interruption. (I had BOINC configured to only 1 transfer/project and 100 kB/s cap on upload speed, which this ARP1 upload fully used.) Either I was lucky and met an unusually quiet point in time at WCG for the upload, or WCG's HTTP troubles only affect downloads, not uploads. Or maybe all work of the current ARP1 batch had already been distributed to hosts and WCG's HTTP errors are over... until the next ARP1 batch.
Sounds like it's better to either _not_ abort them (instead, let BOINC continue to retry until the tasks time out), because the BOINC client does not request more new work from a project at which it has stalled downloads. Or to set WCG to no new work and check back again in a week or so.So news tasks came up, and sure enough, I had to abort them.
so NO work for a week ? WOW . So 10 minutes ago, I aborted 200 or so tasks for every box I have, and 200 more just showed up ! I been doing this for days now. I wait 8 hours or so, then abort another 200. You would think they get a clue. And my internet id 300 million/300million speed ! fiber optic all the way to the house.OK, that is, the download troubles are still not over. That must mean that Krembil's infrastructure can still somehow deal with the uploads which they are receiving, and only the downloads have the high error rate.
Sounds like it's better to either _not_ abort them (instead, let BOINC continue to retry until the tasks time out), because the BOINC client does not request more new work from a project at which it has stalled downloads. Or to set WCG to no new work and check back again in a week or so.
<max_file_xfers>
is always larger than <max_file_xfers_per_project>
. Also, I wouldn't set <max_file_xfers_per_project> too high if dealing with weak project servers. BOINC defaults are <max_file_xfers>8</max_file_xfers> and <max_file_xfers_per_project>2</max_file_xfers_per_project> which should be good in most situations.