WCG problems

Markfw · Nov 7, 2024

I have hundreds of "downloading" forever, so I aborted them all. So news tasks came up, and sure enough, I had to abort them. For the most part, WCG is dead at the moment. Anybody know what's going on ?

StefanR5R · Nov 8, 2024

StefanR5R said:
They restarted the African Rainfall Project (arp1) last week. Each task has got more than ten input files, and several of them are tens of Megabytes big. The result files of arp1 (seven files per result) are even larger IIRC. This is probably just too much for Krembil's internet connection. [...] I read elsewhere that WCG, after the move to Krembil and for as long as the arp1 project was active, had about the same issues like now whenever they submitted a new arp1 work batch.

StefanR5R said:
I tried one ARP1 task. Downloading the input files took two hours with countless retries, but I did not rely on BOINC's built-in retry periods... Executing the task took 21 h on my Haswell Xeon E3 (with all other CPU threads busy with concurrent work, mostly MilkyWay nbody). The seven result files were 94 MB big in total, but they uploaded within a quarter hour without a single interruption. (I had BOINC configured to only 1 transfer/project and 100 kB/s cap on upload speed, which this ARP1 upload fully used.) Either I was lucky and met an unusually quiet point in time at WCG for the upload, or WCG's HTTP troubles only affect downloads, not uploads. Or maybe all work of the current ARP1 batch had already been distributed to hosts and WCG's HTTP errors are over... until the next ARP1 batch.

OK, that is, the download troubles are still not over. That must mean that Krembil's infrastructure can still somehow deal with the uploads which they are receiving, and only the downloads have the high error rate.

Markfw said:
So news tasks came up, and sure enough, I had to abort them.

Sounds like it's better to either _not_ abort them (instead, let BOINC continue to retry until the tasks time out), because the BOINC client does not request more new work from a project at which it has stalled downloads. Or to set WCG to no new work and check back again in a week or so.

Markfw · Nov 8, 2024

StefanR5R said:
OK, that is, the download troubles are still not over. That must mean that Krembil's infrastructure can still somehow deal with the uploads which they are receiving, and only the downloads have the high error rate.

Sounds like it's better to either _not_ abort them (instead, let BOINC continue to retry until the tasks time out), because the BOINC client does not request more new work from a project at which it has stalled downloads. Or to set WCG to no new work and check back again in a week or so.

so NO work for a week ? WOW . So 10 minutes ago, I aborted 200 or so tasks for every box I have, and 200 more just showed up ! I been doing this for days now. I wait 8 hours or so, then abort another 200. You would think they get a clue. And my internet id 300 million/300million speed ! fiber optic all the way to the house.

StefanR5R · Nov 8, 2024

What works:

scheduler requests to receive new work
uploading result files
scheduler requests to report results

What somewhat works:

visiting the WCG web site to change settings etc.

What almost doesn't work at all:

downloading input files of new work

Hence, better don't request new work for as long as the download troubles persist.

You can however periodically tell BOINC to retry stuck transfers, in addition to its own built-in retry policy. Of course more retries from more users won't fix whatever problems Krembil has. (Rather, more retries from more users will only aggravate these problems.)

Another thing you could do while this is going on, is to limit the number of tasks which you want to receive. There should be a setting for this somewhere deep within your account pages on the WCG web server.

You don't need to abort tasks which are stuck downloading: Just let BOINC keep retrying the downloads. If it takes longer than the reporting deadline of the tasks permits, then the WCG server will tell the client to abort them.

Markfw · Nov 8, 2024

The problem continues, now also with Rosetta. Did something change in BOINC recently ? I did not change anything, but now 2 systems doing the same thing. For now they are set to "no new work" until I figure this out.

mmonnin03 · Nov 8, 2024

I would suspend the tasks stuck downloading so that they don't take up your queue. As long as all of the concurrent downloads are taken up prohibiting other projects from downloading.

I had some tasks stuck downloading. Of course Grant at Rosetta thinks its everyone else and not the server as they are fine when this is a server issue. I hit retry and after a couple of attempts I was able to download tasks.

Markfw · Nov 8, 2024

Is there something here that would make it try to load more than 60-70 tasks ? Its a 64 core machine.

StefanR5R · Nov 9, 2024

@Markfw, besides temporarily suspending the WCG tasks which are stuck in downloading state like @mmonnin03 suggested, try if an increased "Minimum work buffer" and decreased "Additional work buffer" keep your workqueue better filled.

Besides, make sure that cc_config.xml's <max_file_xfers> is always larger than <max_file_xfers_per_project>. Also, I wouldn't set <max_file_xfers_per_project> too high if dealing with weak project servers. BOINC defaults are <max_file_xfers>8</max_file_xfers> and <max_file_xfers_per_project>2</max_file_xfers_per_project> which should be good in most situations.

Markfw · Nov 9, 2024

@StefanR5R , below of the culprit I think. This affect windows and linux, and I don't use a config, and all apps and all machines started this at once, do something is up with boinc.

7742 dual Titan V

74418 World Community Grid 11/9/2024 9:14:29 AM Temporarily failed download of MCM1_0227680_8800_MCM1_0227680_8800.txt: transient HTTP error

7742 dual Titan V

74919 World Community Grid 11/9/2024 4:04:38 PM Temporarily failed upload of MCM1_0227680_8811_1_r1650405902_0: transient HTTP error

I will try to get a Rosetta one.

Assimilator1 · Dec 8, 2024

World Community Grid: WCG downtime in December
There will be an extended WCG downtime from December 7th, 2024 to January 3rd, 2025.
05/12/2024 22:41:39 · more...

I see WCG sent the above message the other day saying they're shutting down everything for a month, at the link it went on to say it was for the site move.
They also said they'd extend the deadlines for any WUs out there atm, although I'm surprised they didn't turn off the WU taps before shutting the servers down.
I've got dozens waiting for upload.

Markfw · Jan 3, 2025

Assimilator1 said:
World Community Grid: WCG downtime in December
There will be an extended WCG downtime from December 7th, 2024 to January 3rd, 2025.
05/12/2024 22:41:39 · more...

I see WCG sent the above message the other day saying they're shutting down everything for a month, at the link it went on to say it was for the site move.
They also said they'd extend the deadlines for any WUs out there atm, although I'm surprised they didn't turn off the WU taps before shutting the servers down.
I've got dozens waiting for upload.

Well, its Jan 3 and its not up. Does that mean that tomorrow it will be back up ?

Assimilator1 · Jan 4, 2025

Doesn't look like it, I've still got my WUs waiting upload, and the website is still down.

I found the following here :-
January 3, 2025

We have been notified that the core system at SHARCNET is coming online now (5pm). They are planning to complete it tonight (January 3). We are waiting for the access to our systems, and will start turning everything back on as soon as we gain login.

I can't see it coming online before Monday.

Markfw · Jan 4, 2025

Assimilator1 said:
Doesn't look like it, I've still got my WUs waiting upload, and the website is still down.

I found the following here :-
January 3, 2025

We have been notified that the core system at SHARCNET is coming online now (5pm). They are planning to complete it tonight (January 3). We are waiting for the access to our systems, and will start turning everything back on as soon as we gain login.

I can't see it coming online before Monday.

thanks

Markfw · Jan 6, 2025

still down. any ideas ? I found this on that site:

January 4, 2025
- Update from the data centre: "Having issues with the physical network , likely can't get it diagnosed and fixed until Monday, January 6.". As a result - we still do not have a connection to our servers.

Markfw · Jan 7, 2025

Well this site is up: https://www.worldcommunitygrid.org/about_us/article.s?articleId=818

But I have over 5500 tasks trying to upload. sort of up....

Message from WCG

7950x main

7950x main

303624 World Community Grid 1/7/2025 3:42:56 PM update requested by user
303625 PrimeGrid 1/7/2025 3:42:58 PM Finished upload of llrPPSE_605434265_0_r1342012107_0 (257840 bytes)
303626 World Community Grid 1/7/2025 3:43:02 PM Sending scheduler request: Requested by user.
303627 World Community Grid 1/7/2025 3:43:02 PM Not requesting tasks: too many uploads in progress
303628 World Community Grid 1/7/2025 3:43:03 PM Scheduler request to https://scheduler.worldcommunitygrid.org/boinc/wcg_cgi/fcgi failed: HTTP service unavailable

cellarnoise · Jan 7, 2025

They should just say that they "are working on it". Hopefully it is back up soon.

I have a lot of finished work ready to upload also.

Their servers are going to get hammered once up once again, and that alone will likely take a day or many to overcome..

StefanR5R · Jan 9, 2025

I read elsewhere that clients could clear their pending uploads.

From https://www.cs.toronto.edu/~juris/jlab/wcg.html:
January 8, 2025

Most of our infrastructure is back online. Unfortunately, some issues with the network and specific virtual machines remain. Thus, the BOINC database node remains unavailable, and the website and forums also do not function properly.
Sharcnet data center team is working on to restore access to these instances in priority order.
Once this is resolved, we will have a smooth restart of the workunit management and BOINC components on the backend, and be able to isolate and diagnose any remaining issues as we restart.

Markfw · Jan 9, 2025

My uploads went, but thousands of unreported tasks !

StefanR5R · Jan 9, 2025

Yes, the final step of reporting the results requires the database to function (and more), which has yet to happen.

The transfer of result files which is the step before this only required server-side disk space to be online and the HTTP upload handler to be up.

StefanR5R · Jan 11, 2025

From https://www.cs.toronto.edu/~juris/jlab/wcg.html, Operational Status:
January 9, 2025

BOINC database is up and in a good state. We are waiting on two more servers to regain access to the network, at which point we will be restarting the scheduler, transitioner, assimilators and validators.
All deadlines for outstanding MCM1 work units have been extended to just after 6:00 p.m. Eastern Standard Time on January 15th, 2025.
Web site is up; stats will be updated soon.
Forums are up.

The scheduler has been restarted several hours ago.

Markfw · Jan 11, 2025

WCG is running like gangbusters !

WCG problems

Moderator Emeritus, Elite Member

Elite Member

Moderator Emeritus, Elite Member

Elite Member

Moderator Emeritus, Elite Member

Senior member

Moderator Emeritus, Elite Member

Elite Member

Moderator Emeritus, Elite Member

Elite Member

Moderator Emeritus, Elite Member

Elite Member

Moderator Emeritus, Elite Member

Moderator Emeritus, Elite Member

Moderator Emeritus, Elite Member

Senior member

Elite Member

Moderator Emeritus, Elite Member

Elite Member

Elite Member

Moderator Emeritus, Elite Member