Odd Solaris/Napp-it issue

XavierMace

Diamond Member
Apr 20, 2013
4,307
450
126
I believe this to be either a Solaris or hardware issue, but including napp-it just in case. Plus to help summon gea.

I've got two Solaris 11.3 boxes running Napp-it Pro (January build). One is a physical box, the other a VM. Physical box replicates to VM. Outside of that, configuration/software wise they are identical, the only services they are running is Comstar iSCSI which feeds LUN's to my ESXi host. They've both been running smoothly for ages. I generally log into the Web-GUI every week or two just to check system health. When I did this Monday, the login was super slow. Once I got in, I saw a red light on the disk real time monitor. One of the spindles in the Z2 array was pegged at 100%. So I swapped the drive. Everything seemed to go back to normal at that point.

Tuesday, I try to sign in again. Sign in page loads normally. Sign in, it authenticates the credentials (meaning if put bad creds in it errors out accordingly), then hangs. The storage itself seems to be working fine, as the box is the underlying storage for my ESXi host. So I console into the box and restart the napp-it Web-GUI. No change. Solaris System Monitor looks normal. Pool looks normal. Try to reboot the box via menu. Nothing happens. Attempt to open terminal up to reboot via CLI. Terminal won't open.

At this point, I stop the replication job, take the replication pool out of read only, and send all my iSCSI traffic to the VM which is running strong (yay for home DR, lol). I'll be digging into the issue more on the weekend but I'm not quite sure where to start as there hasn't been any configuration changes and everything looks normal except that it's non-responsive to most actions. I'll be pulling the OS/rpool and ZIL SSD's out to check SMART health but beyond that I'm not sure where to go.
 

gea

Senior member
Aug 3, 2014
221
12
81
When napp-it hangs on login, you have most propably a disk/HBA problem as the first napp-it actions are to detect all disks and pools. But on unclear problems you can

check disks and pool stat at console with commands
# format
- list all disks, cancel with ctrl-c after listing. If format hangs, remove all datadisks and insert disk by disk and retry format to find the disk that is blocking a system. If disk listing is waiting for a longer time on a disk replace it.

if ZFS is waiting too long for a semidead disk you can reduce disk timeout time from default 60s to a lower value like 15s. TLER disks use 8s but this is too low for large arrays. This will offline a trouble disk after a shorter time (in napp-it pro: menu system > appliance tuning)

You can also use iostat to check if disk load of all disks of a pool is similar. If one disk shows a weakness like much lower io or higher wait/busy values, replace it.

#zpool status
- list pools and pool status

- check system and fault logs (napp-it menu system)
- check services (in napp-it 17.04dev there is a new menu Services > Status)

at console you can use svcs to list all services and svcs -x servicename to list details
with the service logs in /var/svc/log
 

XavierMace

Diamond Member
Apr 20, 2013
4,307
450
126
My thinking was only the boot/rpool would effect the login, no? I've had a degraded zpool before and that didn't effect the login. Which if it was the SLOG that was dying (but not outright dead), would that explain it? I'm using the F20 as the SLOG currently. Console wouldn't load which was limiting my options but a semi-dead disk is when all this started. One of the drives in the zpool was pegged at 100% busy so it was replaced. SMART still insists the drive is fine, but I'm getting a blistering 7MB/s sequential transfer on it. LOL. Things seemed fine for a while after that was replaced. The webgui was loading fine at that point. It wasn't until the following night that it completely blew up.

Since the backup box is running fine, I'm going to blow away the whole pool on the malfunctioning box, pull all the drives, and test them one at a time. The backplane is split between two HBA's so if it's an HBA issue, I'll have to isolate which one is causing the problems which will be a little problematic if I can't get the webgui or console to load. LOL. But at least I don't have to worry about the data. Replication for the win. I've got two more spare drives on hand, so if another took a dump, I'll be fine but I'll be concerned about two failing at the same time.

I'll definitely reduce the timeout like you suggested though and if I can get in the menu or console again, I'll check the logs.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |