What's the difference between controller -> SSD TRIM and OS TRIM?
Hasn't this command been around since 2009 or earlier? Why is it only recently starting to receive limited controller support?
OS TRIM is what you usually see when someone talks about a device supporting TRIM commands. Remember that before SSD's, HDD's never cared whether or not sectors had data. If the OS deleted files, it just marked the files for deletion, which means later writes simply wrote over whatever was there. SSD's on the other hand work in pages, and an SSD has to "null" a page before it can write to it (so if the SSD page contained data before, it must first empty the page, then write new data to it). TRIM and UNMAP (SAS equivalent) was a way to get the OS to tell the SSD about what it was keeping track of for deletion. It essentially tells the SSD "hey, we don't need these pages anymore", and the Device would go "great" and file them away for the pages to be deleted and ready to receive writes directly, instead of needing to do an erase - program cycle.
Controller TRIM essentially takes over for what the OS was doing on a device level. Remember from the above example, if the OS can't see the SSD device, it can't tell the SSD what to do with empty pages, and it shouldn't, because the RAID controller is obfuscating that behind it's own data structure. The OS only sees a virtual device. Instead, the RAID Controller itself tracks deletes from the OS using its RAID Controller Driver. Has these build up, the RAID Controller might go "hey, a stripe here can be cleaned". It will then send TRIM Commands to the downstream devices indicating that pages composed of the cleanable Stripe can be cleaned up, and the SSDs file those away.
If you check for TRIM Support at the OS in the above Controller Design, the OS won't see TRIM support, because the OS doesn't get to tell the device what can be Trimmed. Instead, the Controller intercepts those delete notifications and over time tells the SSDs what pages can truly be trimmed.
As for why it's a long time coming, like alot of things, it really isn't down to any one thing, but a multitude of things.
SSDs ideal for RAID have gotten really good, in a short period of time. Garbage Collection is many times better than before
SSDs not ideal for RAID have also gotten better at garbage collection, but the things that make them not ideal for RAID are even worse than before (power loss protection being a big one).
RAID works on Stripes, TRIM works on Blocks (that contain pages). You get a lot of non-aggreeing Chefs in the kitchen when both of those are together.
TRIM in an of itself is a singular device form of work. TRIM is only knowledgeable about a device it can see. This is in direct conflict with RAID, which hides devices behind a virtual device.
The long and short of it is that for SAS and SATA, Enterprise drives are designed to be highly aggressive with Garbage Collection. They have a large amount of spare area, NAND, and plain horsepower to keep themselves cleaned up with even highly aggressive workloads, depending on the model. We have such an example with Hyperconverged Infrastructure, which rely heavily on SSDs as both a Cache and a landing pad for incoming data to distribute to Hard Drives. Such SSDs are pounded with data 24/7 with no TRIM Support, but the proper SSD has no issues with this. It's a testament to the performance and capability of modern Enterprise SSDs.
One of the things above that makes modern Enterprise SSDs good is Power Loss Protection (PLP). Having PLP means that an SSD can consider writes committed as soon as they enter the DRAM Buffer. In the event of a power outage, the SSD can commit the writes to the NAND using capacitor reserves. The same cannot be said of most consumer SSDs, which lack PLP. Writes get held in volatile memory until flushed, and a power disruption in most cases causes those changes to be lost. If the SSD was also updating pages or performing garbage collection, this can have broadly impacting data corruption.
Imagine someone passing plates to a second person, who has a quick table next to them to set them on, before being rolled out to another table. The person who is handing over to the plates to the person keeps asking "are they safe?" and the person receiving the plates keeps saying "yep" right away because that person is just setting them over on the cart. The person person receiving those plates slips and falls. Everything is fine because those plates are on the table. A backup person comes in and moves all those plates over to the main table.
Now imagine there is no safety table. The person who is handing over the plates keeps asking "are they safe?" and the person receiving the plates keeps saying "Not yet", because all they can do is stack them up in their arms. Every couple of seconds, the person handing over the plates gets nervous and says "no more plates until you put those on the table and tell me they're safe." The person receiving the plates does that and then reports the plates as safe. The process begins again. If the person receiving the plates falls, all the plates that person was holding before they went to the table are lost.
That's the difference between an SSD with PLP and an SSD without PLP, and that's why RAID should be done with proper, safe SSDs. When those things align, performance remains high, and TRIM is not the major issue it once was.
TRIM on NVMe is not a given. It's mainly a more open possibility on NVMe because most Controllers operating in NVMe mode are little more than PCIe switches with a software controller. The system in the right mode can see all devices on the other side. So with Software configured properly, the OS can pass trim commands down to those devices as interpreted by the controller. When you have only a virtual representation of a bunch of devices, like with an entire SAS or SATA stack on a traditional RAID controller, that obfuscation is what makes TRIM more difficult to implement.