SSD defragmentation?

ashetos

Senior member
Jul 23, 2013
254
14
76
So I've been thinking, we all know that defragmenting an SSD is so write-heavy that it should be avoided at all costs, due to the reduction in SSD life-time;

but...

I keep wondering what the impact of a fragmented NTFS file-system is on:
1) the persistent re-mapping data structures
2) the clean-up algorithms

My guess is that, for all controllers out there, it is always better to have large sequentially allocated files (logical LBAs, not physical NAND placement but file-system placement). This should help because re-mapping meta-data would be easily coalesced by the garbage collection routines, which would be otherwise impossible if the file is fragmented.

So, if a perfectly de-fragmented file-system indeed reduces the size of the re-mapping meta-data, this would potentially have at least 3 other benefits:
1) Faster look-ups for accessing data blocks, easier read-ahead for reads and buffering for writes
2) More efficient TRIM handling as most unallocated areas should also be de-fragmented
3) Faster wear-leveling/garbage collection thanks to less meta-data

To conclude, I'm arguing that a perfectly defragmented file-system on an SSD would be consistently faster for all operations and possibly provide better NAND longevity in the long run.

So, maybe it is worth it to erase the SSD once in a while, and copy back all the files from a back-up to the SSD for perfect file placement?
 

Hellhammer

AnandTech Emeritus
Apr 25, 2011
701
4
81
The beauty of an SSD is that it doesn't care about file placement. When you write to an SSD, the controller will fragment the data in order to write it as fast as possible by utilizing multiple NAND dies. For example, if you write a 100MB file, the controller won't write 100MB to the first die; instead it will break the file into pieces and write to multiple dies simultaneously to take advantage of parallelism.
 

Insert_Nickname

Diamond Member
May 6, 2012
4,971
1,693
136
To conclude, I'm arguing that a perfectly defragmented file-system on an SSD would be consistently faster for all operations and possibly provide better NAND longevity in the long run.

Why? Besides every block in the NAND having the same access time, the LBA the OS sees has absolutely nothing to do with what's happening inside the SSD...

I think this is already covered by the internal garbage collection and controller management of writes.

But please correct me if I'm wrong...
 

ashetos

Senior member
Jul 23, 2013
254
14
76
I believe we cannot look past logical LBAs. For example, let's say we have a huge file, 40GiB in a 80GiB SSD.

With TRIM support this means half the SSD space is allocated and half of it is free.

Now, depending on file-system placement, this could end up being from LBA 0-40GiB with perfect placement, or LBA 0,2,4,6,8...80GiB which is the worst possible placement.

In the first case, the SSD firmware can coalesce and optimize the look-up meta-data to be as coarse grain as it makes sense, for instance every 2MiB blocks.

In the second case, the SSD firmware needs meta-data for every 4KiB of data, and it cannot coalesce, ever.

Since the amount of meta-data is typically larger than the SSD DRAM, accesses are impacted by the additional flash look-ups. Also, since meta-data need to be written to flash for recovery purposes, more meta-data means more synchronous flash writes. Both these overheads should already be significant.

And if we add workload patterns in the mix, we will notice that applications are bound to do frequent sequential accesses or large I/O requests (which are equivalent to sequential accesses with smaller I/O requests). Then the pattern mismatch between OS access and flash meta-data will have additional performance impact.
 

Hellhammer

AnandTech Emeritus
Apr 25, 2011
701
4
81
Logical LBAs have nothing to do with the physical location of the data. If the OS requests a write to logical LBA 1, that doesn't mean the write will go to physical (i.e. NAND) LBA 1. SSDs have a NAND mapping table (also called indirection table) and its purpose is to keep logical and physical LBAs in sync. For example, the write to logical LBA 1 may be mapped to physical LBA 2.

Most SSDs do 1:1 mapping, which means every single page is tracked, even if it's empty. It requires more DRAM than more efficient designs but it's fast and simple (and yes, the whole table is usually cached to DRAM, not just parts of it).

There's absolutely no use to defragment an SSD because only logical LBAs will be defragmented, there may not be any change in the physical LBAs as the data is fragmented anyway (and it has to be for performance purposes).
 
Last edited:

ashetos

Senior member
Jul 23, 2013
254
14
76
Logical LBAs have nothing to do with the physical location of the data. If the OS requests a write to logical LBA 1, that doesn't mean the write will go to physical (i.e. NAND) LBA 1. SSDs have a NAND mapping table (also called indirection table) and its purpose is to keep logical and physical LBAs in sync. For example, the write to logical LBA 1 may be mapped to physical LBA 2.

Most SSDs do 1:1 mapping, which means every single page is tracked, even if it's empty. It requires more DRAM than more efficient designs but it's fast and simple (and yes, the whole table is usually cached to DRAM, not just parts of it).

There's absolutely no use to defragment an SSD because only logical LBAs will be defragmented, there may not be any change in the physical LBAs as the data is fragmented anyway (and it has to be for performance purposes).

I understand Logical LBAs have nothing to do with the physical location of the data, that was not my point though. My point was that a sequentially allocated logical address range can be mapped to a sequentially allocated physical address range with a minimum amount of data.

As far as 1:1 mapping, this indeed would make things simpler, but you need a huge amount of SSD DRAM and I really can't take your word that most SSDs have a 1:1 mapping. The only one I'm aware of is the intel enterprise model which is very expensive.
 

Hellhammer

AnandTech Emeritus
Apr 25, 2011
701
4
81
As far as 1:1 mapping, this indeed would make things simpler, but you need a huge amount of SSD DRAM and I really can't take your word that most SSDs have a 1:1 mapping. The only one I'm aware of is the intel enterprise model which is very expensive.

Intel is the only one that has publicly said they do 1:1 mapping. However, it's not too hard to recognize SSDs that use 1:1 mapping because it needs a ton of DRAM and the amount needs to scale up with the NAND capacity. Usually it's 1MB of DRAM per 1GB of NAND or more.

With a different mapping scheme you can get by with 1MB of DRAM per 10GB of NAND (e.g. Intel X-25M) or even less (SandForce stores the table in the controller's SRAM).

Even with other mapping schemes you don't need defragmenting because the controller will do it on its.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
FS fragmentation can slow things down, still, so don't fill it up to 90%, and expect great results over time. But, the NAND isn't the issue there. Handling many fragments takes more CPU/RAM time, and more requests over the SATA interface. Fewer requests generally means faster access. As far as the LBAs go, you should leave free space (get a bigger SSD than is required to store your data). Then, larger writes will be simpler most-sequential writes.

The SSD's NAND is going to get fragmented anyway, and it's degree of fragmentation is handled by the drive itself.

Unfortunately, as far as the FS goes, there aren't minimal-write options, TMK, right now. FI, if badly-fragmented files could be copied, but others left alone, and the drive were not used as its own temp space, defragging wouldn't be so bad (with multi-GB in-RAM buffers, it aught to be doable). As it is, you're gaining very little by doing it, yet it could wear your SSD out by an amount that might very well take you years of regular use. On the bright side, random access is typically pretty good, and most real-world access is either fairly random, or fairly low-bandwidth.
 

postmortemIA

Diamond Member
Jul 11, 2006
7,721
40
91
I think that controller's top priority is load leveling in order to produce longevity: ensuring that same blocks are not written during write operations, so you'd get very fragmented drive as a result.
 

ashetos

Senior member
Jul 23, 2013
254
14
76
I think that the performance difference between sequential writes and random writes for same size write requests proves that FS fragmentation has performance impact.

Now, it's true that internal SSD fragmentation will occur anyway due to wear-leveling. But there is potential for cells in the same wear-leveling group to be mapped to the same LBA group and thus achieve sequential-class performance instead of random-class performance.
 

hot120

Member
Sep 11, 2009
43
1
71
You can't look at fragmentation of an SSD the same way you look at fragmentation of a HDD. I think that is where you are lost at.
 

DrPizza

Administrator Elite Member Goat Whisperer
Mar 5, 2001
49,601
166
111
www.slatebrookfarm.com
I'm far from an SSD expert, but let me take a stab at this:
An analogy:
Suppose you have a warehouse with 16 rows. You're arguing that if a particular product consisted of 32 boxes, that it would be more efficient to put all 30 boxes in the same row.
But, in the SSD, it is more efficient to split up those boxes, because there are 16 guys operating forklifts, and each forklift goes down one of the rows. Thus, to retrieve your entire product, you would have one forklift go back and forth 32 times. But, splitting those packages up, the other 15 forklifts could be retrieving boxes simultaneously - so each forklift makes two trips at the same time every other forklift is making its two trips. - I hope this analogy is correct (SSD experts, feel free to let me know if I'm way off with this analogy), and if it's correct, it should help you understand why what you're proposing isn't more efficient.


Oh, fragmentation on an HDD : there's one forklift.
 

ashetos

Senior member
Jul 23, 2013
254
14
76
I'm far from an SSD expert, but let me take a stab at this:
An analogy:
Suppose you have a warehouse with 16 rows. You're arguing that if a particular product consisted of 32 boxes, that it would be more efficient to put all 30 boxes in the same row.
But, in the SSD, it is more efficient to split up those boxes, because there are 16 guys operating forklifts, and each forklift goes down one of the rows. Thus, to retrieve your entire product, you would have one forklift go back and forth 32 times. But, splitting those packages up, the other 15 forklifts could be retrieving boxes simultaneously - so each forklift makes two trips at the same time every other forklift is making its two trips. - I hope this analogy is correct (SSD experts, feel free to let me know if I'm way off with this analogy), and if it's correct, it should help you understand why what you're proposing isn't more efficient.


Oh, fragmentation on an HDD : there's one forklift.

No, I'm not suggesting that. I'm aware of NAND parallelism, blocks, pages, planes and what not. So in effect I'm talking about let's say 512K boxes, and arguing whether different 16-box combinations perform differently.
 

ashetos

Senior member
Jul 23, 2013
254
14
76
For additional clarification, data interleaving across NAND channels can be algorithmic and thus stateless and orthogonal to the discussion.
 

Emulex

Diamond Member
Jan 28, 2001
9,759
1
71
It is not just unused disk space that contributes to wear leveling, if a block gets worn quickly and it can swap in a block that doesn't appear to be changing, that is a logical "preservation" move.

the entire ssd is used for wear leveling, blocks that do not change are just as good as overprovision blocks (maybe better) since a huge portion of your drive is o/s static content.

that make sense? defragmenting always has advantages (recovery of 1 contiguous block is far easier than 6 million blocks), but i'd suggest you ask the folks that make your controller what their design thoughts were.

trim is not the #1 thought when they designed the ssd. it is unreliable and until sata 3.1 can't even be tagged - so i'm guessing controller is cool with hourly/daily swipes of trim when not busy more so than firing them off without NCQ during a busy activity time.
 

ashetos

Senior member
Jul 23, 2013
254
14
76
It is not just unused disk space that contributes to wear leveling, if a block gets worn quickly and it can swap in a block that doesn't appear to be changing, that is a logical "preservation" move.

the entire ssd is used for wear leveling, blocks that do not change are just as good as overprovision blocks (maybe better) since a huge portion of your drive is o/s static content.

that make sense? defragmenting always has advantages (recovery of 1 contiguous block is far easier than 6 million blocks), but i'd suggest you ask the folks that make your controller what their design thoughts were.

trim is not the #1 thought when they designed the ssd. it is unreliable and until sata 3.1 can't even be tagged - so i'm guessing controller is cool with hourly/daily swipes of trim when not busy more so than firing them off without NCQ during a busy activity time.

That makes perfect sense. You are also probably right that TRIM is not the priority of the firmware implementation.

I am very interested in finding out how much degradation FS fragmentation causes. My guesses for the root cause are harder garbage collection, and bigger look-up data structures.

I have 2 reasons to believe wear leveling across static and dynamic data does not make fragmentation irrelevant:
1) Random write performance would be identical to sequential write performance which is not the case
2) Wear leveling algorithms group NAND cells together, in classes, because bit granularity would be too much. Thus, large ranges of LBAs and NAND cells can be associated with minimal meta-data.
 

Emulex

Diamond Member
Jan 28, 2001
9,759
1
71
the LBA to flash location is what intel and samsung use the ram for primarily. It is why an ssd without a tantalum/supercap can really get screwed up since you have a table with LBA to flash in ram that has to be written to ssd.

so asssuming it is a 1:1 map (why?) fragmentation would not matter, if it is a b-tree then the more active mapping could require more work and ram(older ssd)
 

hot120

Member
Sep 11, 2009
43
1
71
Come on, I'm not looking at it the same way!

Yes, you are. Access times for an SSD are the same (all electronic), regardless of the location of the data (physical NAND). For a HDD, that is not the case. The read/write head would have to access different PHYSICAL locations on the platter (mechanical arm, spinning platter). That is why defragmentation on a SSD is pointless. Can you understand that?
 

ashetos

Senior member
Jul 23, 2013
254
14
76
the LBA to flash location is what intel and samsung use the ram for primarily. It is why an ssd without a tantalum/supercap can really get screwed up since you have a table with LBA to flash in ram that has to be written to ssd.

so asssuming it is a 1:1 map (why?) fragmentation would not matter, if it is a b-tree then the more active mapping could require more work and ram(older ssd)

Yes, if the manufacturer does use a super capacitor he gets the luxury to keep SSD RAM contents without updating the flash until only the last moment (power failure).

With a 1:1 mapping it is almost madatory to have something like a super capacitor, cause differently you would need to flush the metadata after each individual re-map (every 4k, ouch).

The b-tree is actually more difficult to implement, especially if you put some effort to store extents instead of pages. That means, a range of pages that are sequential is treated as a single tree node instead of a tree node for each page. Data structures with extents could be responsible for the disparity between sequential and random accesses for writes.

I also wonder if vendors use something else than a b-tree, who knows, hash tables or something more complicated.
 

ashetos

Senior member
Jul 23, 2013
254
14
76
Yes, you are. Access times for an SSD are the same (all electronic), regardless of the location of the data (physical NAND). For a HDD, that is not the case. The read/write head would have to access different PHYSICAL locations on the platter (mechanical arm, spinning platter). That is why defragmentation on a SSD is pointless. Can you understand that?

I don't like your tone. Can I understand that? You can't even follow the discussion.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
Yes, if the manufacturer does use a super capacitor he gets the luxury to keep SSD RAM contents without updating the flash until only the last moment (power failure).

With a 1:1 mapping it is almost madatory to have something like a super capacitor, cause differently you would need to flush the metadata after each individual re-map (every 4k, ouch).
You need something like that anyway. The SSD needs some way to be sure that when the power goes out, any currently-pending writes to NAND can complete successfully (if they haven't started hitting the NAND, they can be ignored). Just making the mapping data smaller doesn't remove the risk. It needs to be able to detect the falling the voltage, act to not leave its state corrupted, then go down with the system.
 

ashetos

Senior member
Jul 23, 2013
254
14
76
You need something like that anyway. The SSD needs some way to be sure that when the power goes out, any currently-pending writes to NAND can complete successfully (if they haven't started hitting the NAND, they can be ignored). Just making the mapping data smaller doesn't remove the risk. It needs to be able to detect the falling the voltage, act to not leave its state corrupted, then go down with the system.

Of course, you are right. All SSDs should have something like a super capacitor. I am pretty sure though that most don't, and this explains discussions about data corruption after power outage with certain SSDs.

You can keep the SSD state consistent with synchronous flash writes, for the cheap models, but at the expense of performance. It is possible though, and you can achieve relatively high performance if you tolerate torn writes, which don't break block device semantics, and don't corrupt file-systems.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |