SSD defragmentation?

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

corkyg

Elite Member | Peripherals
Super Moderator
Mar 4, 2000
27,370
239
106
In this context, it should mean Bulk Only Transport. In tweet land it means Back On Topic.
 

ashetos

Senior member
Jul 23, 2013
254
14
76
So, to prove I'm not talking out of my ass, I did some experiments.

Firstly, I created an 8GiB file on an ext4 filesystem on my OCZ Vertex 4 128GB (30% free space):
dd if=/dev/zero of=ssdfile.bin oflag=direct bs=1M count=8192

Then I used the fio benchmark to compare sequential and random writes. The configuration of fio is:

[global]
filesize=8192MB
norandommap
bs=4096b
ioengine=sync
thread
direct=1
group_reporting
invalidate=0

[benchmark]
numjobs=1
rw=randwrite
filename=ssdfile.bin
runtime=30
time_based


By alternating rw between randwrite and write I conducted sequential and random write experiments.

Let me describe some key observations about the experiment:
1) it uses direct I/O, so it bypasses all file-system buffers directly into the block layer of linux
2) I use the noop I/O scheduler to not do funny optimizations at the I/O scheduler level
3) it uses 1 thread, meaning there is no opportunity for the system software to merge writes in any layer (File-System, block, SATA).
4) all I/O requests are 4K, and this guarantees that both sequential and random experiments will generate an equal amount of SATA commands per GB of I/O traffic

I did 6 30-second runs alternating random writes and sequential writes in this manner:
write, randwrite, write, randwrite, write, randwrite

The results are:
===============================================================================
write
WRITE: io=4529.2MB, aggrb=154591KB/s, minb=154591KB/s, maxb=154591KB/s, mint=30001msec, maxt=30001msec

===============================================================================
randwrite
WRITE: io=3890.9MB, aggrb=132801KB/s, minb=132801KB/s, maxb=132801KB/s, mint=30001msec, maxt=30001msec

===============================================================================
write
WRITE: io=4476.8MB, aggrb=152800KB/s, minb=152800KB/s, maxb=152800KB/s, mint=30001msec, maxt=30001msec

===============================================================================
randwrite
WRITE: io=4217.7MB, aggrb=143958KB/s, minb=143958KB/s, maxb=143958KB/s, mint=30001msec, maxt=30001msec

===============================================================================
write
WRITE: io=4532.9MB, aggrb=154714KB/s, minb=154714KB/s, maxb=154714KB/s, mint=30001msec, maxt=30001msec

===============================================================================
randwrite
WRITE: io=4229.3MB, aggrb=144351KB/s, minb=144351KB/s, maxb=144351KB/s, mint=30001msec, maxt=30001msec


The bandwidth that matters is aggrb. To summarize, sequential writes are consistently 10MB/s faster, even though sequential and random workloads are alternated in a round robin manner.

Via iostat I can add the observation that for sequential workloads the throughput of the device was rock steady at 150MB/s, something like this:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 2.00 0.00 39032.00 0.00 152.48 8.00 0.69 0.02 0.00 0.02 0.02 69.00

For the random workloads the performance would fluctuate between 150MB/s and 120MB/s.


Now to conclude, there is absolutely no reason for the random workload to be slower than the sequential one. This is a sandboxed experiment, where all parameters are kept in check. The only actual difference between the two workloads are the File-System LBAs. In one case they are sequential, in the other they are random.

Now, I'm not in the mood to kill my SSD, OCZ SSDs are not famous for their lifetime, so I will not do any more write intensive experiments. Ideally I could do raw device (via a partition) experiments and add other parameters to the mix, but that's it for now.

This experiment DOES NOT use TRIM, which in my opinion would increase the difference in performance between a fragmented file-system and a defragmented one.

This experiment DOES NOT use large I/O requests, which will be much more frequent in a defragmented file-system than in a fragmented one, and would again increase the difference in performance.
 

Puffnstuff

Lifer
Mar 9, 2005
16,149
4,848
136
Back on topic because this thread had strayed away from the op's question.
 
Last edited:

Hero1711

Senior member
Aug 25, 2013
225
0
0
Defragmentation was meant to be used on HDD. Why? Because when data are fragmented, it takes longer for the reader to collect all the data (longer seek time), thus brings down the overall performance. Data on SSD can be accessed directly without seeking (because of no mechanical parts). So there is no problem when data are fragmented on a SSD.
 

Lorne

Senior member
Feb 5, 2001
873
1
76
Defrag is also used on HDD in alot of cases to bring the data from the inner sectors to the outer sectors not just for access speed but throughput (outer platter rotational speed not spindle spindle rpm) speed.

For SSD question.
Am I getting this right, He is thinking that an 8GB file should take the same time as 8 1GB files to RW?
Not thinking that there is extra time needed for FS/BAM updating or access?.
 

corkyg

Elite Member | Peripherals
Super Moderator
Mar 4, 2000
27,370
239
106
Defrag is also used on HDD in alot of cases to bring the data from the inner sectors to the outer sectors not just for access speed but throughput (outer platter rotational speed not spindle spindle rpm) speed.

Technically, that is not defrag - it is optimization. That is another function of defrag programs, usually separately commanded.
 

ashetos

Senior member
Jul 23, 2013
254
14
76
For SSD question.
Am I getting this right, He is thinking that an 8GB file should take the same time as 8 1GB files to RW?
Not thinking that there is extra time needed for FS/BAM updating or access?.

No, overwriting in-place an 8GiB file should have the same throughput whether you write sequentially or randomly.

The file-system is irrelevant for two reasons in these experiments:
1) I use direct I/O so no FS buffering
2) the file is preallocated so no new FS allocator meta-data
 

aigomorla

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member
Super Moderator
Sep 28, 2005
20,894
3,247
126
OP there is no diskparking... head seeking... spinning platters in a SSD.

So SSD doesnt care if its fragmented as others have said.
The LBA address table is not written in a conventional sense as a magnetic hdd.

Your trying to say washing clothes in a washing machine is mandatory...
When were trying to tell you no... SSD's is Dry Cleaning.
 

ashetos

Senior member
Jul 23, 2013
254
14
76
You guys keep telling me the same thing. I've got numbers saying the opposite and some theories behind them. Please don't keep repeating disks have mechanical parts, SSDs don't, I'm not dense. If you don't feel like exploring this let the thread die already...
 

aigomorla

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member
Super Moderator
Sep 28, 2005
20,894
3,247
126
we are trying to tell you what is factual.
Your telling us you have 1 subset which falls outside norm.
Now if u had told us i have something weird which goes outside the norm, we would probably be more diagnostic then educational.

However your going about how you found that 1 subset not inside the norm, and passing it as wrong information.
This is why everyone jumping into this thread is like that...

anyhow... onto the diagnostic and an addition to my post.... typically when that applies u look at unknown.

Did u check the condition of allignment your SSD's had when you ran your IO tests?

misalligned SSD's have shown performance to not be standard.

Maybe its an allignment issue with a bad allignment instead of 1024k where ur diagnostic should lie.
http://www.overclock.net/t/1226963/lightbox/post/16672746/id/775374


If you were already alligned with 1024k, then what SATA ports are you using? SATA2 SATA3?
Firmware version on VERTEX, and Bios Version for ICH10 or ICH11 ur using.

If you want to get diagnostic, we can go diagnostic..
 
Last edited:

Lorne

Senior member
Feb 5, 2001
873
1
76
Sorry for the tangent question.
"1) I use direct I/O so no FS buffering
2) the file is preallocated so no new FS allocator meta-data "
Ashetos, I have not looked into how this works as of yet so bare with me,,, Where is the directory kept then?
 

ashetos

Senior member
Jul 23, 2013
254
14
76
we are trying to tell you what is factual.
Your telling us you have 1 subset which falls outside norm.
Now if u had told us i have something weird which goes outside the norm, we would probably be more diagnostic then educational.

However your going about how you found that 1 subset not inside the norm, and passing it as wrong information.
This is why everyone jumping into this thread is like that...

anyhow... onto the diagnostic and an addition to my post.... typically when that applies u look at unknown.

Did u check the condition of allignment your SSD's had when you ran your IO tests?

misalligned SSD's have shown performance to not be standard.

Maybe its an allignment issue with a bad allignment instead of 1024k where ur diagnostic should lie.
http://www.overclock.net/t/1226963/lightbox/post/16672746/id/775374


If you were already alligned with 1024k, then what SATA ports are you using? SATA2 SATA3?
Firmware version on VERTEX, and Bios Version for ICH10 or ICH11 ur using.

If you want to get diagnostic, we can go diagnostic..

Thanks for the questions.
1. My partitions are indeed aligned to 2048 linux sectors, which means 1024K. However, it is not relevant for my benchmark, because it takes place inside a file in an ext4 file-system. The alignment of the actual I/O requests that are sent for both the sequential and random workload are identical and should produce identical results. Moreover, ext4 is famous for its delayed allocation optimizations that perfectly align newly created files.

2. My port is intel SATA3, in an updated intel board DH67CL.

3. Vertex 4 firmware is latest, 1.5. The drive is 70% populated, has 30% trimmed space.

Sorry for the tangent question.
"1) I use direct I/O so no FS buffering
2) the file is preallocated so no new FS allocator meta-data "
Ashetos, I have not looked into how this works as of yet so bare with me,,, Where is the directory kept then?

This is a great question, shows depth of understanding.

Let me analyze, direct I/O for all intents and purposes is only applied for read and write system calls. That is, data requests only.

Now, there are tons of file-system meta-data that can lead to accessing a file, such as the inode that points to the file, the dentry that describes the directory entries, access control and timestamps, space allocation meta-data and stuff that generally end up in a file-system journal for recovery purposes.

All the look-ups, directories of the path, extent allocation meta-data and FS meta-data in general are buffered by the linux VFS layer caches and ext4 itself.

That is, for the whole duration of these benchmarks, all and any file-system meta-data are always in memory/CPU cache since file createion and we are talking about nano-second latencies here.

The proof for this is in the iostat snapshot I already gave, if you notice the average request size for the raw device /dev/sda (VERTEX 4 SSD) is 8.

This means 8 linux sectors, which is 4096 bytes. This means that all raw device I/O that ended up being sent to the actual SSD is exactly the same size as the benchmark requests, not 8.01, not 7.99 but 8.
 

Lorne

Senior member
Feb 5, 2001
873
1
76
I see now, I was getting confused as your first post mentioned NTFS and that is what I went with and all post in the middle looped me.

On your test above with the ran/seq writes, What generates the numbers for the random write locations, Is this a numbers generator you implement or the drive itself.
Is this random write full random per LBA or just random write start point for allication.

Assuming the first, Possible time build up = Read cashe, Random find start point, write, Update cache (the write and update cache start location could be reversed as I dont know the order of the norm) , next random LBA location generate, write, repeat till all 8GB saved (The drive would also have to update per random to know what not re-use in its next random location choice instead of choosing a whole free space for a seq write.).

All this could add up to a fraction of a second = RTA (Random gen Time Accumulation) / MBps + Ran-W aggrb should = Seq-W aggrb.
Big file and the RTA would be exponential and the theory could be tested.

Am I totally out of the park on this one?
 

hot120

Member
Sep 11, 2009
43
1
71
So, the OP is saying a sequential read/write should have the same time as a random read/write? I think the OP is smoking weed!

Please don't thread crap
-ViRGE
 
Last edited by a moderator:

ashetos

Senior member
Jul 23, 2013
254
14
76
So, the OP is saying a sequential read/write should have the same time as a random read/write? I think the OP is smoking weed!

Yes, you are. Access times for an SSD are the same (all electronic), regardless of the location of the data (physical NAND). For a HDD, that is not the case. The read/write head would have to access different PHYSICAL locations on the platter (mechanical arm, spinning platter). That is why defragmentation on a SSD is pointless. Can you understand that?

You do understand you're contradicting yourself, right?
 

ashetos

Senior member
Jul 23, 2013
254
14
76
I see now, I was getting confused as your first post mentioned NTFS and that is what I went with and all post in the middle looped me.

On your test above with the ran/seq writes, What generates the numbers for the random write locations, Is this a numbers generator you implement or the drive itself.
Is this random write full random per LBA or just random write start point for allication.

Assuming the first, Possible time build up = Read cashe, Random find start point, write, Update cache (the write and update cache start location could be reversed as I dont know the order of the norm) , next random LBA location generate, write, repeat till all 8GB saved (The drive would also have to update per random to know what not re-use in its next random location choice instead of choosing a whole free space for a seq write.).

All this could add up to a fraction of a second = RTA (Random gen Time Accumulation) / MBps + Ran-W aggrb should = Seq-W aggrb.
Big file and the RTA would be exponential and the theory could be tested.

Am I totally out of the park on this one?

I will not run any additional experiments on my VERTEX 4 since it's my system disk, dual-boot with limited free space and I'm sensitive about it!

I will get my hands on a different SSD and system, to try and do raw device experiments, forget files and file-systems, simpler set-up. The range will be much larger than an 8GiB file, and I will do some really interesting read and write configurations.
 

pandemonium

Golden Member
Mar 17, 2011
1,777
76
91
Interesting hypothesis, ashetos. You're basically getting at a total refresh of the drive's data to assist TRIM from having to overwork areas that are more used than others, correct? If you wipe an SSD clean, would it actually assist TRIM in data placement, or just make more work for the drive (and you) in the long run? I'm guessing the latter, but then again, I don't know the particulars about how TRIM works.

Anyways, definitely in for the results.
 

ashetos

Senior member
Jul 23, 2013
254
14
76
I see now, I was getting confused as your first post mentioned NTFS and that is what I went with and all post in the middle looped me.

On your test above with the ran/seq writes, What generates the numbers for the random write locations, Is this a numbers generator you implement or the drive itself.
Is this random write full random per LBA or just random write start point for allication.

Assuming the first, Possible time build up = Read cashe, Random find start point, write, Update cache (the write and update cache start location could be reversed as I dont know the order of the norm) , next random LBA location generate, write, repeat till all 8GB saved (The drive would also have to update per random to know what not re-use in its next random location choice instead of choosing a whole free space for a seq write.).

All this could add up to a fraction of a second = RTA (Random gen Time Accumulation) / MBps + Ran-W aggrb should = Seq-W aggrb.
Big file and the RTA would be exponential and the theory could be tested.

Am I totally out of the park on this one?

Some further clarifications to your questions,

since the aformentioned experiments were time based they did not populate the entire file, meaning less than 8GiB were written.

The random LBAs were truly random, meaning that there was a probability that a block might be written twice. If this happens often, it can be considered cheating, since a sequential pattern only writes once per block. As we know, SSDs do not handle overwrites very well, so to be fair, in my nexts experiments I will utilize a full map of blocks to randomly choose blocks and write them only once.

The cpu overheads of both true random and map assisted randomness are negligible when I look at CPU utilization, and I do not believe they can affect numbers, especially for throughput that is well below memory speed.
 

ashetos

Senior member
Jul 23, 2013
254
14
76
So, I gained access to a Xeon server with one SSD, the 32GiB SLC intel X25-E drive. Unfortunately, it doesn't support TRIM so experiments with TRIM are out of the question. This is an old and tortured device, so expect internal fragmentation due to wear leveling to be significant.

If anyone is interested in running experiments with TRIM without a file-system, he can use the hdparm utility that can send TRIM commands at specific LBAs of the raw device on demand.

File-systems are too complicated and it is difficult to create an access pattern with perfect precision. This is the reason the following experiments target the block device directly, no file-systems and no partitions. The lack of partitions guarantees perfect alignment for LBA 0.

I use the fio benchmark for the first series of experiments.
I use 4KiB writes, queue depth 1. No system software buffering, no file-systems, no partitions, direct device access. I use the whole 32GB address space. All the device blocks are written exactly once. I compare sequential with pseudo-random access throughput. Fio configuration is:

[global]
bs=4096b
thread
direct=1
ioengine=sync

[benchmark]
numjobs=1
rw=randwrite
filename=/dev/sdb

I alternate rw between write and randwrite 4 times in a round robin fashion. The results are:
[sequential]
MB written:30518
throughput:38975KB/s
[random]
MB written:30518
throughput:12658KB/s
[sequential]
MB written:30518
throughput:37958KB/s
[random]
MB written:30518
throughput:12678KB/s

We can conclude that even though all blocks are written exactly once it is much harder for the SSD device to perform when the pattern is not sequential. The only difference between these two workloads is the LBA order of the requests.

==============================================

Second series of experiments are simple sequential reads. The command is:
dd if=/dev/sdb iflag=direct bs=1M of=/dev/null

This is a sequential scan of the device with huge 1MiB requests. This is the best possible pattern for any block device.

First result, sequential scan half an hour later than a sequential write workload: 244 MB/s

Second result, sequential scan half an hour later than a random write workload: 209 MB/s

This means that sequential read performance is affected by the LBA fragmentation due to past writes. This performance degradation is permanent, no matter how many times you repeat sequential reads. Read requests are not expected to change device state.


So, let's see the big picture now. It is pretty clear that internal SSD fragmentation due to wear-leveling does not negate the impact of external LBA fragmentation.

LBA fragmentation does not only impact dynamic data generation performance, but also static data retrieval performance.

This means, that a fragmented file-system will be measurably slower than a de-fragmented one.

Factors not taken into account:
- TRIM: TRIM divides SSD space to allocated and de-allocated. File-system fragmentation leads to free SSD space fragmentation which is expected to pronounce the difference in performance between a fragmented and de-fragmented file-system.
- I/O request properties: A fragmented file-system is expected to produce more I/O requests and smaller I/O requests. This is again expected to pronounce the difference in performance between a fragmented and a de-fragmented file-system.

Cool huh?
 

Lorne

Senior member
Feb 5, 2001
873
1
76
Hot120.
No.
He is wondering why there is a difference (Using direct IO) in writes and read speeds seq and random location of one file.
IE. for all test using one 8GB file for testing, Writing it sequential then writing it random and showing a speed/latency difference which seems a little excessive for direct IO.

The real problem he is pointing out is that the read backs speeds of both the sequential and the fragmented files are different by up to 30MBps when there shouldnt be any difference at all for SSD, This is something that the websites like Anandtech have not tested for (well, That I have ever seen).

He has tested this theory with 2 different drives now, Both show the same pattern.

Ashtos.
Large diffrence in older smaller drive, Firmware and lower NAND bank count?.

Hmmmm, I would assume this is happening at the firmware level of the drive, I cant see this at the machanic level of the drive, But there could be some latency at NAND bank selection that the firmware cannot handle well in random, And it affecting latency in reads can or should only be firmware.

Could you do this on a ram drive, Assuming you could make one big enough to make a difference in results?

And yes, Cool
 

ashetos

Senior member
Jul 23, 2013
254
14
76
Hot120.
No.
He is wondering why there is a difference (Using direct IO) in writes and read speeds seq and random location of one file.
IE. for all test using one 8GB file for testing, Writing it sequential then writing it random and showing a speed/latency difference which seems a little excessive for direct IO.

The real problem he is pointing out is that the read backs speeds of both the sequential and the fragmented files are different by up to 30MBps when there shouldnt be any difference at all for SSD, This is something that the websites like Anandtech have not tested for (well, That I have ever seen).

He has tested this theory with 2 different drives now, Both show the same pattern.

Ashtos.
Large diffrence in older smaller drive, Firmware and lower NAND bank count?.

Hmmmm, I would assume this is happening at the firmware level of the drive, I cant see this at the machanic level of the drive, But there could be some latency at NAND bank selection that the firmware cannot handle well in random, And it affecting latency in reads can or should only be firmware.

Could you do this on a ram drive, Assuming you could make one big enough to make a difference in results?

And yes, Cool

I don't have access to the machine anymore so I can't give you the exact specs of the model and its firmware.

I have done this on a ramdisk in the past, it is completely different. For instance, sequential reads are always the same. With RAM, identical patterns always perform the same.

In RAM, what matters is the memory controller, random patterns are slower than sequential due to the memory pre-fetchers making a difference. I remember numbers in the order of 20GB/s.

Edit:
Also, the huge difference between the two experiments is that with the OCZ ssd only 8 of 128 GB were written, whereas with the Intel SSD the whole device was written. Typically, SSD reviews in websites don't write the whole device and don't stress the SSD too much.
 
Last edited:

Lorne

Senior member
Feb 5, 2001
873
1
76
Good find.
Still, both cases the problem is visable and has a affect on drive performance.
Maybe somone here at Anand will take up the quest to pinpoint this issue and bring it up to the SSD mfg.
 

hot120

Member
Sep 11, 2009
43
1
71
You do understand you're contradicting yourself, right?

I am talking about access time on an SSD being the same. Transfer time is different. Your post is all over the place. Once again, regardless of the location of the data on the NAND, access time is the same. Transfer time may be different due to TRIM and the work taking place within the SSD controller. Regardless, fragmentation on an SSD is not comparable to a HDD. Access time is directly affected by fragmentation. That is what you should be chasing.
 

ashetos

Senior member
Jul 23, 2013
254
14
76
look up trim

Thanks.

I am talking about access time on an SSD being the same. Transfer time is different. Your post is all over the place. Once again, regardless of the location of the data on the NAND, access time is the same. Transfer time may be different due to TRIM and the work taking place within the SSD controller. Regardless, fragmentation on an SSD is not comparable to a HDD. Access time is directly affected by fragmentation. That is what you should be chasing.

All my tests use a queue depth of 1, so everything is access time already. There is no I/O concurrency, all numbers directly relate to latency or access time depending on your preference. If my post is all over the place your posts are plain wrong, all of them.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |