Features Download
From: Tejun Heo <tj <at> kernel.org>
Subject: ATA 4 KiB sector issues.
Newsgroups: gmane.linux.ide
Date: Monday 8th March 2010 03:48:35 UTC (over 7 years ago)
Hello, guys.

It looks like transition to ATA 4k drives will be quite painful and we
aren't really ready although these drives are already selling widely.
I've written up a summary document on the issue to clarify stuff as
it's getting more and more confusing and develop some consensus.  It's
also on the linux ata wiki.


I've cc'd people whom I can think of off the top of my head but I
surely have missed some people who would have been interested.  Please
feel free to add cc's or forward the message to other MLs.
Especially, I don't know much about partitioners so the details there
are pretty shallow and could be plain wrong.  It would be great if
someone who knows more about this stuff can chime in.


=== Document follows ===

ATA 4 KiB sector issues


Up until recently, all ATA hard drives have been organized in 512 byte
sectors.  For example, my 500 GB or 477 GiB hard drive is organized of
976773168 512 byte sectors numbered from 0 to 976773167.  This is how
a drive communicates with the driver.  When the operating system wants
to read 32 KiB of data at 1 MiB position, the driver asks the drive to
read 64 sectors from LBA (Logical block address, sector number) 2048.

Because each sector should be addressable, readable and writable
individually, the physical medium also is organized in the same sized
sectors.  In addition to the area to store the actual data, each
sector requires extra space for book keeping - inter-sector space to
enable locating and addressing each sector and ECC data to detect and
correct inevitable raw data errors.

As the densities and capacities of hard drives keep growing, stronger
ECC becomes necessary to guarantee acceptable level of data integrity
increasing the space overhead.  In addition, in most applications,
hard drives are now accessed in units of at least 8 sectors or 4096
bytes and maintaining 512 byte granularity has become somewhat

This reached a point where enlarging the sector size to 4096 bytes
would yield measurably more usable space given the same raw data
storage size and hard drive manufacturers are transitioning to 4 KiB

Anandtech has a good article which illustrates the background and
issues with pretty diagrams[1].

Physical vs. Logical

Because the 512 byte sector size has been around for a very long time
and upto ATA/ATAPI-7 the sector size was fixed at 512 bytes, the
sector size assumption is scattered across all the layers -
controllers or bridge chips snooping commands, BIOSs, boot codes,
drivers, partitioners and system utilities, which makes it very
difficult to change the sector size from 512 byte without breaking
backward compatibility massively.

As a workaround, the concept of logical sector size was introduced.
The physical medium is organized in 4 KiB sectors but the firmware on
the drive will present it as if the drive is composed of 512 byte
sectors thus making the drive behave as before, so if the driver asks
the hard drive to read 64 sectors from LBA 2048, the firmware will
translate it and read 8 4 KiB sectors from hardware sector 256.  As a
result, the hard drive now has two sector sizes - the physical one
which the physical media is actually organized in, and the logical one
which the firmware presents to the outside world.

A straight forward example mapping between physical sector and LBA
would be

  LBA = 8 * phys_sect

Alignment problem on 4 KiB physical / 512 logical drives

This workaround keeps older hardware and software working while
allowing the drive to use larger sector size internally.  However, the
discrepancy between physical and logical sector sizes creates an
alignment issue.  For example, if the driver wants to read 7 sectors
from LBA 2047, the firmware has to read hardware sector 255 and 256
and trim leading 7*512 bytes and tailing 512 bytes.

For reads, this isn't an issue as drives read in larger chunks anyway
but for writes, the drive has to do read-modify-write to achieve the
requested action.  It has to first read hardware sector 255 and 256,
update requested parts and then write back those sectors which can
cause significant performance degradation[2].

The problem is aggravated by the way DOS partitions[3] have been laid
out traditionally.  For reasons dating back more than two decades,
they are laid out considering something called disk geometry which
nowadays are arbitrary values with a number of restrictions for
backward compatibility accumulated over the years.  The end result is
that until recently (most Linux variants and upto Windows XP) the
first partition ends up on sector 63 and later ones on cylinder
boundaries where each cylinder usually is composed of 255 * 63

Most modern filesystems generate 4 KiB aligned accesses from the
partition it is in.  If a drive maps 4 KiB physical sectors to 512
byte logical sectors from LBA0, the filesystem in the first partition
will always be misaligned and filesystems in later partitions are
likely to be misaligned too.

Solving the alignment problem on 4 KiB physical / 512 logical drives

There are multiple ways which attempt to solve the problem.

S-1. Yet another workaround from the firmware - offset-by-one.

  Yet another workaround which can be done by the firmware is to
  offset physical to logical mapping by one logical sector such that
  LBA 63 ends up on physical sector boundary, which aligns the first
  partition to physical sectors without requiring any software update.
  The example mapping between phys_sector and LBA becomes

    LBA = 8 * phys_sect - 1

  The leading 512 bytes from phys_sect 0 is not used and LBA 0 starts
  from after that point.  phys_sect 1 maps to LBA 7 and phys_sect 8 to
  63, making LBA 63 aligned on hardware sector.

  Although this aligns only the first partition, for many use cases,
  especially the ones involving older software, this workaround was
  deemed useful and some recent drives with 4 KiB physical sectors are
  equipped with a dip switch to turn on or off offset-by-one mapping.

S-2. The proper solution.

  Correct alignments for all partitions can't be achieved by the
  firmware alone.  The system utilities should be informed about the
  alignment requirements and align partitions accordingly.

  The above firmware workaround complicates the situation because the
  two different configurations require different offsets to achieve
  the correct alignments.  ATA/ATAPI-8 specifies a way for a drive to
  export the physical and logical sector sizes and the LBA offset
  which is aligned to the physical sectors.

  In Linux, these parameters are exported via the following sysfs

    physical sector size	: /sys/block/sdX/queue/physical_block_size
    logical sector size		: /sys/block/sdX/queue/logical_block_size
    alignment offset		: /sys/block/sdX/alignment_offset

  Let the physical sector size be PSS, logical sector size LSS and
  alignment offset AOFF.  The system software should place partitions
  such that the starting LBAs of all partitions are aligned on

    (n * PSS + AOFF) / LSS

  For 4 KiB physical sector offset-by-one drives, PSS is 4096, LSS 512
  and AOFF 3584 and with n of 7 the above becomes,

    (7 * 4096 + 3584) / 512 == 63

  making sector 63 an aligned LBA where the first partition can be
  put, but without the offset-by-one mapping, AOFF is zero and LBA 63
  is not aligned.

  With the above new alignment requirement in place, it becomes
  difficult to honor the legacy one - first partition on sector 63 and
  all other partitions on cylinder boundary (255 * 63 sectors) - as
  the two alignment requirements contradict each other.  This might be
  worked around by adjusting how LBA and CHS addresses are mapped but
  the disk geometry parameters are hard coded everywhere and there is
  no reliable way to communicate custom geometry parameters.


Unfortunately, there are complications.

C-1. The standard is not and won't be followed as-is.

  Some of the existing BIOSs and/or drivers can't cope with drives
  which report 4 KiB physical sector size.  To work around this, some
  drive models lie that its physical sector size is 512 bytes when the
  actual configuration is 4 KiB without offsetting.

  This nullifies the provisions for alignment in the ATA standard but
  results in the correct alignment for Windows Vista and 7.  OS
  behaviors will be described further later.

  For these drives, which are likely to continue to be shipped for the
  foreseeable future, traditional LBA 63 and cylinder based aligning
  results in misalignment.

C-2. Windows XP depends on the traditional partition layout.

  Windows XP makes use of the CHS start/end addresses in the partition
  table and gets confused if partitions are not laid out
  traditionally.  This means that XP can't be installed into a
  partition prepared by later versions of Windows[4].  This isn't a
  big problem for Windows because in most cases the later version is
  replacing the older one, not the other way around.

  Unfortunately, the situation is more complex for Linux because Linux
  is often co-installed with various versions of Windows and XP is
  still quite popular.  This means that when a Linux partitioner is
  used to prepare a partition which may be used by Windows, the
  partitioner might have to consider which version of Windows is going
  to be used and whether to align the partitions for the correct
  alignment or compatibility with older versions of Windows.

C-3. The 2 TiB barrier and the possibility for 4 KiB logical sector size.

  The DOS partition format uses 32 bit for the starting LBA and the
  number of sectors and, reportedly, 32 bit Windows XP shares the
  limitation.  With 32 bit addressing and 512 byte logical sector
  size, the maximum addressable sector + 1 is at

    2^32 * 2^9 == 2^41 == 2 TiB

  The DOS partition format allows a partition to reach beyond 2 TiB as
  long as the starting LBA is under 2 TiB; however, both Windows XP
  and and the Linux kernel (at least upto v2.6.33) refuse such
  partition configurations.

  With the right combination of host controller, BIOS and driver, this
  barrier can be overcome by enlarging the logical sector size to 4
  KiB, which will push the barrier out to 16 TiB.  On the right
  configuration, Windows XP is reportedly able to address beyond the 2
  TiB barrier with a DOS partition and 4 KiB logical sector size.
  Linux kernel upto v2.6.33 doesn't work under such configurations but
  a patch to make it work is pending[5].

  This might also be beneficial for operating systems which don't
  suffer from this limitation.  A different partition format - GPT[6]
  - should be used beyond 2^32 sectors, which could harm compatibility
  with older BIOSs or other operating systems which don't recognize
  the new format.

  As mentioned previously, 512 byte sector assumption has been there
  for a very long time and changing it is likely to cause various
  compatibility problems at many different layers from hardware up to
  the system utilities.


As hard drive vendors aim for performance and compatibility in modern
Windows environments, it is worthwhile to investigate how Windows
partitions with different alignment requirements.  Up until Windows
XP, it followed the traditional layout - the first partition on LBA 63
and the others on cylinder boundaries where a cylinder is defined as
255 tracks with 63 sectors each.

Windows Vista and 7 align partitions differently.  As the two behave
similarly, only 7's behavior is shown here.  These partition tables
are created by Windows 7 RC installer on blank disks.

W-1. 512 byte physical and logical sector drive.

  80 202100 07 df130c 00080000 00200300
  00 df140c 07 feffff 00280300 00689e12
  00 000000 00 000000 00000000 00000000
  00 000000 00 000000 00000000 00000000

  Part0:	FIRST	C    0	H   32	S   33	: 2048		(63 sec/trk)
		LAST	C   12	H  223	S   19	: 206847	(255 heads/cyl)
		LBA	2048 + 204800 = 206848

  Part1:	FIRST	C   12	H  223	S   20	: 206848
		LAST	C 1023	H  254	S   63	: E
		LBA	206848 + 312371200 = 312578048

  Both aligned at (2048 * n).  Part 1 not aligned to cylinder.

W-2. 4 KiB physical and 512 byte logical sector drive without

  80 202100 07 df130c 00080000 00200300
  00 df140c 07 feffff 00280300 00b83f25
  00 000000 00 000000 00000000 00000000
  00 000000 00 000000 00000000 00000000

  Part0:	FIRST	C    0	H   32	S   33	: 2048		(63 sec/trk)
		LAST	C   12	H  223	S   19	: 206847	(255 heads/cyl)
		LBA	2048 + 204800 = 206848

  Part1:	FIRST	C   12	H  223	S   20	: 206848
		LAST	C 1023	H  254	S   63	: E
		LBA	206848 + 624932864 = 625139712

  Both aligned at (2048 * n).  Part 1 not aligned to cylinder.

W-3. 4 KiB physical and 512 byte logical sector drive with offset-by-one.

  80 202800 07 df130c 07080000 f91f0300
  00 df1b0c 07 feffff 07280300 f9376d74
  00 000000 00 000000 00000000 00000000
  00 000000 00 000000 00000000 00000000

  Part0:	FIRST	C    0	H   32	S   40	: 2055		(63 sec/trk)
		LAST	C   12	H  223	S   19	: 206847	(255 heads/cyl)
		LBA	2055 + 204793 = 206848

  Part1:	FIRST	C   12	H  223	S   27	: 206855
		LAST	C 1023	H  254	S   63	: E
		LBA	206855 + 1953314809 = 1953521664

  Both aligned at (2048 * n + 7).  Part 1 not aligned to cylinder.

The partitioner seems to be using 1M as the basic alignment unit and
offsetting from there if explicitly requested by the drive and there
is no difference between handling of 512 byte and 4 KiB drives, which
explains why C-1 works for hard drive vendors.

In all cases, the partitioner ignores both the first partition on LBA
63 and the others on cylinder boundary requirements while still using
the same 255*63 cylinder size.  Also, note that in W-3, both part 0
and 1 end up with odd number of sectors.  It seems that they simply
decided to completely break away from the traditional layout, which is
understandable given that there really isn't one good solution which
can cover all the cases and that the default larger alignment benefits
earlier SSDs.

Windows Vista basically shows the same behavior.  Vista was tested by
creating two partitions using the management tool.  Test data is
available at [7].

  *-alignment_offset	: alignment_offset reported by Linux kernel
  *-fdisk		: fdisk -l output
  *-fdisk-u		: fdisk -lu output
  *-hdparm		: hdparm -I output
  *-mbr			: dump of mbr
  *-part		: decoded partition table from mbr

Please note that hdparm is misreporting the alignment offset.  It
should be reporting 512 instead of 256 for offset-by-one drives.

So, what now for Linux?

The situation is not easy.  Considering all the factors, the only
workable solution looks like doing what Windows is doing.  Hard drive
and SSD vendors are focusing on compatibility and performance on
recent Windows releases and are happy to do things which break the
standard defined mechanism as shown by C-1, so parting away from what
Windows does would be unnecessarily painful.

Unfortunately, while Windows can assume that newer releases won't
share the hard drive with older releases including Windows XP, Linux
distros can't do that.  There will be many installations where a
modern Linux distros share a hard drive with older releases of
Windows.  At this point, I can't see a silver bullet solution.

Partitioners maybe should only align partitions which will be used by
Linux and default to the traditional layout for others while allowing
explicit override.  I think Windows XP wouldn't have problem with
differently aligned partitions as long as it doesn't actually use them
but haven't tested it.

Reportedly, commonly used partitioners aren't ready to handle drives
larger than 2 TiB in any configuration and alignment isn't done
properly for drives with 4 KiB physical sectors.  4 KiB logical sector
support is broken in both the kernel and partitioners.  (need more
details and probably a whole section on partitioner behaviors)

Unfortunately, the transition to 4 KiB sector size, physical only or
logical too, is looking fairly ugly.  Hopefully, a reasonable solution
can be reached in not too distant future but even with all the
software side updated, it looks like it's gonna cause significant
amount of confusion and frustration.

[1] http://www.anandtech.com/storage/showdoc.aspx?i=3691
[2] http://www.osnews.com/story/22872/Linux_Not_Fully_Prepared_for_4096-Byte_Sector_Hard_Drives
[3] http://en.wikipedia.org/wiki/Master_boot_record
[4] http://support.microsoft.com/kb/931760
[5] http://thread.gmane.org/gmane.linux.kernel/953981
[6] http://en.wikipedia.org/wiki/GUID_Partition_Table
[7] http://userweb.kernel.org/~tj/partalign/

* Mar 04 2009
	Initial draft, Tejun Heo 
* Mar 08 2009
	Updated according to comments from Daniel Taylor
	.  Other minor updates.
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
CD: 4ms