Personal tools
You are here: Home Blog Oracle Disk-IO Assessment

Oracle Disk-IO Assessment

When analyzing the performance of an Oracle RDBMS in many cases one may want to take a look at on how the IO subsystems perform. Actually there has not been one situation in my carreer as an DBA when I was confronted with a performance issue that was not related to either disk or network IO. Therefore it is vital for a database sysstem in order to stay healthy to regularly assess the IO-performance.

In many cases nowadays hardware is virtualized and one has many options on where and how to measure the IO-througput of the sysstem.

  1. On the level of the IO-system.
    In the case the database files reside on a SAN or a dedicated and intelligent disk-array it is usualy possible to collect IO-Statistic right here. e.g. if you happen to have a netapp-filer just login via ssh and issue the following command:
    sysstat -x 5
    This will give you current statistics of the ongoings on your SAN box.
  2. On the OS level
    If you are lucky and your DB is running on some sort of UNIX then you can use the iostat tool in order to get elaborate IO-statistics. The differnt flavours of iostat on the differnt UNIX derivates offerr different functionality and outputs. However the following command should be a good starting point on any SystemV UNIX:
    iostat -x 5
    If you happen to run Solaris you may want to try the following:
    iostat -xcnz 5
  3. On the database level
    When you are done with the above and have collected enough data in order to convince the "mere mortals" that there is an issue with the IO-performance, you may want to dig deeper and actually find out, how bad it really is and how it affects you database operations.
    Let's go ...

 Analyzing IO-access from within the Oracle Database

The database quite elaborately tracks it's access to the data files and stores this information until the next shutdown. Therefore it ist possible to analyze IO-performance exactly where it matters. I therefore recommend to measure the IO-performace form within the database, because here we get the real numbers, here we can really see what happens and how fast or slow the database is able to retrieve data from or write data to the storage.

Still it may be neccessary to back up ones findings with the above methods, in order to discuss the issue with the rest of the world.

The first thing that may be interesting is how long it takes the database to perform an IO-operation. Try the following statement:

clear breaks
set pages 100 lin 200
col a heading "Single Block Reads in ms"
col b heading "Count"
select 
  singleblkrdtim_milli a,
  sum(singleblkrds) b
from 
  v$file_histogram v,
  dba_data_files d
where v.file# = d.file_id
group by singleblkrdtim_milli
order by 2
;

 This gives you a listing that shows you the accounted time of every singel block read since DB start. Not the exact time of the block reads is recorded but every block read has been assignet to a category 1ms, 2ms, 4ms, 8ms, ...

It is quite normal that the time to read a singel block differs a lot depending on how buisy the system is and on how much concurrent operations take place. On a healthy system most operations should be fast and less operations should take longer.

e.g. the following listing is from a DB-system that does not too good but reasonably well:

Single Block Reads in ms      Count
------------------------ ----------
                    2048          3
                    1024          6
                     512         35
                     256        189
                     128       1899
                      64      20117
                      32      20400
                      16     107419
                       2     174019
                       4     345235
                       8     521724
                       1     904356

Probably there should be less requests that take 8ms but still, this is somehow OK. If we were to descide that we generally expect a sigle request to be 5ms or less and we start to worry about everything that takes longer then 10ms, the above still is somehow accepable.

5ms or less for a single block read is as per rule of thumb what could be considered fast. Everything up to 10ms may still be acceptable. When it takes longer then 10ms to read a single block, then this is usually considered slow and everyting above 20ms is starting to generate pain ...

So let's look at the listing of a system that suffers from poor IO:

Single Block Reads in ms      Count
------------------------ ----------
                   16384         17
                    8192         29
                    4096        153
                    2048        785
                    1024      10579
                     512     106621
                     256     652412
                       2    1703739
                       4    2313493
                     128    2743911
                      64    3623268
                      32    9595279
                       8   19620750
                      16   29736386
                       1   70583824

This is from a production system that really has an IO issue. The DB is not that buisy but has its datafiles on a SAN storage with also facilitates a number of other systems with the same physical discs. On top of that it has been chosen that the disk layout will be a parity RAID configuration which usually turns out to be very unfavourybly for any DB sooner or later (see http://www.baarf.com). 

On the OS level the DB machine shows significant amounts of waitIO:

$ sar 2 33

12:40:51          CPU     %user     %nice   %system   %iowait    %steal     %idle
12:40:53          all      2,21      0,00      0,74     13,24      0,00     83,82
12:40:55          all      0,00      0,00      0,73     12,41      0,00     86,86
12:40:57          all      1,45      0,00      0,72     12,32      0,00     85,51
12:40:59          all      1,47      0,00      0,00     11,76      0,00     86,76

 The filer shows that the physical disks are mostly busy:

> sysstat -x 3
 CPU   NFS  CIFS  HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s
                                  in   out   read  write  read write   age   hit time  ty util                 in   out
  6%   942     0     0    1061  1371  8039   4497     21     0     0    18   94%   0%  -   46%    119     0   302  1076
  5%   724     0     0     915  1496  7589   4261     16     0     0    18   93%   0%  -   45%    180     0   255  2164
  7%   611     0     0     630   400   319   6130  14161     0     0    18   99%   0%  -   31%     19     0   141    68
  6%   834     0     0     998   454   299   2896   4853     0     0    18   99%   0%  -   42%    163     0    39  2281
  3%   780     0     0     810   483   280   1472     21     0     0    18   99%   0%  -   24%     30     0   109    63
So now we have analyzed the situation in a very detailed way. Not only have we found that our system does have an IO issue, but we also where able to find out how slow it is.
Still this may not be enough. We may have to determine what part of the application the DB facilitates is affected and in what way.
 
To be continued ...
Document Actions