Oracle Disk-IO Assessment
When analyzing the performance of an Oracle RDBMS in many cases one may want to take a look at on how the IO subsystems perform. Actually there has not been one situation in my carreer as an DBA when I was confronted with a performance issue that was not related to either disk or network IO. Therefore it is vital for a database sysstem in order to stay healthy to regularly assess the IO-performance.
In many cases nowadays hardware is virtualized and one has many options on where and how to measure the IO-througput of the sysstem.
- On the level of the IO-system.
In the case the database files reside on a SAN or a dedicated and intelligent disk-array it is usualy possible to collect IO-Statistic right here. e.g. if you happen to have a netapp-filer just login via ssh and issue the following command:
sysstat -x 5
This will give you current statistics of the ongoings on your SAN box. - On the OS level
If you are lucky and your DB is running on some sort of UNIX then you can use the iostat tool in order to get elaborate IO-statistics. The differnt flavours of iostat on the differnt UNIX derivates offerr different functionality and outputs. However the following command should be a good starting point on any SystemV UNIX:
iostat -x 5
If you happen to run Solaris you may want to try the following:
iostat -xcnz 5
- On the database level
When you are done with the above and have collected enough data in order to convince the "mere mortals" that there is an issue with the IO-performance, you may want to dig deeper and actually find out, how bad it really is and how it affects you database operations.
Let's go ...
Analyzing IO-access from within the Oracle Database
The database quite elaborately tracks it's access to the data files and stores this information until the next shutdown. Therefore it ist possible to analyze IO-performance exactly where it matters. I therefore recommend to measure the IO-performace form within the database, because here we get the real numbers, here we can really see what happens and how fast or slow the database is able to retrieve data from or write data to the storage.
Still it may be neccessary to back up ones findings with the above methods, in order to discuss the issue with the rest of the world.
The first thing that may be interesting is how long it takes the database to perform an IO-operation. Try the following statement:
clear breaks set pages 100 lin 200 col a heading "Single Block Reads in ms" col b heading "Count" select singleblkrdtim_milli a, sum(singleblkrds) b from v$file_histogram v, dba_data_files d where v.file# = d.file_id group by singleblkrdtim_milli order by 2 ;
This gives you a listing that shows you the accounted time of every singel block read since DB start. Not the exact time of the block reads is recorded but every block read has been assignet to a category 1ms, 2ms, 4ms, 8ms, ...
It is quite normal that the time to read a singel block differs a lot depending on how buisy the system is and on how much concurrent operations take place. On a healthy system most operations should be fast and less operations should take longer.
e.g. the following listing is from a DB-system that does not too good but reasonably well:
Single Block Reads in ms Count
------------------------ ----------
2048 3
1024 6
512 35
256 189
128 1899
64 20117
32 20400
16 107419
2 174019
4 345235
8 521724
1 904356
Probably there should be less requests that take 8ms but still, this is somehow OK. If we were to descide that we generally expect a sigle request to be 5ms or less and we start to worry about everything that takes longer then 10ms, the above still is somehow accepable.
5ms or less for a single block read is as per rule of thumb what could be considered fast. Everything up to 10ms may still be acceptable. When it takes longer then 10ms to read a single block, then this is usually considered slow and everyting above 20ms is starting to generate pain ...
So let's look at the listing of a system that suffers from poor IO:
Single Block Reads in ms Count
------------------------ ----------
16384 17
8192 29
4096 153
2048 785
1024 10579
512 106621
256 652412
2 1703739
4 2313493
128 2743911
64 3623268
32 9595279
8 19620750
16 29736386
1 70583824
This is from a production system that really has an IO issue. The DB is not that buisy but has its datafiles on a SAN storage with also facilitates a number of other systems with the same physical discs. On top of that it has been chosen that the disk layout will be a parity RAID configuration which usually turns out to be very unfavourybly for any DB sooner or later (see http://www.baarf.com).
On the OS level the DB machine shows significant amounts of waitIO:
$ sar 2 33 12:40:51 CPU %user %nice %system %iowait %steal %idle 12:40:53 all 2,21 0,00 0,74 13,24 0,00 83,82 12:40:55 all 0,00 0,00 0,73 12,41 0,00 86,86 12:40:57 all 1,45 0,00 0,72 12,32 0,00 85,51 12:40:59 all 1,47 0,00 0,00 11,76 0,00 86,76
The filer shows that the physical disks are mostly busy:
> sysstat -x 3
CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk FCP iSCSI FCP kB/s
in out read write read write age hit time ty util in out
6% 942 0 0 1061 1371 8039 4497 21 0 0 18 94% 0% - 46% 119 0 302 1076
5% 724 0 0 915 1496 7589 4261 16 0 0 18 93% 0% - 45% 180 0 255 2164
7% 611 0 0 630 400 319 6130 14161 0 0 18 99% 0% - 31% 19 0 141 68
6% 834 0 0 998 454 299 2896 4853 0 0 18 99% 0% - 42% 163 0 39 2281
3% 780 0 0 810 483 280 1472 21 0 0 18 99% 0% - 24% 30 0 109 63

