Oracle RAC Moved to Mainstream Use
 
6 February 2009

Donna Scott, Donald Feinberg

Gartner RAS Core Research Note G00164939
 

After more than eight years of nurturing and improving the technology, manageability and implementation, Oracle Real Application Clusters has been moved to the mainstream and is providing significant advantages to customers. We provide guidance on when and where RAC provides the most benefit.





Overview



Oracle has made significant improvements to its Real Application Clusters (RAC) technology during the past eight years, making it easier to manage. Thus, Oracle has moved RAC to the mainstream, with more than 15,000 customers in production.

Key Findings
  • Oracle has greatly improved RAC's manageability with Oracle Database 10g and Oracle Database 11g.
  • More than 15,000 companies have implemented RAC.
  • Key benefits of RAC include availability, horizontal scalability on lower-cost hardware and sharing capacity across database management system (DBMS) instances in a grid.
  • Oracle has been successful at driving Linux as a key platform for its DBMS; at least 30% of RAC installed bases are on Linux.
Recommendations
  • Organizations desiring shared capacity for databases and horizontal scaling of the database to align with business demand should evaluate RAC.
  • Organizations should obtain RAC training and education, because clustering is more complex than single-instance DBMSs.
  • Customers desiring transparent failover of transactions during server failures should use Oracle application programming interfaces (APIs) and write application logic to enable automatic processing of the transaction on alternative available RAC nodes.
  • Do not use RAC where distributed, shared-nothing environments are required.
  • Use caution when considering RAC for small businesses and when there is a lack clustering skills.



Analysis



Oracle RAC is a shared database environment where multiple server nodes share DBMS instances, with shared concurrent access to disk. In November 2003, Gartner analyzed usage and implementation of early adopters of Oracle Database 9i (9i) RAC. At that time, RAC was reliable and provided increased scalability and availability; however, it was complex and required highly experienced database administrators (DBAs) to generate value; the complexity was a RAC inhibitor.

Fast-forward five years and multiple releases: Oracle has significantly improved RAC's operational manageability and reduced its skill level for implementation and management. We interviewed more than a dozen RAC customers (most of whom implemented Oracle Database 10gR2 RAC) to gain an understanding of the business value of implementing RAC, and its pros and cons. Overall, we found significant value that justified the additional software costs.

However, RAC is not for every application. Here, we provide an analysis of benefits, strengths and challenges, along with guidance on when to use or not to use RAC.

By year-end 2003, Oracle had approximately 1,000 RAC production customers; today, it has more than 15,000, bringing RAC to the mainstream. Oracle has penetrated the midmarket as well, primarily through its Dell reseller agreement. Thus, Oracle's investment in Oracle Database 10g (10g) manageability has reduced the complexity inhibitor. RAC runs on all Oracle-supported platforms, with Linux being its platform of choice. Approximately 30% of RAC implementations have deployed on Linux (and the percent is rising).

Oracle's main RAC sales theme is selling a fully integrated software stack, including the operating system (OS) — Oracle Enterprise Linux, which is Oracle's distribution of Red Hat Linux. While RAC licenses incur a 50% increment over a single-instance DBMS (see www.oracle.com/corporate/pricing/technology-price-list.pdf ), Oracle justifies the pricing due to reduced hardware costs (in addition to RAC benefits).




RAC Improvements Since 2003

Evolution of the Grid

While 9i enabled "islands" of RAC instances, 10g pulls those islands together into an interconnected grid. Through a concept called "services," Oracle enables any node to be allocated to any service and for the databases to share capacity, including spare capacity. Although Oracle defines a service as a RAC database, its vision is that the service will encompass all tiers of the application architecture, not just the database tier.

Clusterware on All Platforms

In 9i, Oracle Clusterware was only available on Windows and Linux. With 10g, it extended the technology to its other platforms. In 2003, Oracle acquired TruCluster software assets, which are a base for Oracle's Clusterware implementation. Therefore, customers that want a single-instance database with failover, rather than RAC, do not need to use third-party cluster software. However, for Unix platforms, customers can still implement their third-party clustering software if they choose to. For Windows and Linux, however, Oracle only supports its own clusterware with RAC.




Automated Storage Manager

For 10g, Oracle added Automated Storage Manager (ASM), a no-charge option that provides a grid volume manager for single-instance and RAC-clustered databases. (RAC uses ASM underneath its architecture for its cluster volume management.) Logical unit numbers (LUNs), a logical storage allocation, are assigned by a storage administrator to ASM, which then forms a shared storage pool for database storage. ASM provides striping, mirroring (two- and three-way) and enables DBAs to add storage from the pool to a database without downtime.

ASM reduces DBA labor time in managing storage because storage space allocation and performance tuning (where to place the data for optimum performance) are done automatically by the software and not the DBA. Moreover, customers that implement ASM reduce their need for third-party file systems and volume managers, thus reducing their overall software and maintenance costs. ASM is supported with storage area network (SAN) or network-attached storage. Similar to how it has used reduced server hardware investment to justify RAC, Oracle is now advocating using less-expensive shared storage as a justification for RAC and ASM.

Manageability

Oracle enhanced 10g's manageability to reduce DBA labor time. The 10g product achieves a 25% to 50% reduction in DBA time over 9i RAC, but did not get to the point that a cluster is the same as managing a single system. To that end, Oracle added the capability for 10g to manage multiple databases from a single view. Commands can be run against an entire cluster, a database or a specific instance from one console.

Oracle implemented task automation in 10g, enabling a workflow engine for task execution and rollback. Oracle calls this feature "grid control," and it is enabled through Oracle Enterprise Manager. It also reduced installation time, implemented more self-tuning capabilities, such as for storage performance and automated shared memory tuning, and added more proactive diagnostics to aid root-cause analysis. With Oracle Database 11g (11g), Oracle's Automatic Database Diagnostic Monitoring now runs on a RAC cluster, further reducing the complexity and resources required to manage RAC.

Planned Downtime

The 10g RAC supports rolling patches, OS upgrades and hardware upgrades for the grid without application downtime, but does not support rolling-version upgrades. In 11g, Oracle added support for rolling ASM upgrades and patches. Although Oracle supports rolling DBMS patching, there is always the possibility of a patch requiring downtime due to the nature of the patch.

RAC does not provide transparent failover of update transactions when a cluster node fails. Applications that are not written to survive a database crash may require a restart or re-login. To make applications aware of a database server failure, the developer must write the logic to reconnect with a surviving node and resubmit lost transactions. Because of this, not all independent software vendor (ISV) applications support RAC.




RAC Adoption Motivators and Benefits

Early adopters in 2003 primarily deployed RAC for improved availability, as it provides less than 60-second failover when a server or database instance fails in a cluster. Although RAC provides increased availability, most clients implement RAC to increase scalability and flexibility. Scalability is improved through scale-out architectures, enabling horizontal scaling on lower-cost hardware (versus vertical scaling with symmetric multiprocessing [SMP] systems) and the ability to better match investments with increased business demand (versus having to buy hardware in anticipation of demand).

Because many cluster nodes can participate in a 10g grid, any database instance can run on any node, thus improving flexibility. For example, if a node needs to come down for planned downtime, then the database demand can be spread across other nodes in the cluster. One node can even be repurposed from running one instance to another instance. Moreover, if demand is rising from one application and more capacity is needed, then the DBA can add more capacity by starting the database on another node in the grid.

Specific benefits that Oracle customers cited include:

  • Shared infrastructure and storage: The increased flexibility and manageability enabled by ASM enable customers to manage many databases in a single cluster of server nodes. Nodes can run any database instance. Because RAC can scale up or down horizontally, DBAs can optimize the capacity of each individual database by scheduling or manually initiating capacity increases or decreases, thus reducing the overprovisioning that occurs with infrastructure islands of databases. In addition, clients that had tried Oracle's Cluster File System (CFS) in 9i indicated that ASM is easier to manage and reduces complexity associated with allocating storage to instances.
  • Active/active processing: RAC enables full use of acquired capacity, rather than operating in active/passive mode.
  • Horizontal scale and lower-cost x86 servers: Many RAC customers buy into the grid concept by justifying using lower-cost x86 servers, rather than using more-expensive Unix SMP systems. Many Oracle customers cited the ability to scale horizontally with business growth as the primary driver for using RAC. Clients stated they found 80% or more scaling of new nodes added to the grid (10 to 18 nodes total).
  • Planned downtime: This referred to migrating instances between nodes to enable hardware and OS maintenance without affecting users.
  • Unplanned downtime: Failover time is typically less than one minute. However, depending on the type of application, some organizations measured failover time in tens of seconds. Query transactions are preserved across the cluster without user re-entry (update transactions must be re-entered).
  • DBA efficiencies of up to 50% savings over managing 9i RAC: DBAs can provision nodes faster, as well as manage storage and tune the environment with fewer resources.
  • Single integrated stack of software (RAC DBMS, ASM and Grid Control) with one-stop support: All the customers with whom we spoke reported excellent support by Oracle for their RAC environments. While all these customers continued to use the OS supplier for OS support, those using Linux noted Oracle's considerable knowledge on the OS side, which aided support calls. For that reason, some were evaluating Oracle's Linux distribution. The integrated stack also reduces third-party software required for tools such as clustering, file systems and diagnostics.



Challenges
  • Although Oracle has lowered the complexity with RAC, clustering is still complex and requires specialized DBA and system administration skills and training.
  • Although customer feedback on RAC was largely positive, the one persistent complaint was the lack of a complete graphical user interface (GUI) for monitoring and control, which many reported was inconsistent across platforms and did not provide enough granular information about tasks or optimization possibilities. Oracle has made significant changes in its GUI for 11g.
  • RAC uses a single shared database and, therefore, a single point of failure. Recovery strategies are vital to deal with risk scenarios such as data corruption, cluster failure and site failure, as well as version upgrades without downtime. While Oracle provides various solutions (in 10g and 11g) to resolve these issues, we found some customers, especially those from small and midsize businesses and new to RAC, to be less familiar with these options.
  • The scale of the grid is unknown at this time. The largest grid we saw in our interviews was 18 nodes in the primary site (running eight databases) and 18 nodes in the disaster recovery site. Oracle indicates that the largest number of nodes in a single cluster today is 32 nodes. It is unclear, as grids grow, what the management implications will be.
  • ASM is a culture change for storage administrators who lose visibility of the storage, handing it over to DBAs and the ASM to manage. While ASM's first version 10gR1 was buggy, reports are much better for 10gR2. Training is only available from Oracle, but vital for DBAs to attend to become self-sufficient.
  • Confusion in ASM versus CFS exists. In 9i RAC, Oracle offered raw partitions or CFS (for Windows or Linux only). Both had drawbacks from a manageability perspective. In 10g, CFS continues to be supported, but ASM is Oracle's strategic direction for Oracle databases. However, customers are confused about which to use when, and when to use both — that is, ASM does not support shared binaries, but CFS does; ASM provides raw disk access and is better performing, but CFS is easier to manage and performs more granular actions.
  • RAC does not provide transparent failover of update transactions when a cluster node fails. A restart of the middle-tier applications is often required to redirect clients to another node. To do so, application developers must write to cluster and client APIs. Consequently, not all ISV applications support RAC.
  • Oracle has added better configuration management — provisioning, cloning and patching. These require the purchase of two Oracle Management Packs at an additional cost of $3,500 per processor license each, which clients cited as too expensive for the benefits they bring. Patches still require node downtime, although RAC can keep the DBMS up on other nodes.
  • RAC does not dynamically increase/decrease capacity, but it can be scheduled by DBAs or manually initiated.



Lessons Learned
  • Putting binaries in CFS enables any database to run on any node in the cluster, but also risks downtime for the entire cluster if the SAN failed.
  • Effective disk load balancing with ASM requires that LUN sizes be standardized.
  • The certification environment (post-quality assurance) is critical to test cluster functions, failover and test patches, for example.
  • Use Oracle APIs to notify the middle tier about node failures, and make the failure transparent to users. This will avoid having to restart the middle tier, which can take many minutes.
  • Test multipathing with ASM, and ensure that the storage supplier supports ASM.
  • Verify third-party software certification with RAC.
  • For latency-sensitive applications, put Oracle logs on high-performing disks.
  • Design database redundancy into your architecture for a complete, high-availability, continuous operations and disaster recovery solution with RAC.
  • RAC is not for every environment; smaller organizations without clustering skills should be cautious about using RAC.



Bottom Line

Oracle has greatly improved 10g RAC's manageability and, as a result, moved it to the mainstream. IT organizations should evaluate RAC when they are seeking horizontal database scaling and want to:

  • Achieve the flexibility of running many databases in one cluster with the ability to share the infrastructure
  • Reduce idle/passive resources
  • Reduce failover times for node failures

IT organizations should realize, however, that clustering adds complexity to the environment and should not be implemented lightly. It requires users to have significant training and education to be successful, even with the significant manageability improvements Oracle has made with 10g and 11g.


© 2009 Gartner, Inc. and/or its Affiliates. All Rights Reserved. Reproduction and distribution of this publication in any form without prior written permission is forbidden. The information contained herein has been obtained from sources believed to be reliable. Gartner disclaims all warranties as to the accuracy, completeness or adequacy of such information. Although Gartner's research may discuss legal issues related to the information technology business, Gartner does not provide legal advice or services and its research should not be construed or used as such. Gartner shall have no liability for errors, omissions or inadequacies in the information contained herein or for interpretations thereof. The opinions expressed herein are subject to change without notice.