By Kevin Medlin
August 13, 2008Introduction
The most important information in most businesses can be found in the database. A lot of time and attention goes into planning for any new database application. Storage, servers, high availability, capacity, and clustering are just some of the considerations.
The same planning process must take place for disaster recovery and business continuity planning of databases. All actions taken to make business critical applications available must be methodical and deliberate. Disruptions are serious events and should not be taken lightly. "It's not about seeing the [recovered] data on your screen, but conducting business." (Day, Jo, Day, Kevin, 2006) Databases that are at the heart of the business today fall squarely on the critical path of the disaster recovery actions taken when a disruption strikes.
Partial or complete disruptions of a business can be devastating. Business continuity planning can ensure that capacity is available for critical business operations in the time of need. Practiced professionals in the area of business continuity understand that life and opportunities can continue after a disaster. Understanding the steps involved with keeping a business viable is where some planning is needed.
Destruction of assets can be devastating. Insurance may cover the expense to replace those assets but it will not put a business back in place overnight. This takes a huge mental and physical toll on workers. These conditions create burdens and stress on employees and their customers. Without a disaster recovery plan in place, there is little hope of ever getting a business back on its feet.
Requirements
One of the first things needed are the requirements for each database supported. Recovery times are probably the most important of these requirements. The difference between a few seconds of downtime and a few minutes of downtime can be quite substantial. Some business units may have a tolerance for a few hours. This must be known for each database for your plan to be effective. "...you have to prioritize what you need in order to function... you have to figure out what is actually mission-critical." (O'Hanlon, 2007)
Another important answer needed is in reference to data loss. If little to no data loss is acceptable, then a disaster recovery solution can become a budgetary concern. If the backup from last night will suffice, then this can lead to major cost savings.
Capacity can be a concern at the disaster recovery site. Customers should be asked about performance degradation and what is acceptable. This can be a tricky question to answer, and customers will usually need assistance to figure it out. If left to themselves, they will almost always answer that no degradation is acceptable.
Another question that should accompany performance degradation is finding out about the number of users that will be accessing the system during the disruption. These two answers will help to identify a more accurate capacity. What should be explained is that during the disruption, the entire corporate population may not need access to the enterprise application. Possibly only power users may need the system to run business critical functions for the enterprise.
One example is Human Resources applications. An HR application may be available to the corporate population during normal operations for viewing pay stubs, updating W-2s, etc. During a disruptive event, these rights could be suspended but power users could continue to run payrolls, enter benefits, hire and fire employees, etc. It is possible far less capacity is needed than originally thought necessary, which can mean more databases on the same servers, as long as the databases will not interfere with one another's processing. Virtual servers can be used as well, "... you would re-instantiate the virtual machines at a higher ratio (density) of virtual-to-physical. Consequently, organizations that can tolerate a slight drop in performance can build a much cheaper secondary data center to handle temporary disruptions." (Antonopoulos, 2006)
Accessing the databases and applications is another important matter. If the primary place of employment is no longer habitable, employees will need a place to go for office space and workstations. Workstations will need to be equipped with necessary software for database connections. This important point must not be overlooked.
Testing is very important. Determine the frequency in which you will need to test your disaster recovery plans. Only through testing of the plan can issues and problems be discovered and corrected. Testing can also bring opportunities to make improvements to the disaster recovery plan. "Disaster recovery (DR) testing isn't about pass and fail. It's about exercising and rehearsing the DR plan to reveal shortcomings and weaknesses." (Gsoedl, 2006)
Since nothing stays the same in business very long, you will find the same quality in disaster recovery plans. To keep them relevant and up-to-date, testing must become a regular occurrence. Testing may occur yearly, twice per year, or quarterly. The more practical experience individuals can get with the disaster recovery plan and the disaster recovery site, the better off everyone will be during a crisis situation. Familiarity will build confidence in individuals and the equipment and systems they are working on.
Usually, disaster recovery setup is not an emergency. The emergency only comes during execution of the plan. Still, a timeline should be put in place when planning disaster recovery for databases. It is unfortunate that many times, other projects push disaster recovery to the back burner. Make disaster recovery part of all projects so that it can be completed in a timely manner.
Moving back to the primary site will be a joyful time. It can also be quite hectic. "... you should plan to get back into your own premises as fast as possible" (Dawson, 2007). No one wants to stay at the disaster recovery site any longer that they have to. Plan the return much as would be done with the go-live of a new application. Plan the downtime, migrations, testing, go/no-go decision and fallback procedures. Everything should be scheduled and users made fully aware of the outages and change over schedules.
There is someone, or some people, in the organization that will make the decision that a disaster has struck and failover should now take place. Determine who that person is and how the information will be communicated. Ideally, the information will be distributed in multiple forms. Rarely in a disaster will all the normal lines of communication be available to the organization.
Key Roles
Obviously, database administrators are critical to the success of any disaster recovery scenario. There are many key roles that are critical to the success of the database administrator. A server administrator will have to install and set up the server. A system administrator will be needed to install and set up the operating system. A storage administrator will be necessary to duplicate the disks accordingly. Application developers will need to assist with troubleshooting errors detected by the user community. These are some of the people that a database administrator will rely on.
Many, if not all, of these steps can be accomplished prior to any disaster and tested. There can also be problems at the time of failover where some of these areas may need to be revisited. The database administrator may know who to call and work with during normal times, but what happens when a disaster strikes and some primary support personnel are not available? They could be taking care of injured family members or injured themselves. What if your database administrator is not available? Contingencies for these scenarios should be put in place.
It is imperative for employees to know who to call when they have an issue.
One of the best ways to avoid a situation with availability is cross-training employees. An employee that knows more than one job function can become essential and can play a key role during a disruption by knowing more than one area or job function.
Some people may not be able to make it to the recovery site, leaving some areas not covered (Maiwald & Sieglein, 2002, p. 193).The cross-training should not be a complete shift from their normal profession, unless requested by the employee. What is usually better is to have an employee learn a skill that is new, but in the same profession they are currently engaged.
For instance, Oracle database administrators can cross-train as SQL Server database administrators. They are already familiar with the concepts, SQL, structures, etc. of database administration. It should mostly be a matter of learning the different toolsets for the new database software. This can be a win-win for the employee and the organization.
The employee learns a valuable new skill that can enhance their career. The organization gains an employee that has multiple skill sets that can be called upon in times of normalcy and times of crisis.
Backups
Requirements for a database will drive the type of backups you make for it. If a database can have several hours of downtime and the last night backup will work sufficiently, then a full backup will be fine. If little to no downtime and/or little to no data loss is acceptable, then full backups will not do the job.
Technologies such as remote mirroring will have to be investigated. In remote mirroring, all changes made to the production system are copied to the disaster recovery site. This is normally considered in an asynchronous context, since most disaster recovery sites are at some distance away from the primary site. "Asynchronous remote mirroring is most often utilized when the remote site is a long distance from the local site." (Staimer, 2005) When a fail over is called for, databases can be recovered with the mirrored data for business continuance.
Data replication is another technology that can keep disaster recovery databases updated. The native settings of the software replicate changes as they occur from production databases to databases at the disaster recovery site. This can be altered so that changes are applied on a schedule, i.e. every four hours. This would be for a data recovery scenario in case a user made an error. The database administrator could use the data from the disaster recovery database to correct the error in production because the changes had been delayed.
Installations
Installation of database software should be a fairly routine task for a database administrator. It should also be the same across servers with the same database versions. Installation and setup should be well documented. There is always the possibility that a database administrator will not be available when a fail over is called for. Clear and concise, step by step directions will allow technical professionals from another area the ability to stand in for a missing database administrator and set up the database software.
This being said, each production server is different. Certain things may need to be done to prepare the database. Special scripts will sometimes need to run, or jobs to load or unload data. These steps for individual databases and the order in which they should execute also need to be well documented.
Good Use
The best way to set up disaster recovery is by having a dedicated site with servers available and application software running so that an immediate fail over can be done when called for. This approach is also very expensive and not always popular. There are ways to implement disaster recovery sites, save money and be practical, all at the same time.
An excellent approach for the dual use of just such a facility is testing of upgrades. All operating systems, applications, and databases require regular maintenance patches, fixes, and upgrades. With environments available as exact duplicates of production systems, these are prime locations to test the maintenance releases.
Patches and fixes can be applied to a disaster recovery system on a regular schedule. An approved test plan can be administered against the environment to check for issues with the maintenance release. If no issues are found, the patches can be left in place and migrated to the test environment on a regular schedule as well. If no problems are found, the patches can then be migrated into production on a regular schedule.
If any issues are found at the disaster recovery site or in the test system, then the patch can be rolled back or tickets can be opened with the vendors if problems are minor. This eliminates the need for a separate laboratory environment, which can also be very costly. No additional hardware, software, licenses, maintenance, administration, or space would be needed for a lab to test maintenance releases.
If you do not currently have a lab for testing patches and fixes for software, then this can be of a substantial benefit in three areas. The money has already been spent on the disaster recovery site, which was a necessity in itself. Secondly, a duplicate environment of your production systems now exists to test software patching, negating the need for a laboratory. Thirdly, less administrative maintenance is spent on systems once they are patched. Keeping software patched and fixed to current levels reduces downtime and the amount of time administrators spend on system repairs.
This approach can be especially helpful for database administrators. Many times a server may be available for database installations, patching and upgrades, but rarely are there complete environments for these tasks. The need for application developers and users is to test the application against the database after the patches have been installed. The database administrator can perform some limited testing, but the true tests come when users put the system through the motions.
Stocking the disaster recovery site with test servers is another great way to get the disaster recovery site up and running quickly and maximize the value of those servers. In most, if not every case, these servers are purchased for every new project that will be migrated into production. Test servers should be purchased with the same specifications, or better, than production. Most test servers will need higher capacity because more databases, application servers, web servers, etc. will be running on them than the production hardware. With test servers in the disaster recovery facility, much of the work of software installation is already done. Disaster recovery instances can be created on test servers and left idle. Application servers, web servers, and databases just wait for the day that a fail over will be alerted.
Using virtualized servers can assist in lower costs for a disaster recovery site. Server virtualization has become less expensive and at the same time, less complex, "... the cost of these technologies continues to fall, allowing small firms to implement solutions once reserved for large companies." (McCarthy, 2007).
It is now much easier to implement virtual servers than it has been in the past. Today, many applications, operating systems, and databases support server virtualization software. This has changed since many of the virtualization vendors have tried to work closely and cooperate fully with the other software vendors.
Pressures from customers have also driven software companies to work with virtualization companies to certify and support their products. Through virtualization, a physical server can be imaged and reproduced in a virtual environment. A production system consisting of a web server, application server, and a database server can all be imaged and virtualized on a single physical server. This effectively consolidates three physical servers down to one without losing any functionality. Capacity may not be equal, but it may suffice perfectly in a disaster recovery scenario. This does not mean that all applications will work together on virtual servers. "For example, one would not configure a SQL Server, an Oracle server, and a Lotus server to fail over to a common target. As a basic rule of thumb, if the applications would not peacefully coexist on a production server, then they will not peacefully coexist on the target." (Buffington, 2005)
Mentoring
A step beyond cross training is mentoring. A mentoring program allows subject matter experts to work directly with management-identified employees who are interested in becoming experts in a different field than the one they are currently in. This can become a large financial gain for employers while increasing employee morale as well.
"On average, companies with mentoring programs have a 19 percent lower turnover rate than those without such a program. That retention boost can translate into a substantial cost benefit. A mentoring program could save a 1,000-person company nearly $9.5 million a year, based on a $50,000 average turnover cost, according to Interim's 1999 Emerging Workforce Study." (Southgate, 2002) Mentoring can also work well for employees who wish to cross train to qualify for positions on other technology teams that have unfilled vacancies.
By identifying and opening career opportunities across teams, individuals feel a sense of empowerment and are not stuck in their current roles. For instance, a database administrator position may be difficult to fill externally. A current developer with talent, ability, and desire to become a database administrator could miss an opportunity to make a lateral move due to lack of experience. Through mentoring, the developer could continue in her current role while cross training in a potentially new career path. In this way, mentoring programs can help manage expected retirements and workflow fluctuations while providing alternative career paths for qualified candidates.
When an employee and mentor begin the process, they should meet with a manager. During this initial interview, they will identify the goals and objectives of the process and develop work plans. The primary focus of the mentor and employee should be to capture institutional knowledge. The employee should document the mentor's position and job in the form of process diagrams and standardized procedures.
As part of the mentoring process, learning employees will identify, learn, and record undocumented processes and procedures. This assists in preventing the loss of institutional knowledge that occurs when a subject matter expert leaves a position that has not been well documented. It also insures that the employee understands the mentor's job functions.
A review of the documentation by the mentor will give an excellent indication of the understanding and progress of the employee. This provides opportunities for standardization and improvements through process engineering. The employee and mentor should also look for training opportunities to supplement the learning process. Future mentoring times, communication, and work product delivery can be managed by the employee and mentor in alignment with approved work plans. The work plans can become subject to review in the annual review of the participants.
By establishing a mentoring program, senior technical staff is recognized for their accomplishments and junior staff is given the opportunity to learn from them and develop into the next generation of subject matter experts. Senior technical staff is the primary source of institutional knowledge. By spreading this knowledge within and across teams, the ability to provide support when subject matter experts are inaccessible or incapacitated is greatly improved.
This is a critical consideration with respect to disaster recovery. By documenting processes and procedures through a mentoring program, the ability to respond quickly to outages or disasters is dramatically enhanced.
About the Author
Kevin Medlin has been administering, supporting, and developing in a variety of industries including energy, retail, insurance and government since 1997. He is currently a DBA supporting Oracle and SQL Server, and is Oracle certified in versions 8 through 10g. He received his graduate certificate in Storage Area Networks from Regis University and he will be completing his MS in Technology Systems from East Carolina University in 2008. When he's not trying to make the world a better place through IT, he enjoys spending time with his family, traveling, hanging out by the pool, riding horses, hiking, and camping.