Better to know some
... than all
Streamlining SQL Statements for Improved Performance
Streamlining SQL statements is as much a part of application performance as database designing and tuning. No matter how fine-tuned the database or how sound the database structure, you will not receive timely query results that are acceptable to you, or even worse, the customer, if you don't follow some basic guidelines. Trust us, if the customer is not satisfied, then you can bet your boss won't be satisfied either.
• Understand the concept of streamlining your SQL code
• Understand the differences between batch loads and transactional processing and their effects on database performance
• Be able to manipulate the conditions in your query to expedite data retrieval
• Be familiar with some underlying elements that affect the tuning of the entire database
Here's an analogy to help you understand the phrase streamline an SQL statement: The objective of competitive swimmers is to complete an event in as little time as possible without being disqualified. The swimmers must have an acceptable technique, be able to torpedo themselves through the water, and use all their physical resources as effectively as possible. With each stroke and breath they take, competitive swimmers remain streamlined and move through the water with very little resistance.
Look at your SQL query the same way. You should always know exactly what you want to accomplish and then strive to follow the path of least resistance. The more time you spend planning, the less time you'll have to spend revising later. Your goal should always be to retrieve accurate data and to do so in as little time as possible. An end user waiting on a slow query is like a hungry diner impatiently awaiting a tardy meal. Although you can write most queries in several ways, the arrangement of the components within the query is the factor that makes the difference of seconds, minutes, and sometimes hours when you execute the query. Streamlining SQL is the process of finding the optimal arrangement of the elements within your query.
In addition to streamlining your SQL statement, you should also consider several other factors when trying to improve general database performance, for example, concurrent user transactions that occur within a database, indexing of tables, and deep-down database tuning.
Make Your SQL Statements Readable
Even though readability doesn't affect the actual performance of SQL statements, good programming practice calls for readable code. Readability is especially important if you have multiple conditions in the WHERE clause. Anyone reading the clause should be able to determine whether the tables are being joined properly and should be able to understand the order of the conditions.
The Full-Table Scan
A full-table scan occurs when the database server reads every record in a table in order to execute an SQL statement. Full-table scans are normally an issue when dealing with queries or the SELECT statement. However, a full-table scan can also come into play when dealing with updates and deletes. A full-table scan occurs when the columns in the WHERE clause do not have an index associated with them. A full-table scan is like reading a book from cover to cover, trying to find a keyword. Most often, you will opt to use the index.
You can avoid a full-table scan by creating an index on columns that are used as conditions in the WHERE clause of an SQL statement. Indexes provide a direct path to the data the same way an index in a book refers the reader to a page number. Adding an index speeds up data access.
Although programmers usually frown upon full-table scans, they are sometimes appropriate.
• You are selecting most of the rows from a table.
• You are updating every row in a table.
• The tables are small.
In the first two cases an index would be inefficient because the database server would have to refer to the index, read the table, refer to the index again, read the table again, and so on. On the other hand, indexes are most efficient when the data you are accessing is a small percentage, usually no more than 10 to 15 percent, of the total data contained within the table.
In addition, indexes are best used on large tables. You should always consider table size when you are designing tables and indexes. Properly indexing tables involves familiarity with the data, knowing which columns will be referenced most, and may require experimentation to see which indexes work best.
Adding a New Index
You will often find situations in which an SQL statement is running for an unreasonable amount of time, although the performance of other statements seems to be acceptable; for example, when conditions for data retrieval change or when table structures change.
We have also seen this type of slowdown when a new screen or window has been added to a front-end application. One of the first things to do when you begin to troubleshoot is to find out whether the target table has an index. In most of the cases we have seen, the target table has an index, but one of the new conditions in the WHERE clause may lack an index. Looking at the WHERE clause of the SQL statement, we have asked, Should we add another index? The answer may be yes if:
• The most restrictive condition(s) returns less than 10 percent of the rows in a table.
• The most restrictive condition(s) will be used often in an SQL statement.
• Condition(s) on columns with an index will return unique values.
• Columns are often referenced in the ORDER BY and GROUP BY clauses.
Composite indexes may also be used. A composite index is an index on two or more columns in a table. These indexes can be more efficient than single-column indexes if the indexed columns are often used together as conditions in the WHERE clause of an SQL statement. If the indexed columns are used separately as well as together, especially in other queries, single-column indexes may be more appropriate. Use your judgment and run tests on your data to see which type of index best suits your database.
Arrangement of Elements in a Query
The best arrangement of elements within your query, particularly in the WHERE clause, really depends on the order of the processing steps in a specific implementation. The arrangement of conditions depends on the columns that are indexed, as well as on which condition will retrieve the fewest records.
You do not have to use a column that is indexed in the WHERE clause, but it is obviously more beneficial to do so. Try to narrow down the results of the SQL statement by using an index that returns the fewest number of rows. The condition that returns the fewest records in a table is said to be the most restrictive condition. As a general statement, you should place the most restrictive conditions last in the WHERE clause. (Oracle's query optimizer reads a WHERE clause from the bottom up, so in a sense, you would be placing the most restrictive condition first.)
When the optimizer reads the most restrictive condition first, it is able to narrow down the first set of results before proceeding to the next condition. The next condition, instead of looking at the whole table, should look at the subset that was selected by the most selective condition. Ultimately, data is retrieved faster. The most selective condition may be unclear in complex queries with multiple conditions, subqueries, calculations, and several combinations of the AND, OR, and LIKE.
For queries that are executed on a regular basis, try to use procedures. A procedure is a potentially large group of SQL statements. Procedures are compiled by the database engine and then executed. Unlike an SQL statement, the database engine need not optimize the procedure before it is executed. Procedures, as opposed to numerous individual queries, may be easier for the user to maintain and more efficient for the database.
Avoid using the logical operator OR in a query if possible. OR inevitably slows down nearly any query against a table of substantial size. We find that IN is generally much quicker than OR. This advice certainly doesn't agree with documentation stating that optimizers convert IN arguments to OR conditions.
OLAP versus OLTP
When tuning a database, you must first determine what the database is being used for. An online analytical processing (OLAP) database is a system whose function is to provide query capabilities to the end user for statistical and general informational purposes. The data retrieved in this type of environment is often used for statistical reports that aid in the corporate decision-making process. These types of systems are also referred to as decision support systems (DSS). An online transactional processing (OLTP) database is a system whose main function is to provide an environment for end-user input and may also involve queries against day-to-day information. OLTP systems are used to manipulate information within the database on a daily basis. Data warehouses and DSSs get their data from online transactional databases and sometimes from other OLAP systems.
A transactional database is a delicate system that is heavily accessed in the form of transactions and queries against day-to-day information. However, an OLTP does not usually require a vast sort area, at least not to the extent to which it is required in an OLAP environment. Most OLTP transactions are quick and do not involve much sorting.
One of the biggest issues in a transactional database is rollback segments. The amount and size of rollback segments heavily depend on how many users are concurrently accessing the database, as well as the amount of work in each transaction. The best approach is to have several rollback segments in a transactional environment.
Another concern in a transactional environment is the integrity of the transaction logs, which are written to after each transaction. These logs exist for the sole purpose of recovery. Therefore, each SQL implementation needs a way to back up the logs for use in a "point in time recovery." SQL Server uses dump devices; Oracle uses a database mode known as ARCHIVELOG mode. Transaction logs also involve a performance consideration because backing up logs requires additional overhead.
Tuning OLAP systems, such as a data warehouse or decision support system, is much different from tuning a transaction database. Normally, more space is needed for sorting.
Because the purpose of this type of system is to retrieve useful decision-making data, you can expect many complex queries, which normally involve grouping and sorting of data. Compared to a transactional database, OLAP systems typically take more space for the sort area but less space for the rollback area.
Most transactions in an OLAP system take place as part of a batch process. Instead of having several rollback areas for user input, you may resort to one large rollback area for the loads, which can be taken offline during daily activity to reduce overhead.
Batch Loads Versus Transactional Processing
A major factor in the performance of a database and SQL statements is the type of processing that takes place within a database. One type of processing is OLTP, discussed earlier today. When we talk about transactional processing, we are going to refer to two types: user input and batch loads.
Regular user input usually consists of SQL statements such as INSERT, UPDATE, and DELETE. These types of transactions are often performed by the end user, or the customer. End users are normally using a front-end application such as PowerBuilder to interface with the database, and therefore they seldom issue visible SQL statements. Nevertheless, the SQL code has already been generated for the user by the front-end application.
Your main focus when optimizing the performance of a database should be the end-user transactions. After all, "no customer" equates to "no database," which in turn means that you are out of a job. Always try to keep your customers happy, even though their expectations of system/database performance may sometimes be unreasonable. One consideration with end-user input is the number of concurrent users. The more concurrent database users you have, the greater the possibilities of performance degradation.
What is a batch load? A batch load performs heaps of transactions against the database at once. For example, suppose you are archiving last year's data into a massive history table. You may need to insert thousands, or even millions, of rows of data into your history table. You probably wouldn't want to do this task manually, so you are likely to create a batch job or script to automate the process. (Numerous techniques are available for loading data in a batch.) Batch loads are notorious for taxing system and database resources. These database resources may include table access, system catalog access, the database rollback segment, and sort area space; system resources may include available CPU and shared memory. Many other factors are involved, depending on your operating system and database server.
Both end-user transactions and batch loads are necessary for most databases to be successful, but your system could experience serious performance problems if these two types of processing lock horns. Therefore, you should know the difference between them and keep them segregated as much as possible. For example, you would not want to load massive amounts of data into the database when user activity is high. The database response may already be slow because of the number of concurrent users. Always try to run batch loads when user activity is at a minimum. Many shops reserve times in the evenings or early morning to load data in batch to avoid interfering with daily processing.
You should always plan the timing for massive batch loads, being careful to avoid scheduling them when the database is expected to be available for normal use. Another problem with batch processes is that the process may hold locks on a table that a user is trying to access. If there is a lock on a table, the user will be refused access until the lock is freed by the batch process, which could be hours. Batch processes should take place when system resources are at their best if possible. Don't make the users' transactions compete with batch. Nobody wins that game.
Optimizing Data Loads by Dropping Indexes
One way to expedite batch updates is by dropping indexes. Imagine the history table with many thousands of rows. That history table is also likely to have one or more indexes. When you think of an index, you normally think of faster table access, but in the case of batch loads, you can benefit by dropping the index(es).
When you load data into a table with an index, you can usually expect a great deal of index use, especially if you are updating a high percentage of rows in the table. Look at it this way. If you are studying a book and highlighting key points for future reference, you may find it quicker to browse through the book from beginning to end rather than using the index to locate your key points. (Using the index would be efficient if you were highlighting only a small portion of the book.)
To maximize the efficiency of batch loads/updates that affect a high percentage of rows in a table, you can take these three basic steps to disable an index:
1. Drop the appropriate index(es).
2. Load/update the table's data.
3. Rebuild the table's index.
A Frequent COMMIT Keeps the DBA Away
When performing batch transactions, you must know how often to perform a "commit." A COMMIT saves a transaction or writes any changes to the applicable table(s). Behind the scenes, however, much more is going on. Some areas in the database are reserved to store completed transactions before the changes are actually written to the target table. Oracle calls these areas rollback segments. When you issue a COMMIT statement, transactions associated with your SQL session in the rollback segment are updated in the target table. After the update takes place, the contents of the rollback segment are removed. A ROLLBACK command, on the other hand, clears the contents of the rollback segment without updating the target table.
As you can guess, if you never issue a COMMIT or ROLLBACK command, transactions keep building within the rollback segments. Subsequently, if the data you are loading is greater in size than the available space in the rollback segments, the database will essentially come to a halt and ban further transactional activity. Not issuing COMMIT commands is a common programming pitfall; regular COMMITs help to ensure stable performance of the entire database system.
The management of rollback segments is a complex and vital database administrator (DBA) responsibility because transactions dynamically affect the rollback segments, and in turn, affect the overall performance of the database as well as individual SQL statements. So when you are loading large amounts of data, be sure to issue the COMMIT command on a regular basis. Check with your DBA for advice on how often to commit during batch transactions.
Rebuilding Tables and Indexes in a Dynamic Environment
The term dynamic database environment refers to a large database that is in a constant state of change. The changes that we are referring to are frequent batch updates and continual daily transactional processing. Dynamic databases usually entail heavy OLTP systems, but can also refer to DSSs or data warehouses, depending upon the volume and frequency of data loads.
The result of constant high-volume changes to a database is growth, which in turn yields fragmentation. Fragmentation can easily get out of hand if growth is not managed properly. Oracle allocates an initial extent to tables when they are created. When data is loaded and fills the table's initial extent, a next extent, which is also allocated when the table is created, is taken.
Sizing tables and indexes is essentially a DBA function and can drastically affect SQL statement performance. The first step in growth management is to be proactive. Allow room for tables to grow from day one, within reason. Also plan to defragment the database on a regular basis, even if doing so means developing a weekly routine. Here are the basic conceptual steps involved in defragmenting tables and indexes in a relational database management system:
1. Get a good backup of the table(s) and/or index(es).
2. Drop the table(s) and/or index(es).
3. Rebuild the table(s) and/or index(es) with new space allocation.
4. Restore the data into the newly built table(s).
5. Re-create the index(es) if necessary.
6. Reestablish user/role permissions on the table if necessary.
7. Save the backup of your table until you are absolutely sure that the new table was built successfully. If you choose to discard the backup of the original table, you should first make a backup of the new table after the data has been fully restored.
Tuning the Database
Tuning a database is the process of fine-tuning the database server's performance. As a newcomer to SQL, you probably will not be exposed to database tuning unless you are a new DBA or a DBA moving into a relational database environment. Whether you will be managing a database or using SQL in applications or programming, you will benefit by knowing something about the database-tuning process. The key to the success of any database is for all parties to work together. Some general tips for tuning a database follow.
• Minimize the overall size required for the database.
It's good to allow room for growth when designing a database, but don't go overboard. Don't tie up resources that you may need to accommodate database growth.
• Experiment with the user process's time-slice variable.
This variable controls the amount of time the database server's scheduler allocates to each user's process.
• Optimize the network packet size used by applications.
The larger the amount of data sent over the network, the larger the network packet size should be. Consult your database and network documentation for more details.
• Store transaction logs on separate hard disks.
For each transaction that takes place, the server must write the changes to the transaction logs. If you store these log files on the same disk as you store data, you could create a performance bottleneck.
• Stripe extremely large tables across multiple disks.
If concurrent users are accessing a large table that is spread over multiple disks, there is much less chance of having to wait for system resources.
• Store database sort area, system catalog area, and rollback areas on separate hard disks.
These are all areas in the database that most users access frequently. By spreading these areas over multiple disk drives, you are maximizing the use of system resources.
• Add CPUs.
This system administrator function can drastically improve database performance. Adding CPUs can speed up data processing for obvious reasons. If you have multiple CPUs on a machine, then you may be able to implement parallel processing strategies. See your database documentation for more information on parallel processing, if it is available with your implementation.
• Add memory.
Generally, the more the better.
• Store tables and indexes on separate hard disks.
You should store indexes and their related tables on separate disk drives when- ever possible. This arrangement enables the table to be read at the same time the index is being referenced on another disk. The capability to store objects on multiple disks may depend on how many disks are connected to a controller.
The objective when spreading your heavy database areas and objects is to keep areas of high use away from each another.
• Disk01-- The system catalog stores information about tables, indexes, users, statistics, database files, sizing, growth information, and other pertinent data that is often accessed by a high percentage of transactions.
• Disk02--Transaction logs are updated every time a change is made to a table (insert, update, or delete). Transaction logs are a grand factor in an online transactional database. They are not of great concern in a read-only environment, such as a data warehouse or DSS.
• Disk03--Rollback segments are also significant in a transactional environment. However, if there is little transactional activity (insert, update, delete), rollback segments will not be heavily used.
• Disk04-- The database's sort area, on the other hand, is used as a temporary area for SQL statement processing when sorting data, as in a GROUP BY or ORDER BY clause. Sort areas are typically an issue in a data warehouse or DSS. However, the use of sort areas should also be considered in a transactional environment.
Tuning a database very much depends on the specific database system you are using. Obviously, tuning a database entails much more than just preparing queries and letting them fly. On the other hand, you won't get much reward for tuning a database when the application SQL is not fine-tuned itself. Professionals who tune databases for a living often specialize on one database product and learn as much as they possibly can about its features and idiosyncrasies. Although database tuning is often looked upon as a painful task, it can provide very lucrative employment for the people who truly understand it.
We have already mentioned some of the countless possible pitfalls that can hinder the general performance of a database. These are typically general bottlenecks that involve system-level maintenance, database maintenance, and management of SQL statement processing.
This section summarizes the most common obstacles in system performance and database response time.
• Not making use of available devices on the server--A company purchases multiple disk drives for a reason. If you do not use them accordingly by spreading apart the vital database components, you are limiting the performance capabilities. Maximizing the use of system resources is just as important as maximizing the use of the database server capabilities.
• Not performing frequent COMMITs--Failing to use periodic COMMITs or ROLLBACKs during heavy batch loads will ultimately result in database bottlenecks.
• Allowing batch loads to interfere with daily processing--Running batch loads during times when the database is expected to be available will cause problems for everybody. The batch process will be in a perpetual battle with end users for system resources.
• Being careless when creating SQL statements--Carelessly creating complex SQL statements will more than likely contribute to substandard response time.
• Running batch loads with table indexes--You could end up with a batch load that runs all day and all night, as opposed to a batch load that finishes within a few hours. Indexes slow down batch loads that are accessing a high percentage of the rows in a table.
• Having too many concurrent users for allocated memory--As the number of concurrent database and system users grows, you may need to allocate more memory for the shared process. See your system administrator.
• Creating indexes on columns with few unique values--Indexing on a column such as GENDER, which has only two unique values, is not very efficient. Instead, try to index columns that will return a low percentage of rows in a query.
• Creating indexes on small tables--By the time the index is referenced and the data read, a full-table scan could have been accomplished.
• Not managing system resources efficiently--Poor management of system resources can result from wasted space during database initialization, table creation, uncontrolled fragmentation, and irregular system/database maintenance.
• Not sizing tables and indexes properly--Poor estimates for tables and indexes that grow tremendously in a large database environment can lead to serious fragmentation problems, which if not tended to, will snowball into more serious problems.
Built-In Tuning Tools
Check with your DBA or database vendor to determine what tools are available to you for performance measuring and tuning. You can use performance-tuning tools to identify deficiencies in the data access path; in addition, these tools can sometimes suggest changes to improve the performance of a particular SQL statement.
Oracle has two popular tools for managing SQL statement performance. These tools are explain plan and tkprof. The explain plan tool identifies the access path that will be taken when the SQL statement is executed. tkprof measures the performance by time elapsed during each phase of SQL statement processing. Oracle Corporation also provides other tools that help with SQL statement and database analysis, but the two mentioned here are the most popular. If you want to simply measure the elapsed time of a query in Oracle, you can use the SQL*Plus command SET TIMING ON.
Sybase's SQL Server has diagnostic tools for SQL statements. These options are in the form of SET commands that you can add to your SQL statements. (These commands are similar to Oracle's SET commands). Some common commands are SET SHOWPLAN ON, SET STATISTIC IO ON, and SET STATISTICS TIME ON. These SET commands display output concerning the steps performed in a query, the number of reads and writes required to perform the query, and general statement-parsing information.
Two major elements of streamlining, or tuning, directly affect the performance of SQL statements: application tuning and database tuning. Each has its own role, but one cannot be optimally tuned without the other. The first step toward success is for the technical team and system engineers to work together to balance resources and take full advantage of the database features that aid in improving performance. Many of these features are built into the database software provided by the vendor.
Application developers must know the data. The key to an optimal database design is thorough knowledge of the application's data. Developers and production programmers must know when to use indexes, when to add another index, and when to allow batch jobs to run. Always plan batch loads and keep batch processing separate from daily transactional processing.
Databases can be tuned to improve the performance of individual applications that access them. Database administrators must be concerned with the daily operation and performance of the database. In addition to the meticulous tuning that occurs behind the scenes, the DBA can usually offer creative suggestions for accessing data more efficiently, such as manipulating indexes or reconstructing an SQL statement. The DBA should also be familiar with the tools that are readily available with the database software to measure performance and provide suggestions for statement tweaking.