Duties of the DBA This Chapter looks at aspects of performance tuning
Chapter 9. Query Optimization
We have means of storing data on the disk so that it can be accessed efficiently. A DBMS exists to maintain data so that it can be queried. A significant portion of a DBMS is devoted to performing queries efficiently.
Query optimizer creates a procedural access plan for a query. After submit a Select statement, three steps.
select * from customers where city = 'Boston'
and discnt between 10 and 12;
(1) Parse (as in BNF), look up tables, columns, views (query modification), check privileges, verify integrity constraints.
(2) Query Optimization. Look up statistics in System tables. Figure out what to do. Basically consider competing plans, choose the cheapest.
(What do we mean by cheapest? What are we minimizing? See shortly.)
(3) Now Create the "program" (code generation).
Then Query Execution.
We will be talking about WHY a particular plan is chosen, not HOW all the plan search space is considered. Question of how compiler chooses best plan is beyond scope of course. Assume QOPT follows human reasoning.
For now, considering only queries, not updates.
Conceptually there might be different ways to access the data specified in a query. The Query Optimizer has to generate those different approaches and evaluate each .
Section 9.1. Now what do we mean by cheapest alternative plan? What resources are we trying to minimize? CPU and I/O. Given a PLAN, talk about COSTCPU(PLAN) and COSTI/O(PLAN).
But two plans PLAN1 and PLAN2 can have incomparable resource use with this pair of measures. See Figure 9.1. ( pg. 535)
|
|
COSTCPU(PLAN) |
COSTI/O(PLAN) |
|
PLAN1 |
9.2 CPU seconds |
103 reads |
|
PLAN2 |
1.7 CPU seconds |
890 reads |
*** Which is cheaper? Depends on which resources have a bottleneck.
DB2 takes a weighted sum of these for the total cost, COST(PLAN):
COST(PLAN) = W1. COSTCPU(PLAN) + W2. COSTI/O(PLAN)
A DBA should try to avoid bottlenecks by understanding the WORKLOAD for a system. Estimated total CPU and I/O for all users at peak work time. Figure out various queries and rate of submission per second. Qk, RATE(Qk) as in Figure 9.2. (pg. 537.)
|
Query Type |
RATE(Query) in Submissions/second |
|
Q1 |
40.0 |
|
Q2 |
20.0 |
Figure 9.2 Simple Workload with two Queries
Then work out COSTCPU(Qk) and COSTCPU(Qk) for plan that database system will choose. Now can figure out total average CPU or I/O needs per second:
RATE(CPU) =
RATE(Qk).COSTCPU(Qk)
Similarly for I/O. Tells us how many disks to buy, how powerful a CPU.
(Costs go up approximately linearly with CPU power within small ranges).
Ideally, the weights W1 and W2 used to calculate COST(PLAN) from I/O and CPU costs will reflect actual costs of equipment
Good DBA tries to estimate workload in advance to make purchases, have equipment ready for a new application.
*** Note that one other major expense is response time. Could have poor response time (1-2 minutes) even on lightly loaded inexpensive system.
This is because many queries need to perform a LOT of I/O, and some commercial database systems have no parallelism: all I/Os in sequence.
But long response time also has a cost, in that it wastes user time. Pay workers for wasted time. Employees quit from frustration and others must be trained. Vendors are trying hard to reduce response times.
Usually there exist parallel versions of these systems as well; worth it if extremely heavy I/O and response time is a problem, while number of queries running at once is small compared to number of disks.
Note that in plans that follow, we will not try to estimate CPU. Too hard. Assume choose plan with best I/O and try to estimate that. Often total cost is proportional to I/O since each I/O entails extra CPU cost as well.
Statistics. Need to gather, put in System tables. System does not automatically gather them with table load, index create, updates to table.
ORACLE uses analyze command. Fig. 9.4, pg. 538.
ANALYZE {INDEX | TABLE | CLUSTER}
[schema.] {indexname | tablename | clustername}
{COMPUTE STATISTICS | other alternatives not covered}
{FOR TABLE | FOR ALL [INDEXED] COLUMNS [SIZE n]
| other alternatives not covered}
System learns how many rows, how many data pages, stuff about indexes, etc., placed in catalog tables.
Retrieving the Query Plans. In DB2 and ORACLE, perform SQL statement.
Here is the ORACLE Explain Plan syntax.
EXPLAIN PLAN [SET STATEMENT_ID = 'text-identifier'] [INTO
[schema.]tablename]
FOR explainable-sql-statement;
This inserts a sequence of statements into a user created DB2/ORACLE table known as PLAN_TABLE one row for each access step. To learn more about this, see ORACLE8 documentation named in text.
In DB2:
EXPLAIN PLAN [SET QUERYNO = n] [SET QUERYTAG = 'string'] FOR
explainable-sql-statement;
For example:
explain plan set queryno = 1000 for
select * from customers
where city = 'Boston' and discnt between 12 and 14;
The Explain Plan statement puts rows in a "plan_table" to represent individual procedural steps in a query plan. Can get rows back by:
select * from plan_table where queryno = 1000;
Recall that a Query Plan is a sequence of procedural access steps that carry out a program to answer the query. Steps are peculiar to the DBMS. From one DBMS to another, difference in steps used is like difference between programming languages. Can't learn two languages at once.
We will stick to a specific DBMS in what follows, MVS DB2, so we can end up with an informative benchmark we had in the first edition.
Need to understand what basic procedural access steps ARE in the particular product you're working with.
The set of steps allowed is the "bag of tricks" the query optimizer can use. Think of these procedural steps as the "instructions" a compiler can use to create "object code" in compiling a higher-level request. A system that has a smaller bag of tricks is likely to have less efficient access plans for some queries. Process to optimize query
MVS DB2 (and the architecturally allied DB2 UDB) have a wide range of tricks, but not bitmap indexing or hashing capability. Still, very nice capabilities for range search queries, and probably the most sophisticated query optimizer.
Basic procedrual steps covered in the next few Sections (thumbnail):
Table Scan Look through all rows of table
Unique Index Scan Retrieve row through unique index
Unclustered Matching Index Scan Retrieve multiple rows through a
non-unique index, rows not same order
Clustered Matching Index Scan Retrieve multiple rows through a
non-unique clustered index
Index-Only Scan Query answered in index, not rows
Note that the steps we have listed access all the rows restricted by the WHERE clause in some single table query.
Need two tables in the FROM clause to require two steps of this kind and Joins come later. A multi-step plan for a single table query will also be covered later. Such a multi-step plan on a single table is one that combines multiple indexes to retrieve data. Up to then, only one index per table can be used.
9.2 Table Space Scans and I/O
Single step. The plan table (plan_table) will have a column ACCESSTYPE with value R (ACCESSTYPE = R for short).
Example 9.2.1. Table Space Scan Step. Look through all rows in table to answer query, maybe because there is no index that will help.
Assume in DB2 an employees table with 200,000 rows, each row of 200 bytes, each 4 KByte page just 70% full. Thus 2800 usable pages, 14 rows/pg. Need CEIL(200,000/14) = 14,286 pages.
Consider the query:
select eid, ename from employees where socsecno = 113353179;
If there is no index on socsecno, only way to answer query is by reading in all rows of table. (Stupid not to have an index if this query occurs with any frequency at all!)
In Table Space Scan, might not stop when find proper row, since statistics for a non-indexed column might not know socsecno is unique.
Therefore have to read all pages in table in from disk, and COSTI/O(PLAN)
= 14286 I/Os. Does this mean 14286 random I/Os? Maybe not.
But if we assume random I/O, then at 80 I/Os per second, need about 14286/80 = 178.6 seconds, a bit under three minutes. This dominates CPU by a large factor and would predict elapsed time quite well.
Homework 4, non-dotted Exercises through Exercise 9.9. This is due when we finish Section 9.6..
Assumptions about I/O
We are now going to talk about I/O assumptions a bit. First, there might be PARALLELISM in performing random I/Os from disk. The pages of a table might be striped across several different disks, with the database system making requests in parallel for a single query to keep all the disk arms busy. See Figure 9.5, pg. 542
DB2 has a special form of sequential prefetch now where it stripes 32 pages at a time on multiple disks, requests them all at once. While parallelism speeds up TOTAL I/O per second (expecially if there's ony one user process running), it doesn't really save any RESOURCE COST.
v If it takes 12.5 ms (0.0125 seconds) to do a random I/O, doesn't save resources to do 10 random I/Os at once on 10 different disks. Still have to make all the same disk arm movements, cost to rent the disk arms is the same if there is parallelism: just spend more per second.
v Will speed things up if there are few queries running, fewer than the number of disks, and there is extra CPU not utilized. Then can use more of the disk arms and CPUs with this sort of parallelism.
v Parallelism shows up best when there is only one query running! But if there are lots of queries compared to number of disks and accessed pages are randomly placed on disks, probably keep all disk arms busy already.
But there's another factor operating. Two disk pages that are close to each other on one disk can be read faster because there's a shorter seek time. Recall that the system tries to make extents contiguous on disk, so I/Os in sequence are faster. Thus, a table that is made up of a sequence of (mainly) contiguous pages, one after another within a track, will take much less time to read in. In fact it seems we should be able to read in successive pages at full transfer speed would take about .00125 secs per page.
v Used to be that by the time the disk controller has read in the page to a memory buffer and looked to see what the next page request is, the page immediately following has already passed by under the head. But now with multiple requests to the disk outstanding, we really COULD get the disk arm to read in the next disk page in sequence without a miss.
v Another factor supports this speedup: the typical disk controller buffers an entire track in it's memory whenever a disk page is requested. Reads in whole track containing the disk page, returns the page requested, then if later request is for page in track doesn't have to access disk again. So when we're reading in pages one after another on disk, it's like we're reading from the disk an entire track at a time.
v I/O is about TEN TIMES faster for disk pages in sequence compared to randomly place I/O. (Accurate enough for rule of thumb: We can do 800 I/Os per second when pages in sequence (S) instead of 80 for randomly placed pages (R). Sequential I/O takes 0.00125 secs instead of 0.0125 secs for random I/O.)
DB2 Sequential Prefetch makes this possible even if turn off buffering on disk (which actually hurts performance of random I/O, since reads whole track it doesn't need: adds 0.008 sec to random I/O of 0.0125 sec) IBM puts a lot of effort into making I/O requests sequentially in a query plan to gain this I/O advantage!
Example 9.2.2. Table Space Scan with Sequential Advantage. The 14286R of Example 9.2.1 becomes 14286S (S for Sequential Prefetch I/O instead of Random I/O). And 14286S requires 14286/800 = 17.86 seconds instead of the 178.6 seconds of 142286R. Note that this is a REAL COST SAVINGS, that we are actually using the disk arm for a smaller period. Striping reduces elapsed time but not COST.
We use the rule of thumb that List Prefetch reads in 200 pages per second.
See Figure 9.10, page 546, for table.
Plan table row for an access step will have PREFETCH = S for sequential prefetch, PREFETCH = L for list prefetch, PREFETCH = blank if random I/O. See Figure 9.10. And of course ACCESSTYPE = R when really Random I/O.
Back to Oracle and illustrating reading plans
------------------------------------------------------------March 17-----------------------------------------------------------------
This section focuses on the way the Query Optimizer can build a plan when indexes are available. The I/O cost of the the plan will be the cost of accessing the index PLUS the cost of accessing the data. Either access may be TYPE=R (random, 80 records/sec), TYPE=S (sequential, 800 records/sec) or TYPE=L (list prefetch, 200 records/sec). We will be concerned with how to do index matching, range queries, concatenated indexes, and multiple indexes. These for sections contain 17 worked examples.
9.3 Simple Indexed Access in DB2.
Index helps efficiency of query plan. There is a gread deal of complexity here. Remember, we are not yet covering queries with joins: only one table in from clause and NO subquery in where clause.
Examples with tables: T1, T2, . . , columns C1, C2, C3, . .
Example 9.3.1. Assume index C1X exists on column C1 of table T1 (always a B-tree secondary index in DB2). Consider:
select * from T1 where C1 = 10;
This is a Matching Index Scan. In plan table: ACCESSTYPE = I, ACCESSNAME = C1X, MATCHCOLS = 1. (MATCHCOLS might be >1 in multiple column index.)
Perform matching index scan by walking down B-tree to LEFTMOST entry of C1X with C1 = 10. Retrieve row pointed to. Loop through entries at leaf level from left to right until run out of entries with C1 = 10. For each such entry, retrieve row pointed to. No assumption about clustering or non-clustering of rows here.
In Example 9.3.2, assume other restrictions in WHERE clause, but matching index scan used on C1X. Then other restrictions are validated as rows are accessed (row is qualified: look at row, check if matches restrictions).
Not all predicates are indexable. In DB2, indexable predicate is one that can be used in a matching index scan, i.e. a lookup that uses a contiguous section of an index. For example, looking up words in the dictionary that start with the letters 'pre' is a matching index scan. Looking up words ending with 'tion' is not.
DB2 considers the predicate C1 <> 10 to be non-indexable. It is not impossible that an index will be used in a query with this predicate:
select * from T1 where C1 <> 10;
But the statistics usually weigh against it's use and so the query will be performed by a table space scan.
what about query:
select * from T1 where C1 = 10 and C2 between 100 and 200
and C3 like 'A%';
These three predicates are all indexable. If have only C1X, will be like previous example with retrieved rows restricted by tests on other two predicates.
If have index combinx, created by:
create index combinx on T1 (C1, C2, C3) . . .
Will be able to limit (filter) RIDs of rows to retrieve much more completely before going to data. Like books in a card catalog, looking up
authorlname = 'James' (c1 = 10) and authorfname between 'H' and 'K'
and title begins with letter 'A'
Finally, we will cover the question of how to filter the RIDs of rows to retrieve if we have three indexes, C1X, C2X, and C3X.
Example 9.3.3. Index Scan Step, Unique Match. Continuing with Example 9.2.1, employees table with 200,000 rows of 200 bytes and pctfree = 30, so 14 rows/pg and CEIL(200,000/14) =14,286 data pages. Assume in index on eid, also have pctfree = 30, and eid||RID takes up 10 bytes, so 280 entries per pg, and CEIL(200,000/280) = 715 leaf level pages. Next level up CEIL(715/280) = 3. Root next level up.
employees table: 14,286 data pages
index on eid, eidx: 715 leaf nodes, 3 level 2 nodes, 1 root node.
Now query: select ename from employees where eid = '12901A';
Root, on level 2 node, 1 leaf node, 1 data page. Seems like 4R. But what about buffered pages?
Five minute rule says should purchase enough memory so pages referenced more frequently than about once every 120 seconds (popular pages) should stay in memory. Assume we have done this. If workload assumes 1 query per second of this on ename with eid = predicate (no others on this table), then leaf nodes and data pages not buffered, but upper nodes of eidx are. So really 2R is cost of query.
v This Query Plan is a single step, with ACCESSTYPE = I, ACCESSNAME = eidx, MATCHCOLS = 1.
OK now we introduce a new table called prospects. Based on direct mail applications (junk mail). People fill out warrenty cards, name hobbies, salary range, address, etc.
50M rows of 400 bytes each. FULL data pages (pctfree = 0) and on all indexes: 10 rows on 4 KByte page, so 5M data pages.
prospects table: 5M data pages
Now: create index addrx on prospects (zipcode, city, straddr) cluster . . .;
v zipcode is integer or 4 bytes, city requires 12 bytes, straddr 20 bytes, RID 4 bytes, and assume NO duplicate values so no compression.
v Thus each entry requires 40 bytes, and we can fit 100 on a 4 KByte page. With 50M total entries, that means 500,000 leaf pages. 5000 directory nodes at level 2. 50 level 3 node pages. Then root page. Four levels.
v Also assume a nonclustering hobbyx index on hobbies, 100 distinct hobbies (. . . cards, chess, coin collecting, . . .). We say CARD(hobby) = 100.
v (Like we say CARD(zipcode) = 100,000. Not all possible integer zipcodes can be used, but for simplicity say they are.)
v Duplicate compression on hobbyx, each key (8 bytes?) amortized over 255 RIDS (more?), so can fit 984 RIDs (or more) per 4 KByte page, call it 1000.
Thus 1000 entries per leaf page. With 50M entries, have 50,000 leaf pages. Then 50 nodes at level 2. Then root.
index on eid, eidx: 715 leaf nodes, 3 level 2 nodes, 1 root node.
|
|
prospects table |
addrx index |
hobbyx index |
|
|
50,000,000 rows |
500,000 leaf pages |
50,000 leaf pages |
|
|
5,000,000 data pages |
5,000 level 3 nodes |
151 level 2 nodes |
|
|
(10 rows per page) |
50 level 2 nodes |
1 root node |
|
|
|
1 root node |
(1000 entries/leaf) |
|
|
|
CARD(zipcode)= 100,000 |
CARD(hobby)=100 |
Figure 9.12. Some statistics for the prospects table, page 552
Example 9.3.4. Matching Index Scan Step, Unclustered Match. Consider the following query:
select name, straddr from prospects where hobby = 'chess';
Query optimizer assumes each of 100 hobbies equally likely (knows there are 100 from RUNSTATS), so restriction cuts 50M rows down to 500,000. Walk down hobby index (2R for directory nodes) and across 500,000 entries (1000 per page so 500 leaf pages, sequential prefetch so 500S).
For every entry, read in row -- non clustered so all random choices out of 5M data pages, 500,000 distinct I/Os (not in order, so R), 500,000R.
Total I/O is 500S + 500,002R. Time is 500/800 + 500,002/80, about 500,000/80 = 6250 seconds. Or about 1.75 hours (2 hrs = 7200 secs).
Can generally assume that upper level index pages are buffer resident (skip 2R) but leaf level pages and maybe one level up are not. Should calculate index time and can then ignore it if insignificant.
If we used a table space scan for Example 9.3.4, qualifying rows to ensure hobby = 'chess, how would time compare to what we just calculated?
v Simple: 5M pages using sequential prefetch, 5,000,000/800 = 625 seconds. (Yes, CPU is still ignored — in fact is relatively insignificant.)
v But this is the same elapsed time as for indexed access of 1/100 of rows!!
Yes, surprising. But 10 rows per page so about 1/10 as many pages hit, and S is 10 times as fast as R. Query optimizer compares these two approaches and chooses the faster one. Would probably select Table Space Scan here But minor variation in CARD(hobby) could make either plan a better choice.
Example 9.3.5. Matching Index Scan Step, Clustered Match. Consider the following query:
select name, straddr from prospects
where zipcode between 02159 and 03158;
Recall CARD(zipcode) = 100,000. Range of zipcodes is 1000. Therefore, cut number of rows down by a factor of 1/100. SAME AS 9.3.4. Bigger index entries. Walk down to leaf level and walk across 1/100 of leaf level: 500,000 leaf pages, so 5000 pages traversed. I/O of 5000S.
And data is clustered by index, so walk across 1/100 of 5M data pages, 50,000 data pages, and they're in sequence on disk, so 50,000S. Compared to Nonmatching index scan of Example 9.3.4, walk across 1/10 as many pages and do it with S I/O instead of R. Ignore directory walk. Then I/O cost is 55,000S, with elapsed time 55,000/500 = 137.5 seconds, a bit over 2 minutes, compared with 1.75 hrs for unclustered index scan.
The difference between Examples 9.3.4 and 9.3.5 doesn't show up in the PLAN table. Have to look at ACCESSNAME = addrx and note that this index is clustered, (clusterratio) whereas ACCESSNAME = hobbyx is not.
(1) Clusterratio determines if index still clustered in case rows exist that don't follow clustering rule. (Inserted when no space left on page.)
(2) Note that entries in addrx are 40 bytes, rows of prospects are 400 bytes. Seems natural that 5000S for index, 50,000S for rows.
Properties of index:
1. Index has directory structure, can retrieve range of values
2. Index entries are ALWAYS clustered by values
3. Index entries are smaller than the rows.
Example 9.3.6. Concatenated Index, Index-Only Scan. Assume (just for this example) a new index, naddrx:
create index naddrx on prospects (zipcode, city, straddr, name)
. . . cluster . . .;
Now same query as before:
select name, straddr from prospects where zipcode
between 02159 and 03158;
Can be answered in INDEX ONLY (because find range of zipcodes and read name and straddr off components of index: Show components:
naddrx keyvalue: zipcodeval.cityval.straddrval.nameval
This is called an Index Only scan, and with EXPLAIN plan table gets new column: INDEXONLY = Y (ACCESSTYPE = I, ACCESSNAME = naddrx). Previous plans had INDEXONLY = N.
Time? Assume naddrx takes 60 bytes instead of 40 bytes, then amount read in index, instead of 5000S is 7500S, elapsed time 7500/800 = 9.4 seconds. Compare to 62.5 seconds with Example 9.3.5.
Valuable idea, Index Only. Select count(*) from . . . is always index only if index can do in a single step at all, since count entries.
But can't build index on the spur of the moment. If don't have needed one already, out of luck. E.g., consider query:
select name, straddr, age from prospects where zipcode
between 02159 and 02258;
Now naddrx doesn't have all needed components. Out of luck.
Indexes cost something. Disk media cost (not commonly crucial). With inserts or updates of indexed rows, lot of extra I/O (not common). With read-only, like prospects table, load time increases. Still often have every col of a read-only table indexed.
Chapter 9.4. Filter Factors and Statistics
Recall, estimated probability that a random row made some predicate true. By statistics, determine the fraction (FF(pred)) of rows retrieved.
E.g., hobby column has 100 values. Generally assume uniform distribution, and get: FF(hobby = const) = 1/100 = .01.
And zipcode column has 100,000 values, FF(zipcode = const) = 1/100,000. FF(zipcode between 02159 and 03158) = 1000.(1/100,000) = 1/100.
How does the DB2 query optimizer make these estimates?
DB2 statistics.
See Figure 9.13, pg. 558. After use RUNSTATS, these statistics are up to date. (Next pg. of these notes) Other statistics as well, not covered.
DON'T WRITE THIS ON BOARD -- SEE IN BOOK
|
Catalog Name |
Statistic Name |
Default Value |
Description |
|
SYSTABLES |
CARD NPAGES |
10,000 CEIL(1+CARD/20) |
Number of rows in the table Number of data pages containing rows |
|
SYSCOLUMNS |
COLCARD HIGH2KEY LOW2KEY |
25 n/a n/a |
Number of distinct values in this column Second highest value in this column Second lowest value in this column |
|
SYSINDEXES |
NLEVELS NLEAF FIRSTKEY- CARD FULLKEY- CARD CLUSTER- RATIO |
0 CARD/300 25
25
0% if CLUSTERED = 'N' 95% if CLUSTERED = 'Y' |
Number of Levels of the Index B-tree Number of leaf pages in the Index B-tree Number of distinct values in the first column, C1, of this key Number of distinct values in the full key, all components: e.g. C1.C2.C3 Percentage of rows of the table that are clustered by these index values |
Figure 9.13. Some Statistics gathered by RUNSTATS used for access plan determination
Statistics gathered into DB2 Catalog Tables named. Assume that index might be composite, (C1, C2, C3)
CARD, NPAGES for table. For column, COLCARD, HIGH2KEY, LOW2KEY. For Indexes, NLEVELS, NLEAF, FIRSTKEYCARD, FULLKEYCARD, CLUSTERRATIO. E.g., from Figure 9.12, statistics for prospects table (given on pp. 552-3). Write these on Board.
SYSTABLES
|
NAME |
CARD |
NPAGES |
|
. . . |
. . . |
. . . |
|
prospects |
50,000,000 |
5,000,000 |
|
. . . |
. . . |
. . . |
SYSCOLUMNS
|
NAME |
TBNAME |
COLCARD |
HIGH2KEY |
LOW2KEY |
|
. . . |
. . . |
. . . |
. . . |
. . . |
|
hobby |
prospects |
100 |
Wines |
Bicycling |
|
zipcode |
prospects |
100000 |
99998 |
00001 |
|
. . . |
. . . |
. . . |
. . . |
. . . |
SYSINDEXES
|
NAME |
TBNAME |
NLEVELS |
NLEAF |
FIRSTKEY CARD |
FULLKEY CARD |
CLUSTER RATIO |
|
. . . |
. . . |
. . . |
. . . |
. . . |
. . . |
. . . |
|
addrx |
prospects |
4 |
500,000 |
100,000 |
50,000,000 |
100 |
|
hobbyx |
prospects |
3 |
50,000 |
100 |
100 |
0 |
|
. . . |
. . . |
. . . |
. . . |
. . . |
. . . |
. . . |
CLUSTERRATIO is a measure of how well the clustering property holds for an index. With 80 or more, will use Sequential Prefetch in retrieving rows.
Indexable Predicates in DB2 and their Filter Factors
Look at Figure 9.14, pg. 560. QOPT guesses at Filter Factor. Product rule assumes independent distributions of columns. Still no subquery predicate.
|
Predicate Type |
Filter Factor |
Notes |
|
Col = const |
1/COLCARD |
"Col <> const" same as "not (Col = const)" |
|
Col µ const |
Interpolation formula |
"µ" is any comparison predicate other than equality; an example follows |
|
Col < const or Col <= const |
|
LOW2KEY and HIGH2KEY are estimates for extreme points of the range of Col values |
|
Col between const1 and const2 |
|
"Col not between const1 and const2" same as "not (Col between const1 and const2)" |
|
Col in list |
(list size)/COLCARD |
"Col not in list" same as "not (Col in list)" |
|
Col is null |
1/COLCARD |
"Col is not null" same as "not(Col is null)" |
|
Col like 'pattern' |
Interpolation Formula |
Based on the alphabet |
|
Pred1 and Pred2 |
FF(Pred1).FF(Pred2) |
As in probability |
|
Pred1 or Pred2 |
FF(Pred1)+FF(Pred2) -FF(Pred1).FF(Pred2) |
As in probability |
|
not Pred1 |
1 - FF(Pred1) |
As in probability |
Figure 9.20. Filter Factor formulas for various predicate types
Matching Index Scans with Composite Indexes
Assume new index mailx:
create index mailx on prospects (zipcode, hobby, incomeclass, age);
NOT clustered. column incomeclass has 10 distinct values, age has 50.
FULLKEYCARD(mailx) could be as much as
CARD(zipcode).CARD(hobby).CARD(incomeclass).CARD(age) =
100,000.100.10.50 = 1,000,000,000.
Can't be that much, only 50,000,000 rows, so assume FULLKEYCARD is 50,000,000, with no duplicate rows. (Actually, 50M darts in 5G slots. About 1/100 of slots hit, so only about 1% duplicate keyvalues.)
Entries for mailx have length: 4 (integer zipcode) + 8 (hobby) + 2 (incomeclass) + 2 (age) + 4 (RID) = 20 bytes. So 200 entries per page.
NLEAF = 50,000,000/100 = 500,000 pages. Next level up has 5,000 nodes, next level 50, next is root, so NLEVELS = 4.
SYSINDEXES
|
NAME |
TBNAME |
NLEVELS |
NLEAF |
FIRSTKEY CARD |
FULLKEY CARD |
CLUSTER RATIO |
|
. . . |
. . . |
. . . |
. . . |
. . . |
. . . |
. . . |
|
mailx |
prospects |
4 |
250,000 |
100,000 |
50,000,000 |
0 |
|
. . . |
. . . |
. . . |
. . . |
. . . |
. . . |
. . . |
Example 9.5.1. Concatenated Index, Matching Index Scan.
select name straddr from prospects where
zipcode = 02159 and hobby = 'chess' and incomeclass = 10;
Matching Index Scan here means that the three predicates in the where clause match the INITIAL column in concatenated mailx index.
Argue that matching means all entries to be retrieved are contiguous in the index.
Full filter factor for three predicates given is 1/100,000 .1/100.1/10 = 1/100M, with 50M rows, so only 0.5 rows selected. 0.5R
Interpret this probabilistically, and expected time for retrieving rows is only 1/80 second. Have to add index I/O of course. 2R, .05 sec.
Example 9.5.2. Concatenated Index, Matching index scan.
select name straddr from prospects
where zipcode between 02159 and 04158
and hobby = 'chess' and incomeclass = 10;
Now, important. Not one contiguous interval in index. There is one interval for: z = 02159 and h = 'c' and inc = 10, and then another for z = 02160 and h = 'c' and inc = 10, and . . . But there is stuff between them. Analogy in telephone directory: last name between 'Sma' and 'Smz' and first name 'John'. Lot of directory to look through, not all matches.
Query optimizer here traverses from leftmost z = 02159 to rightmost z = 04158 and uses h = 'c' and inc = 10 as screening predicates.
We say the first predicate is a MATCHING predicate (used for cutting down interval of index considered) and other two are SCREENING predicates.
(This MATCHING predicate is what we mean by Matching Index Scan.)
So index traversed is: (2000/100,000) (filter factor) of 500,000 leaf pages, = 10,000 leaf pages. Query optimizer actually calculates FF as
(04158-02159)(HIGH2KEY-LOW2KEY) = 2000/(99998-00001)
= 200/99997 or approximately 2000/100,000 = 1/50
Have to look through 1/50.NLEAF = 5000 pages, I/O cost is 5,000S with elapsed time: 5,000/400 = 12.5 seconds.
How many rows retrieved? (1/50)(1/100)(1/10) = (1/50,000) with 50M rows, so 1000 rows retrieved. 1000R, with elapsed time 1000/40 = 25 seconds. Total elapsed time is 37.5 secs.
Example 9.5.3. Concatenated Index, Non-Matching Index Scan.
select name straddr from prospects where
hobby = 'chess' and incomeclass = 10 and age = 40;
Like saying First name = 'John' and City = 'Waltham' and street = 'Main'.
Have to look through whole index, no matching column, only screening predicates.
Still get small number of rows back, but have to look through whole index. 250,000S. Elapsed time 250,000/400 = 625 seconds, about 10.5 minutes.
Number of rows retrieved: (1/100)(1/10)(1/50)(50,000,000) = 1000. 1000R = 25 seconds.
In PLAN TABLE, for Example 9.5.2, have ACCESSTYPE = I, ACCESSNAME = mailx, MATCHCOLS = 1; In Example 9.5.3, have MATCHCOLS = 0.
Definition 9.5.1. Matching Index Scan. A plan to execute a query where at least one indexable predicate must match the first column of an index (known as matching predicate, matching index). May be more.
What is an indexable predicate? Equal match predicate is one: Col = const See Definition 9.5.3. Pg. 565
Say have index C1234X on table T, composite index on columns (C1, C2, C3, C4). Consider following compound predicates.
C1 = 10 and C2 = 5 and C3 = 20 and C4 = 25 (matches all columns)
C2 = 5 and C3 = 20 and C1 = 10 (matches first three: needn't be in order)
C2 = 5 and C4 = 22 and C1 = 10 and C6 = 35 (matches first two)
C2 = 5 and C3 = 20 and C4 = 25 (NOT a matching index scan)
Screening predicates are ones that match non-leading columns in index. E.g., in first example all are matching, in second all are matching, in third, two are matching, one is screening, and one is not in index, in fourth all three are screening.
Finish through Section 9.6 by next class. Homework due next class (Wednesday after Patriot's day). NEXT homework is rest of Chapter 9 non-dotted exercises if you want to work ahead.
Definition 9.5.2. Basic Rules of Matching Predicates
(1) A matching predicate must be an indexable predicate. See pg. 560, Table 9.14 for a list of indexable predicates.
(2) Matching predicates must match successive columns, C1, C2, . . . of an index. Procedure: Look at index columns from left-to right. If find a matching predicate for this column, then this is a matching column. As soon as column fails to be matching terminate the search.
Idea is that sequence of matching predicates cuts down index search to smaller contiguous range. (One exception: In-list predicate, covered shortly).
(3) A non-matching predicate in an index scan can still be a screening predicate.
Look at rule (1) again. This is actually a kind of circular definition. For a predicate to be matching it must be indexable and:
Definition 9.5.3: An indexable predicate is one that can be used to match a column in a matching index scan.
Calling such a predicate indexable is confusing. Even if a predicate is not indexable, the predicate can use the index for screening.
When K leading columns of index C1234X are matching for a query, EXPLAIN into plan table, get ACCESSTYPE = I, ACCESSNAME = C1234X, MATCHCOLS = K. When non-matching index scan, MATCHCOLS = 0.
Recall Indexable Predicates in Figure 9.14, pg. 560, and relate to telephone directory. Does the predicate give you a contiguous range?
Col > const? between? In-list is special. like 'pattern' with no leading wild card? Col1 like Col2? (same middle name as street name) Predicate and? Predicate or? Predicate not?
OK, a few more rules on how to determine matching predicates. Page 566, Def 9.5.4. Match cols. in index left to right until run out of predicates. But
(3) Stop at first range predicate (between, <, >, <=, >=, like).
(4) At most one In-list predicate.
In-list is special because it is considered a sequence of equal matching predicates that the query optimizer agrees to bridge in the access plan
C1 in (6, 8, 10) and C2 = 5 and C3 = 20 is like C1 = 6 and . . .;
Plan for C1 = 8 and . . .; then C1 = 10 and . . .; etc.
But the following has only two matching columns since only one in-list can be used.
C1 in (6, 8, 10) and C2 = 5 and C3 in (20, 30, 40)
When In-list is used, say ACCESSTYPE = 'N'.
Example 9.5.4. In following examples, have indexes C1234X, C56X, Unique index C7X.
(1) select C1, C5, C8 from T where C1 = 5 and C2 = 7 and C3 <> 9;
ACCESSTYPE = I, ACCESSNAME = C1234X, MATCHCOLS = 2
(2) select C1, C5, C8 from T where C1 = 5 and C2 >= 7 and C3 = 9;
C3 predicate is indexable but stop at range predicate.
ACCESSTYPE = I, ACCESSNAME = C1234X, MATCHCOLS = 2
(3) select C1, C5, C8 from T
where C1 = 5 and C2 = 7 and C5 = 8 and C6 = 13;
We ignore for now the possibility of combining multiple indexes
Note, we don't know what QOPT will choose until we see plan table row
ACCESSTYPE = I, ACCESSNAME = C56X, MATCHCOLS = 2
(4) select C1, C4 from T
where C1 = 10 and C2 in (5, 6) and (C3 = 10 or C4 = 11);
C1 and C2 predicates are matching. The "or" operator doesn't give
indexable predicate, but would be used as screening predicate (not
mentioned in plan table, but all predicates used to filter, ones that
exist in index certainly would be used when possible). ACCESSTYPE = 'N',
ACCESSNAME = C1234X, MATCHCOLS = 2 Also: INDEXONLY = 'Y'
(5) select C1, C5, C8 from T where C1 = 5 and C2 = 7 and C7 = 101;
ACCESSTYPE = I, ACCESSNAME = C7X, MATCHCOLS = 1
(Because unique match, but nothing said about this in plan table)
(6) select C1, C5, C8 from T
where C2 = 7 and C3 = 10 and C4 = 12 and C5 = 16;
Will see can't be multiple index. Either non-matching in C1234X or
matching on C56X. ACCESSTYPE = I, ACCESSNAME = C1234X,
MATCHCOLS = 2
Some Special Predicates
Pattern Match Search. Leading wildcards not indexable.
"C1 like 'pattern'" with a leading '%' in pattern (or leading '_'?) like looking in dictionary for all word ending in 'tion'. Non-matching scan (non-indexable predicate).
Exist dictionaries that index by backward spelling, and DBA can use this trick: look for word with match: Backwards = 'noit%'
Expressions. Use in predicate makes not indexable.
select * from T where 2*C1 <= 56;
DB2 doesn't do algebra. You can re-write: where C1 <= 28;
Never indexable if two different columns are used in predicate: C1 = C2.
One-Fetch Access. Select min/max . . .
select min(C1) from T;
Look at index C1234X, leftmost value, read off value of C1. Say have index C12D3X on T (C1, C2 DESC, C3). Each of following qs has one-fetch access.
select min(C1) from T where C1 > 5; (NOT obviously 5)
select min(C1) from T where C1 between 5 and 6;
select max(C2) from T where C1 = 5;
select max(C2) from T where C1 = 5 and C2 < 30;
select min(C3) from T where C1 = 6 and C2 = 20 and C3 between 6 and 9;
9.6 Multiple Index Access
Assume index C1X on (C1), C2X on (C2), C345X on (C3, C4, C5), query:
(9.6.1)select * from T where C1 = 20 and C2 = 5 and C3 = 11;
By what we've seen up to now, would have to choose one of these indexes. Then only one of these predicates could be matched. Other two predicates are not even screening predicates. Don't appear in index, so must retrieve rows and validate predicates from that. BUT if only use FF of one predicate, terrible inefficiency may occur. Say T has 100,000,000 rows, FF for each of these predicates is 1/100. Then after applying one predicate, get 1,000,000 rows retrieved. If none of indexes are clustered, will take elapsed time of: 1,000,000/40 = 25,000 seconds, or nearly seven hours.
If somehow we could combine the filter factors, get composite filter factor of (1/100)(1/100)(1/100) = 1/1,000,000, only retrieve 100 rows.
Trick. Multiple index access. For each predicate, matching on different index, extract RID list. Sort it by RID value.
Intersect (AND) all RID lists (Picture) (easy in sorted order). Result is list of RIDs for answer rows. Use List prefetch to read in pages. (Picture)
This is our first multiple step access plan.
See Figure 9.15. Put on board.
V Note
|
TNAME |
ACCESSTYPE |
MATCHCOLS |
ACCESSNAME |
PREFETCH |
MIXOPSEQ |
|
T |
M |
0 |
|
L |
0 |
|
T |
MX |
1 |
C1X |
S |
1 |
|
T |
MX |
1 |
C2X |
S |
2 |
|
T |
MX |
1 |
C345X |
S |
3 |
|
T |
MI |
0 |
|
|
4 |
|
T |
MI |
0 |
|
|
5 |
Figure 9.15 Plan table rows of a Multiple Index Access plan for Query (9.6.1)
Row M, start Multiple index access. Happens at the end, after Intersect Diagram steps in picture. RID lists placed in RID Pool in memory.
Note Plan acts like reverse polish calculation: push RID lists as created by MX step; with MI, pop two and intersect , push result back on stack.
MX steps require reads from index. MI steps require no I/O: already in memory, memory area known as RID Pool.
Final row access uses List prefetch, program disk arms for most efficient path to bring in 32 page blocks that are not contiguous, but sequentially listed. Most efficient disk arm movements.
(Remember that an RID consists of (page_number, slot_number), so if RIDs are placed in ascending order, can just read off successive page numbers).
Speed of I/O depends on distance apart. Rule of thumb for Random I/O is 40/sec, Seq. prefetch 400/sec, List prefetch is 100/sec.
Have mentioned ACCESSTYPEs = M, MX, MI. One other type for multiple index access, known as MU (to take the Union (OR) of two RID lists). E.g.
(9.6.2) select * from T where C1 = 20 and (C2 = 5 or C3 = 11);
Here is Query Plan (Figure 9.16):
|
TNAME |
ACCESSTYPE |
MATCHCOLS |
ACCESSNAME |
PREFETCH |
MIXOPSEQ |
|
T |
M |
0 |
|
L |
0 |
|
T |
MX |
1 |
C1X |
S |
1 |
|
T |
MX |
1 |
C2X |
S |
2 |
|
T |
MX |
1 |
C345X |
S |
3 |
|
T |
MU |
0 |
|
|
4 |
|
T |
MI |
0 |
|
|
5 |
Figure 9.16 Plan table rows of a Multiple Index Access plan for Query (9.6.2)
Actually, query optimizer wouldn't generate three lists in a row if avoidable. Tries not to have > 2 in existence (not always possible).
Figure 9.17. 1. MX C2X, 2. MX C345X, 3. MU, 4. MX C1X, 5. MI
Example 9.6.1. Multiple index access. prospects table, addrx index, hobbyx index (see Figure 9.11, p 544), index on age (agex) and incomeclass (incomex). What is NLEAF for these two, class? (Like hobbyx: 50,000.)
select name, straddr from prospects
where zipcode = 02159 and hobby = 'chess' and incomeclass = 10;
Same query as Example 9.5.1 when had zhobincage index, tiny cost, 2R index, only .5R for row. But now have three different indexes: PLAN.
|
TNAME |
ACCESSTYPE |
MATCHCOLS |
ACCESSNAME |
PREFETCH |
MIXOPSEQ |
|
T |
M |
0 |
|
L |
0 |
|
T |
MX |
1 |
hobbyx |
S |
1 |
|
T |
MX |
1 |
addrx |
S |
2 |
|
T |
MI |
0 |
|
|
3 |
|
T |
MX |
1 |
incomex |
S |
4 |
|
T |
MI |
0 |
|
|
5 |
|
|
|
|
|
|
|
Figure 9.18 Plan table rows of Multiple Index Access plan for Example 9.6.1
Calculate I/O cost. FF(hobby = 'chess') = 1/100, on hobbyx (NLEAF = 50,000), 500S (and ignore directory walk).
For MIXOPSEQ = 2, FF(zipcode = 02159) = 1/100,000 on addrx (NLEAF = 500,000 ), so 5S.
Intersect steps MI requires no I/O.
FF(incomeclass = 10) = 1/10, on incomex (NLEAF = 50,000), so 5,000S.
Still end up with 0.5 rows at the end by taking product of FFs for three predicates: (1/100,000)(1/100)(1/10) = 1/100,000,000
Only 50,000,0000 rows; ignore list prefetch of .5L.
So 5,505, elapsed time 5,505/400 = 13.8 seconds.
List Prefetch and RID Pool
All the things we've been mentioning up to now have been relatively universal, but RID list rules are quite product specific. Probably won't see all this duplicated in ORACLE.
-> We need RID extraction to access the rows with List Prefetch, but since this is faster than Random I/O couldn't we always use RID lists, even in single index scan to get at rows with List Prefetch in the end?
YES, BUT: There are restrictive rules that make it difficult. Most of these rules arise because RID list space is a scarce resource in memory.
-> The RID Pool is a separate memory storage area that is 50% the size of all four Disk Buffer Pools, except cannot exceed 200 MBytes.
Thus if DBA sets aside 2000 Disk Buffer Pages, will have 1000 pages (about 4 MBytes) of RID Pool. (Different memory area, though.)
Def. 9.6.1. Rules for RID List Use.
(1) QOPT, when it constructs query plan, predicts size of RIDs active at any time. Cannot use > 50% of total capacity. If guess wrong, abort during runtime.
(2) No Screening predicates can be used in an Index scan that extracts a RID List.
(3) An In-list predicate cannot be used when extract a RID list.
Example 9.6.3. RID List Size Limit. addrx, hobbyx, incomex, add sexx on column sex. Two values, 'M' and 'F', uniformly distributed.
select name, straddr from prospects where zipcode between
02159 and 02358 and incomeclass = 10 and sex = 'F';
Try multiple index access. Extract RID lists for each predicate and intersect the lists.
But consider, sexx has 50,000 leaf pages. (Same as hobbyx, from 50,000,000 RID size entries.) Therefore sex = 'F' extracts 25,000 page RID list. 4 Kbytes each page, nearly 100 MBytes.
Can only use half of RID Pool, so need 200 MByte RID Pool.
Not unreasonable. A proof that we should use large buffers.
Example 9.6.5. No Index Screening. Recall zhobincage index in Example 9.3.8.
select name, straddr from prospects where zipcode between 02159
and 04158 and hobby = 'chess' and incomeclass = 10;
Two latter predicates were used as screening predicates. But screening predicates can't be used in List Prefetch. To do List Prefetch, must resolve two latter predicates after bring in rows.
The 'zipcode between' predicate has a filter factor of 2000/100,000 = 1/50, so with 50 M rows, get 1M rows, RID list size of 4 MBytes, 1000 buffer pages, at least 2000 in pool. Assume RID pool is large enough.
So can do List Prefetch, 1,000,000/100 = 10,000 seconds, nearly 3 hours.
If use screening predicates, compound FF is (1/50)(1/100)(1/10) = 1/50,000. End with 1000 rows. Can't use List Prefetch, so 1000R but
1000/40 is only 25 seconds. Same index I/O cost in both cases. Clearly better not to use RID list.
What is reason not to allow List Prefetch with Screening Predicates? Just because resource is scarce, don't want to tie up RID list space for a long time while add new RIDs slowly. If refuse screening, QOPT might use other plans without RIDs.
Point of Diminishing Returns in MX Access.
Want to use RID lists in MIX. >> Talk it
Def. 9.6.2. (1) In what follows, assume have multiple indexes, each with a disjoint set of matching predicates from a query. Want to use RID lists.
(2) Way QOPT sees it: List RID lists from smallest size up (by increasing Filter Factor). Thus start with smaller index I/O costs (matching only) and larger effect in saving data page I/O.
(3) Generate successive RID lists and calculate costs; stop when cost of generating new RID list doesn't save enough in eventual data page reads by reducing RID set.
Example 9.6.6.
select name straddr from prospects
where prospects between o2159 and 02659
and age = 40 and hobby = 'chess' and incomeclass = 10;
(1) FF(zipcode between 02159 and 02658) = 500/100,000 = 1/200.
(2) FF(hobby = 'chess') = 1/100
(3) FF(age = 40) = 1/50 (Typo in text)
(4) FF(incomeclass = 10) = 1/10
Apply predicate 1. (1/200) (50M rows) retrieved, 250,000 rows, all on separate pages from 5M, and 250,000L is 2500 seconds. Ignore the index cost.
Apply predicate 2 after 1. Leaf pages of hobbyx scanned are (1/100) (50,000) = 500S, taking 500/400 = 1.25 seconds. Reduce number of rows retrieved from 250,000 to about 2500, all on separate pages, and 2500L takes about 25 seconds. So at cost of 1.25 seconds of index I/O, reduce table page I/O from 2500 seconds to 25 seconds. Clearly worth it.
Apply predicate 3 after 1 and 2. Leaf pages of agex scanned are (1/50) (50,000) = 1000S, taking 1000/400 = 2.5 seconds. Rows retrieved down to (1/50) (2500) = 50, 50L is .5 seconds. Thus at cost of 2.5 seconds of index I/O reduce table page I/O from 25 seconds to 0.5 seconds. Worth it.
With predicate 4, need index I/O at leaf level of (1/10) (50,000) = 5000S or 12.5 seconds. Not enough table page I/O left (0.5 seconds) to pay for it. Don't use predicate 4. We have reached point of diminishing returns.
------------------------------------------------------------March 19--------------------------------------------------------------
Thus far we have explored how the Query Optimizer is able to develop a plan for executing a query. We have limited ourselves in the FROM clause to a single table, but we have examined a variety of WHERE clause phrases to see how indexes could be used to more quickly filter the records required for the response. However, in Chapter 8 we understood that clustering could be used to speed the joining of two tables, so now we look at how the Query Optimizer handles Joins.
9.7 Methods for Joining Tables
DB2 uses three different algorithms for computing joins: nested loop, merge scan, and hybrid, and a fourth algorithm, hash join has been implemented by other vendors. Since a join is a binary operation, one table (called the OUTER TABLE) is joined with another and the resulting table can play the role of outer table if more than two joins are specified in the query.
v Joins are expensive in time and space, so the algorithms do not materialize entire tables but only the minimum amount of data needed by the query.
v Often a join
will only be partially implemented with the remainder computed on the fly as
the additional rows are needed.
Nested-Loop Join
[9.7.1] select T1.C1, T1.C2, T2.C3, T2.C4 from T1,T2
where T1.C1=5 and T1.C2=T2.C3;
R1: FIND ALL ROWS T1.* IN THE OUTER TABLE T1 WHERE C1=5;
FOR EACH ROW T1.* FOUND IN THE OUTER TABLE:
R2: FIND ALL ROWS T2.* IN THE INNER TABLE WHERE T1.C2=T2.C3;
FOR EACH ROW T2.* FOUND IN THE INNER TABLE
RETURN ANSWER: T1.C1, T1.C2, T2.C3, T2.C4;
END FOR;
END FOR;
Step R1: generates a row in the plan table for indexed access via C1X with List Prefetch.
As each qualifying row from T1 is retrieved, the value in column C1 is used, along with C3X on table T2, to obtain all of the rows of T2 that participate in the join with this row from T1. A second entry is made in the plan table, this one for PLAN NO = 2, for indexed access with METHOD=1 to signify a nested-loop join. Steps labeled R1 and R2 indicate that a list of I/Os must be accomplished (via index or scan), where R2 occurs repeatedly in a loop.
COSTI/O(N-L join) = COSTI/O(R1) +(#rows qualified from Outer Table)* COSTI/O(R2)
Cost is low if (1) small number of outer table rows qualify; or (2) inner table can be brought into memory.
EXAMPLE 9.7.1 Suppose T1 and T2 are each 200 bytes wide and 1M records long, having indexes C1X, C3X and C4X. Further suppose FF(C1=const)=FF(C4=const)= 1/100, FF(C2 = const)= 1/250,000, and FF(C3=const)=1/500,000. What is the best plan for
select T1.C1, T1.C2, T2.C3, T2.C4 from T1,T2 where T1.C1=5 and T1.C2=T2.C3 and T2.C4=6;
The plan for this query with T1 the outer table is identical to the one above (only the returned rows will be restricted by C4=6). 10,000 rows will be retrieved via C1X=5 with cost 10,000L=50 seconds (ignoring indexing cost). Because only two rows will be retrieved for each C3X=const [1M/500,000] after retrieving the index leaf node, the cost of the inner loop is 10,000*(3R) = 30,000/80 = 375 seconds. Of the 20,000 rows retrieved from T2, only 200 will have C4=6, so the join table will have only 200 rows but will take just over 7 minutes to compute.
Merge Join
Repeat the query of Example 9.7.1 but this time the idea of this algorithm is to access T1 on C1X=5, and write to disk an intermediate table IT1 of the qualifying records (all of them) in order by C2 (this might take some disk sorting to accomplish). Then do the same for T2 using C4X=6 and ordering by C3. Now the job is to step through both files and construct the rows of the join table.
SET P1 TO ROW1 OF IT1;
SET P2 TO ROW1 OF IT2;
MJ: WHILE (TRUE) { /* find a match*/
WHILE (P1.C2 <> P2.C3) { /*move lower pointer up*/
IF (P1.C2>P2.C3) {ADVANCE P2; IF (eof ) EXIT MJ;}
ELSE {ADVANCE P1; IF (eof ) EXIT MJ;}
}/*match found*/
MEMP = P2; /*inner loop for join always begins here*/
WHILE (P1.C2==P2.C3) {return IT1.C1, IT1.C2,IT2.C3, IT2.C4;
ADVANCE P2;} /*end match*/
ADVANCE P1; IF (eof) EXIT MJ;
IF(P1.C2==MEMP.C3) P2=MEMP; } /*end MJ*/
REPEAT EXAMPLE 9.7.1 Suppose T1 and T2 are each 200 bytes wide and 1M records long, having indexes C1X, C3X and C4X. Further suppose FF(C1=const)=FF(C4=const)= 1/100, FF(C2 = const)= 1/250,000, and FF(C3=const)=1/500,000. What is the best plan for
select T1.C1, T1.C2, T2.C3, T2.C4 from T1,T2 where T1.C1=5 and T1.C2=T2.C3 and T2.C4=6;
Using a merge join, IT1 is obtained by extracting 10,000 records from T1 and IT2 is formed by extracting 10,000 records from T2. These two steps appear in the plan table as TYPE=I, ACCESS NAME=C1X and C4X, PREFETCH=L, but this time SORTN_JOIN=Y. Since the extracted records are only two values each (about 20 bytes), IT1 and IT2 are each 200,000 bytes long so they should fit into memory (50 blocks each). The merge to find the resulting 200 records of the join (8,000 bytes total) will also be done in memory, so the total cost (ignoring accessing the index leaf nodes) is 20,000L = 100 seconds, four times faster.
Note that if there were a combined indexes: C12X and C43X, then the join could be performed as INDEX_ONLY, by reading the leaf nodes of the two indexes and obtaining the records in join order (IT1 and IT2 would not have to be materialized). The cost would that of finding 10,000 leaf entries in each index, where leaf nodes now hold half as many pointers but are accessed at 800 blocks per second (16,000 entries), hence at total of 1.25 seconds.
EXAMPLE 9.7.3 changes the query to
select T1.C5, T1.C2, T2.* from T1,T2 where T1.C5=5 and T1.C2=T2.C3;
and suppose the FF(T1.C5)=1/1000. The nested plan now obtains 1000 rows for outer table T1 with a cost of 1 R + 1000 L and the same 3 R to retrieve each inner table row (3000 R) for a total cost of 3001/80 + 1000/200 or 42.5 seconds.
The merge join plan extracts IT1 as 1 R + 1000 L and holds the 20,000 bytes in memory. We can now (1) access T2 via C3X and building IT2 in sort order (all of the leaf nodes will be scanned in order, so 1M entries will be processed and 1,000,000 R data accesses made as a result), or (2) we can just sort T2 on C3 to produce IT2 and then do the merge (reading IT2 into memory once). The first alternative is a bad way to sort IT2 but the second, even assuming a rapid sort would require 3 or 4 passes through the file at 800 blocks per second and 4,000,000/800 = 5000 seconds.
Hybrid Join
In the previous example, it was a waste of time to produce IT2 when so few of its records participated in the merge. When IT1 is produced, each of its rows could be extended with the RID found by using its C2 value in C3X. If more than one RID is found in C3X, the row in IT1 is duplicated. IT1 will then mirror the join result table. IT1 is then sorted in the RID column order and T2 can be accessed with LIST PREFETCH.
EXAMPLE 9.7.4 repeats the above query with the hybrid plan. The 1000 records of IT1 now become 2000 records each 24 bytes long found first by using C1X=5 to get 1000 entries of C1 and C2 (1 R + 1000 L), then to sort on C2, and then obtain an average of 2 RIDs for each C3X=IT1.C2 at a cost of 1000 R After sorting this table by RID, we can access T2 to get the data for each row: 2000 L. The cost is 1001/80 + 3000/200 = 27.5 seconds.
In the DB2 plan table, METHOD 0 indicates no join method used for this step, METHOD 1 means nested loop join, METHOD 2 means merge join, and METHOD 4 means hybrid join.
Multiple Table Joins
Accomplished by pipelining: rows of results tables are fed as outer table rows to the following join method. In this way some of these rows will be disqualified and the amount of storage needed to hold intermediate tables will be reduced. The Query Optimizer must examine the constraints specified on the rows of each table before deciding in what order to perform the joins. Using its statistics, it will compute the join in the order that maximizes filtering, and use the join method that facilitates pipelining.
Transforming Nested Queries to Joins
THEOREM 9.7.1 The following two queries give equivalent results:
select T1.C1 from T1 where [predicates on T1 and ] T1.C2 is in (select T2.C3 from T2
[where predicates on T2, T1]);
and
select distinct T1.C1 from T1, T2 where T1.C2=T2.C3 [and predicates on T1 and ]
[and predicates on T2, T1]);
EXAMPLE 9.7.6
Select * from T1 where C1=5 and C2 in (select C4 from T2 where C5=6 and C6> T1.C3);
It would
seem that the Query Optimizer would have to loop through T1 to qualify the
records from which C3 can be used in the subquery via C4X access to T2 to find
those records where C4=C2. For each
record in T1 such that C1=5, use C4X=T1.C2 and see if C6>T1.C3 and C5=6,
keeping the those that do. Now apply
the theorem:
Select distinct T1.* from T1, T2 where C1=5 and
T1.C2 =T2.C4 and T2. C5=6 and T2.C6> T1.C3;
METHOD 2 could be used where IT1 was obtained via condition C1=5 and sorted on C2, and IT2 was obtained by extracting rows where C5=6 and sorting on C4. The merge result has to satisfy the final condition. Table sizes and fllter factors would be used to make the choice.
9.8 Disk Sorts
N-Way merge sort:
1. Memory permits M+1 pages in memory where the file contains D>M pages where all rows lie within a page. Read in M blocks and sort, write one sorted block to scratch area.
2. Repeat until file in Bn sorted blocks in scratch area (last one may be small)
3. Read first page (or more) from each of the first N blocks and merge them, writing pages to second scratch area in blocks of size N*M.
4. Repeat step 3 until only one block is written (sorted file). In fact, whenever the written block count falls below M+1, we know that there is only one pass left, so in the last past we overwrite the original file. However, is the result is the final result of the query, it is printed as it is generated rather than stored to disk.
If the scratch area consists of space on 2N disk units, the reads and writes can be done in parallel (writing cyclically). This would permit sequential access to the block (if there was no contention for the arm). Otherwise, if the scratch areas are on the same drive, every page access for read or write is a random access to that page (after the first N pages).
The cost of disk sorting, assuming each record is read and written each pass, is 2*number of passes, which is CEIL(logM (D))*D when each page is a random read. EXAMPLE 9.8.3 demonstrates sorting (ORDERED BY) a 10,000 page table (1M rows) in a 10-way merge which requires 4 passes, producing blocks of 10 pages, 100 pages, 1000 pages, and finally one block of 10,000. It costs 20000 R page I/Os to convert 10,000 unsorted pages into 1000 sorted blocks. It costs 20,000 R page I/Os to covert the 1,000 sorted blocks into 100 sorted blocks. It costs 20,000 R page I/Os to convert the 100 sorted blocks into 10 sorted blocks. However, it costs only 10,000 page I/Os to convert the 10 blocks into results output. The cost of the sort is 70,000 R.
-----------------------------------------------March 31----------------------------------------------------
The final two sections of this chapter on query optimization focus on the role of benchmarks in helping the DBA ascertain the performance of the database software on simulated workloads. O’Neill is an authority on benchmarking and the discussion he gives on the Query Set benchmark using the BENCH table is an update of published work that is about 10 years old (he has a warning to that effect in the text). I have gathered here some material from the Web where the second edition (1993) of the Benchmark Handbook is available [http://www.benchmarkresources.com/handbook/contents.asp ] with specific reference to chapter 6 written by O’Neill. I have also included on the Chapter 10 web page a survey article from the 1997 edition that discusses the history of benchmarks relevant to assessment of a variety of database functions.