If you have not studied the chapter and reviewed the dotted parts of the question before reading the solutions below, you will not gain much understanding here. The exercises are to provide a reality check about the concepts presented in the chapter. If you have not grasped the concepts, the details of their application will not be evident. When solving each problem it is wise to

 

[9.1] (b) (i) As in Example 9.3.4, we assume buffering of directory level B-tree pages, so that disk access is needed only for leaf-level index pages and data pages. Thus elapsed time is 2R, but instead of random reads R, we are required to perform P32 type reads requiring .028 seconds calculated in part (a). We need 2P32, and thus the time is .056 seconds. Note that it doesn’t help to read in 32 pages in a row when we only NEED one page in sequence as is the case with the leaf page and data page for a unique row. In fact it hurts us by taking extra time for the disk arms. (In addition, we would have to buffer 32-page blocks, which would waste buffer space, so we would probably have more trouble buffering upper levels of the index.)

(ii) Here we have 2R, normal random I/Os taking .0125 seconds (Figure 9.6), so 2R takes .025 seconds.

(d) (i) The time for the workload when random I/Os are used is: W1*0.025 + W2*687.5, and time when P8 I/Os are used (at 0.016 seconds each) is: W1*2*0.016 + W2*110. Equating the two times gives us an equation in two unknowns, where we can solve for W1, W1 = ((687.5 - 110)/(.032-.025))*W2 = 82500*W2. Thus, for every large query of type (c), we need to have 82,500 exact match queries in the workload to have the situation where it doesn't matter which access method is used.

(ii) Clearly, in the case of query type (b) we would choose 2R I/O, or 0.025 seconds. In the case of type (c) we would choose P8, so the total elapsed time for the query is 110 seconds.

[9.4](a) (i) CARD = 100,000,000, just the number of rows. Given that the pages are 90% full, we can fit 18 pages of 200 bytes on a 4 KByte page. CEIL(100,000,000/18) = 5,555,556, and this is NPAGES.

(ii) For each column Ck, CARD(Ck) = COLCARD, and we have assumed that the values populating the column go from 1 to CARD(Ck).

For C2, COLCARD = 200, LOW2KEY = 2, HIGH2KEY = 199.
For C3, COLCARD = 50, LOW2KEY = 2, HIGH2KEY = 49.
For C6, COLCARD = 1000, LOW2KEY = 2, HIGH2KEY = 999.
For C7, COLCARD = 20, LOW2KEY = 2, HIGH2KEY = 19.

(iii) For C1234X, FIRSTKEYCARD = CARD(C1) = 100. To calculate FULLKEYCARD, multiply the COLCARDS to get: 100 * 200 * 50 * 10 = 10,000,000. This means that on the average there are 10 rows for each of these values, and we assume the FULLKEYCARD is 10,000,000. For C5678X, FIRSTKEYCARD = CARD(C5) = 10. To calculate FULLKEYCARD, multiply to get: 10 * 1000 * 20 * 5000 = 1000 million. This is 10 times the number of rows, and so the proper estimate for the number of values used is about 100,000,000, the number of rows. Essentially all key values are unique.

(iv) Recall that all columns are 4 bytes in length. For C1234X, the keyvalue length is 16 bytes. There is a certain amount of compression at the leaf level, with each block of duplicates containing 1 keyvalue (16 bytes), 4 bytes block overhead, and 10 RIDs of 4 bytes each; this means a total of 60 bytes for 10 entries or 6 bytes per entry. Can fit FLOOR(4000/6) = 666 entries per page. With 100,000,000 entries at the leaf level, this gives 100,000,000/666 = 150,151 = NLEAF. Entries on higher directory levels are 20 bytes in length (no compression, so 16 bytes for sepkey and 4 bytes for nodeptr). This means 4000/20 = 200 entries per page, and the next level of nodes will contain:

CEIL(150,151/200) = 751 nodes.

The next level will have 4 nodes and the level above that will be the root, so NLEVEL = 4. In the case of C5678X, there is no compression since we had assumed all key values are unique. Therefore at leaf level and all levels above, an index entry is 20 bytes, 200 entries per page. NLEAF = 100,000,000/200 = 500,000. Next level up, number of nodes is 2,500. Next level up, 13. Next is root and NLEVELS is 4.

(c) One possibility, is the plan: ACCESSTYPE = I, ACCESSNAME = C1234X, MATCHCOLS = 2, (INDEXONLY = N). We have matching columns C1 and C2, with filter factor given by: (1/100)(1/200) = 1/20,000. We thus have to retrieve at the index leaf level: (1/20,000)(150,151) = about 8S, or 0.01 seconds. We also have to retrieve data pages for each of (1/20,000)(100,000,000) = 5,000 rows. (List prefetch could be used here, but we are defaulting to Random I/O), so elapsed time is:

5000/80 = 62.5 seconds.

Another possiblility is the plan: ACCESSTYPE = I, ACCESSNAME = C5678X, MATCHCOLS = 0, INDEXONLY = Y. A non-matching scan, but one that is INDEXONLY. We will have to read the whole of the C5678X leaf level, 500,000 pages, using sequential prefetch, 500,000S. The elapsed time is therefore 500,000/800 = 625 seconds. This is an inferior plan to the first.

9.6 (b) Yes. FF(AGE = 40) = 1/50, use agex index to extract 1 Million RIDs.

(d) Yes, under the same memory requirements as in part a. FF(AGE between 40 and 44) = 1/10, and use agex index as in (b). There will be 5 million RIDs in the list.

(f) Yes, use the predicate "age = 40 or age = 43 or age = 45". Then have sequence: MX, MX, MU, MX, MU, ending up with the desired RID list.

[9.8] (b)(ii) FF(hobby = ‘chess’) = .01, FF(age = 20) = .02, FF(zipcode between 02139 and 07138) = .05.

ACCESS
TYPE

MATCH
COLS

ACCESS
NAME

PRE
FETCH

MIXOP
SEQ

M

0

 

L

1

MX

1

hobbyx

S

2

MX

1

agex

S

3

MI

0

 

 

4

MX

1

mailx

S

5

MI

0

 

 

6

Note that addrx could be used in step 5, but it has more leaf pages, so mailx is faster to use.

[9.10] (b) We start by noting that the indexes C6X and C9X have only a small number of values (FF is small relative to 1,000,000) and therefore there will be compression on the leaf level of these indexes so that each entry requires only 4 bytes. Thus there are 1000 entries per page of 4 KBytes, and 1,000,000 entries requires only 1000 NLEAF pages. The table itself will fit 20 rows of 200 bytes on each page, so we will have NPAGES = 1,000,000/20 = 50,000.

(i)With nested loop join where T1 is the outer table, we start by retrieving rows from T1 where T1.C6 = 5. The index scan cost is (1/20)(1000) page, 50 pages with sequential prefetch, 50S, and elapsed time 50/800 = .0625 seconds. There will be (1/20)(1,000,000) rows retrieved, or 50,000 rows. We assume these all lie on distinct pages (of the 50,000 data pages Q it will be somewhat less, but we allow this approximation), and we access these 50,000 page with list prefetch, 50,000L, with elapsed time 50,000/200 = 250 seconds. Now for each of these rows, we set T1.C7 = K and retrieve all rows in T2 such that T2.C8 = K. This requires one access to the leaf page of the C8X index (1R), and then a retrieval of 5 pages from T2, 5L. We do this 50,000 times, so we end up with 50,000R + 250,000L and use the rule of thumb that the elapsed time is 50,000/80 + 250,000/200 = 625 + 1250 seconds. The total time therefore is given by:

0.0625 seconds (C6X index) + 250 seconds (50,000 rows in T1) + 625 seconds + 1250 seconds (50,000 retrievals in T2) = 2125 seconds. (We ignore the 0.0625 as insignificant.)

(ii) With nested loop join where T2 is the outer table, we start by retrieving rows from T2 where T2.C9 = 6. The index scan cost is (1/800)(1000) page, 1.25 pages and this will be insignificant. There will be (1/400)(1,000,000) rows retrieved, or 2,500 rows. We assume these all lie on distinct data pages, and we access these 2,500 page with list prefetch, 2,500L, with elapsed time 2,500/200 = 12.5 seconds. Now for each of these rows, we set T2.C8 = K and retrieve all rows in T1 such that T1.C7 = K. This requires one access to the leaf page of the C7X index (1R), and then a retrieval of 5 pages from T1, 5L. We do this 2,500 times, so we end up with 2,500R + 12,500L and use the rule of thumb, that the elapsed time is 2,500/80 + 12,500/200 = 31.25 + 62.5 seconds. The total time therefore is given by:

(insignificant C9X index) + 12.5 seconds (2,500 rows in T2) + 31.25 seconds + 62.5 seconds (2,500 retrievals in T2) = 106.25 seconds.

This is clearly the winner so far.

(iii) In merge join, we retrieve all rows from T1 where T1.C6 = 5, and then sort them by T1.C7. The cost of the index scan cost, as calculated in part (i), is 0.0625 seconds for the index (insignificant) and then 250 seconds to extract 50,000 rows into IT1. We then retrieve all rows from T2 where T2.C9 = 6 and sort them by T2.C8. The cost for the extract, as calculated in part (ii) is (insignificant for the index) and 12.5 seconds for the 2,500 rows extracted into IT2. The entire row in both cases must be extracted and sorted (since the Select statement retrieves all columns), and this means 20 rows per page. Thus IT1 requires 50,000/20 = 2,500 pages, and the sort requires 2,500S + 2,500R, with elapsed time 2,500/800 + 2,500/80 = 3.125 + 31.25 = 34.375 seconds. Also IT2 requires 2,500/20 = 125 pages, and the sort requires 125S + 125R, with elapsed time of 125/800 + 125/80 = .15625 + 1.5625 seconds. All other steps have no I/O cost (the sorts of IT1 and IT2 feed pages in order to the merge, and very few, only about 625, are ultimately selected). Thus the total elapsed time for I/O in this case is:

250 seconds (to extract IT1) + 12.5 seconds (IT2) + 34.375 seconds (sort IT1) + 1.7375 seconds.

The 250 seconds at the beginning

[9.11] (b) After pass 1:

12 29 45 58 67 84 | 7 22 39 76 81 91 | 4 28 33 65 77 96 | 1 13 32 41 54 59

After pass 2:

4 7 12 22 28 29 33 39 45 58 65 67 76 77 81 84 91 96 | 1 13 32 41 54 59

After pass 3:

1 4 7 12 13 22 28 28 32 33 39 41 45 54 58 59 65 67 76 77 81 84 91 96

(c) After pass 1:

12 45 67 84 | 7 29 58 76 | 22 39 81 91 | 28 33 65 96 | 4 13 54 77 | 1 32 41 59

After pass 2:

7 12 22 28 29 33 39 45 58 65 67 76 81 84 91 96 | 1 4 13 32 41 54 59 77

After pass 3:

1 4 7 12 13 22 28 28 32 33 39 41 45 54 58 59 65 67 76 77 81 84 91 96