Section 6.5. Preliminaries for Normalization.
Idea in normalization, start with data item names (to be columns in some tables) together with a list of rules of relatedness. Then start with all data items in one table (universal table).
Rules of relatedness and a desire to avoid certain types of bad behavior (anomalies) causes us to factor this big table into smaller tables, achieving more and more restrictive forms (Normal Forms). 1NF, 2NF, 3NF, BCNF.
The idea is that from the factored tables can always get back all the original data by joins; this is called a "lossless decomposition".
A second desire we have is that the database system will be able to check the rules of relatedness as simply as possible (generally by simply enforcing uniqueness of a column in a table).
Section 6.6. Functional Dependencies.
Def 6.6.1. Given a table T with at least two attributes A and B, we say that A -> B (A functionally determines B, or B is functionally dependent on A) iff it is the intent of the designer that for any set of rows that might exist in the table, two rows in T cannot agree on A and disagree on B.
More formally, given two rows r1 and r2 in T, if r1(A) = r2(A) then r1(B) = r2(B).
Def. 6.6.2. Now sets of attributes: X = A1 A2 . . . Ak, and Y = B1 B2 . . . Bm. Say X -> Y iff it is the intention of the designer that two rows cannot simultaneously agree on X and disagree on Y.
Same wording as Def. 6.6.1, but note that for a set X, agrees if and only if agrees on ALL column value, disagrees if disagrees on ANY column value.
Ex. 6.6.3. We claim what follows is all FDs of emp_info.
(1) emp_id -> emp_name emp_phone dept_name
(2) dept_name -> dept_phone dept_mgrname
(3) skill_id -> skill_name
(4) emp_id skill_id -> skill_date skill_lvl
Note that we can conclude from above that designer does not intend that skill_name should be unique for a particular skill. skill_name -> skill_id is not there, nor is it implied by this set (no skill_name on left).
Note that a set of FDs has logical implications to derive OTHER FDs.
Theorem 6.6.3. Inclusion rule. Given T with Head(T). If X and Y are sets in Head(T) and Y Í X, then X -> Y (for ANY content for T). (Venn Diagram.)
Proof. By def, need only demonstrate that if two rows u and v agree on X they must agree on Y. But Y is a subset of X, so seems obvious.
Def. 6.6.4. A trivial dependency is an FD of the form X -> Y that holds for any table T where X UNION Y Í Head(T). (Assume ANY content for T.)
Theorem 6.6.5. Given a trivial dependency X -> Y in T, it must be the case that Y Í X.
Def. 6.6.6. Armstrong's Axioms. From the following small set of basic rules of implication among FDs, we can derive all others that are true.
[1] Inclusion rule: if Y Í X, then X -> Y
[2] Transitivity rule: if X -> Y and Y -> Z, then X -> Z
[3] Augmentation rule: if X -> Y, then X Z -> Y Z
Note X Z for sets is the same as X UNION Z.
Everything we can say about FDs is implied by these axioms.
Let's prove one of these axioms. We will be appealing to definitions of Functional Dependency here to prove an Armstrong Axiom.
Prove Augmentation Rule. If X -> Y then X Z -> Y Z. Consider two rows u and v that agree on X Z, that is they agree on all attributes in X UNION Z. Then since u and v agree on all attributes of X they agree on all attributes of Y (since X -> Y). Similarly, since it is given that they agree on all attributes of Z and we have just shown they agree on all attributes of Y, they agree on all attributes of Y UNION Z. Thus, from the fact that u and v agree on all attributes of X UNION Z we can conclude that they also agree on all attributes of Y UNION Z, and by definition we have X Z -> Y Z.
Some implications of Armstrong's Axioms.
(1) Union Rule: If X -> Y and X -> Z then X -> Y Z
(2) Decomposition Rule: If X -> Y Z then X -> Y and X -> Z
(3) Pseudotransitivity Rule: If X -> Y and W Y -> Z then X W -> Z
(4) Accumulation Rule: If X -> Y Z and Z -> B W then X -> Y Z B
One Example:
[1] Union Rule: If X -> Y and X -> Z, then X -> Y Z.
Proof: We are given (a) X -> Y and [3]
Augmentation rule: if X -> Y, then X Z -> Y Z, hence we have (c) X
X -> X Y. But X X is X UNION X = X,
so (c) can be rewritten
(d) X -> X Y.
Similarly we are given (b) X
-> Z so by Armstrong's Augmentation rule we have
(e) X Y -> Y Z.
Finally, by (d) and (e) and
transitivity, we have X -> Y Z, the desired result.
The idea is that we can use these new Rules as if they were axioms, as facts to prove yet other rules or special cases. Like proving theorem in Geometry, can then use to prove other theorems.
In what follows, when we list a set of FDs, we normally try to list a MINIMAL set, so that a smaller set doesn't exist that will imply these. It will turn out that finding a minimal set of FDs is very important in finding the right relational design by Normalization.
Example 6.6.4. Pg. 338. Consider the table T below fixed in content for all time so the intended FDs can be read off (VERY UNUSUAL). Let's try to list a minimal set of FDs.
Table T
|
row # |
A |
B |
C |
D |
|
1 |
a1 |
b1 |
c1 |
d1 |
|
2 |
a1 |
b1 |
c2 |
d2 |
|
3 |
a2 |
b1 |
c1 |
d3 |
|
4 |
a2 |
b1 |
c3 |
d4 |
Analysis. Start by considering FDs with a single attribute on the left.
Always have the trivial FDs, A -> A, B -> B, C -> C, and D -> D, but don't list trivial FDs in a minimal set.
(a) All values of the B attribute are the same, so it can never happen for any other attribute P (i.e., where P represents A, C, or D) that r1(P) = r2(P) while r1(B) ≠ r2(B); thus we see that A -> B, C -> B, and D -> B. At the same time no other attribute P is functionally dependent on B since they all have at least two distinct values, and so there are always two rows r1 and r2 such that r1(P) ≠ r2(P) while r1(B) = r2(B); thus: B -/-> A, B -/-> C, and B -/-> D.
(b) Because the D values are all different, in addition to D -> B of part (a), we also have D -> A and D -> C. We state: a KEY (D) functionally determines everything else, which will turn out to be the point. At the same time D is not functionally dependent on anything else since all other attributes have at least two duplicate values, so in addition to B -/-> D of part (a), we have A -/-> D, and C -/-> D.
(c) We have A -/-> C (because of rows 1 and 2) and C -/-> A (because of rows 1 and 3). Therefore, we can list all FDs (and failed FDs) with a single attribute on the left (we provide a letter in parentheses keyed to the paragraph above that give us each fact).
(a) A -> B (a) B -/-> A (c) C -/-> A (b) D -> A
(c) A -/-> C (a) B -/-> C (a) C -> B (a) D -> B
(b) A -/-> D (a) B -/-> D (b) C -/-> D (b) D -> C
By the union rule, whenever a single attribute on the left functionally determines several other attributes, as with D above, we can combine the attributes on the right: D -> A B C. From the analysis so far, we have the following set of FDs (which we believe to be minimal):
(1) A -> B, (2) C -> B, (3) D -> A B C
(What
is the key for this table? Note, really
D -> A B C D and
D->A B, D->B C, D->A C)
Now consider FDs with pairs of attributes on the left.
(d) Any pair containing D determines all other attributes, by FD (3) above and the augmentation rule, so there is no new FD with D on the left that is not already implied.
(e) The attribute B combined with any other attribute P on the left, still functionally determines only those attributes already determined by P, as we see by the following argument. If P -/-> Q this means there are rows r1 and r2 such that r1(Q) ≠ r2(Q) while r1(P) = r2(P). But because B has equal values on all rows, we know that r1(B P) = r2(B P) as well, so B P -/-> Q. Thus we get no new FDs with B on the left.
(f) Now the only pair of attributes that does not contain B or D is A C, and since A C has distinct values on each row (examine table T again!), we know that A C -> A B C D. This is new, but is it minimal?
It is trivial that A C -> A, and A C -> C, and we already knew that A -> B, so it is easy to show that A C -> B. (A C -> B C by augmentation, B C -> B by inclusion, so A C -> B by transitivity.) Thus the only new fact we get from seeing that A C -> A B C D is that A C -> D.
Now consider looking for FDs with triples of attributes on the left.
Any triple that does not contain D (which would assure that it functionally determine all other attributes) must contain A C (and therefore functionally determines all other attributes). So everything with three attributes on the left is already handled. Clearly the same holds for any set of four attributes on the left.
The complete set of FDs implicit in the table T is therefore the following:
(1) A -> B, (2) C -> B, (3) D -> A B C, (4) A C -> D.
------------------------------------------begin 2/3/03---------------------------------------------------
NOTE: We have examined above the concept of a FD, and derived some rules that apply to a SET of FDs. We have used as an example a single table: Table T
|
row # |
A |
B |
C |
D |
|
1 |
a1 |
b1 |
c1 |
d1 |
|
2 |
a1 |
b1 |
c2 |
d2 |
|
3 |
a2 |
b1 |
c1 |
d3 |
|
4 |
a2 |
b1 |
c3 |
d4 |
That we assume will never change, and reasoned about all the possible FDs that could exist between the columns of this table (one column DETERMINES (is the domain of a function) the value of another if whenever two rows in the domain column happen to have the same value, then the values in the determined column will also be the same.
The first three FDs come from the earlier list of FDs with single attributes on the left, while the last FD, A C -> D, is the new one generated with two attributes listed on the left. It will turn out that this set of FDs is not quite minimal, despite all our efforts to derive a minimal set. We will see this after we have had a chance to better define what we mean by a minimal set of FDs. This is why we need rigor.
(1) A -> B, (2) C -> B, (3) D -> A C, (4) A C -> D is minimal!
The original set can be derived as follows: (3) D -> A C is equivalent (by decomposition and union) to (a) D -> A and (b) D -> C. Furthermore, since (a) D -> A and (1) A -> B, by transitivity (c) D -> B. Now by (3) and union, D -> A B C.
Def 6.6.9. Closure of a set of FDs. Given a set F of FDs on attributes of a table T, we define the CLOSURE of F, symbolized by F+, to be the set of all FDs implied by F.
E.g., in above, D -> A B C was in the closure of the reduced set of FDs. The MINIMAL set of FDs is what we want, so take out implied ones!
There can be a LOT of FDs in a closure. Ex. 6.6.5, From F = {A -> B, B -> C, C -> D, D -> E, E -> F, F -> G, G -> H}, we can add trivial dependencies A -> A, B -> B, etc., and by transitivity and union get A -> A B, A -> A B C, . . . In fact any letter -> any subset of letters equal and to it's right in the alphabet. C -> C D E F , C -> D F, B -> C D F G. In fact, if we name any two sets of letters, one of them has an earliest letter in the alphabet and -> the other (maybe they both have that letter and -> each other). So the number of implied FDs is the number of subsets of letters squared, about (2n)2 = 22n. E.g., from A -> B, B -> C, . . . , J -> K, 10 FDs, get 220 FDs, about a million. If go up to . . . , X -> Y, Y -> Z, get 252 FDs, about 4,000,000,000,000,000 FDs (four quadrillion).
Really only want to arrive at a MINIMAL set of FDs, so we try to avoid this kind of explosion. Still we need all these concepts.
Def. 6.6.10. FD Set Cover. A set F of FDs on a table T is said to COVER another set G of FDs on T if the set G can be derived by implication rules from the set F, i.e., if G Í F+. If F covers G and G covers F, we say the two sets of FDs are equivalent, F ~ G.
Ex. 6.6.6. F = {B -> C D, A D -> E, B -> A} and G = {B -> C D E, B -> A B C, A D -> E}.
Demostrated in book how F covers G. But also G covers
Def. 6.6.11. Closure of a set of attributes. Given a set X of attributes in a table T and a set F of FDs on T, we define the CLOSURE of the set X (under F), denoted by X+, as the largest set of attributes Y such that X -> Y is in F+.
We will study closure of set of ATTRIBUTES, not closure of set of FDs.
Pg. 368. Algorithm 6.6.12. Set Closure. Algorithm to determine X+, the closure of a given set of attributes X, under a given set F of FDs.
I = 0; X[0] = X; /* integer I, attr. set X[0] */
REPEAT /* loop to find larger X[I] */
I = I + 1; /* new I */
X[I] = X[I-1]; /* initialize new X[I] */
FOR ALL Z -> W in F /* loop on all FDs Z -> W in F */
IF Z Í X[I] /* if Z contained in X[I] */
THEN X[I]= X[I]UNION W;/* add attributes in W to X[I] */
END FOR /* end loop on FDs */
UNTIL X[I] = X[I-1]; /* loop till no new attributes */
RETURN X+ = X[I]; /* return closure of X */
Note that the step in Algorithm 6.6.12 that adds attributes to X[I] is based on a simple inference rule that we call the Set Accumulation Rule, stated thus: If X -> Y Z and Z -> W then X -> Y Z W.
In our algorithm we are saying that since X -> X[I] (our inductive assumption) and X[I] can be represented as Y Z (because Z ÍX[I]), we can write X -> X[I] as X -> Y Z, and since F contains the FD Z -> W, we conclude by the set accumulation rule that X -> Y Z W or in other words X -> X[I] UNION W.
Example 6.6.7. In Example 6.6.6, we were given the set F of FDs:
F = {B -> C D, A D -> E, B -> A}
Given X = B, we determine that X+ = A B C D E. In terms of Algorithm 6.6.12, we start with X[0] = B. Then X[1] = B, and we begin to loop through the FDs. Because of B -> C D, X[1] = B C D. The next FD, A D -> E, does not apply at this time, since A D is not a subset of X[1]. Next because of B -> A, we get X[1] -> A B C D.
Now X[0] is strictly contained in X[1] (i.e., X[I-1] Ì X[I]) so X[I-1] ≠ X[I]. Thus we have made progress in this last pass of the loop and go on to a new pass, setting X[2] = X[1] = A B C D. Looping through the FDs again, we see that all of them can be applied (we could skip the ones that have been applied before since they will have no new effect), with the only new FD, A D -> E, giving us X[2] = A B C D E.
At the end of this loop, the algorithm notes that X[1] Ì X[2], progress has been made, so we go on to create X[3] and loop though the FDs again, ending up this pass with X[3] = X[2].
Finding a Minimal Cover of a set of FDs.
Algorithm 6.6.13, pg. 369. BASIC TO NORMALIZATION! Given a set F of FDs, construct a set M of FDs that is minimal and covers F.
Let's apply this to the (non-reduced) set of FDs above.
F: (1) A -> B, (2) C -> B, (3) D -> A B C, (4) A C -> D
^ Remember, this can come out!!
Step 1, From the set F of FDs, create an equivalent set H with only single FDs on the right. Use decomposition rule.
H: (1) A -> B, (2) C -> B, (3) D -> A, (4) D -> B, (5) D -> C, (6) A C -> D
Step 2. Remove inessential FDs from the set H to get the set J. Determine inessential X -> A if A is in X+ under FDs without X -> A.
Try removing (1) A -> B, leaving only (2) C -> B, (3) D -> A, (4) D -> B, (5) D -> C, (6) A C -> D. Take X = A in closure algorithm, clearly get only X+ = A, because no other FD has its left side contained in X. So need (1).
Try removing others. (2) stays, since no other C on left. Therefore if set X = C couldn't get X+ to contain more than C. (That reasoning is OK.)
How about (3)? Would be left with only: (1) A -> B, (2) C -> B, (4) D -> B, (5) D -> C, (6) A C -> D. Consider X = D. Get X+ = D B (4) C (5). Then stop. In fact A not on right of any FDs if take out (3), so (3) needed.
Now try removing (4). Keep: (1) A -> B, (2) C -> B, (3) D -> A, (5) D -> C, (6) A C -> D. Can we derive D -> B? X = D. Get: D A (3) C (5) B (2). Therefore, got D -> B. So can leave (4) out.
Can't leave out (5) because C not on right of any other FD. Can't leave out (6) because D not on right of any other. Therefore can only reduce set to:
J = (1) A -> B, (2) C -> B, (3) D -> A, (4) D -> C, (5) A C -> D (Renumber)
Step 3. Successively replace FDs in J with FDs that have a smaller number of FDs on the left-hand side so long as J+ remains the same.
Test this by successively removing single attributes from multi-attribute left hand sides of FDs, changing X -> A to Y -> A, then checking if Y+ under new FD set is unchanged. (Clearly if we assume Y -> A, and Y Ã X, can derive everything used to be able to: still true that X -> A. ONLY RISK is that Y -> A might imply TOO MUCH. I.e., might have Y+ is LARGER than before!)
Only one to try is (5). If we change this to A -> D, does D change? used to be A+ = A B, now, A+ = A B D C. No good. How about changing A C -> D to C -> D? Does C+ change? Used to be C+ = C B. Now C+ = C B D A. So no good, can't reduce. (NO NEED TO TRY STEP 2 AGAIN.)
IF WE DID REDUCE and created a new FD, let us say A -> D to replace A C -> D, we would need to apply Step 2 again to test if A -> D could be removed!!!
Step 4. Apply Union rules to bring things back together on the right for common sets of attributes on the left of FDs, renamed M.
J: (1) A -> B, (2) C -> B, (3) D -> A, (4) D -> C, (5) A C -> D
M: (1) A -> B, (2) C -> B, (3) D -> A C, (4) A C -> D
This is the reduced set, above.
Section 6.7. Lossy and Lossless decomposition. We're going to be factoring tables into smaller tables (projecting onto two subsets of columns that cover all columns and have some columns in common), but it doesn't always work when join back that keep all information of original table.
Always get ALL rows back, but might get MORE. Lose Information. See Example 6.7.1 in text, Pg. 374, a Lossy decomposition.
Ex 6.7.1. A Lossy Decomposition. Consider table, ABC:
Table ABC
|
A |
B |
C |
|
a1 |
100 |
c1 |
|
a2 |
200 |
c2 |
|
a3 |
300 |
c3 |
|
a4 |
200 |
c4 |
If we factor this table into two parts, AB and BC, we get the following table contents:
Table AB Table BC
|
A |
B |
|
B |
C |
|
a1 |
100 |
|
100 |
c1 |
|
a2 |
200 |
|
200 |
c2 |
|
a3 |
300 |
|
300 |
c3 |
|
a4 |
200 |
|
200 |
c4 |
However, the result of joining these two tables is
AB JOIN BC
|
A |
B |
C |
|
a1 |
100 |
c1 |
|
a2 |
200 |
c2 |
|
a2 |
200 |
c4 |
|
a3 |
300 |
c3 |
|
a4 |
200 |
c2 |
|
a4 |
200 |
c4 |
This is NOT the original table content for ABC! Note that the same decomposed tables AB and BC would have arisen if the table we had started with was ABCX, with content equal to AB JOIN BC above, or either of two other tables, ABCY or ABCZ:
ABCY ABCZ
|
A |
B |
C |
|
A |
B |
C |
|
a1 |
100 |
c1 |
|
a1 |
100 |
c1 |
|
a2 |
200 |
c2 |
|
a2 |
200 |
c2 |
|
a2 |
200 |
c4 |
|
a3 |
300 |
c3 |
|
a3 |
300 |
c3 |
|
a4 |
200 |
c2 |
|
a4 |
200 |
c4 |
|
a4 |
200 |
c4 |
Since we can't tell what table content we started from, information has been lost by this decomposition and the subsequent join.
This is known as a Lossy Decomposition, or sometimes a Lossy-Join Decomposition.
Reason we lose information is many to many matching on join columns. Lose which one on the left was with which one on the right. E.g. a2, 200 on left and a4, 200 on left EACH match with 200, c2 and 200, c4 on right in a join, but all for combinations were not part of the original table. Would be OK if always had N-1 relationships on join columns (or 1-1). E.g., CAP, orders can have multiple rows with same pid, but pid is unique for products. Must ALWAYS be unique on one side, not an accident, so need join columns to be SUPERKEY on at least ONE SIDE of join.
Theorem 6.7.3. Given a table T with a set of FDs F and a set of attributes X in Head(T) then X is a superkey of T iff X functionally determines all attributes in T, i.e., X -> Head(T) is in F+.
Theorem 6.4.7. Given a table T with a set F of FDs valid on T, then a decomposition of T into two tables {T1, T2} is a lossless decomposition if one of the following functional dependencies is implied by F:
(1) Head(T1) Ç Head(T2) -> Head(T1), or
(2) Head(T1) Ç Head(T2) -> Head(T2)
Note that in the first example Head(T1) Ç Head(T2) = B, but B-/-> Head(T1), and B-/-> Head(T2),
Section 6.8. Normal Forms
We start with a Universal table T and a set of FDs that hold on T. We then create a lossless decompostion:
T = T1 JOIN T2 JOIN . . . JOIN Tn
so that in each of the tables Ti the anomalies we studied earlier not present. Consider the following Universal table called emp_info:
emp_id, emp_name, emp_phone, dept_name, dept_phone, dept_mgrname, skill_id, skill_name, skill_date, skill_lvl
1234 abcdefgh 12345 pqrstu 45678 tuvwxyz jkl246 mnopqur 012302 5
(1) emp_id -> emp_name emp_phone dept_name [in every row with emp_id, name, phone, and dept will have the same value]
(2) dept_name -> dept_phone dept_mgrname [in every row with dept_name, phone and dept_mgrname will have the same value]
(3) skill_id -> skill_name
(4) emp_id skill_id -> skill_date skill_lvl
In what follows, we will factor emp_info into smaller tables which form a lossless join. A set of table headings (decomposition tables) together with a set of FDs on those heading attributes is called a Database Schema.
In emp_info, if delete record of only employee with given skill_id, lose information about that skill (in this case, the name of the skill). So have to have separate table for skills.
What we are really saying here is that emp_id skill_id is the key for emp_info, but a SUBSET of this key functionally determines some skills attributes. This is to be avoided because of the delete anomaly. Similarly, bunch of attributes are functionally determined by emp_id (including a bunch of dept information by transitivity). These attributes also don't belong in emp_info with key of two attributes.
There's a cute way of saying what FDs should appear in a normalized table: every attribute must be determined by "the key, the whole key, and nothing but the key." Need separate table for skills because of "the whole key".
Now have (pg. 385): emps table (emp_id, emp_name, emp_phone, dept_name, dept_phone, dept_mgrname),
skills table (skill_id, skill_name), and
emp_skills table (emp_id, skill_id, skill_date, skill_lvl).
But, if delete only employee in a department (during reorganization), lose info about the department, such as departmental phone number.
Problem here is that dept_phone, for example, is dependent on dept_name. So fail on "nothing but the key".
This is what is known as a "transitive dependency" or "transitive FD", and cannot exist in a well-behave table with no delete anomaly.
OK, now Figure 6.26, pg. 386: emps table, depts table, emp_skills table, skills table. Three entities and a relationship. Note that there is one table for each FD. Not an accident.
BCNF, 3NF, 2NF
2NF is too weak, no DBA would stop there. Essentially, means stopping with depts in emps table. Define near end of section as note.
BCNF definition is more intuitive than 3NF, but 3NF and BCNF are equivalent for the schema we have just been dealing with. Need special situation for difference to arise.
Def. 6.8.4. Boyce-Codd Normal Form, or BCNF. A table T in a database schema with FD set F is in BCNF iff, for any FD X -> A in F+ that lies in T (all attributes of X and A in T), A is a single attribute not in X, then X must be a superkey for T.
OR: For a non-trivial FD X -> Y where X is minimal, X must be the key for T.
The minimal cover for the FDs whose attributes lie in F are all of the form X -> A, where X is a key for T. Note NOT saying it is a PRIMARY key — might be more than one key. Every attribute in T is determined by a key (definition), the whole key (no subset of it) and nothing but the key (any FD determining A turns out to contain a key).
OK, now assume add attributes to emps table attributes to keep track of the street address (emp_straddr), the city and state (emp_cityst) and ZIPCODE (emp_zip). These are all determined by emp_id as now:
(1) emp_id -> emp_name emp_phone dept_name emp_straddr emp_cityst emp_zip (every employee has only one mailing address).
Now the way the PO has assigned ZIPCODES, turn out have new FDs:
(5) emp_cityst emp_straddr -> emp_zip
(6) emp_zip -> emp_cityst
Consider the extended emps table with these added attributes:
emps table: emp_id emp_name emp_phone dept_name emp_straddr emp_cityst emp_zip
This still has key emp_id, but now there are FDs that are NOT determined solely by the key so emps must be factored while all other tables remain the same:
emps table: emp_id emp_name emp_phone dept_name emp_straddr emp_cityst
empadds table: emp_straddr emp_cityst emp_zip
Note emp_cityst, emp_straddr is key for empadds and in intersection of factored tables (it is a foreign key in emps that determines the address held in empadds), so lossless.
But FD (6) still isn't derived from key in empadds. So for BCNF, empadds must factor further (emps same)
empadds table: emp_zip emp_straddr
zip table: emp_zip emp_cityst
Now empadds has a key consisting of both columns (there is no FD whose left-hand side is in empadds). zip table has emp_zip as key (FD (6)).
PROBLEM: FD 5 doesn't lie in any table. But we always try to "preserve" each FD X -> Y, meaning that X union Y lies in the head of some Ti. This allows us (for example) to update the street address, city, & state for an employee and test in one table that ZIP was entered correctly.
Thus BCNF has a bad property, and we want to fall back to design of previous database schema. Need a new normal form.
Def. 6.8.5. A PRIME ATTRIBUTE of a table T is any attribute that is part of a key for that table (not necessarily a primary key).
Def. 6.8.6. Third Normal Form (3NF). A table T in a database schema with FD set F is in 3NF iff, for any FD X -> A implied by F that lies in T, if A is a single non-prime attribute not in X, then X must be a superkey for T (BCNF with an escape clause, that can allow the implied condition, X must be a superkey of T, to fail if A is a prime attribute.)
OK, now with weakened 3NF and two FDs:
(5) emp_cityst emp_straddr -> emp_zip
(6) emp_zip -> emp_cityst
Can have all three attributes in a single empadds table, since emp_cityst emp_straddr is the key, and FD (6) doesn't break the rule for X -> A because A = emp_cityst is a prime attribute.
Now have: every non-prime attribute is determined by the key, the whole key, and nothing but the key.
OK, Algorithm 6.8.8 to achieved well behaved 3NF decomposition, pg 393
Given a universal table T and a set of FDs F, generate a set S of headings for a database schema in 3NF that is a lossless decomposition and preserves all FDs in F.
Start by finding minimal cover (Algorithm 6.6.13). Then loop on all FDs, and for every FD, if there is not already a table heading that contains the FD, create a new heading (new table). Finally, if there is a key K for T that is not in any table, create a new table that contains it.
Def. 6.8.8. Second Normal Form (2NF). A table T with FD set F is in 2NF iff: for any X -> A implied by F that lies in T, where A is a single non-prime attribute not in X, X is not properly contained in any key of T.
A database schema is 2NF iff all of its tables are 2NF.
EXAMPLE Assume we wish to construct a database from a set of data items, {A, B, C, D, E, F, G, H} (which will become attributes in tables), and a set F of FDs given by:
F: (1) A -> B C , (2) A B E -> C D G H, (3) C -> G D, (4) D -> G, (5) E -> F
(a) Find the minimal cover for this set of FDs, and name this set G.
Step 1. Decompose rhs
H = (1) A->B, (2) A->C, (3) ABE->C, (4) ABE->D, (5) ABE->G, (6) ABE->H, (7) C->G, (8) C->D, (9) D->G, (10) E->F
Step 2. Test extraneous dependences
(1) A->B: J = rest, X+ = AC(2)G(5)D(8), not containing B, so keep.
(2) A->C, J = rest, X+ = A, keep.
(3) ABE->C, X+ = ABEC(2)D(4)G(5)F(6), containing C, drop.
J: (1) A->B, (2) A->C, (4) ABE->D, (5) ABE->G, (6) ABE->H, (7) C->G, (8) C->D, (9) D->G, (10) E->F
(4) ABE->D: X+ = ABEC(2)G(5)D(8)F(10), containing D, drop.
J: (1) A->B, (2) A->C, (5) ABE->G, (6) ABE->H, (7) C->G, (8) C->D, (9) D->G, (10) E->F
(5) ABE->G: X+ = ABEC(2)G(7)D(8)F(10)G, containing G, drop.
J: (1) A->B, (2) A->C, (6) ABE->H, (7) C->G, (8) C->D, (9) D->G, (10) E->F
(6) ABE->H: X+ = ABEC(2)G(7)D(8)F(10), not containing H, keep.
(7) C->G: X+ = CD(8)G(9) containing G, drop.
J: (1) A->B, (2) A->C, (6) ABE->H, (8) C->D, (9) D->G, (10) E->F
(8) C->D: X+ = C, keep.
(9) D->G: X+ = D, keep.
(10) E->F: x+ = E, keep.
Result J: (1) A->B, (2) A->C, (3) ABE->H, (4) C->D, (5) D->G, (6) E->F
Step 3. Try to drop something from lhs of (3) ABE->H:
AB-> H instead in J': AB+ under J' = ABC(2)H(3)D(4)G(5)
AB+ under J = ABC(2)D(4)G(5) Don't get H. Not the same.
AE->H instead in J': AE+ under J' = AEB(1)C(2)H(3)D(4)G(5)F(6),
AE+ under J = AEB(1)C(2)H(3)D(4)G(5)F(10). Same.
Rename J new H: (1) A->B, (2) A->C, (3) AE->H, (4) C->D, (5) D->G, (6) E->F.
Must repeat Step 2.
(1) A->B. Without (1) no B on rhs.
(2) A->C. Without (2) no C on rhs.
(3) AE->H. Without (3) no H on rhs.
(4) C->D. Without (4) no D on rhs.
(5) D->G. Without (5) no G on rhs.
(6) E->F. Without (6) no F on rhs.
Step 3. cannot replace AE->H with A->H or E->H
Step 4. Union rhs: M: (1) A->BC, (2) AE->H, (3) C->D, (4) D->G, (6) E->F
(b) Start with the table T containing all these attributes, and perform a lossless decomposition into a 2NF but not 3NF schema.
M: (1) A->BC, (2) AE->H, (3) C->D, (4) D->G, (5) E->F
T = (A, B, C, D, E, F, G, H). Since A and E only occur on lhs of FDs in M, AE must be in any key, and AE+ = AEBC(1)H(2)D(3)G(4)F(5) = ABCDEFGH, so AE is the key for T.
FD (5), E->F, constitutes a key-proper-subset dependency, and thus needs decomposition for 2NF. E+ = EF(5), so let T1 = (E,F) and T2 = (A,B,C,D,E,G,H) and Then E is the key for T1 and AE is the key for T2. (Note we leave Key in old table for intersection, but take out dependent attribute that would cause bad FD to exist, E->F.)
Similarly (1) A->BC is also a key-proper-subset dependency, and A+ = ABC(1)D(3)G(4), so we further decompose: T1= (E,F), T2 = (A,E,H), T3 = (A,B,C,D,G).
Now FD E->F lies in T1, FD AE->H lies in T2, and FDs A->BC, C->D and D->G lie in T3. C->D and D->G are transitive dependencies, allowed in 2NF, so we go no further here. These are lossless decompositions by Theorem 6.7.4. In the first, we note that head(T1) INTERSECT head(T2) = E, the key for T1. In the second, head(T2) INTERSECT head(T3) = A, the key for T3. The idea here is that A and E are "dimensions" on which other attributes are dependent: need AE, A, and E all separate if there are dependencies in all cases.
(c) Continue decomposition to bring this database to 3NF. Is this table also BCNF?
To complete the decomposition to 3NF, we need to factor out the transitive dependencies. Have T3 = (A,B,C,D,G), with FDs: (3) C->D and (4) D->G. Change T3 to (A,B,C) and create new tables T4 = (C,D) and T5=(D,G). Now we have one FD contained in each table. In each case the lhs is a key for its table, so the schema is in BCNF. (Don't have to worry about FDs with non-prime lhs.)
(d) Use Algorithm 6.8.8 and the set G of FDs to achieve a lossless 3NF decomposition that preserves FDs of G. I
S = nullset. Loop through all the FDs of M:
M = { A->BC, AE->H, C->D, D->G, E->F}
(Loop through FDs, and if not already contained in table of S, create new)
A->BC: S = {ABC}.
AE->H: AEH not contained in ABC, so add it: S={ABC,AEH}
C->D: CD not contained in either yet, so add it: S= {ABC,AEH,CD}
D->G: DG not contained in any yet, so add it: S = {ABC,AEH,CD,DG}
E->F: EF not contained in any yet, so add it: S = {ABC,AEH,CD,DG,EF}
AE is the only candidate key, and it is contained in AEH, so done.
Is this is the same decomposition as in c?