Chapter 10
Def. 10.1 Transaction. A transaction is a means to package together a number of database operations performed by a process, so the database system can provide several guarantees, called the ACID properties.
Think of writing:
BEGIN TRANSACTION op1 op2 . . .
opN END TRANSACTION
Then all ops within the transaction are packaged together. There is no actual BEGIN TRANSACTION
statement in SQL. A transaction is begun by a system when there is none in
progress and the application first performs an operation that accesses
data: Select, Insert, Update, etc.
When a program makes a call to ORACLE using Embedded SQL, it receives a code back that indicates to the program the outcome of the call. If it was performed as expected, the application program can end a transaction successfully by executing:
exec sql commit work; /* called simply a Commit */
Then any updates performed by operations in the transaction are successfully completed and made permanent and all locks on data items are released. Alternatively, if an error code is returned after the ORACLE call the application program executes:
exec sql rollback work; /* called an Abort */
and all updates are reversed, and locks on data items are released. Thus, an application program implements the concept of a transaction by what unit of work it commits or aborts.
The ACID guarantees are extremely important -- This and SQL is what differentiates a database from a file system.
Imagine that you are trying to do banking applications on the UNIX file system (which has it's own buffers, but no transactions). There will be a number of problems, the kind that faced database practitioners in the 50s.
1. Inconsistent result. Our application is transferring money from one account to another (different pages). One account balance gets out to disk (run out of buffer space) and then the computer crashes. When bring computer up again, have no idea what used to be in memory buffer, and on disk we have destroyed money.
2. Errors of concurrent execution. (One kind: Inconsistent Analysis.) Teller 1 transfers money from Acct A to Acct B of the same customer, while Teller 2 is performing a credit check by adding balances of A and B. Teller 2 can see A after transfer subtracted, B before transfer added.
3. Uncertainty as to when changes become permanent. At the very least, we want to know when it is safe to hand out money: don't want to forget we did it if system crashes, then only data on disk is safe. We want this to happen at transaction commit. And don't want to have to write out all rows involved in transaction (teller cash balance -- very popular, we buffer it to save reads and want to save writes as well).
To solve these problems, systems analysts came up with idea of transaction (formalized in 1970s). Here are ACID guarantees:
Atomicity. The set of record updates that are part of a transaction are indivisible (either they all happen or none happen). This is true even in presence of a crash (see Durability, below).
Consistency. If all the individual processes follow certain rules (money is neither created nor destroyed) and use transactions right, then the rules won't be broken by any set of transactions acting together. Implied by Isolation, below.
Isolation. Means that operations of different transactions seem not to be interleaved in time -- as if ALL operations of one Tx before or after all operations of any other Ty.
Durability. When the system returns to the logic after a Commit Work statement, it guarantees that all Tx Updates are on disk. Now ATM machine can hand out money.
The system is kind of clever about Durability. It doesn't want to force all updated disk pages out of buffer onto disk with each Tx Commit. So it writes a set of notes to itself on disk (called logs). After crash run Recovery (also called Restart) and makes sure notes translate into appropriate updates.
What about Read-Only Tx? (No data updates, only Selects.) Atomicity and Durability have no effect, but Isolation does.
Money spent on Transactional systems today is about SIX BILLION DOLLARS A YEAR. We're being rigorous about some of this for a BUSINESS reason.
10.1 Transactional Histories.
Reads and Writes of data items. Let’s examine the operations that might be grouped into a transaction Ti. A data item might be a row of a table or it might be an index entry or set of entries. For now talking about rows. Read a data item when access it without changing it. Often a select.
select val into :pgmval1 from T1 where uniqueid = A;
We will write this as Ri(A): transaction with identification number i reads data item A.
update T1 set val = pgmval2 where uniqueid = B;
we will write this as Wj(B); Tj writes B; say Update results in Write.
To get the row to update it ORACLE may read an index entry as well to write B. In the previous case to get to row B we have to read an index entry, Rj (predicate: uniqueid = B), then a pair of row operations: Rj(B) (have to read it first, then update it) Wj(B). Have to read it in this case before can write it.
update T set val = val + 2 where uniqueid between :low and :high;
Will result in a lot of operations: Rj(predicate: uniqueid between :low and :high), then Rj(B1) Wj(B1) Rj(B2) Wj(B2) . . . Rj(BN) Wj(BN).
>>The reason for this notation is that often have to consider complex interleaved histories of concurrent transactions; Example history:
(10.1.2) . . . R2(A) W2(A) R1(A) R1(B) R2(B) W2(B) C1 C2 . . .
Note Ci means commit by Ti. A sequence of operations like this is known as a History or sometimes a Schedule. A history results from a series of operations submitted by users, translated into R & W operations at the level of the Scheduler. See Fig. 10.1.
It is the job of the scheduler to look at the history of operations as it comes in and provide the Isolation guarantee, by sometimes delaying some operations, and occasionally insisting that some transactions be aborted. By this means it assures that the sequence of operations is equivalent in effect to some serial schedule (all ops of a Tx are performed in sequence with no interleaving with other transactions). See Figure 10.1, pg. 640.
In fact, (10.1.2) above is an ILLEGAL schedule. Because we can THINK of a situation where this sequence of operations gives an inconsistent result.
Example 10.1.1. Say that the two elements A and B in (10.1.2) are Acct records with each having balance 50 to begin with and suppose two transactions are executing simultaneously: T1 is adding up balances of two accounts, T2 is transferring 30 units from A to B. The first transaction does two reads to get its summands, while the second transaction reads and writes both the A row and the B row to record that $30 has moved between them. Suppose that the actual sequence of these reads and writes occurs in the following schedule (value read or written indicated)
. . . R2(A, 50) W2(A, 20) R1(A, 20) R1(B, 50) R2(B, 50) W2(B, 80) C1 C2 . . .
And T determines that the customer fails the credit check (because under balance total of 80, say). But this could never have happened in a serial schedule, where all operation of T2 occurred before or after all operations of T2.
. . . R2(A, 50) W2(A, 20) R2(B, 50) W2(B, 80) C2 R1(A, 20) R1(B, 80) C2 . . .
or
. . . R1(A, 50) R1(B, 50) C1 R2(A, 50) W2(A, 20) R2(B, 50) W2(B, 80) C2 . . .
And in both cases, T1 sees total of 100, a Consistent View.
Notice we INTERPRETED the Reads and Writes of (10.1.2) to create a model of what was being read and written to show there could be a schedule that produced an inconsistency. How did we know that such an inconsistent ordering existed? We have to give the Scheduler program a list of rules by which it can decide if it is possible to overlap two transactions without risk.
10.2. Interleaved Read/Write Operations
If a serial history is always consistent, why don't we just enforce serial histories.
The scheduler could take the first operation that it encounters of a given transaction (T2 in the above example) and delay all ops of other Txs (the Scheduler is allowed to do this) until all operations of T2 are completed and the transaction commits (C2).
Reason we don't do this? Performance. It turns out that an average Tx has a relatively small CPU bursts and then I/O during which CPU has nothing to do. See Fig 10.3, pg. 644. When I/O is complete, CPU can start up again. What do we want to do? Let another Tx run (call this another thread) during slack CPU time. (Interleave). Doesn't help much if have only one disk (disk is bottleneck). See Fig 10.4, pg. 644.
But if we have two disks in use all the time we get about twice the throughput. Fig 10.5, pg. 645. And if we have many disks in use, we can keep the CPU 100% occupied. Fig 10.6, pg 646. In actuality, everything doesn't work out perfectly evenly as in Fig 10.6. Having multiple threads and multiple disks is still like throwing darts at slots, sometimes you still have contention.
The goal of the DBA, however is to try to have enough threads running to keep lots of disk occupied so CPU is 90% occupied. When one thread does an I/O, want to find another thread with completed I/O ready to run again. This requires have the right number of disks with the data in the right places and the right speed CPU to match the workload.
10.3 Serializability and the Precedence Graph.
We want to come up with a set of rules for the Scheduler to allow operations by interleaved transactions and guarantee Serializability. Serializability means the series of operations is EQUIVALENT to a Serial schedule, where operations of Tx are not interleaved. How can we guarantee this?
First notice that all the Scheduler can do is commute the order of two operations (delay one that arrived first). We can commute ops in the history of requests until all ops of one Tx are together (serial history), but we may not commute operations within a single Tx (imagine a transaction is a string of beads of one color with several inches of string between each bead. We have several different colored strings which we intermingle to from a rope so that all the beads from all the strings are in a line. Which of all the different possible interminglings produce execution results equivalent to processing each string one at a time?). Note that if two transactions never access the same data items, it doesn't matter that they're interleaved. The operations don't affect each other, and order of pair of operations from each Tx doesn't matter.
Therefore the Scheduler only has to be concerned about the operations by two different transactions that do affect the same data item. There are only four possibilities:
1. R or W by T1 followed by R or W by T2.
Consider history: If the Scheduler receives the operation sequence
. . . R1(A) . . . W2(A) . . .
Would it matter if the order were reversed? YES. Can easily imagine an interpretation where T2 changes data T1 reads: if T1 reads it first, sees old version, if reads it after T2 changes it, sees later version.
We use the notation: R1(A) <<H W2(A)
to mean that R1(A) comes before W2(A) in H, and what we have just noticed is that whenever we have the ordering in H we must also ensure that the Scheduler maintains that order when it performs (serializes) the operations:
R1(A) <<S(H) W2(A)
Thus we have established the rule If R1(A) <<H W2(A) then R1(A) << S(H) W2(A).
2. Now these transaction numbers are just arbitrarily assigned labels, so it is clear we could have written the above as follows:
If R2(A) <<H W1(A) then R2(A) << S(H) W1(A).
3. Now what can we say about the following?
R1(A) <<H R2(A)
This can be commuted -- reads can come in any order since they don't affect each other. Note that if there is a third transaction, T3, where:
R1(A) <<H W3(A) <<H R2(A)
Then the reads cannot be commuted (because we cannot commute either one with W3(A)), but this is because of application of the earlier rules, not depending on the reads as they affect each other.
4. Finally, we consider:
W1(A) <<H W2(A)
And it should be clear that these two operations cannot commute. The ultimate outcome of the value of A would change. That is:
If W1(A) <<H W2(A) then W1(A) <<S(H) W2(A)
Def. 10.3.1. (pg. 650) Two operations Xi(A) and Yj(B) in a history are said to conflict (i.e., the order matters) if and only if the following three conditions hold:
(1) A ∫ B. Operations on distinct data items never conflict.
(2) i ≠ j. Operations conflict only if they are performed by different Txs.
(3) One of the two operations X or Y is a write, W. (Other can be R or W.)
Note in connection with (2) that two operations of the SAME transaction also cannot be commuted in a history, but not because they conflict. If the scheduler delays the first, the second one will not be advanced (operations within a transaction never change order).
We shall now show how some histories can be shown not to be serializable. Then we show that such histories can be characterized by an easily identified characteristic in terms of conflicting operations. The Transaction Manger module is interacting with multiple application programs that are generating Read and Write requests to the database. The Transaction Manager would like to overlap the operations of different transactions to maintain CPU utilization, so it generates an interleaved stream of transactions for the scheduler to execute in some order. We show that sometimes a stream of operations will arrive at the scheduler that it CANNOT interpret unambiguously, ie., it cannot reconstruct the original transactions. The whole idea of a transaction is to perform changes to that database in a way that insures data consistency. IF THE TRANSACTION MANAGER INTERLEAVES THE OPERATIONS OF TRANSACTIONS SUCH THAT THE DATABASE IS CHANGED TO AN INCONSISTENT STATE, THEN THE SCHEDULER CANNOT DETERMINE WHAT SEQUENCE OF THE OPERATIONS OF THE ORIGINAL TRANSACTIONS IS EQUIVALENT, SINCE NO TRANSACTION SEQUENCE WOULD HAVE DONE THAT!
To show that a history is not serializable (SR), we use an interpretation of the history.
Def. 10.3.2. An interpretation of an arbitrary history H consists of 3 parts. (1) A description of the purpose of the logic being performed. (2) Specification of precise values for data items being read and written in the history. (3) A consistency rule, a property that is obviously preserved by isolated transactions of the logic defined in (1).
Example 10.3.1. Here is a history, H1, we claim is not SR.
H1 = R2(A) W2(A) R1(A) R1(B) R2(B) W2(B) C1 C2
Here is an interpretation. T1 is doing a credit check, adding up the balances of A and B. T is transferring money from A to B. Here is the consistency rule: Neither transaction creates or destroys money. Values for H1 are:
H1' = R2(A,50) W2(A,20) R1(A,20) R1(B,50) R2(B,50) W2(B,70) C1 C2
The schedule H1 is not SR because H1' shows an inconsistent result: sum of 70 for balances A and B, though no money was destroyed by T2 in the transfer from A to B. This could not have occurred in a serial execution.
The concept of conflicting operations gives us a direct way to confirm that the history H1 is not SR. Note the second and third operations of H1, W2(A) and R1(A). Since W(A) comes before R(A) in H1, written: W2(A) <<H1 R1(A) we know since these operations conflict that they must occur in the same order in any equivalent serial history, S(H1), i.e.: W2(A) <<S(H1) R1(A) One possible serialization of H1 would be to collect all of the operations of T1, and T2 together, so in this serialization we could only do this if T2 occurs before T1
But now consider the fourth and sixth operations of H1. We have: R1(B) <<H1 W2(B) and since these operations conflict, we also have R1(B) <<S(H1) W2(B). But this implies that T1 comes before T2 , and this is at odds with our previous conclusion. In any serial history S(H1), either T1 <<S(H1) T2 or T2 <<S(H1) T1, not both. Since we conclude from examining H1 that both occur, S(H1) must not really exist. Therefore, H1 is not SR.
We illustrate this technique a few more times, and then prove a general characterization of SR histories in terms of conflicting operations.
Ex. 10.3.2. Consider the history: H2 = R1(A) R2(A) W1(A) W2(A) C1 C2
We give a interpretation of this history as a paradigm called a lost update.
1. Assume that A is a bank balance starting with the value 100 and T1 tries to add 40 to the balance at the same time that T2 tries to add 50.
2. Amplifying H2: H2' = R1(A, 100) R2(A, 100) W1(A, 140) W2(A, 150) C1 C2
3. Clearly the final result is 150, and we have lost the update where we added 40. This couldn't happen in a serial schedule, so H2 is non-SR.
In terms of conflicting operations, note that operations 1 and 4 imply that T1 <<S(H2) T2. But operations 2 and 3 imply that T2 <<S(H2) T1. No SR schedule could have both these properties, therefore H2 is non-SR.
Ex 10.3.3. Consider the history: H3 = W1(A) W2(A) W2(B) W1(B) C1 C2
1. Understand that these are what are known as "Blind Writes": there are no Reads at all involved in the transactions. Assume the logic of the program is that T1 and T2 are both meant to "top up" the two accounts A and B, setting the sum of the balances to 100.
2. T1 does this by setting A and B both to 50, T2 does it by setting A to 80 and B to 20. Here is the result for the interleaved history H3.
H3' = W1(A, 50) W2(A, 80) W2(B, 80) W1(B, 50) C1 C2
3. Clearly in any serial execution, the result would have A + B = 100. But with H3' the end value for A is 80 and for B is 50.
To show that H3 is non-SR by using conflicting operations, note that operations 1 and 2 imply T1 << T2, and operations 3 and 4 that T2 << T1.
Seemingly, the argument that an interleaved history H is non-SR seems to reduce to looking at conflicting pairs of operations and keeping track of the order in which the transactions will occur in an equivalent S(H).
v When there are two transactions, T1 and T2, we expect to find in a non-SR schedule that T1 <<S(H) T2 and T2 <<S(H) T1, an impossibility.
o If we don't have such an impossibility arise from conflicting operations in a history H, does that mean that H is SR?
o what about histories with 3 or more transactions involved? Will we ever see something impossible other than T1 <<S(H) T2 and T2 <<S(H) T1?
We start by defining a Precedence Graph [by the time you are a CS senior you should have seen a precedence graph before]. Def. 10.3.3. A precedence graph for a history H is a directed graph denoted by PG(H). [A directed graph has edges (arcs) with arrows indicating the only way that the arc can be traversed.]
v The vertices of PG(H) correspond to the transactions that have COMMITTED in H: that is, transaction Ti where C exists as an operation in H.
v An edge Ti -> Tj exists in PG(H) whenever two conflicting operations Xi and Yj occur in that order in H. Thus, Ti -> Tj should be interpreted to mean that Ti must precede Tj in any equivalent serial history S(H).
v Whenever a pair of operations conflict in H for COMMITTED transactions, we can draw the corresponding direct arc in the Precedence Graph, PG(H).
o The reason uncommitted transactions don't count is that we're trying to figure out what the scheduler can allow. Uncommitted transactions can always be aborted, and then it will be as if they didn't happen.
o It would be unfair to hold uncommitted transactions against the scheduler by saying the history is non-SR because of them.
v The examples we have given above of impossible conditions arising from conflicting operations look like this:
T1 ----------------------> T2
^<------------------------/
o Of course this is what is called a circuit in a directed graph (a digraph). This suggests other problems that could arise with 3 or more Txs.
o It should be clear that if PG(H) has a circuit, there is no way to put the transactions in serial order so Ti always comes before Tj for all edges Ti -> Tj in the circuit.
o How do we make this intuition rigorous? And if PG(H) doesn't have a circuit, does that mean the history is SR?
ONCE WE HAVE INTERPRETED OUR INTERMINGLED TRANSACTIONS AS ARCS IN A DIGRAPH WHOSE ORDERING IS DETERMINED BY CONFLICTING OPERATIONS WE CAN USE THE RESULTS OF DIGRAPH THEOREMS. ONE SUCH IS Lemma 10.3.5. In any finite directed acyclic graph G there is always a vertex with no edges entering it.
Proof. Choose any vertex v1 from G. Either this has the desired property, or there is an edge entering it from another vertex v2. (There might be several edges entering v1, but choose one.) Now v2 either has the desired property or there is an edge entering it from vertex v3. We continue in this way, and either the sequence stops at some vertex vm, or the sequence continues forever.
v If the sequence stops at a vertex vm, that's because there is no edge entering vm, and we have found the desired vertex.
v But if the sequence continues forever, since this is a finite graph, sooner or later in the sequence we will have to have a repeated vertex.
o Say that when we add vertex vn, it is the same vertex as some previously mentioned vertex in the sequence, vi.
o Then there is a path from vn -> v(n-1) -> . . . v(i+1) ->vi, where vi = vn. But this is the definition of a circuit, which we said was impossible.
Therefore the sequence had to terminate with vm and that vertex was the one desired with no edges entering.
WE THINK OF A TREE AS AN ACYCLIC DIGRAPH BUT SETS OF TREES ALSO HAVE THIS PROPERTY. NOTE THAT A TREE IS A TWO-DIMENSIONAL STRUCTURE AND WHAT WE ARE DOING IS LISTING THE NODES OF THE TREE IN ONE DIMENSION SUCH THAT WE DO NOT LOSE THE STRUCTURE!
v NOTE THAT EVERY TREE CAN BE CODED AS A BINARY TREE:
o Let the root of the tree be the root of the binary tree.
o Traverse the tree level-by-level counting the root level 1, moving left to right across the level and call all the nodes on a level “siblings” and consider the left-most sibling the oldest. Let the root of the left subtree of the binary tree be the oldest sibling on level 2 of the tree (the left node on level two of the binary tree is the left-most node on level 2 of the tree).
o Let the left node on level n of the binary tree be the oldest sibling on level n of the tree. Therefore the leftmost branch of the binary tree duplicates the leftmost branch of the tree.
o Let the right node on level 3 of the binary tree be the second oldest sibling on level 2 of the tree and let the right node of that node, the right-most level 4 node of the binary tree be the third oldest sibling on the trees level 2. Continue so that the rightmost branch of the left subtree of the binary tree duplicates level 2 of the tree.
o In like manner, for every node now on the binary tree (the eldest sibling of every level of the tree and all nodes on level 2 of the tree), if the node has children in the tree and the oldest (leftmost) of those children is not already in the binary tree, let that node’s left child in the binary tree be the oldest of those siblings. If the node has siblings in the tree and the next oldest sibling (the next to the right in the tree) is not already on the binary tree, let that node’s right child in the binary tree be the next oldest sibling.
Every node on the original tree is now placed onto the binary tree and it can be found in the binary tree by knowing its path from the root via a level-by-level search of the tree for an ancestor: If not the root node, go left to the next level (down), then right (across) until the ancestor is found, then left (down), then right (across) until the ancestor is found (eventually the parent) then left (down), then right (across) to the node.
v BUT EVERY BINARY TREE CAN BE SERIALIZED: in the array T, let T[1] = root, T[2]=left child of root, T[3]=right child of root, T[2n]=left child of T[n], T[2n+1]=right child of T[n]. Therefore every tree can be serialized (its nodes can be traversed from the root without missing any and without visiting any twice.
o You should be familiar with standard recursive binary tree traversal schemes: LNR (left subtree, root, right subtree), NLR (root, left subtree, right subtree), and LRN (left subtree, right subtree, root).
v ACYCLIC DIGRAPHS INCLUDE SETS OF TREES: for example consider a glossary in which terms are defined in terms of other terms. To avoid circular definitions there must be some undefined terms (several, not just one), and then we can connect those terms to the terms that use only them in their definition. The next level of terms use only terms from the first two levels, etc. With terms as nodes and the edges the connection to the earlier terms used in their definition, we have a precedence graph but not a tree since edges connect from multiple levels and there are multiple parents.
v
Hereditary trees are
really digraphs since none actually recede to a common ancestor. However, it is possible to list all the
recorded members of the family tree by generation (levels) so that with respect
to any reference member, all parents, aunts and uncles are listed earlier and
all grandparents, great aunts and great uncles are listed before them,
etc. However, since it doesn’t matter
in which order the members of a given generation are listed, there are still
lots of different orderings of the family tree. Moreover, if the precedence rule is only that for every node all
its ancestors must be listed before it, we don’t have to group nodes strictly
by generation (my uncle can be listed after me but as long as he appears before
his child, my cousin).
INTERMINGLED THREADS ARE PRECEDENCE GRAPHS DETERMINED BY CONFLICTING OPERATIONS ON COMMON DATA. RELATIVELY FEW OF THE TRANSACTIONS DEAL WITH COMMON DATA, SO MOST TRANSACTIONS (HENCE ALL THEIR OPERATIONS) ARE UNRELATED BY PRECEDENCE AND THUS CAN BE SERIALZED IN ANY CONVENIENT ORDER.
Thm. 10.3.4. The Serializability Theorem. A history H has an equivalent serial execution S(H) iff the precedence graph PG(H) contains no circuit.
1. if there is no circuit, we can serialize: Assume there are m transactions involved, and label them T1, T2, . . ., Tm. We show how to reorder the integers 1 to m, i(1), i(2), . . ., i(m), so that Ti(1), Ti(2), . . ., Ti(m) is the desired serial schedule.
v Since PG(H) has no circuit, from the Lemma there is a vertex, or transaction, Tk, with no edge entering it. We choose Tk to be Ti(1). Now remove this vertex, Ti(1), from PG(H) and all edges leaving it. Call the resulting graph PG1(H).
v By the Lemma, there is now a vertex in PG1(H) with no edge entering it. Call that vertex Ti(2).
v Continue in this fashion, removing Ti(2) and all it's edges to form PG2(H), and so on, choosing Ti(3) from PG2(H), . . ., Ti(m) from PGm-1(H).
v By construction, no edge of PG(H) will ever point backward in the sequence S(H), from Ti(m) to Ti(n), m > n.
The proof is complete, and we now know how to create an equivalent SR schedule from a history whose precedence graph has no circuit.
2. Given a serial traversal of a PG(H), there cannot be any cycles. If T(1), T(2), …T(n) is a listing of all the vertexes of PG(H) numbered such that if i<j then T(i) is traversed before T(j) in the history (T(1) has no edge pointing to it, T(n) has no edge leaving it, every other vertex is pointed to only by edges from a lower numbered vertex).
v Now suppose one or more cycles exist in PG(H). Let T(k) be the highest number vertex that participates in a cycle.
v Then the successor of T(k) is T(m) where m<k because T(m) is part of the cycle and T(k) had the largest subscript of any vertex participating in any cycle.
v But this contradicts the assumption that in this numbering no vertex is pointed to with an edge from a higher numbered vertex (when m<k T(k) is a successor of T(m)).
Therefore, the edges producing a serial listing of the vertices cannot include a cycle.
THUS WE HAVE CONCLUSIVE PROOF THAT WHEN THE SCHEDULER FIND CONFLICTING PAIRS OF OPERATIONS IT MUST ABORT ONE OF THE TRANSACTIONS IN ORDER TO GUARANTEE SERIALIZABILITY.
10.4 Locking Ensures Serializabilty
See Fig. 10.8, pg. 609. TM passes on calls such as fetch, select, insert, delete, abort; Scheduler interprets them as: Ri(A), Wj(B). It is the job of the scheduler to make sure that no non-SR schedules get past. This is normally done with Two-Phase Locking, or 2PL.
Def. 10.4.1. 2PL. Locks taken AND released following three rules.
(1) Before Ti can read a data item, Ri(A), scheduler attempts to Read Lock the item on it's behalf, RLi(A); before Wi(A), try Write Lock, WLi(A).
(2) If conflicting lock on item exists, requesting Tx must WAIT. (Conflicting locks corresponding to conflicting ops: two locks on a data item conflict if they are attempted by different Txs and at least one of them is a WL).
(3) There are two phases to locking, the growing phase (when locks are acquired) and the shrinking phase (when locks are released: RUi(A); The scheduler must ensure that it can't shrink (drop a lock) and then grow again (take a new lock).
Rule (3) implies can release locks before Commit; More usual to release all locks at once on Commit, and we shall assume this in what follows.
Note that a transaction can never conflict with its own locks! If Ti holds RL on A, can get WL so long as no other Tx holds a lock on A. A Tx with a WL doesn't need a RL (WL more powerful than RL).
Clearly locking is defined to guarantee that a circuit in the Precedence Graph can never occur. The first Tx to lock an item forces any other Tx that gets to it second to "come later" in any SH.
But what if other Tx already holds a lock the first one now needs? This would mean a circuit, but in the WAIT rules of locking it means NEITHER Tx CAN EVER GO FORWARD AGAIN. This is a DEADLOCK. When a deadlock occurs, scheduler will recognize it and force one of the Txs involved to Abort. (Note, there might be more than 2 Txs involved in a Deadlock.)
Ex. Here is history not SR (Error in text: this is a variant of Ex. 10.4.1).
H4 = R1(A) R2(A) W2(A) R2(B) W2(B) R1(B) C1 C2
Same idea as 10.3.1 why it is non-SR: T1 reads two balances that start out A=50 and B=50, T2 moves 30 from A to B. Non-SR history because T1 sees A=50 and B=80. Now try locking and releasing locks at commit.
RL1(A) R1(A) RL2(A) R2(A) WL2(A) (conflicting lock held by T1 so T2 must WAIT—NO T2 OPERATIONS CAN BE PREFORMED UNTIL WL2(A) IS FULFILLED) RL1(B) R1(B) C1 (now T2 can get WL2(A)) W2(A) RL2(B) R2(B) WL2(B) W2(B) C2
Works fine: T1 now sees A=50, B=50. Serial schedule, T1 then T2.
But what if allowed to Unlock and then acquire more locks later (necessity of 2PL Rule (3))?
RL1(A) R1(A) RU1(A) RL2(A) R2(A) WL2(A) W2(A) WU2(A) RL2(B) R2(B) WL2(B) (T2 already has RL) W2(B) WU2(B)(RU2 doesn’t matter) RL1(B) R1(B) C1 C2
So H4 above is possible, giving inconsistent results. But the only 2PL rule broken is that T1 and T2 unlock rows, then lock other rows later.
The Waits-For Graph. How scheduler checks if deadlock occurs. Vertices are currently active Txs, Directed edges Ti -> Tj iff Ti is waiting for a lock held by Tj.
(Note, might be waiting for lock held by several other Transactions, for example a write lock cannot be obtained while some other Transaction holds a read lock on that row.)
The scheduler attempts to perform a required lock operation and if Tx required to wait, draws new directed edges resulting, then checks for circuit (is this transaction in a queue .
Ex 10.4.2. Here is schedule like H4 above, where T2 reverses order it touches A and B (now touches B first), but same example shows non-SR.
H5 = R1(A) R2(B) W2(B) R2(A) W2(A) R1(B) C1 C2
Locking result:
RL1(A) R1(A) RL2(B) R2(B) WL2(B) W2(B) RL2(A) R2(A) WL2(A) (Fails: RL1(A) held, T2 must WAIT for T1 to complete and release locks) RL1(B) (Fails: WL2(B) held, T1 must wait for T2 to complete: But this is a deadlock! Choose T2 as victim (T1 chosen in text)) A[bort]2 (now RL1(B) will succeed) R1(B) C1 (start T2 over, retry, it gets Tx number 3) RL3(B) R3(B) WL3(B) W3(B) RL3(A) R3(A) WL3(A) W3(A) C3.
Locking serialized T1, then T2 (retried as T3).
Thm. 10.4.2. Locking Theorem. A history of transactional operations that follows the 2PL discipline is SR.
First, Lemma 10.4.3. If H is a Locking Extended History that is 2PL and the edge Ti -> Tj is in PG(H)[Ti needs a lock held by Tj], then there must exist a data item D and two conflicting operations Xi(D) and Yj(D) such that XUi(D) <<H YLj(D).
Proof. Since Ti -> Tj in PG(H), there must be two conflicting ops Xi(D) and Yj(D) such that Xi(D) <<H Yj(D) [otherwise Ti would not be blocked from locking all the data items that it wanted to process].
· By the definition of 2PL, there must be locking and unlocking ops on either side of both ops, e.g.: XLi(D) <<H Xi(D) <<H XUi(D).
· Now between the lock and unlock for Xi(D), the X lock is held by Ti and similarly for Yj(D) and Tj. Since X and Y conflict, the locks conflict and the intervals cannot overlap. Thus, since Xi(D) <<H Yj(D), we must have:
XLi(D) <<H Xi(D) <<H XUi(D) <<H YLj(D) <<H Yj(D) <<H YUj(D)
And in particular XUi(D) <<H YLj(D). (since X and Y are conflicting operations on D and X precedes Y in the history, the XU operation must precede the YL operation in the history.)
Proof of Thm. 10.4.2. We want to show that every 2PL history H is SR.
Assume in contradiction that there is a cycle T1 -> T2 -> . . . -> Tn -> T1 in PG(H). By the Lemma, for each pair Tk -> T(k+1), there is a data item Dk where XUk(Dk) <<H YL(k+1)(Dk). We write this out as follows:
1. XU1(D1) <<H YL2(D1)
2. XU2(D2) <<H YL3(D2)
. . .
n-1. XU(n-1)(D(n-1)) <<H YLn(D(n-1))
n. XUn(Dn) <<H YL1(Dn) (Note, T1 is T(n+1) too.)
But now have (in 1.) an unlock of a data item by T1 before (in n.) a lock of a data item. So not 2PL after all. Contradiction.
Therefore H is 2PL implies no circuit in the PG(H), and thus H is SR.
Note that not all SR schedules would obey 2PL. E.g., the following is SR:
H7 = R1(A) W1(A) R2(A) R1(B) W1(B) R2(B) C1 C2
But it is not 2PL (T2 breaks through locks held by T1). We can optimistically allow a Tx to break through locks in the hopes that a circuit won't occur in PG(H). But most databases don't do that.
Here, then, is how the Scheduler works:
It maintains two tables: a table of transactions and a table of objects being accessed (and potentially locked):
Transaction table entry:
N| blocked count | pending or current op | list of operations | Wait List
Access Object table entry:
Obj-ID | disk address | 0 or T# if locked | operation queue
Activate when i/o completed:
a. If the completed operation was the last in the transaction, do commit process which unlocks the queues of the operations in the transaction and deletes the transaction entry in the transaction table (first decrementing the block count of each transaction in the entry’s WAIT list).
b. Order pending operations on object table entries whose queues are not locked by disk access algorithm and examine operation at head of the first entry’s queue.
a. If this operation is not the pending operation for the transaction that it is part of, move to the next operation in that queue. Repeat if necessary until a pending operation is found for an object in the access list.
b. If this transaction is not blocked, removed the operation from the queue and execute it, lock the queue for this transaction, make the operation current, and enter the transaction of each remaining queue member into the Wait List of the transaction of the selected operation (incrementing each’s blocked count--number of locks that must be released before this transaction can finish).
c. If the operation’s transaction is blocked, continue through the queue and down the access list.
c. If no executable operation can be found, check for deadlock.
a. If no transaction table entry has blocked count = 0 then find the transaction entry with the lowest block count and scan the Wait lists of the other transactions until this transaction number is found and ABORT the blocking transaction, decrementing the count for each transaction in its wait list.
b. Repeat until a transaction is no longer blocked.
Activate when TM adds transaction
1. Translates SQL request into a transaction with appropriate Reads and Write on rows and columns of a table.
2. Enters transaction in transaction table with chain of each of its operations and each accessed object in the transaction into object table and enqueue operation on that entry.
a. If any queue is locked, enter this transaction to Wait List of the locking transaction.
3. If disk of initial operation of new transaction available, initiate I/O.
10.5 Levels of Isolation
The idea of Isolation Levels, defined in ANSI SQL-92, is that people might want to gain more concurrency, even at the expense of imperfect isolation.
A paper by Tay showed that when there is serious loss of throughput due to locking, it is generally not because of deadlock aborts (having to retry) but simply because of transactions being blocked and having to wait.
· Recall that the reason for interleaving transaction operations, rather than just insisting on serial schedules, was so we could keep the CPU busy.
· We want there to always be a new transaction to run when the running transaction did an I/O wait.
· But if we assume that a lot of transactions are waiting for locks, we lose this. There might be only one transaction running even if we have 20 trying to run. All but one of the transactions are in a wait chain!
So the idea is to be less strict about locking and let more transactions run. The problem is that dropping proper 2PL might cause SERIOUS errors in applications. But people STILL do it.
The idea behind ANSI SQL-99 Isolation Levels is to weaken how locks are held. Locks aren't always taken, and even when they are, many locks are released before EOT.
And more locks are taken after some locks are released in these schemes. Not Two-Phase, so not perfect Isolation. (Note in passing, that ANSI SQL-92 was originally intended to define isolation levels that did not require locking, but it has been shown that the definitions failed to do this. Thus the locking interpretation is right.)
Define short-term locks to mean a lock is taken prior to the operation (R or W) and released IMMEDIATELY AFTERWARD. This is the only alternative to long-term locks, which are held until EOT.
Then ANSI SQL-92 Isolation levels are defined as follows (Fig. 10.9 -- some difference from the text):
|
|
Write locks on rows of a table are long term |
Read Locks on rows of a table are long term |
Read locks on predicates are long term |
|
Read Uncommitted (Dirty Reads) |
NA (Read Only) |
No Read Locks taken at all |
No Read Locks taken at all |
|
Read Committed |
Yes |
No |
No |
|
Repeatable Read |
Yes |
Yes |
No |
|
Serializable |
Yes |
Yes |
Yes |
Note that Write Predicate Locks are taken and held long-term in all isolation levels listed. What this means is explained later.
In Read Uncommitted (RU), no Read locks are taken, thus can read data on which Write lock exists (nothing to stop you if don't have to WAIT for RL). Thus can read uncommitted data; it will be wrong if Tx that changed it later aborts. But RU is just to get a STATISTICAL idea of sales during the day (say). CEO wants to know ballpark figure -- OK if not exact. Scheduler NEVER locks an object queue when a Read is being performed.
In Read Committed (RC), we take Write locks and hold them to EOT, and Read Locks on rows read and predicates and release immediately. (Cover predicates below.)
· Problem that can arise is serious one, Lost Update (Example 10.3.2):
... R1(A,100) R2(A,100) W1(A,140) W2(A,150) C1 C2...
Since R locks are released immediately, nothing stops the later Writes, and the increment of 40 is overwritten by an increment of 50, instead of the two increments adding to give 90. Call this the Scholar's Lost Update Anomaly (since many people say Lost Update only happens at Read Uncommitted).
· This is EXTREMELY serious, obviously, and an example of lost update in SQL is given in Figure 10.12 (pg. 666) for a slightly more restrictive level: Cursor Stability. Applications that use RC must avoid this kind of update.
In Figure 10.11, we see how to avoid this by doing the Update indivisibly in a single operation. Not all Updates can be done this way, however, because of complex cases where the rows to be updated cannot be determined by a Boolean search condition, or where the amount to update is not a simple function.
It turns out the IBM's Cursor Stability guarantees a special lock will be held on current row under cursor, and at first it was thought that ANSI Read Committed guaranteed that, but it does not. Probably most products implement a lock on current of cursor row, but there is no guarantee. NEED TO TEST if going to depend on this.
Scheduler unlocks object queue after Read I/O completed,
releasing the transactions in the corresponding Wait list.
In Repeatable Read Isolation, this is the isolation level that most people think is all that is meand by 2PL. All data items read and written have RLs and WLs taken and held long-term, until EOT.
· Example 10.5.3, pg. 666, Phantom Update Anomaly.
R1(predicate: branch_id = 'SFBay') R1(A1,100.00) R1(A2, 100.00)
R1(A3,100.00) I2(A4, branch_id = 'SFBay', balance =100.00)
R2(branch_totals, branch_id = SFBay, 300.00)
W2(branch_totals, branch_id = SFBay ,400.00) C2
R1(branch_totals, branch_id = SFBay, 400) (Prints out error message) C1
T1 is reading all the accounts with branch_id = SFBay and testing that the sum of balances equals the branch_total for that branch (accounts and branch_totals are in different tables). After T1 has gone through the rows of accounts, T2 inserts another row into accounts with branch_id = SFBay (T1 will not see this as it's already scanned past the point where it is inserted), T2 then updates the branch_total for SFBay, and commits. Now T1, having missed the new account, looks at the branch_total and sees an error. There is no error really, just a new account row that T1 didn't see.
· Note that nobody is breaking any rules about data item locks. The insert by T2 holds a write lock on a new account row that T1 never read. T2 locks the branch_total row, but then commits before T1 tries to read it. No data item lock COULD help with this problem. But we have non-SR behavior nonetheless.
· The solution is this: When T1 reads the predicate branch_id = SFBay on accounts, it takes a Read lock ON THAT PREDICATE, that is to say a Read lock on the SET of rows to be returned from that Select statement. Now when T2 tries to Insert a new row in accounts that will change the set of rows to be returned for SFBay, it must take a Write lock on that predicate. Clearly this Write lock and Read lock will conflict. Therefore T2 will have to wait until T1 reaches EOT and releases all locks.
ANSI Repeatable Read Isolation doesn't provide Predicate Locks, but ANSI Serializable does. Note that Oracle doesn't ever perform predicate locks. Oracle's SERIALIZABLE isolation level uses a different approach, based on snapshot reads, that is beyond what we can explain in this course.
Scheduler is focused only on data reads. SR requires a lock on ALL parameters of the transaction.
10.6 Transactional Recovery.
The idea of transactional recovery is this.
· Memory is "Volatile", meaning that at unscheduled times we will lose memory contents (or become unsure of the validity). This applies to data in registers (in the CPU) and in memory buffers that have not yet been flushed to disk.
· a transaction is "atomic", meaning that all update operationa a transaction performs must either ALL succeed or ALL fail.
o If we read two pages into memory during a transaction and update them both, we might (because of buffering) have one of the pages go back out to disk before we commit. What are we to do about this after a crash? An update has occurred to disk where the transaction did not commit. How do we put the old page back in place? How do we even know what happened? That we didn't commit?
o A similar problem arises if we have two pages in memory and after commit we manage to write one of the pages back to disk, but not the other. (In fact, we always attempt to minimize disk writes from popular buffers, just as we minimize disk reads.) How do we fix it so that the page that didn't get written out to disk gets out during recovery?
The answer is that as a transaction progresses we write notes to ourselves about what changes have been made to disk pages. We ensure that these notes get out to disk to allow us to correct any errors after a crash.
· These notes are called "logs", or "log entries". The log entries contain "Before Images" and "After Images" of every update made by a Transaction.
· In recovery, we can back up an update that shouldn't have gotten to disk (the transaction didn't commit) by applying a Before Image.
· Similarly, we can apply After Images to correct for disk pages that should have gotten to disk (the transaction did commit) but never made it.
There is a "log buffer" in memory (quite long), and we write the log buffer out to the "log on disk" every time one of following events occur.
a. The log buffer fills up. We write it to disk and meanwhile continue filling another log buffer with logs. This is known as "double buffering" and saves us from having to wait until the disk write completes.
b. Some transaction commits. We write the log buffer, including all logs up to the present time, before we return from commit to the application (and the application hands out the money at the ATM). This way we're sure we won't forget what happened.
Everything else in the next few sections is details: what do the logs look like, how does recovery take place, how can we speed up recovery, etc.
10.7 Recovery in Detail: Log Formats.
Consider the following history H5 of operations as seen by the scheduler:
(10.7.1) H5 = R1(A,50) W1(A,20) R2(C,100) W2(C,50) C2 R1(B,50) W1(B,80) C1
Because of buffering, some of the updates shown here might not get out to disk as of the second commit, C1. Assume the system crashes immediately after. How do we recover all these lost updates?
While the transaction was occurring, we wrote out the following logs as each operation occurred (Figure 10.13, pg. 673).
OPERA- LOG ENTRY
TION
R1(A,50) (S, 1) — Start transaction T1 - no log entry is written
for a Read operation, but this operation is the start of T1
W1(A,20) (W, 1, A, 50, 20) — T1 Write log for update of A.balance.
The value 50 is the Before Image (BI) for A.balance
column in row A, 20 is the After Image (AI) for A.balance
R2(C,100) (S, 2), another start transaction log entry.
W2(C,50) (W, 2, C, 100, 50), another Write log entry.
C2 (C, 2) — Commit T2 log entry. (Write Log Buffer to Log
File.)
R1(B,50) No log entry.
W1(B,80) (W, 1, B, 50, 80)
C1 (C, 1) Commit T1 (Write Log Buffer to Log File.)
Assume that a System Crash occurred immediately after the W1(B,80) operation.
This means that the log entry (W, 1, B, 50, 80) has been placed in the log buffer, but the last point at which the log buffer was written out to disk was with the log entry (C, 2)
This is the final log entry we will find when we begin to recover from the crash. Assume that the values out on disk are A = 20 (the update to 20 drifted out to disk because the data buffer space was needed), B = 50 (update was in memory buffer and didn't get to disk), and C = 100 (same). If you look carefully at the sequence, where T2 committed and T1 didn't, you will see that the values should be: A = 50, B = 50, C = 50.
After the crash, a command is given by the system operator that initiates recovery. This is usually called the RESTART command. The process of recovery takes place in two phases, Roll Back and Roll Forward.. The Roll Back phase backs out updates by uncommitted transactions and Roll Forward reapplies updates of committed transactions.
In Roll Back, the entries in the disk log are read backward to the beginning, System Startup, when A = 50, B = 50, and C = 100.
In Roll Back, the system makes a list of all transactions that did and did not commit. This is used to decide what gets backed out and reapplied.
LOG ENTRY ROLL BACK/ROLL FORWARD ACTION PERFORMED
1.(C, 2) Put T2 into "Committed List"
2.(W, 2, C,100,50) Since T2 is on "Committed List", we do nothing.
3.(S, 2) Make a note that T2 is no longer "Active"
4.(W, 1, A, 50, 20) Transaction T1 has never committed (it's last
operation was a Write). Therefore, the system
performs UNDO of this update by Writing the
Before Image value (50) into data item B.
Put T1 into "Uncommitted List"
5.(S, 1) Make a note that T1 is no longer "Active". Now
that no transactions were active, we can end the
ROLL BACK phase.
ROLL FORWARD
6. (S, 1) No action required
7.(W, 1, A, 50, 20) T1 is Uncommitted — No action required
8.(S, 2) No action required
9.(W, 2, C,100,50) Since T2 is on Committed List, we REDO this
update by writing After Image value (50) into
data item C
10 (C, 2) No action required
11 We note that we have rolled forward through all
log entries and terminate Recovery.
Note that at this point, A = 50, B = 50, and C = 50.
Guarantees That Needed Log Entries are on Disk
How could a problem occur with our method of writing logs and recovering? Look at the history earlier again and think what would happen if B = 80 had been written to disk before the crash but that the log entry (W, 1, B, 50, 80) did NOT get out to disk. We wouldn't be able to UNDO this update (which should not occur, since T1 did not commit).
This is a problem that we solve with a policy that ties the database buffer writes to the Log. The policy is called Write-Ahead Log (WAL) and It guarantees that no buffer dirty page gets written back to disk before the Log that would be able to UNDO it gets written to the disk Log file. This is accomplished by counting every write to the database. The last write count made to a buffer page is kept with that page and the first write count that is present in the log buffer is kept with the log buffer page. Every time a data page is written back to disk its write count is compared to the log buffer’s write count, and if it is larger, then the log buffer is written to disk first.
OK, this would solve the problem of UNDOs. Do we ever have a problem with REDOs? No, because we always write the Log buffer to the Log file as part of the Commit. So we're safe in doing REDO for committed Txs.
10.8 Checkpoints
In the recovery process we just covered, we performed ROLLBACK to System Startup Time, when we assume that all data is valid. We assume that System Startup occurs in the morning, and database processing continues during the day and everything completes at the end of day.
(This is a questionable assumption nowadays, with many companies needing to perform 24X7 processing, 24 hours a day, 7 days a week, with no time when transactions are guaranteed to not be active.) Even if we have an 8-hour day of processing, however, we can run into real problems recovering a busy transactional system (heavy update throughput) with the approach we've outlined so far.
The problem is that it takes nearly as much processing time to RECOVER a transaction as it did to run it in the first place. (We could backup the state of the database at the end of the day and write a copy of every transaction to tape as it entered the system, and simply reset the DB and re-run the transactions when a crash occurs and not have to worry about logging!)
A simpler type of Checkpointing is a "Commit Consistent Checkpoint," that merely duplicates the process of shutting down for the night, but then transactions start right up again. The problem is that it might take minutes to take a Commit Consistent Checkpoint, and during that time NO NEW TRANSACTIONS CAN START UP.
For this reason, database systems programmers have devised two other major checkpointing schemes that reduce the "hiccup" in transaction processing that occurs while a checkpoint is being performed. The Commit Consistent Checkpoint is improved on by using something called a "Cache Consistent Checkpoint". Then an even more complicated checkpoint called a "Fuzzy Checkpoint" improves the situation further. From time to time, a Checkpoint is triggered, probably by a time since last checkpoint system clock event.
Def 10.8.1. Commit Consistent Checkpoint steps. After the "performing checkpoint state" is entered, we have the following rules.
(1) No new transactions can start until the checkpoint is complete.
(2) Database operation processing continues until all existing transactions Commit, and all their log entries are written to disk. (Thus we are Commit Consistent ).
(3) Then the current log buffer is written out to the log file, and after this the system ensures that all dirty pages in buffers have been written out to disk.
(4) When steps (1)-(3) have been performed, the system writes a special log entry, (CKPT), to disk, and the Checkpoint is complete.
It should be clear that these steps are basically the same ones that would be performed to BRING THE SYSTEM DOWN for the evening. We allow transactions in progress to finish, but don't allow new ones, and everything in volatile memory that reflects a disk state is put out to disk. As a matter of fact, the Disk Log File can now be emptied. We needed it while we were performing the checkpoint in case we crashed in the middle, but now we don't need it any longer.
From this, it should be clear that we can modify the Recovery approach we have been talking about so that instead of a ROLLBACK to the Beginning of the Log File at System Startup, we ROLLBACK to the LAST CHECKPOINT!!!
If we take a Checkpoint every five minutes, we will never have to recover more than five minutes of logged updates, so recovery will be fast.
The problem is that the Checkpoint Process itself might not be very fast.
Note that we have to allow all transactions in progress to complete before we can perform successive steps. If all applications we have use very short transactions, there should be no problem.
Cache Consistent Checkpoint
But what if some transactions take more than five minutes to execute? Then clearly we can't guarantee a Checkpoint every five minutes!!! Worse, while the checkpoint is going on (and the last few transactions are winding up) nobody else can start any SHORT transactions to read an account balance or make a deposit!
We address this problem with something called the "Cache Consistent Checkpoint". With this scheme, transactions can continue active through the checkpoint. We don't have to wait for them all to finish and commit.
Definition 10.8.2. Cache Consistent Checkpoint procedure steps.
(1) No new transactions are permitted to start.
(2) Existing transactions are not permitted to start any new operations.
(3) The current log buffer is written out to disk, and after this the system ensures that all dirty pages in cache buffers have been written out to disk. (Thus, we are "Cache" (i.e., Buffer) Consistent on disk.)
(4) Finally, a special log entry, (CKPT, List) is written out to disk, and the Checkpoint is complete. NOTE: this (CKPT) log entry contains a list of active transactions at the time the Checkpoint occurs.
The recovery procedure using Cache Consistent Checkpoints differs from Commit Consistent Checkpoint recovery in a number of ways.
Ex 10.8.1 Cache Consistent Checkpoint Recovery.
Consider the history H5:
H5: R1(A, 10) W1(A, 1) C1 R2(A, 1) R3(B, 2) W2(A, 3) R4(C, 5) CKPT
W3(B, 4) C3 R4(B, 4) W4(C, 6) C4 CRASH
Here is the series of log entry events resulting from this history. The last one that gets out to disk is the (C, 3) log entry.
(S, 1) (W, 1, A, 10, 1) (C, 1) (S 2) (S, 3) (W, 2, A, 1, 3) (S, 4)
(CKPT, (LIST = T2, T3 , T4)) (W, 3, B, 2, 4) (C, 3) (W, 4, C, 5, 6) (C, 4)
At the time we take the Cache Consistent Checkpoint, we will have values out on disk: A = 3, B = 2 , C = 5. (The dirty page in cache containing A at checkpoint time is written to disk.) Assume that no other updates make it out to disk before the crash, and so the data item values remain the same.
Here is a diagram of the time scale of the various events. Transaction Tk begins with the (S, k) log, and ends with (C, k). **LEAVE ON BOARD**
Checkpoint Crash
T1 |-------|
T2 |--------------------------------------------
T3 |------------------------------|
T4 |--------------------------------------|
Next, we outline the actions taken in recovery, starting with ROLL BACK.
ROLL BACK
1.(C, 3) Note T3 is a committed Tx in active list.
2.(W, 3, B, 2, 4) Committed transaction, wait for ROLL FORWARD.
3.(CKPT, (LIST = T2, T3 ,T4)) Note active transactions T2,T3 and T4;
THESE HAVE NOT COMMITTED (no (C, 2) or
(C, 4) logs have been encountered)
4.(S, 4) List of Active transactions now shorter: {T2, T3}
5.(W, 2, A, 1, 3) Not Committed. UNDO: A = 1
6.(S, 3) List of Active Transactions shorter.
7.(S, 2) List of Active Transactions empty. STOP ROLLBACK.
With a Cache Consistent Checkpoint, when ROLL BACK encounters the CKPT log entry the system takes note of the transactions that were active, even though we have never seen any operations in the log file.
We now take our list of active transactions, remove those that we have seen committed, and have a list of transactions whose updates we need to UNDO. Since Transactions can live through Checkpoints we have to go back PRIOR to the Checkpoint, while UNDO steps might remain.
We continue in the ROLL BACK phase until we complete all such UNDO actions. We can be sure when this happens because as we encounter (S, k) logs, rolling backward.
When all Active Uncommitted Tk have been removed, the ROLL BACK is complete, even though there may be more entries occurring earlier in the log file.
ROLL FORWARD
8.(CKPT, (LIST = T2, T3 , T4)) Skip forward in log file to this entry, start
after this.
9.(W, 3, B, 2, 4) Roll Forward: C = 6.
18. (C, 3) No Action. Last entry: ROLL FORWARD is complete.
In starting the Roll Forward Phase, we merely need to REDO all updates by committed transactions that might not have gone out to disk.
We can jump forward to the first operation after the Checkpoint, since we know that all earlier updates were flushed from buffers.
Roll Forward continues to the end of the Buffer File. Recall that the values on disk at the time of the crash were: A = 3, B = 2 , C = 5. At the end of Recovery, we have set A = 1 (Step 5), and B = 4 (Step 9).
We still have C = 5. A glance at the time scale figure shows us that we want updates performed by T3 to be applied, and those by T2 and T4 to be backed out. There were no writes performed by T4 that got out to disk, so we have achieved what is necessary for recovery: A = 1, B = 4, C = 5.
Fuzzy Checkpoint.
A problem can still arise that makes the Cache Consistent Checkpoint a major hiccup in Transaction Processing. Note in the procedure that we can't let any Active Transactions continue, or start any new ones, until all buffers are written to disk. What if there are a LOT of Buffers? Some new machines have several GBytes of memory! That's probably Minutes of I/O, even if we have a lot of disks. DISK I/O is SLOW!!!
OK, with a Fuzzy Checkpoint, each checkpoint, when it completes, makes the PREVIOUS checkpoint a valid place to stop ROLLBACK.
Definition 10.8.3. Fuzzy Checkpoint procedure steps.
(1) Prior to Checkpoint start, the remaining pages that were dirty at the prior checkpoint will be forced out to disk (but the rate of writes should leave I/O capacity to support current transactions in progress; there is no critical hurry in doing this).
(2) No new transactions are permitted to start. Existing transactions are not permitted to start any new operations.
(3) The current log buffer is written out to disk with an appended log entry, (CKPTN, List), as in the Cache Consistent Checkpoint procedure.
(4) The set of pages in buffer that have become dirty since the last checkpoint log, CKPTN-1, is noted.
This will probably be accomplished by special flags on the Buffer directory. There is no need for this information to be made disk resident, since it will be used only to perform the next checkpoint, not in case of recovery. At this point the Checkpoint is complete.
As explained above, the recovery procedure with Fuzzy Checkpoints differs from the procedure with Commit Consistent Checkpoints only that ROLL FORWARD must start with the first log entry following the SECOND to last checkpoint log.
What really happens with commercial databases. Used to be all Commit Consistent, now often Fuzzy.
Also used to be VERY physical. Before and After images meant physical copies of the entire PAGE. Still need to do this sometimes, but for long-term log can be more "logical".
· Instead of "Here is the way the page looked after this update/before this update", have: this update was ADD 10 to Column A of row with RID 12345 with version number 1121.
· Note that recovery is intertwined with the type of recovery. It doesn't do any good to have row-level locking if have page level recovery.
o T1 changes Row 12345 column A from 123 to 124, and log gives PAGE BI with A = 123, T1 hasn't committed yet.
o T2 changes Row 12347 (same page) column B from 321 to 333 and Commits, log gives Row 12345 with 124, Row 12347 column B with 333 on page AI.
o Transaction T2 commits, T1 doesn't, then have crash. What do we do? Put AI of T2 in place? Gives wrong value to A! Put BI of T1 in place? Wrong value of B!
o Page level logging implies page level locking is all we can do.
Sybase SQL Server STILL doesn't have row-level locking.
10.9 Media Recovery
Problem is that Disk can fail. (Not just stop running: head can score disk surface.) How do we recover?
· First, we write our Log to TWO disk backups. Try to make sure two disks have Independent Failure Modes (not same controller, same power supply). We say that storage that has two duplicates is called stable storage, as compared to nonvolatile storage for a normal disk copy.
· Before System Startup run BACKUP (bulk copy of disks/databases).
· Then, when perform Recovery from Media failure, put backup disk in place and run ROLL FORWARD from that disk, as if from startup in the morning.
RAID Disks
RAID stands for Redundant Arrays of Inexpensive Disks. Invented at Berkeley. The Inexpensive part rather got lost when this idea went commercial. See slides on Raid and IBM explanation. Also see IEEE COMPUTER handout by Robinson.
10.10 TPC-A Benchmark
The TPC-A Benchmark is now out of date. Newer TPC-C Benchmark: more complex.
See Fig 10.16, pg. 686. Size of tables determined by TPS.
See Fig 10.17. pg. 687. All threads do the same thing. Run into each other in concurrency control because of Branch table and History table.
Benchmark specifies how many threads there are, how often each thread runs a Tx, costs of terminals, etc.
On a good system, just add disks until use 95% of CPU. On a bad system, run into bottlenecks.
Ultimate measure is TPS and $/TPS
Read O’Neil’s Article on Benchmarking and the Transaction Processing Council.
March 13,
2003 | Brian Moran | SQL Server Perspectives | InstantDoc
#38348
The Truth About the TPC
In last week's SQL Server Magazine UPDATE commentary, I lauded Microsoft's new TPC-C benchmark world record. But I received several letters that reminded me that many readers don't know what the Transaction Processing Performance Council (TPC) is or what its benchmarks mean.
For example, one reader asked, "Why does the TPC organization only test commercially licensed operating systems and databases? My presumptions would lead me to think that a non-profit based organization would be benchmarking anything they could get their hands on. An example being, why don't they test postreqsql or mysql on a Linux platform?"
There's a simple answer to that question: fear of getting sued. I'll get to the more complex answer to the question in a minute. First, here's some background about the TPC to help you keep things in perspective.
The TPC's mission statement says, "The TPC is a non-profit corporation founded to define transaction processing and database benchmarks and to disseminate objective, verifiable TPC performance data to the industry." The tag line on the TPC Web site says "The Transaction Processing Performance Council defines transaction processing and database benchmarks and delivers trusted results to the industry." You get the picture: They perform database benchmark tests and publish the results. The TPC currently supports four benchmark suites: TPC-C, TPC-H, TPC-R, and TPC-W. TPC-C focuses on OLTP systems, TPC-H and TPC-R focus on decision support and data warehousing loads, and TPC-W is a transactional Web-commerce benchmark designed to test end-to-end system performance. The TPC Web site provides a wealth of information about the organization and the benchmarks and includes a surprisingly interesting history page (yes, I'm enough of a geek that I honestly found it interesting) at http://www.tpc.org/information/about/history.asp .
TPC benchmark scores usually have two components: tpmC and Price/tpmC. TpmC is the number of transactions per minute for the TPC-C test, and Price/tpmC measures how much each of those transactions cost. The cost is amortized over the life of the system. You'll find more detailed information about TPC pricing scores at http://www.tpc.org/information/pricing.asp .
So let's return to the reader question above, which is essentially, "Why didn't I see a benchmark for XYZ product?" The TPC is an independent, non-profit organization. However, the TPC doesn't have the power to run benchmark tests on a database platform without the approval of the database vendor. In fact, with the exception of IBM, most major database vendors include in their license agreements a clause that forbids the publication of benchmark information without explicit permission. Here's the clause from the SQL Server End User License Agreement (EULA):
e. Benchmark Testing. You may not disclose the results of any
benchmark test of either the Server Software or Client Software to
any third party without Microsoft's prior written approval.
Oracle, Sybase, and Informix each have a similar clause. These clauses are generically referred to as "DeWitt clauses." David DeWitt was one of the founders of the Wisconsin Benchmarks, which were first published in the mid-1980s. At that time, the Wisconsin Benchmarks published less-than-favorable scores for an Oracle database, and Oracle wasn't happy with the negative publicity. Oracle added a clause to its license agreement forbidding unauthorized benchmarking, and most other vendors followed suit. So the answer to the first part of the reader's question above is that many benchmarks are never performed because the database vendor might not allow the results to be published. You might see unauthorized database benchmarks that other independent organizations have published. Vendors are hesitant to sue people over this clause because they know the publicity would be horrible. But technically, publishing an unauthorized benchmark could open the organization to a lawsuit from a vendor. Regardless, you won't see an unauthorized benchmark from the TPC.
Vendors use their own resources to run TPC benchmark tests, and they hire independent third parties to audit the numbers and ensure they've followed TPC rules. Only then will you see a TPC-C score on the TPC site. But anyone can submit a TPC benchmark score, as long as the vendor authorizes it. Submitting a benchmark costs only $1,250. You can read about how to submit a score at http://www.tpc.org/information/other/submit_results.asp . Of course, it could easily cost you millions of dollars to build the test environment, run the test, and have the test audited. Because of this practical constraint, I'm not aware of any TPC numbers that have ever been released unless a database vendor and hardware vendor teamed together to publish the result.
The answer to the second part of the reader's question becomes obvious when you consider the fact that vendors are the ones who decide which benchmarks they'll publish. Naturally, vendors publish only the numbers that further their strategic interests. I won't define what those strategic interests might be, but you probably won't see a TPC-C score published unless the vendor thinks it's in the company's best interest. No sane vendor would ever publish a head-to-head, apples-to-apples comparison with an existing number unless the results make the company look good. Unfortunately, most real users would prefer those head-to-head, apples-to-apples comparisons.
Is the new world record that Microsoft set accurate and true? Yes. Oracle would publish a new benchmark score in a heartbeat if they had a number that could beat it. TPC numbers are verifiably audited and the scores can be challenged, so you can trust the numbers. Microsoft is on top for the time being. TPC benchmark scores do serve a valid purpose, and you can glean a lot of useful information from them. However, interpreting a TPC score requires an understanding of how and why that score was published. Companies publish scores for marketing reasons, not to support a noble goal of providing the most comprehensive set of benchmark information available to end users.
March 14,
2002 | Brian Moran | SQL Server Perspectives | InstantDoc
#24467
How Valuable Are Benchmark Scores?
You can't judge a book by its cover, and you can't judge a benchmark by its raw score. On behalf of PC Magazine and eWeek, eTesting Labs recently conducted database- and application-server benchmark tests.
PC Magazine conducted a Nile-based benchmark test to compare the performance of SQL Server, Oracle, IBM DB2, Sybase, and MySQL. (Read the white paper for details about the Nile benchmark.) SQL Server finished dead last in the JavaServer Pages (JSP)-based performance test. I heard about this result from some colleagues who were upset that the test was based on a beta version of Microsoft's new Java Database Connectivity (JDBC) driver. They didn't think it was fair to judge database-server performance limited by the artificial constraints of a middleware driver. And needless to say, any test that says SQL Server can't scale receives my immediate attention. I decided to investigate the matter. Below are my findings, as well as some thoughts about the value of benchmarking in general.
I think that measuring server performance based on a middleware driver is foolhardy, so I was prepared for a belly full of righteous indignation by the time I finished the PC Magazine review (which I encourage you to read). However, I read the entire article with an open mind and found that the reviewers did a reasonably good job of making it clear that Microsoft's JDBC driver played a significant role in SQL Server's poor performance. The reviewers underscored the JDBC driver's effect by running the same implementation of the Nile benchmark under ASP.NET to see how SQL Server performed on that platform. SQL Server running on ASP.NET was markedly faster than any of the JSP competitors in this environment and displayed much better throughput and scalability numbers. (Note: PC Magazine didn't report benchmark numbers for the competing databases running under a Microsoft.NET application architecture.) Those results appear to be good news for Microsoft and suggest that ASP.NET is markedly faster than JSP and that SQL Server is markedly faster than its competitors.
So what value does PC Magazine's SQL Server benchmark have for your decision-making purposes? Like any benchmark, the PC Magazine test simply measures the performance of an application under load by using a precise set of circumstances. Mapping the numbers to your environment's performance is difficult or impossible. Extrapolating results when the performance test includes multiple layers of middleware and client software in an end-to-end test like this is even more difficult. Where does the bottleneck live--front end, middleware, or back end?
A Microsoft spokesperson says that PC Magazine didn't adhere to publicly available best practices for implementation of a scalable .NET application. For example, PC Magazine used the OLE DB .NET Data Provider rather than the Microsoft-recommended SQL Server .NET Data Provider, even though all the JSP solutions were built using vendor-recommended native drivers. Microsoft says that using the native SQL Server .NET Data Provider would have had a substantial positive impact on the benchmark result, which was already faster than any of the JSP solutions.
How fast would SQL Server 2000 have been if PC Magazine had implemented the Nile benchmark by using a full set of .NET best practices? Microsoft has such a benchmark, which you can read about on its GotDotNet Web site.
The Microsoft and PC Magazine tests are similar, but the code isn't 100 percent identical. Comparing the Microsoft Nile benchmark to the PC Magazine Nile benchmark isn't comparing apples to apples. However, this issue demonstrates the inherent problems of analyzing benchmark results.
For example, Microsoft's best throughput numbers are more than three times faster than the PC Magazine results, with both test suites running SQL Server on a four-CPU system. The Microsoft and PC Magazine Nile numbers were both based on the same design guidelines set forth in the Nile specification, and both ran the same type of queries in their workloads. Where does the performance difference come from?
Implementation decisions were responsible for Microsoft's numbers being more than three times faster, and implementation decisions always affect performance in the real world. Let's relate this fact to the question of how relevant PC Magazine's SQL Server benchmark is. PC Magazine's test of SQL Server on .NET was faster than any of the competing JSP solutions the magazine tested. However, PC Magazine's best SQL Server results were roughly three times slower than the Nile numbers Microsoft achieved on similar hardware. In essence, PC Magazine tested the performance of the JDBC driver and an application that the magazine wrote. However, the magazine didn't fully test SQL Server's performance characteristics because the benchmark relies so much on application-implementation decisions and middleware choices.
Incidentally, SQL Server recently posted a new world record on the SAP R/3 Sales & Distribution benchmark. SQL Server now holds the world record for database performance for 10 significant industry-standard benchmarks, including TPC-C. Check out Microsoft's Web site for the latest information about SQL Server benchmarks.