Starter In recent years, the availability of databases and of computer networks has given rise to a new field, which is a distributed database. A distributed database is integrated database, which is built on top of a computer network rather than on a single computer. The data, which constitute the database, are stored at the different sites of the computer network, and the application programs, which are run by the computers, access data at different sites. Databases may involve different database management systems, running on different architectures, that distributes the execution of transactions. A Distributed Database Management System (DDBMS) is defined as the software that handles the management of the DDB (Distributed Database) and makes the operation of such a system appear to the user as if it were a centralized database. Distributed database is a collection of data, which belongs logically to the same system but are spread over the sites of a computer network. Thus, we say that: (1) Data is stored at several sites, each managed by a DBMS that can run independently. (2) The data from different sites are tied together using computer network. (3) On each site there are two types of data (a) Data required for local site. (b) Data required for other sites.
Following two properties are satisfied by distributed databases: §
Distributed Data Independence: Users should not have to know where data is located (extends Physical and Logical Data Independence principles).
§
Distributed Transaction Atomicity: Users should be able to write Xacts accessing multiple sites just like local Xacts.
From above properties, Users have to be aware of where data is located, i.e., Distributed Data Independence and Distributed Transaction Atomicity are not supported. These properties are hard to support efficiently. For globally distributed sites, these properties may not even be desirable due to administrative overheads of making location of data transparent. Advantages and Disadvantages
Following are few advantages of DDBMS: Data is located near the greatest demand site, access is faster, processing is faster due to several sites spreading out the work load, new sites can be added quickly and easily, communication is
Sushil Kulkarni
2
improved, operating costs are reduced, it is user friendly, there is less danger of a single-point failure, and it has process independence. Several reasons why businesses and organizations move to distributed databases include organizational and economic reasons, reliable and flexible interconnection of existing database, and the future incremental growth. Companies believe that a decentralized, distributed data database approach will adapt more naturally with the structure of the organizations. Distributed database is more suitable solution when several databases already exist in an organization. In addition, the necessity of performing global application can be easily performed with distributed database. If an organization grows by adding new relatively independent organizational units, then the distributed database approach support a smooth incremental growth. Data can physically reside nearest to where it is most often accessed, thus providing users with local control of data that they interact with. This result in local autonomy of the data allowing users to enforce locally the policies regarding access to their data. One might want to consider a parallel architecture is to improve reliability and availability of the data in a scalable system. In a distributed system, with some careful tact, it is possible to access some or possibly all of the data in a failure mode if there is sufficient data replication. DDBMS have following few disadvantages: Managing and controlling is complex, there is less security because data is at so many different sites. A distributed database provides more flexible accesses that increase the chance of security violations since the database can be accessed throughout every site within the network. For many applications, it is important to provide secure. Present distributed database systems do not provide adequate mechanisms to meet these objectives. Hence the solution requires the operation of DDBMS capable of handling multilevel data. Such a system is also called a multi level security distributed database management systems (MLS-DDBMS). MLS-DDBMS provides a verification service for users who wish to share data in the database at different level security. In MLS- DDBMS, every data item in the database has correlated with one of several classifications or sensitivities. The ability to ensure the integrity of the database in the presence of unpredictable failures of both hardware and software components is also an important features of any distributed database management systems. The integrity of a database is concerned with its consistency, correctness, validity, and accuracy. The integrity controls must be built into the structure of software, databases, and involved personnel. If there are multiple copies of the same data, then this duplicated data introduces additional complexity in ensuring that all copies are updated for each update. The notion of concurrency control and recoverability consume much of the research efforts in the area of distributed database theory. Increasing in reliability and performance is the goal and not the status quo. Example
Following example illustrates DDB
Sushil Kulkarni
3
Consider the bank having three branches connected by computer network containing teller terminals of the branch and the account database of the branch All the branches can process the database. Each branch that has local database constitutes one site of the distributed database. The applications that are required from the terminals of a branch need only to access the database of that branch. Local applications are applications processed by the computer of the branch from where they are issued. A global application involves the generation of a number of sub-transactions, each of which may be executed at different sites. For instance, suppose the customer wants to transfer the money from one account of one branch to an account of another branch. This application requires updating the database at two different branches. One has to perform two local updates at two different branches. Thus if A is the data item with two copies A 1 and A 2 exists at sites S 1 and S 2 , then the operation of updating the value of A involves generating two sub-transactions T S 1 and T S 2 , each of which will perform the required operation from sites S 1 and S 2. The branches are located at different geographical areas, then the banking system is called distributed database on a geographical dispersed network. [See the figure] The branches are situated in one building and are connected with a high bandwidth local network. The teller terminals of the branches are connected to their respective computer by telephone lines. Each processor and its database constitute a site of the local computer network. Here the distributed database is implemented on a local network instead of a geographical network and is called distributed database on a local network. [See the figure by replacing communication network to local network] Let us now consider the same example as above. The data of the different branches are distributed on three backend computers, which perform the database management functions. The application
Site 1
T
T
T
T
Site 2
T
COM
COM
1
1
Communicatio n Network
T
T
Site 3 COM 1
T
T
Sushil Kulkarni
4
programs are executed by different computer, which request database access services from the backends when necessary. This type of system is not distributed database because the data is physically distributed over different processors but their distribution is not relevant from the application viewpoint. Here the existence of local application on computers can’t be executed. From above examples, one can summarized and obtain the following working definition: A distributed database is a collection of data, which are distributed over different computers of a computer network. Each site of the network has processing capacity and can perform local applications. Each site also participates in the execution of at least one global application. Thus we have in short, § § § §
Can have many sites, each with their own DBMS. Local data stored at each site. Sites connected by communication network. To a user (application program) all sites together appear as one big DB.
Functions of DDBMS
Functions of a centralized DBMS plus: (1) extended communication to allow the transfer of queries and data among sites (2) extended system catalog to store data distribution details (3) distributed query processing , including query optimization (4) extended concurrency control to maintain consistency of replicated data. (5) extended recovery services to take account of failures of individual sites and common links Two main issues In DDBMS, there are two main issues: § §
Making query from one site to the same or remote site. Logical database is partitioned in to different data streams and located at different sites.
User from one site makes a request from a site. This request can be made as if it is made to local database. Users can see only one logical database and does not need to know the names of different data streams. In fact, user need not know that the database is partitioned and the location of data streams. Component Architecture
The components, which are necessary for building distributed databases, are as follows: User Request Interface is typically a client program that acts as an interface to the Distributed Transaction Manager and accepts user's requests for transactions. The distributed transaction manager or transaction processor (TP) is a program that translates requests from the user and converts them into actionable requests for the database managers, which are typically distributed.
Sushil Kulkarni
5
A Database Manager or database processor (DP) is software, which is responsible for processing a segment of the distributed database. Together, the distributed transaction managers and the database managers make up the DDBMS A node is a piece of hardware that can implement a transaction manager, a database manager, or both. Distributed database systems can be comprised of infinitely many architectures. This characteristic is key for maintaining a system that is to be scalable. For instance, it is possible for DBMSs to be a collection of mainframes, minicomputers, or even PC's, or any combination contained therein. This can allow for scalable systems, which integrate all three types of computers into one system. Types of distributed data bases
There are two types of distributed database systems: Homogeneous and Heterogeneous distributed database system. Homogeneous: If data is distributed but every site runs same type of DBMS, then the distributed database is called homogeneous database. Heterogeneous: If data is distributed but different sites run different DBMSs (different DBMSs or even non-relational DBMSs), then the distributed database is called heterogeneous database. Following figure illustrates this. This system is build using gateway protocols, which is API that exposes DBMS functionality to external applications. Example is ODBC and JDBC.
DBMS1
DBMS2
DBMS3
Gateway Distributed database architecture There are two alternative approaches used to separate functionality across different distributed databases related processes: Client-server, Collaborating server. Client-server architecture: This architecture is similar to the system of restaurant where many customers give order to a waiter. In this architecture there are three components clients, servers, and communication middleware § §
The client ships the request to a single site, which process the query. The server process all queries send by the client
Sushil Kulkarni §
6
The communication middleware is communication network by which client and server communicate. client
client
client
Query server The following figure illustrates this architecture:
Collaborating server architecture The client-server architecture does not allow a single query to span multiple servers because the client process would have to be capable of breaking such a query into appropriate subqueries to be executed at different sites and then piecing together the answer to the subqueries. His difficulty can be eliminated using collaborating server architecture In this architecture, we have a collection of database servers, each capable of running transactions against local data, which co-operatively execute transactions spanning multiple servers. server Client server Query
server
Other servers
server
A ‘client’ server makes a request for a data to a server using a query. The server generates appropriate subqueries to be executed by ‘other’ servers and puts the result together to compute the answer of the original query. Data allocation How should a database be distributed among the sites of a network? We discussed this important issue using following strategies.
Sushil Kulkarni
§ § §
7
Centralized Replication Fragmentation
We will explain and illustrate each of these approaches using relational databases. Suppose that a bank has numerous branches located throughout a state. One of the base relations in the bank’s database is the customer relation given below. For simplicity the sample data in the relation apply to only two of the branches (Mumbai and Pune). The primary key in the relation is account number (Ac_no). B_name is the name of the branch where customers have opened their accounts. Ac_no 200 324 153 426 500 683 252
Customer_name Smita Hemant Smith Aishwarya Tina Samira Ajay
B_name Mumbai Pune Pune Mumbai Pune Mumbai Mumbai
Balance 1000 250 38 1500 168 1398 303
Centralized All data can be stored at one site. The processing can be done only from this site. Other sites have to approach this site for data. If problem occurs at this site then the system fails. This method of storing data has high communication costs. Fragmentation Fragmentation consists of breaking a relation into smaller relations or fragments and storing the fragments at different sites. Thus partitioning of global relation R into fragments R1, R2, . . . , Rn that contain sufficient information to reconstruct the original relation.. Following are the advantages of fragmentation • Allows parallel processing of data • tuples are located where they are most frequently accessed To fragment the relation the prerequisite is the access patterns (typical queries) and application behavior must be known; that is, which applications access which (portions of the) data. There are two types of fragmentation: horizontal, vertical, Horizontal fragmentation With this fragmentation, some of the rows of a table are put into a base relation at one site, and other rows are put into a base relation at another site. More generally, the rows of a relation are distributed to many sites. Fragments are required to be disjoint. Each fragment assigned to one site locality of reference can bee high, storage costs are low - no replication, reliability and availability are low
Sushil Kulkarni
8
For a given relation R(A1, . . . , An) a set PR = {p1, . . . , pn} of simple predicates is determined. Each pi has the pattern Ai q
, with q Î. {=, ¹,< ,>}. Assume the two relations: EMP(EmpNo, EName, Job, Sal, Deptno ®. DEPT) DEPT(Deptno, DName, Location) Furthermore, assume that there are two applications. One application typically retrieves information about employees who earn more than 5000 Rs., the other application typically manages information about clerks. The simple predicates will be PEMP = {Job = ’Clerk’, Sal > 5000}and set of minterm predicates MR = {mi | [not]p1 . . . . . [not] p n } m1 = Job = ’Clerk’ Ù Sal > 5000 m3 Job ¹ ’Clerk’ Ù Sal > 5000
m2 = Job = ’Clerk’ Ù Sal £ 5000 m4 ¹ Job = ’Clerk’ Ù Sal £ 5000
Thus the fragments EMP i = s mi (EMP), i = 1, . . . , 4 Vertical fragmentation Here the idea is to split schema R into smaller schemas R i. Splitting is based on which attributes are typically accessed together by (different) applications. Each fragment R i must contain primary key in order to reconstruct original relation R. With this fragmentation, some of the columns of a relation are projected into a base relation at one of the sites, and other columns are projected into a base relation at another site. More generally, the columns may be projected to several sites. For example two fragments of EMP can be written as EMP1 (EmpNo, EName, Sal), EMP2 (EmpNo, Job, Deptno) This fragmentation is more difficult than horizontal fragmentation, because more alternatives exist. The fragmentation can be achieved in this case by two approaches: (1) grouping attributes to fragments, (2) splitting relation into fragments. The collection of vertical fragmentation should be losses-join decomposition Following figure illustrates these fragments.
Sushil Kulkarni
9
For the given relation we can fragment as follows: [A] Following are the horizontal fragmentation according to branches Ac_no 200 426 683 252
Customer_name B_name Smita Mumbai Aishwarya Mumbai Samira Mumbai Ajay Mumbai Mumbai branch
Balance 1000 1500 1398 303
Ac_no 324 153 500
Customer_name B_name Hemant Pune Smith Pune Tina Pune Pune branch
Balance 250 38 168
[B] Following are the vertical fragmentation. Ac_no 200 324 153 426 500 683 252
Customer_name Smita Hemant Smith Aishwarya Tina Samira Ajay
Ac_no 200 324 153 426 500 683 252
Balance 1000 250 38 1500 168 1398 303
B_name Mumbai Pune Pune Mumbai Pune Mumbai Mumbai
When a relation is fragmented, we must be able to recover the original relation from the fragments. Correctness of Fragmentation §
Completeness - Decomposition of relation R into fragments R1, R2, . . . , Rn is complete if and only if each tuple in R can also be found in some Ri.
§
Reconstruction - If relation R is decomposed into fragments R1, R2, . . . , Rn, then there
•
should be some relational operator Ä such that R = Ä 1£ i £ n ( R i ). Disjointness: If a relation R is partitioned, and a tuple t is in fragment R i, then t should be in no other fragment Rj, j ¹ i
Sushil Kulkarni
10
Replication System maintains multiple copies of data (fragments) stored at different sites, for faster retrieval and fault tolerance. • A relation or fragment of a relation is replicated if it is stored redundantly at two or more sites. • Full replication of a relation is the case where the relation is stored at all sites. • Fully redundant databases are those in which every site contains a copy of the entire database. • Rule:
Advantages • Availability: Failure of a site containing fragment/relation R does not result in unavailability of R if replicas of R exist. • Parallelism: Queries on R may be processed by several nodes in parallel. • Reduced data transfer: Relation R is available locally at each site containing a replica of R Disadvantages • Increased cost of updates: Each replica of R must be updated. • Increased complexity of concurrency control: Concurrent updates two distinct replicas may lead to inconsistent data unless special concurrency control mechanisms are implemented. Strategies • Synchronous replication: Before an update transaction commits, it synchronizes all copies of the modified data (_ Read-One-Write-All (ROWA) technique) • Asynchronous replication: Changes to primary copy are propagated (periodically) to secondary copies (_ snapshots). The allocation schema describes the mapping of relations or fragments to the servers that store them. This mapping can be: – Non-redundant, when each fragment or relation is allocated to a single server – Redundant, when at least one fragment or relation is allocated to more than one server
Sushil Kulkarni
11
Transparencies in DDBMS A distributed database system has certain functional characteristics. These characteristics are grouped together and called set of transparency. The DDBMS transparency features are: Distributed Transparency This allows the user to see the database as a single, logical entity. If this transparency is exhibited then the user does not need to know that § § §
The data are partitioned Data can be replicated at several sites. Data location.
The level of transparency supported by DDBMS varies from system to system. Three levels of distributed transparency are recognized: Transparency of fragmentation, of allocation and of language In absence of transparency, each DBMS accepts its own SQL ‘dialect’: the system is heterogeneous and the DBMSs do not support a common interoperability standard Let us assume that the SUPPLIER (SNum, Name, City) relation with two horizontal fragments SUPPLIER1 = sCity=‘Mumbai SUPPLIER SUPPLIER2 = sCity=‘ Hydrabad’ SUPPLIER And the allocation schema: [email protected] [email protected] SUPPLIER2@company. Hydrabad 2.in Fragmentation transparency On this level, the programmer should not worry about whether or not the database is distributed or fragmented For example, consider the following query: procedure Query1(:snum,:name); SELECT Name into: name FROM Supplier WHERE SNum =: snum; end procedure; Here snum and name are found from the fragmentations SUPPLIER1 and SUPPLIER2 but the user will not know that the data is stored on different nodes.
Sushil Kulkarni
12
Allocation transparency On this level, the programmer should know the structure of the fragments, but does not have to indicate their allocation. With replication, the programmer does not have to indicate which copy is chosen for access (replication transparency) For example, consider the following query: procedure Query2(:snum,:name); SELECT Name into :name FROM Supplier1 WHERE SNum =: snum; IF :empty then SELECT Name into :name FROM Supplier2 WHERE SNum = :snum; end procedure; Language transparency On this level the programmer must indicate in the query both the structure of the fragments and their allocation Queries expressed at a high level of transparency are transformed to this level by the distributed query optimizer, aware of data fragmentation and allocation For example, consider the following query: procedure Query 3 (:snum,:name); SELECT Name into :name FROM [email protected] WHERE SNum = :snum; IF :empty then SELECT Name into :name FROM [email protected] WHERE SNum = :snum; end procedure; Optimizations This application can be made more efficient in two ways: §
By using parallelism: instead of submitting the two requests in sequence, they can be processed in parallel, thus saving on the global response time
§
By using the knowledge on the logical properties of fragments (but then the programs are not flexible)
For example, consider the following query: procedure Query 4 (:snum,:name,:city); CASE: city of
Sushil Kulkarni
13 "Mumbai" : SELECT Name into: name FROM Supplier1 WHERE SNum = :snum; "Hydrabad": SELECT Name into: name FROM Supplier2 WHERE SNum = :snum; end procedure;
Classification Of Transaction This transparency allows a transaction to update data at several sites. It will also ensure that the transaction is completed or aborted and thus database integrity is maintained. In order to understand how the transactions are managed, one should know the classification of transaction. : remote request, remote transaction, distributed transaction and distributed requests. Remote request These are read-only transactions made up of an arbitrary number of SQL queries, addressed to a single remote DBMS. The remote DBMS can only be queried. Following figure illustrate this Site 1
Site 2
Network ¬¾ ¾ ¾¾®
TP
SELECT * FROM Employee WHERE eno = 123
DP
Employee
Comment: The request is made directly to employee table at site 2
Remote transactions It is made up of any number of SQL commands (select, insert, delete, update) directed to a single remote DBMS and each transaction writes onto only one DP.Consider Customer and Employee tables situated at site2. The transaction should able to update customer and employee tables. The transaction can reference to only one DP at a time. Site 1 TP
Site 2
Network ¬¾ ¾ ¾¾®
begin: UPDATE Employee SET salary = salary + 200 WHERE eno > 123; INSERT INTO Customer ( cno, cname, age) VALUES ( 123, ‘Seema’,29); COMMIT;
Employee DP
Customer
Sushil Kulkarni
14
Distributed transactions It is made up of any number of SQL commands (select, insert, delete, update) directed to an arbitrary number of remote DP sites, but each SQL command refers to a single DBMS Let us consider the transaction that points towards two remote sites say 2 and 3. The first request (SELECT statement) is processed by the DP at the remote site 2, and the next requests (UPDATE, INSERT statements) are processed by the DP at the remote site 3. Each request can access only one request at a time. Site 1 TP
Site 2
Network ¬¾ ¾ ¾¾®
begin: SELECT * FROM Employee WHERE eno > 104; UPDATE Customer SET cbalance =cbalance -100 WHERE cno = 124;
DP
Employee
Customer DP
Product Site 3
INSERT INTO Product ( pno, price) VALUES ( 321, 50); COMMIT; Distributed requests It is made up of arbitrary transactions, in which each SQL command can refer to any DP that may contain the fragmentation. This request requires a distributed optimizer. Let us consider the following examples: Example 1: Let shop (sno, sname) is at site 2 and customer (cno, cname, bill, sno) and employee (eno, ename, sno) be at site 3. Fetch the tuples to find sname and cname where sno = 123. The following figure 1 illustrates this: Example 2: The distributed request features allow a single request to references a physically partitioned table. Divide the employee table into two fragments says E1 and E2 located sites 2 and 3. Suppose the end- user wants to obtain all tuples whose salary exceed Rs. 15,000. The request is illustrated in figure 2
Sushil Kulkarni
15
Site 1
TP
Site 2
Network ¬¾ ¾ ¾¾®
Shop DP
begin: SELECT sname, cname FROM shop, customer WHERE sno =123 AND Customer.sno= shop.sno;
DP
Customer
Employee Site 3
UPDATE Customer SET bill = bill+5100 WHERE cno = 124; INSERT INTO Product ( pno, price) VALUES ( 321, 50); COMMIT;
Figure 1
Site 1
Network ¬¾ ¾ ¾¾® TP
begin: SELECT * FROM employee
Site 2
E1
DP
DP
E2
WHERE salary >15000; COMMIT; Site 3 Figure 2
Distributed Catalog manager Catalog manager keeps track of all data, which are distributed across different sites by keeping the name of each replica of each fragment. To preserve local autonomy, information is stored in the form < local_name, birth_site>. Every site also has the catalog that describes all objects in the form < fragments, replica> at a site plus it keeps the track of all replica of relations created at this site. To find any relation it is the best to see birth site catalog which never changes even if the relation is moved to another site.
Sushil Kulkarni
16
Atomicity problem: Fund transfer transaction Assume: ACCOUNT (AccNum,Name,Total) with accounts lower than 10000 allocated on fragment Account1 and accounts above 10000 allocated on fragment Account2 The code can be developed as follows: begin transaction UPDATE Account1 SET Total = Total - 100000 WHERE AccNum = 3154; UPDATE Account2 SET Total = Total + 100000 WHERE AccNum = 14878; commit ; end transaction The above code is not acceptable because it may violate atomicity that is one of the modifications is executed while the other is not Technology of distributed databases Data distribution does not influence consistency and durability because §
Consistency of transactions does not depend on data distribution, because integrity constraints describe only local properties (a limit of the actual DBMS technology)
§
Durability is not a problem that depends on the data distribution, because each system guarantees local durability by using local recovery mechanisms (logs, checkpoints, and dumps)
The distributed system require major enhancements in the following concepts: § § §
Query optimization Concurrency control Reliability control
We will see these concepts in detail Distributed query optimization Besides the factors used in centralized databases, we require additional factors for optimization. Query optimization is required when a DP receives a distributed request from the site where TP is located; the DP that is queried is responsible for the ‘global optimization’ DP decides on the breakdown of the query into many sub-queries, each addressed to a specific DP and it builds a strategy (plan) of distributed execution that consists of coordinating various execution programs from various DPs and in the exchange of data among them There are two major cost factors that are involved § quantity of data transmitted on the network
Sushil Kulkarni §
17
unit cost of transmission
The total cost is given using following formula: C total = C I/O x n I/O + C cpu x n cpu + Ctr x ntr Where ntr: the quantity of data transmitted on the network Ctr: unit cost of transmission The following examples will give us the idea for determining the “best” execution plan Consider the following relations: Student (rno, name, address, cno) on site 1 Course (cno, ctitle, duration) on site 2 These relations are not fragmented and have the following information § Each record of Student contains 10,000 records and each record is of 100 bytes long. § Each record of Course contains 100 records and each record is of 35 bytes long. From above size of Student relation is 100*10,000 = 10 100*35 bytes.
6
bytes and size of Course relation is
Query: Find the names of students and the course titles that are taken by the students. Using relational algebra the query can be sited as follows: s name , cname ( Student join cno Course ) The result of this query will contain 10,000 records, assuming that every student is opted for a course. Assume further that each record in the query will result in 40 bytes long. [A] Suppose assume that the query is submitted at a site 2. Site 2 is called resultant site. We have the following strategies: (1) Transfer the Student relation to site 2 by executing the query and result is displayed at site 2. Then 10 6 bytes are transfer from site 1 to2. (2) Transfer the Course relation to site 1 by executing the query at site 1 and send the result at site 2. Then 4000,000 + 3500 = 403,500 bytes are transfer for the query. If minimizing the amount of data transfer is our optimization criteria then we can choose strategy 2. [B] Suppose assume that the query is submitted at a site 3 that does not contain any one of the relations given above. Site 3 is called resultant site where we will store the result obtained by the given query. We have the following strategies: Student: Site 1 Course: Site 2
Result site: Site 3
Sushil Kulkarni
18
(1) Transfer Student and Course relations to site 3 and perform the join at site 3. Here total of 10 6 +3500 = 1,003,500 bytes must be transferred. (2) Transfer the Student relation to site 2, execute the join at site 2 and send the result to site 3. The size of the query result is 40*10,000 bytes, so 4000,000 + 10 6 = 1,400,000 bytes are transferred. (3) Transfer the Course relation to site 1, execute the join at site 1 and send the result to site 3. The size of the query result is 40*10,000 bytes, so 4000,000 + 3500 = 403,500 bytes are transferred. If minimizing the amount of data transfer is our optimization criteria then we can choose strategy 3. Following is the SQL query to find the ENAME from the relation EMP (ENO, ENAME, AGE) and ASG (ENO, DUR) SELECT ENAME FROM EMP, ASG WHERE EMP.ENO = ASG.ENO AND DUR > 37 Following are the two possible strategies:
Strategy 2 is better because it avoids Cartesian product. So what is the problem? Suppose ASG has two fragmentations say ASG1 and ASG2 located at site 1 and 2. Suppose EMP has two fragmentations say EMP1 and EMP2 located at site3 and 4. The result is stored at site 5 [see the figure] Let us assume that: size (EMP) = 400, size (ASG) = 1000 and tuple access cost = 1 unit; tuple transfer cost = 10 units For strategy 1 § § § §
Produce ASG': (10+10) * tuple access cost Transfer ASG' to the sites of EMP: (10+10) * tuple transfer cost Produce EMP ' : (10+10) * tuple access cost * 2 Transfer EMP ' to result site: (10+10) *tuple transfer cost Total cost
For Strategy 2
20 200 40 200 ______ 460
Sushil Kulkarni
§ § § §
Transfer EMP to site 5: 400 * tuple transfer cost Transfer ASG to site 5 : 1000 * tuple transfer cost Produce ASG ' : 1000 * tuple access cost Join EMP and ASG':400*20*tuple access cost Total cost
19
4,000 10,000 1,000 8,000 ______ 23,000
Thus strategy 1 is better than 2.
Concurrency Control There are varies techniques to determine which object are to be locked in distributed databases by obtaining and releasing the locks. The locks are determined by concurrency control protocol. Lock management can be distributed across sites in the following ways: Centralized A single site is in charge of handling lock and unlock requests for all objects. Primary copy One copy of each object is called the primary copy. All requests to lock or unlock a copy of this object are handled by the lock manager at the site where the primary copy is stored. Fully distributed Request to lock or unlock a copy of an object stored at a site is handled by the lock manager of that site where the copy is stored. The centralized scheme is not useful as well as popular because the failure of the site, which contains locks, will disturb the functionality of all sites.
Sushil Kulkarni
20
The primary copy scheme avoids this problem, but in general, reading an object requires communications with two sites, the site where the primary copy resides and the site where the copy to be read resides. This problem is avoided in the fully distributed scheme, because locking is done at the site where the copy to be read resides. However, while writing locks must be set at all sites where copies are modified in the fully distributed scheme, whereas locks need to be set only at one site in the other two schemes. Serializability in distributed databases In a distributed system, a transaction t i can carry out various sub-transactions t i j, where the second subscript denotes the node of the system on which the sub-transaction works. For example, let there be two nodes say node 1 and node 2, then the transactions on them can be performed as follows: t1 Node 1 Read(x) Write(x)
t2 Node 2
Node 1
Read(y) Write(y)
Read(x) Write(x)
Node 2 Read(y) Write(y)
This can be written as follows t1 : r11(x) w11(x) r12(y) w12(y) t2 : r22(y) w22(y) r21(x) w21(x) The local Serializability within the schedulers is not a sufficient guarantee of serializability. Consider the two schedules at nodes 1 and 2 [see the table above]: S1: r11 (x) w11 (x) r21 (x) w21 (x) S2: r22 (y) w22 (y) r12 (y) w12 (y) These are locally serializable, but their global conflict graph has a cycle, since for a node 1 t1 precedes t2 and is in conflict with t2 and for node 2, t2 precedes t1 and is in conflict with t1. Thus we define: Global serializability of distributed transactions over the nodes of a distributed database requires the existence of a unique serial schedule S equivalent to all the local schedules Si produced at each node. The following properties are valid. (1) If each scheduler of a distributed database uses the two-phase locking method on each node and carries out the commit action when all the sub-transactions have acquired all the resources, then the resulting schedules are globally conflict-serializable (2) This is imposed by the 2-phase commit protocol If each distributed transaction acquires a single timestamp and uses it in all requests to all the schedulers that use concurrency control based on timestamp, the resulting schedules are globally serial, based on the order imposed by the timestamps
Sushil Kulkarni
21
Lamport Method for assigning time-stamp The Lamport method for assigning timestamps reflects the precedence among events in a distributed system. The method is as follows: • A timestamp is a number characterized by two groups of digits (1) The least significant digits identify the node at which the event occurs (2) The most significant digits identify the events that happen at that node. They can be obtained from a local counter, which is incremented at each event §
Each time two nodes exchange a message, the timestamps become synchronized.The receiving event must have a timestamp greater than the timestamp of the sending event and may require the increasing of the local counter on the receiving node
Following example shows how timestamps can be assignment using the Lamport method
Distributed deadlocks Distributed deadlocks can be due to circular waiting situations between two or more nodes. A particular time is given to all nodes to replay and if one of the nodes is not responding in a given time then dead lock may be possible. Deadlock resolution can be done with an asynchronous and distributed protocol (implemented in a distributed version of DB2 by IBM) Distributed deadlock resolution Assume that using a remote procedure call, that is, a synchronous call to a procedure that is remotely executed activates sub-transactions. This model allows for two distinct types of waiting §
Two sub-transactions of the same transaction can be in waiting in distinct DP s as one waits for the termination of the other. For instance, if t 11 activates t 12 , it waits for the termination of t 12
Sushil Kulkarni §
22
Two different sub-transactions on the same DPS can wait as one blocks a data item to which the other one requires access. For instance, if t 11 locks an objects requested by t 21, t 21 waits for the termination of t 11
Following figure illustrates this:
Representation of waiting conditions The waiting conditions at each DP can be characterized using precedence conditions, Let t i and t j be two transactions such that t j has to wait till t i is completed. The general format of waiting condition is give using a wait sequence as: EXT < t i < t j < EXT Where EXT represents executions at a remote DP. For example, § §
On DP1 we have: EXT < t21 < t11 < EXT On DP2 we have: EXT < t12 < t22 < EXT
Algorithm can be developed on distributed deadlocks and is given below and is periodically activated on the various DPs of the system. When it is activated, it: §
integrates new wait sequences with the local wait conditions as described by the lock manager
§
analyzes the wait conditions on its DP and detects local deadlocks. It communicates the wait sequences to other instances of the same algorithm
§
to avoid the situation in which the same deadlock is discovered more than once, the algorithm sends wait sequences: – –
‘ahead’, towards the DPS which has received the remote procedure call. only if, for example, i > j where i and j are the identifiers of the transactions.
Sushil Kulkarni
23
Failures in Distributed Systems A distributed system is subject to failures, message losses, or network partitioning • Node failures may occur on any node of the system and be soft or hard, as discussed before • Message losses leave the execution of a protocol in an uncertain situation – Each protocol message (msg) is followed by an acknowledgement message (ack) – The loss of either one leaves the sender uncertain about whether the message has been received • Network partitioning. This is a failure of the communication links of the computer network, which divides it into two sub-networks that have no communication between each other – A transaction can be simultaneously active in more than one sub-network Two-phase commit protocol Commit protocols allow a transaction to reach the correct commit or abort decision at all the nodes that participate in a transaction The two-phase commit protocol is similar in essence to a marriage, in that the decision of two parties is received and registered by a third party, who ratifies the marriage
Sushil Kulkarni
24
The servers, who represent the participants to the marriage, are called resource managers (RM) and the celebrant (or coordinator), is allocated to a process, called the transaction manager (TM). It takes place by means of a rapid exchange of messages between TM and RM and writing of records into their logs. The TM can use broadcast mechanisms (transmission of the same message to many nodes, collecting responses arriving from various nodes) and serial communication with each of the RMs in turn. Let us see how the records are stored in TM and RM Records of TM §
The prepare record contains the identity of all the RM processes (that is, their identifiers of nodes and processes)
§
The global commit or global abort record describes the global decision. When the TM writes in its log the global commit or global abort record, it reaches the final decision
§
The complete record is written at the end of the two-phase commit protocol
Records of RM §
The ready record indicates the irrevocable availability to participate in the two-phase commit protocol, thereby contributing to a decision to commit. Can be written only when the RM is “recoverable”, i.e., possesses locks on all resources that need to be written. The identifier (process identifier and node identifier) of the TM is also written on this record
§
In addition, begin, insert, delete, and update records are written as in centralized servers
§
At any time an RM can autonomously abort a sub-transaction, by undoing the effects, without participating to the two-phase commit protocol
First phase of the basic protocol §
The TM writes the prepare record in its log and sends a prepare message to all the RMs. Sets a timeout indicating the maximum time allocated to the completion of the first phase
§
The recoverable RMs write on their own logs the ready record and transmit to the TM a ready message, which indicates the positive choice of commit participation
§
The non-recoverable RMs send a not-ready message and end the protocol
§
The TM collects the reply messages from the RMs – If it receives a positive message from all the RMs, it writes a global commit record on its log – If one or more negative messages are received or the timeout expires without the TM receiving all the messages, it writes a global abort record on its log
Second phase of the basic protocol §
The TM transmits its global decision to the RMs. It then sets a second time-out
Sushil Kulkarni
§ § §
25
The RMs that are ready receive the decision message, write the commit or abort record on their own logs, and send an acknowledgement to the TM. Then they implement the commit or abort by writing the pages to the database as discussed before The TM collects all the ack messages from the RMs involved in the second phase. If the time-out expires it sets another time-out and repeats the transmission to all the RMs from which it has not received an ack When all the acks have arrived, the TM writes the complete record on its log
Following figure illustrates this protocol
Two-phase commit protocol in the context of a transaction is shown in the following figure
Blocking, Uncertainty, recovery protocols • An RM in a ready state loses its autonomy and awaits the decision of the TM. A failure of the TM leaves the RM in an uncertain state. The resources acquired by using locks are blocked.
Sushil Kulkarni
26
• The interval between the writing on the RM’s log of the ready record and the writing of the commit or abort record is called the window of uncertainty. The protocol is designed to keep this interval to a minimum. • Recovery protocols are performed by the TM or RM after failures; they recover a final state which depends on the global decision of the TM Recovery of participants • Performed by the warm restart protocol. Depends on the last record written in the log: – When it is an action or abort record, the actions are undone; when it is a commit, the actions are redone; in both cases, the failure has occurred before starting the commit protocol – When the last record written in the log is a ready, the failure has occurred during the twophase commit. The participant is in doubt about the result of the transaction §
During the warm restart protocol, the identifier of the transactions in doubt are collected in the ready set. For each of them the final transaction outcome must be requested to the TM
§
This can happen as a result of a direct (remote recovery) request from the RM or as a repetition of the second phase of the protocol
Recovery of the coordinator §
When the last record in the log is a prepare, the failure of the TM might have placed some RMs in a blocked situation. Two recovery options: – Write global abort on the log, and then carry out the second phase of the protocol – Repeat the first phase, trying to arrive to a global commit
§
When the last record in the log is a global decision, some RMs may have been correctly informed of the decision and others may have been left in a blocked state. The TM must repeat the second phase
Message loss and network partitioning §
The loss of a prepare or ready messages are not distinguishable by the TM. In both cases, the time-out of the first phase expires and a global abort decision is made
§
The loss of a decision or ack message are also indistinguishable. In both cases, the time-out of the second phase expires and the second phase is repeated
§
A network partitioning does not cause further problems, in that the transaction will be successful only if the TM and all the RMs belong to the same partition
Presumed abort protocol §
The presumed abort protocol is used by most DBMSs
§
Based on the following rule:
Sushil Kulkarni –
27
when a TM receives a remote recovery request from an induct RM and it does not know the outcome of that transaction, the TM returns a global abort decision as default
§
As a consequence, the force of prepare and global abort records can be avoided, because in the case of loss of these records the default behavior gives an identical recovery
§
Furthermore, the complete record is not critical for the algorithm, so it needs not be forced; in some systems, it is omitted. In conclusion the records to be forced are ready, global commit and commit
Read-only optimization §
When a participant is found to have carried out only read operations (no write operations)
§
It responds read-only to the prepare message and suspends the execution of the protocol
§
The coordinator ignores read-only participants in the second phase of the protocol
Four-phase commit protocol §
Created by Tandem, a provider of hardware-software solutions for data management based on the use of replicated resources to obtain reliability
§
The TM process is replicated by a backup process, located on a different node. At each phase of the protocol, the TM first informs the backup of its decisions and then communicates with the RMs
§
The backup can replace the TM in case of failure
§
When a backup becomes TM, it first activates another backup, to which it communicates the information about its state, and then continues the execution of the transaction
Three-phase commit protocol
Sushil Kulkarni §
28
The basic idea is to introduce a third pre-commit phase in the standard protocol. If the TM fails, a participant can be elected as new TM and decide the result of the transaction by looking at its log – If the new TM finds ready as last record, no other participants in the protocol can have gone beyond the pre-commit condition, and thus can make the decision to abort – If the new TM finds pre-commit as last record, it knows that the other participants are at least in the ready state, and thus can make the decision to commit
§
The three-phase commit protocol has serious inconveniences and has not been successfully implemented: – It lengthens the window of uncertainty – It is not resilient to network partitioning, unless with additional quorum mechanisms
Interoperability §
Interoperability is the main problem in the development of heterogeneous applications for distributed databases
§
It requires the availability of functions of adaptability and conversion, which make it possible to exchange information between systems, networks and applications, even when heterogeneous
§
Interoperability is made possible by means of standard protocols such as FTP, SMTP/MIME, and so on
§
With reference to databases, interoperability is guaranteed by the adoption of suitable standards