翻译

Abstract A database is C-Armstrong for a given set of constraints in a class C if it satisfies every constraint of the set and violates every constraint in C not implied by the set.Therefore, Armstrong databases are test data that perfectly illustrate the current perceptions about the semantics of a schema. We extend the existing theory of Armstrong relations to a toolbox of Armstrong tables. That is,we investigate structural and computational properties of Armstrong tables for the class of functional dependencies (FDs) over SQL tables. Relations are special instances of SQL tables with no duplicate rows and no null value occurrences.While FDs do not enjoy Armstrong tables, the combined class of standard FDs and NOT NULL constraints does enjoy Armstrong tables.The problem of finding an Armstrong table is shown to be precisely exponential for this combined class. However, we establish an algorithm that computes Armstrong tables with a size at most quadratic in that of a minimum-sized Armstrong table. Our resulting toolbox of Armstrong tables can be applied by data engineers to concisely visualize constraints on SQL data. Such support can lead to designs that guarantee efficient data management in practice.

Keywords Armstrong table Functional dependency Schema design SQL Test data Uniqueness

1 Introduction

A database system is a software package that manages a collection of persistent information in a shared, reliable, effective, and efficient way. The core of most database systems is still founded on the relational model of data . In this model, data are stored in a collection of relations that may vary over time. A relation is a set of tuples over a given time-invariant relation schema.

Relations permit the storage of inconsistent data, i.e., data that violate conditions which every meaningful relation ought to satisfy. Consequently, data engineers must identify the conditions, called data dependencies, that restrict the relations to those that are meaningful for the application domain at hand. Functional dependencies (FDs) constitute the probably most important class of data dependencies. According to , they make up around two-thirds of all uni-relational data dependencies (those defined over a single relation schema) in practice.FDs form one of the most prolific database concepts. They are essential for the fundamental tasks of database modeling and design , normalization on the conceptual and logical level , and maintenance. They have proven applications in data cleaning , entry , exchange , integration, sampling , warehousing , indexing , information retrieval , uncertain data management , repairing and consistent query answering , query optimization , access control , and privacy-preserving data publishing among others.A basic prerequisite for addressing these fundamental tasks and taking effective advantage of these applications is that data engineers are able to identify the FDs that are meaningful for their application domain.

Example 1 Suppose that in designing an information system for a company, the team of data engineers has identified the relationship between employees and managers in departments as an important information unit. Therefore, the team has decided to utilize the relation schema Employment with attributes Emp, Dept, and Mgr. The schema stores the name of an employee in the Emp column, the name of the department in which the employee works in the Dept column, and the name of the department’s manager in the Mgr column. The team has identified additional constraints they consider important for the schema Employment. Firstly, the information stored in the columns Emp and Mgr must be total, i.e., no occurrences of null values are permitted in these columns. Consequently, an SQL definition of the table may look as follows:

CREATE TABLE Employment (

Emp VARCHAR NOT NULL,

Dept VARCHAR,

Mgr VARCHAR NOT NULL);

Moreover, the team of data engineers believes that every employee can work for only one department, and that every department has only one manager. These constraints can be expressed as the FDs EmpDept and DeptMgr, respectively. The team would like to consult the experts of the application domain to find out whether their current design choice captures all the requirements necessary. In order to validate their own understanding of the application domain and to facilitate the knowledge acquisition from the domain experts, the team plans to create test data that represents concisely their current understanding of the database.

Example 1 illustrates the findings of previous research : The creation of good test data can help data engineers with the challenge to consolidate the set of meaningful FDs. In fact, the findings suggest to use good test data to identify actually meaningful FDs that are incorrectly perceived as meaningless . It is thus an important task to create good test data effectively and efficiently. Informally, good means here that the test data satisfies the FDs that the team of data engineers considers meaningful and violates the FDs the team perceives as meaningless.

More formally, given a set ∑U{φ} of FDs, we say that ∑ implies φ if every relation that satisfies all FDs in ∑ also satisfies φ. We use ∑* to denote the semantic closure of ∑, i.e., the set of all FDs implied by ∑. Armstrong relations have been identified as an effective means to represent concisely abstract sets of FDs in the form of a single relation. Indeed, Armstrong relations are widely regarded as good test data and a helpful tool for data engineers to judge, justify,convey, or test their understanding of the relation schema . More precisely, an Armstrong relation for a set of FDs is a single relation that satisfies every FD in ∑* and violates every FD not in∑*. Hence, an FD is implicit in the explicit specification of an FD set if and only if it is satisfied by an Armstrong relation for _. For this reason, Armstrong relations represent one of the few instances where example-based reasoning is effective. That is, if constitutes the design choice of the set of FDs currently perceived as meaningful to the underlying application domain, then an Armstrong relation r for constitutes an example on which data engineers can test the meaningfulness of an arbitrary FD φ.Namely, φ is meaningful under the current design choice if and only if the example relation r satisfies φ. Consequently, this approach to design is called design-by-example. Empirical evidence has been presented that Armstrong relations help data engineers to identify meaningful FDs they perceived incorrectly as meaningless before inspecting an Armstrong relation . Industry-leading data modeling tools, such as the ERwin data modeler, emphasize the need for good test data to validate the semantics of the models produced . We exemplify these observations by the following case, expanding on Example 1.

Example 2 Consider the relation schema Employment with attribute set Emp, Dept, and Mgr, and FD set that consists of EmpDept and DeptMgr. As in Example 1, assume that only Emp and Mgr have been declared NOT NULL. In this case, the relation

is an Armstrong relation for and the NOT NULL constraints. Note that, according to the semantics under the no information interpretation of the null value ni , the relation satisfies the FD DeptMgr since the FD can only be violated if there are distinct tuples that are total and agree in the Dept column, and differ in the Mgr column. When the domain experts inspect this Armstrong relation, they are alerted to the fact that Turing has two managers von Neumann and Gödel. More specifically, the domain experts communicate to the team of data engineers that every employee should only occur once in the Emp column. Consequently, the inspection of the Armstrong relation has revealed that under the current design choice neither the meaningful FD EmpMgr nor the meaningful key Emp have been successfully captured. As a consequence, the team has now a choice of either specifying the FD Emp Mgr additionally to the FDs in , or declaring Dept as NOT NULL.

Example 2 illustrates the assistance that Armstrong relations can provide to data engineers in their task to capture the semantics of the underlying application domain. Unfortunately, the current toolbox of Armstrong relations does not apply to SQL tables where arbitrary attributes can be declared NOT NULL. That means that currently, there is no effective support in identifying the set of meaningful FDs over SQL tables. While FDs have been studied in this context , the concept of Armstrong relations has not been investigated yet. To address this research gap, we adopt the formal framework of Lien , Atzeni, and Morfuni who have studied FDs under Zaniolo’s no information interpretation of nulls. It is intuitive and well known that FDs interact quite differently in the presence of NOT NULL constraints than they do over total relations . Therefore, it should come as no surprise that Armstrong relations for FD sets in the presence of NOT NULL constraints can be quite different from Armstrong relations for FDs over total relations. The following example illustrates that currently existing techniques for creating total Armstrong relations do not apply to real SQL table definitions.

Example 3 Consider the relation schema Employment with attributes Emp, Dept and Mgr, and FD set that consists of Emp Dept and DeptMgr. Suppose we would apply the existing techniques to create the total Armstrong relation for . Then, this is only an Armstrong relation for in the special case where all attributes are NOT NULL, but not an Armstrong relation for the SQL table definition of Example 1. If we incorrectly used this total relation as a concise representation for our SQL table definition of Example 1, then, it would give us the false impression that the current design choice successfully captures both the meaningful key Emp and the meaningful FD EmpMgr. Consequently, there is a real need to study Armstrong relations for FDs in the presence of arbitrary NOT NULL constraints.

Contributions and organization. In this article, we investigate the concept of Armstrong tables for FDs in the presence of NOT NULL constraints. Following previous research , we adopt Zaniolo, Lien, Atzeni, and Morfuni’s no information interpretation of null values. This is the most primitive interpretation which allows data engineers to model non-existent information as well as existent information that is currently unknown. In this context, Atzeni and Morfuni have studied FDs in the presence of a null-free subschema (NFS). Essentially, an NFS is the subset of the underlying relation schema whose attributes have been declared NOT NULL. More specifically, Atzeni and Morfuni have presented an axiomatization for the implication of FDs in the presence of an NFS, and an algorithm that decides the corresponding implication problem in time linear in the input size . Note that the combination of FDs and an NFS subsumes the uniqueness and (primary) key constraints over SQL relations.

The objective of this article is to extend the existing theory of Armstrong relations to a toolbox that is applicable to SQL table definitions with functional dependencies. We review previous work in Sect. 2. Subsequently, we define the underlying concepts and present related previous results in Sect. 3.

As a first contribution, we show in Sect. 4 that FDs do not enjoy Armstrong relations when null values are permitted to occur. That is, we identify relation schemata and sets of FDs for which no Armstrong relations exist. This is in contrast to the case of total relations. Fortunately, the source of this negative result are so-called non-standard FDs. These are FDs of the form Ø → A with an empty attribute set on the LHS. One may argue that these FDs do not occur in practice since they would enforce all entries in the A-column to be the same. We will show that the class of standard FDs does enjoy Armstrong relations, even in the presence of an arbitrary NFS. The situation is reminiscent of the situation for functional and inclusion dependencies over total relations where only standard FDs and inclusion dependencies enjoy Armstrong databases . For these reasons, we will focus on the class of standard FDs in the presence of an NFS.

As a second contribution, we characterize Armstrong tables in Sect. 5. That is, we give sufficient and necessary conditions for a given relation to be an Armstrong table for a given set of standard FDs and a given NFS. Our characterization is based on extensions of the notions of agree sets , maximal sets and “closed” sets from the special case of total relations. However, we demonstrate that in the presence of null values, the closure operation is not idempotent. Therefore, the maximal sets are no longer the intersection generators of closed sets, a property which was fundamental in the development of the results for the special case of total relations . This observation constitutes a significant challenge on the development of our results.

As a third contribution, we establish in Sect. 6 an algorithm that given a relation schema R, an NFS Rs and a set of standard FDs over R, computes an Armstrong table for and Rs . The algorithm is a non-trivial extension of the algorithm by Mannila and Räihä for FDs over total relations . It is based on the computation of attribute subsets that are maximal for a given attribute, i.e., maximal with the property that the attribute is not functionally dependent on the subset. The computation of the families of maximal attribute subsets is incremental in the given set of standard FDs. The families evolve conceptually differently than they do in the special case of total relations.

As a fourth contribution, we investigate the complexity of certain problems associated with Armstrong tables in Sect. 7. While the problem of finding an Armstrong table remains precisely exponential in general, the size of the Armstrong tables that our algorithm produces is shown to be at most quadratic in the size of a minimum-sized Armstrong table. We show that the size of Armstrong tables for a given set of FDs can be exponentially smaller than an optimal cover of . Therefore, just for the reason of the size of a representation, data engineers should always consider both representations of business rules: as abstract FD sets and as Armstrong tables for these. Finally, we also show that the problem of deciding whether a given attribute set is a Codd key (i.e., it enforces totality and uniqueness) with respect to a given set of FDs and an NFS is NP-complete. Therefore, our results extend the existing toolbox of Armstrong relations to SQL table definitions with no loss in time efficiency and almost no loss in space efficiency over total relations.

As a fifth contribution, we demonstrate in Sect. 8 how our results carry over to i) schemata on which no Codd key has been specified and ii) Armstrong tables for standard FDs and an NFS in the world of all FDs. Concerning i) we need to study the combined class of uniqueness constraints and FDs in the presence of an NFS. Concerning ii) Armstrong tables for the world of all FDs must additionally violate all non-trivial non-standard FDs, in contrast to the world of standard FDs.

As a final contribution, we illustrate the impact of our results on database practice in Sect. 9. We show how Armstrong tables can assist data engineers in recognizing meaningful constraints that were previously incorrectly perceived as meaningless. The recognition of these constraints can result in advanced designs that guarantee the absence of data redundancy and update anomalies, ensure new opportunities for the efficient processing of queries, and allow security officers to gain an advanced understanding about possible inference attacks.

In summary, our contributions extend well-known results from total relations to SQL tables. Hence, the resulting toolbox can be applied to instances that occur in real database systems. Moreover, almost all of the nice properties previously established for the special case of total relations are retained in the more general case. We conclude in Sect. 10 where we also discuss multiple directions of future work.

Proofs. Due to space restrictions, we have moved some of the proofs to the electronic appendix.

2 Related work

Data dependencies have been studied thoroughly in various data models, and we refer the reader to more detailed surveys .

2.1 Total relations

We begin with a summary of the work over total relations. Keys and FDs are concepts almost as old as the relational model of data itself . A total relation over a relation schema R satisfies a key X if and only if it satisfies the functional dependency X R. Therefore, keys are subsumed by FDs. Armstrong established the first axiomatization of FDs over total relations , now known as the Armstrong axioms. In fact, Armstrong showed that the Armstrong axioms are even strongly complete for the implication of FDs, i.e., for an arbitrary relation schema and an arbitrary set of FDs on that schema, he constructed a single finite set of tuples which satisfies precisely all implied FDs. That is, the reason why such specific relations became known as Armstrong relations. In general, axiomatizations can be applied by data engineers to enumerate all implied data dependencies.In practice, such an enumeration is often desirable, e.g., to validate the correct specification of explicit knowledge, to design and fine-tune databases or to optimize queries. In particular, the completeness of the inference rules ensures that all opportunities of utilizing implicit knowledge for these purposes have been exploited. Furthermore, an analysis of the completeness argument can provide invaluable hints for finding algorithms that efficiently decide the associated implication problem, i.e., to decide for an arbitrarily given FD set ∪ {φ} whether implies φ. For FDs over total relations, implication can be decided in time linear in the total number of attributes that occur in the input . Such decision algorithms complement the enumeration algorithm by a further reasoning capability that can make efficient, but only partial decisions about implicit knowledge. These decisions are only partial in the sense that the input to this algorithm must also contain a candidate for an implied functional dependency.In contrast, the enumeration algorithm simply lists all implied dependencies. The reason for the prominence of the implication problem is manifold.An algorithm for testing the implication of dependencies enables us to test whether two given sets of dependencies are equivalent or whether a given set of dependencies is redundant. A solution to these problems is a big step toward automated database schema design which some researchers see as the ultimate goal in dependency theory . Moreover, such an algorithm can be used in relational normalization theory and practice involving many normal form proposals , requirements engineering and schema validation , data mining , in database security, view maintenance and in query optimization . More recently, the implication problem has received a lot of attention in other data models as well . New application areas involve data cleaning , data transformations , consistent query answering , and data exchange and data integration .

Armstrong relations constitute an invaluable tool for the validation of semantic knowledge, and a user-friendly representation of integrity constraints. Armstrong relations have been deeply studied for keys and FDs . In particular, the existence of Armstrong relations for FDs was shown by Armstrong , and Fagin has shown the existence of Armstrong relations for a large class of data dependencies; however, classes of data dependencies do not necessarily enjoy Armstrong relations by any means. The structure of Armstrong relations for the class of FDs over total relations has mainly been investigated by Beeri, Fagin, Statman and Howard and Mannila and Räihä . In the current article, we will extend several of their results from the special case of total to partial relations. The properties of Armstrong relations have also been analyzed for various other classes of data dependencies . An excellent survey on Armstrong databases is .Over the last decades, the concept of Armstrong relations has been found useful in many database applications. For example, informative Armstrong databases are used for the semantic sampling of existing databases . They constitute small subsets of an existing database that satisfy the same data dependencies. In particular, De Marchi and Petit applied the concept of informative Armstrong relations to generate a sample that only used 0.6% of the tuples in an existing real-world database and satisfied the exact same set of data dependencies. The small size of this sample enabled data engineers to identify the FDs meaningful for the application domain. Design aids have been developed that utilize Armstrong databases to help judge, justify, convey or test the data engineers’ understanding of a given relation schema. Recently, empirical evidence has been established that Armstrong relations help data engineers to recognize meaningful FDs that they were not able to recognize without the help of Armstrong relations . Failure to identify some meaningful functional dependency also means that the output of a requirements analysis is afflicted with errors.Empirical studies show that more than half the errors which occur during systems development are requirements errors . Requirements errors are also the most common cause of failure in systems development projects . The cost of errors increases exponentially over the development life cycle: it is more than 100 times more costly to correct a defect post-implementation than it is to correct it during requirements analysis . This suggests that it would be more effective to concentrate quality assurance efforts in the requirements analysis stage, in order to catch requirements errors as soon as they occur or to prevent them from occurring altogether . Hence, there are also strong economic incentives to utilize Armstrong relations for the acquisition of meaningful FDs. Finally, Kolaitiset al. have established first correspondences between unique characterizations of schema mappings and the existence of Armstrong bases . An Armstrong basis refers to a finite set of pairs consisting of a source instance and a target instance that is a universal solution for the source instance. Our results may open the way to establish further correspondences.

2.2 Partial relations

We will now comment on some of the work concerning data dependencies in the presence of nulls. One of the most important extensions of Codd’s basic relational model is incomplete information. This is mainly due to the high demand for the correct handling of such information in real-world applications. Approaches to deal with incomplete information comprise incomplete relations, orrelations or fuzzy relations . Here,we focus on incomplete relations.

《翻译.doc》
将本文的Word文档下载,方便收藏和打印
推荐:
下载文档
热门推荐
相关推荐