18 Nov 2015

Data Integration: A Theoretical Perspective

http://dl.acm.org/citation.cfm?id=543644 PODS '02 cited over 2400+

ABSTRACT. Data integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data…This document presents on overview of the material to be presented in a tutorial on data integration…Special attention will be devoted to the following aspects: modeling a data integration application, processing queries in data integration, dealing with inconsistent data sources, and reasoning on queries.

1. Introduction

The data integration systems we are interested in this work are characterized by an architecture based on a global schema and a set of sources. The sources contain the real data, while the global schema provides a reconciled, integrated, and virtual view of the underlying sources.

so-called global schema (or, mediated schema).

global-as-view
requires that the global schema is expressed in terms of the data sources
local-as-view
requires the global schema to be specified independently from the sources, and the relationships between the global schema and the sources are established by defining every source as a view over the global schema.

2. DATA INTEGRATION FRAMEWORK

M is the mapping between G and S
constituted by a set of assertions: qSG, qG ↝ qS
qS and qG are two queries of the same arity, respectively over the source schema S, and over the global schema G… over the alphabet AS and AG.

3. MODELING

the specification of the correspondence between the data at the sources and those in the global schema

3.1 Local as view

In a data integration system I = ⟨G, S, M⟩ based on the LAV approach, the mapping M associates to each element s of the source schema S a query qG over G… Therefore, a LAV mapping is a set of assertions, one for each element s of S, of the form: s ↝ qG

the LAV approach favors the extensibility of the system: adding a new source simply means enriching the mapping with a new assertion, without other changes … in particular with the goal of establishing the assumption holding for the various source extensions [1, 53, 65, 24].

3.2 Global as view

the mapping explicitly tells the system how to retrieve the data when one wants to evaluate the various elements of the global schema. This idea is effective whenever the data integration system is based on a set of sources that is stable. GAV approach favors the system in carrying out query processing, because it tells the system how to use the sources to retrieve data. However, extending the system with a new source is now a problem: the new source may indeed have an impact on the definition of various elements of the global schema, whose associated views need to be redefined.

the GAV approach provides a specification mechanism that has a more procedural flavor with respect to the LAV approach.

4. QUERY PROCESSING IN LAV

As we already observed, in general, there are several possible global databases that are legal for the data integration system with respect to a given source database.

two approaches to view-based query processing, called view-based query rewriting and view-based query answering.

so-called maximally contained rewriting, i.e., an expression that captures the original query in the best way.

6. INCONSISTENCIES BETWEEN SOURCES

If in a data integration system I = ⟨G, S, M⟩, the data retrieved from the sources do not satisfy the integrity constraints of G, then no global database exists for I, and query answering becomes meaningless … occurring when data in the sources are mutually inconsistent… generally dealt with by means of suitable transformation and cleaning procedures to be applied to data retrieved by the sources (see [12, 50]).

A possible solution is to characterize the data integration system I = ⟨G,S,M⟩ (with M = {r1V1, ... ,rnVn}), in terms of those global databases that

  1. satisfy the integrity constraints of G, and
  2. approximate at best the satisfaction of the assertions in the mapping M, i.e., that are as sound as possible.

7. REASONING ON QUERIES

The basic form of reasoning on queries is checking containment, i.e., verifying whether one query returns a subset of the result computed by the other query in all databases.

The notion of “information content” of materialized views is studied in [57] for a restricted class of aggregate queries, with the goal of devising techniques for checking whether a set of views is sufficient for completely answering a given query based on the views.

reference