http://dl.acm.org/citation.cfm?id=543644
PODS '02 cited over 2400+
ABSTRACT. Data integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data…This document presents on overview of the material to be presented in a tutorial on data integration…Special attention will be devoted to the following aspects: modeling a data integration application, processing queries in data integration, dealing with inconsistent data sources, and reasoning on queries.
The data integration systems we are interested in this work are
characterized by an architecture based on a global schema and a set of
sources.
so-called global schema (or, mediated schema).
the specification of the correspondence between the data at the sources and those in the global schema
In a data integration system I = ⟨G, S, M⟩
based on the LAV
approach, the mapping M
associates to each element s
of the source
schema S
a query qG over G
… Therefore, a LAV mapping
is a set of assertions, one for each element s
of S
, of the
form:
the LAV approach
the mapping explicitly tells the system how to retrieve the data when one wants to evaluate the various elements of the global schema. This idea is effective whenever the data integration system is based on a set of sources that is stable. GAV approach favors the system in carrying out query processing, because it tells the system how to use the sources to retrieve data. However, extending the system with a new source is now a problem: the new source may indeed have an impact on the definition of various elements of the global schema, whose associated views need to be redefined.
the GAV approach provides a specification mechanism that has a more procedural flavor with respect to the LAV approach.
As we already observed, in general, there are several possible global databases that are legal for the data integration system with respect to a given source database.
two approaches to view-based query processing, called view-based query rewriting and view-based query answering.
so-called maximally contained rewriting, i.e., an expression that captures the original query in the best way.
If in a data integration system I = ⟨G, S, M⟩
, the data retrieved
from the sources do not satisfy the integrity constraints of G
, then
no global database exists for I
, and query answering becomes
meaningless … occurring when data in the sources are mutually
inconsistent… generally dealt with by means of suitable
transformation and cleaning procedures to be applied to data retrieved
by the sources (see [12, 50]).
A possible solution is to characterize the data integration system I
= ⟨G,S,M⟩ (with M = {r1
↝ V1, ... ,rn
↝ Vn})
, in
terms of those global databases that
G
, andM
, i.e., that are as sound as possible.The basic form of reasoning on queries is checking
The notion of “information content” of materialized views is studied in [57] for a restricted class of aggregate queries, with the goal of devising techniques for checking whether a set of views is sufficient for completely answering a given query based on the views.