Track I-D-5:
Data Science
Chair:
Jacques-Emile Dubois, ITODYS, Université de Paris
VII - France and Past-President, CODATA
|
1.
Quality Control of Data in Data-Sharing Practices and Regulations
Paul Wouters and Anne Beaulieu, Networked Research and Digital
Information (Nerdi), NIWI-KNAW, The Royal Netherlands Academy
of Arts and Sciences, The Netherlands
Scientific research is generating increasing amounts of data.
Overall, in each year more data has been generated than in all
years before combined. At the same time, knowledge production
is becoming more dependent on data sets. This puts the question
of quality control of the data center stage. How is the scientific
system coping with the formidable task of controlling for the
quality of this flood of data? One area in which this question
has not yet been fully explored is the domain of data-sharing
practices and regulations. The need to share data among researchers
and between researchers and the public has been put on the agenda
at the level of science policy (Franken 2000), partly out of
fear that the system might not be able to cope with the abundance
of data. Data sharing is not only a technical issue, but a complex
social process in which researchers have to balance different
pressures and tensions.
Basically, two different modes of data sharing can be distinguished:
peer-to-peer forms of data sharing and repository-based data
sharing. In the first mode, researchers communicate directly
with each other. In the second mode, there is a distance between
the supplier of data and the user in which the rules of the
specific data repository determine the conditions of data sharing.
In both modes, the existence or lack of trust between the data
supplier and the data user is crucial, though in different configurations.
If data sharing becomes increasingly mediated by information
and communication technologies, and hence less dependent on
face to face communication, the generation of trust will have
to be organised differently (Wouters and Beaulieu 2001). The
same holds for forms of quality control of the data. How do
researchers check for the quality in peer to peer data sharing?
And how have data repositories and archives taken care of the
need for quality control of the data supplied? Which dimensions
of social relationships seem to be crucial in data quality control?
Which technical solutions have been embedded in this social
process and what role has been played by information and communication
technologies?
This paper addresses
these questions in a number of different scientific fields (among
others functional brain imaging, high energy physics, astronomy,
and molecular biology) because different scientific fields tend
to display different configurations of these social processes.
References:
H. Franken (2000), Conference Conclusions in: Access
to Publicly Financed Research, The Global Research Village III
Conference, Conference Report (P. Schröder, ed.), NIWI-KNAW,
Amsterdam.
Paul Wouters and Anne Beaulieu (2001), Trust Building and Data
Sharing - an exploration of research practices, technologies
and policies. Research Project Proposal, OECD/CSTP Working Group
on Datasharing.
2.
Distributed Oriented Massive Data Management: Progressive Algorithms
and Data Structures
Rita Borgo, Visual Computing Group, Consiglio Nazionale delle
Ricerche (C.N.R.), Italy
Valerio Pascucci, Lawrence Livermore National Laboratory (LLNL),
USA
Projects dealing with massive amounts of data need to carefully
consider all aspects of data acquisition, storage, retrieval
and navigation. The recent growth in size of large simulation
datasets still surpasses the combined advances in hardware infrastructure
and processing algorithms for scientific visualization. The
cost of storing and visualizing such datasets is prohibitive,
so that only one out of every hundred time-steps can be really
stored and visualized.
As a consequence interactive visualization of results is going
to become increasingly difficult, especially as a daily routine
from a desktop. High frequency of I/O operations starts dominating
the overall running time. The visualization stage of the modeling-simulation-analysis
activity, still the ideal effective way for scientists to gain
qualitative understanding simulations results, becomes then
the bottleneck of the entire process. In this panorama the efficiency
of a visualization algorithm must be evaluated in the context
of end-to-end systems instead of being optimized individually.
There is a need at system level to design the visualization
process as a pipeline of modules able to process data in stages
creating a flow of data that need themselves to be optimized
globally with respect to magnitude and location of available
resources. To address these issues we propose an elegant and
simple to implement framework for performing out-ofcore visualization
and view dependent refinement of large volume datasets. We adopt
a method for view dependent refinement that relies on longest
edge-bisection strategies yet introducing a new method for extending
the technique to the field of Volume Visualization while keeping
untouched the simplicity of the technique itself. Results in
this field are applicable in parallel and distributed computing
ranging from cluster of PC's to more complex and expensive architectures.
In our work we present a new progressive visualization algorithm
where the input grid is traversed and organized in a hierarchical
structure (from coarse level to fine level) and subsequent levels
of detail are constructed and displayed to improve the output
image. We uncouple the data extraction from its display: the
hierarchy is built by one process that traverses the input 3D
mesh while a second process performs the traversal and display.
The scheme allows us to render at any given time partial results
while the computation of the complete hierarchy makes progress.
The regularity of the hierarchy allows the creation of a good
data-partitioning scheme that allows us to balance processing
time and data migration time still maintaining simplicity and
memory/computing efficiency.
3. Knowledge Management in Physicochemical
Property Databases - Knowledge Recovery and Retrieval of NIST/TRC
Source Data System
Qian Dong, Thermodynamics Research Center (TRC), National Institute
of Standards and Technology (NIST), USA
Xinjian Yan, Robert D. Chirico, Randolph C. Wilhoit, Michael
Frenkel
Knowledge management has become more and more important to physicochemical
databases that are generally characterized by their complexity
in terms of chemical system identifiers, sets of property values,
the relevant state variables, estimates of uncertainty, and
a variety of other metadata. The need for automation of database
operation, for assurance of high data quality, and for the availability
and accessibility of data sources and knowledge is a driving
force toward knowledge management in the scientific database
field. Nevertheless, current relational database technology
makes the construction and maintenance of database systems of
such kind tedious and error prone, and it provides less support
than the development of physicochemical databases requires.
The NIST/TRC SOURCE data system is an extensive repository system
of experimental thermophysical and thermochemical properties
and relevant measurement information that have been reported
in the worlds scientific literature. It currently consists
of nearly 2 million records for 30,000 chemicals including pure
compounds, mixtures, and reaction systems, which have already
created both a need and an opportunity for establishing a knowledge
infrastructure and intelligent supporting systems for the core
database. Every major stage of database operations and management,
such as data structure design, data entry preparation, effective
data quality assurance, as well as intelligent retrieval systems,
depends to a degree on substantial domain knowledge. Domain
knowledge regarding characteristics of compounds and properties,
measurement methods, sample purity, estimation of uncertainties,
data range and condition, as well as property data consistency
are automatically captured and then represented within the database.
Based upon this solid knowledge infrastructure, intelligent
supporting systems are being built to assist (1) complex data
entry preparation, (2) effective data quality assurance, (3)
best data and model recommendation, and (4) knowledge retrieval.
In brief, the NIST/TRC SOURCE data system is a three-tier architecture.
The first tier is considered as a relational database management
system, the second tier refers to knowledge infrastructure,
and the last represents intelligent supporting systems consisting
of computing algorithms, methods, and tools to carry out particular
tasks of database development and maintenance. The goals of
the latter two tiers are to realize the intelligent management
of scientific databases based on relational model. The development
of knowledge infrastructure and intelligent supporting systems
is described in the presentation.
4. Multi-Aspect Evaluation of Data
Quality in Scientific Databases
Juliusz L. Kulikowski, Institute of Biocybernetics and Biomedical
Engineering c/o the Polish Academy of Sciences, Poland
The problem of data quality evaluation arises both, when a database
is to be designed and when database customers are going to use
data in investigations, learning and/or decision making. However,
it is not quite clear what does it mean, exactly, that the quality
of some given data is high or, even, it is higher than this
of some other ones. Of course, it suggests that a data quality
evaluation method is possible. If so, it should reflect the
data utility value, but can it be based on a numerical quality
scale? It was shown by the author (1982) that information utility
value is a multi-component vector rather than a scalar.Its components
should characterise such information features as its relevance,
actuality, credibility, accuracy, completeness, acceptability,
etc. Therefore, data quality evaluation should be based on vectors
ordering concepts. For this purpose the Kantorovitsch's proposal
of a semi-ordered linear space (K-space) can be used. In this
case vector components should satisfy the general vector-algebra
assumptions concerning additivity and multiplication by real
numbers. This is possible if data quality features are defined
in an adequate way. It is also desired that data quality evaluation
is extended on data sets. In K-space this can be reached in
several ways, by introduction of the notions of: 1/minimum guaranteed
and maximum possible data quality, 2/ average data quality,
3/ median data quality. In general, the systems of single data
quality and data quality sets evaluation are not identical.
For example, a notion of data set redundancy (being an important
component of its quality evaluation) is not applicable to single
data. It also plays different roles if a data set is to be used
for specific data selection and if it is taken as a basis of
statistical inference. Therefore, data set quality depends on
the users' point of view. On the other hand, there is no identity
between points of view on data set quality of the users and
of database designers, the last being intended to satisfy various
and divergent users' requirements. The aim ot this paper is
to present, with more details, the data quality evaluation method
based on vectors ordering in K-space.
5.
Modeling the Earth's Subsurface Temperature Distribution From
a Stochastic Point of View
Kirti Srivastava, National Geophysical Research Institute, India
Stochastic modeling has played an important role in the quantification
of errors in various scientific investigations. In the quantification
of errors one looks for the first two moments i.e. mean and
variance in the system output due to errors in the input parameters.
Modeling a given physical system with the available information
and obtaining meaningful insight into its behavior is of vital
importance in any investigation. One such investigation in Earth
sciences is to understand the crustal/lithospheric evolution
and temperature controlled geological processes. For this an
accurate estimation of the subsurface temperature field is essential.
The thermal structure of the Earth's crust is influenced by
its geothermal controlling parameters such as thermal conductivity,
radiogenic heat sources and initial and boundary conditions.
Modeling the subsurface temperature field is either done using
a deterministic approach or the stochastic approach. In the
deterministic approach the controlling parameters are assumed
to be known with certainty and the subsurface temperature field
is obtained. However, due to inhomogeneous and anisotropic character
of the Earth's interior some amount of uncertainty in the estimation
of the geothermal parameters are bound to exist. Uncertainties
in these parameters may arise from the inaccuracy of measurements
or lack of information available on them. Such uncertainties
in parameters are incorporated in the stochastic approach and
an average picture of the thermal field along with its associated
error bounds is obtained.
The quantification of uncertainty in the temperatures field
is obtained using both random simulation and stochastic analytical
methods. The random simulation method is a numerical method
in which the uncertainties in the thermal field due to uncertainties
in the controlling thermal parameters are quantified. The stochastic
analytical method is generally solved using the small perturbation
method and closed form analytical solutions to the first two
moments are obtained. The stochastic solution to the steady
state heat conduction equation has been obtained for two different
conditions i.e. when the heat sources are random and when the
thermal conductivity is random. Closed form analytical expressions
for mean and variance of the subsurface temperature distribution
and the heat flow have been obtained. This study has been applied
to understand the thermal state in a tectonically active region
in the Indian Shield.
Track IV-B-4:
Emerging Concepts of Data-Information-Knowledge Sharing
Henri Dou, Université Aix Marseille III, Marseille,
France, and
Clément Paoli, Université of Marne la Vallée
UMLV, Champ sur Marne, France
In various
academic or professional activities the need to use
distributed Data, Information and Knowledge (D-I-K)
features, either as resources or in cooperative action,
often becomes very critical. It is not enough to limit
oneself to interfacing existing resources such as
databases or management systems. In many instances,
new actions and information tools must be developed.
These are often critical aspects of some global changes
required in existing information systems.
The complexity
of situations to be dealt with implies an increasing
demand for D-I-K attributes in large problems, such
as environmental studies or medical systems. Hard
and soft data must be joined to deal with situations
where social, industrial, educational, and financial
considerations are all involved. Cooperative work
already calls for these intelligent knowledge management
tools. Such changes will certainly induce new methodologies
in management, education, and R&D.
This session
will emphasize conceptual level of emerging global
methodology as well as the implementation level of
working tools for enabling D-I-K sharing in existing
and future information systems. Issues that might
be examined in greater detail include:
-
Systems
to develop knowledge on a cooperative basis;
-
Access
to D-I-K in remote teaching systems, virtual laboratories
and financial aspects;
-
Corporate
universities (case studies will be welcomed): alternate
teaching and industrial D-I-K confidentiality innovation
supported by information technology in educational
systems and data format interchange in SEWS (Strategic
Early Warning Systems) applied to education and
usage of data;
-
Ethics
in distance learning; and
-
Cases
studies on various experiments and standardization
of curriculum.
|
1.
Data Integration and Knowledge Discovery in Biomedical Databases.
A Case Study
Arnold Mitnitski,
Department of Medicine, Dalhousie University, Halifax, Canada
Alexander Mogilner, Montreal, Canada
Chris MacKnight, Division of Geriatric Medicine, Dalhousie University,
Halifax, Canada
Kenneth Rockwood, Division of Geriatric Medicine, Dalhousie
University, Halifax, Canada.
Biomedical (epidemiological) databases generally contain information
about large numbers of individuals (health related variable:
diseases, symptom and signs, physiological and psychological
assessments, socio-economic variables etc.). Many include information
about adverse outcomes (e.g.death), which makes it possible
to discover links between health outcomes and other variables
of interest (e.g., diseases, habits, function). Such databases
also can be linked with demographic surveys that themselves
contain large amounts of data aggregated by age and sex and
with genetic databases. While each of the databases are usually
created independently, for discrete purposes the possibility
of integrating knowledge from several domains across databases
is of significant scientific and practical interest. One example
of linking a biomedical database (National Population Health
Survey) containing more than 80,000 records of Canadian population
in 1996-97 years and 38 variables (disabilities, diseases, health
conditions) with mortality statistic obtained for Canadians
male and female is discussed. First, the problem of the redundancy
in the variables is considered. Redundancy makes it possible
to derive a simple score as a generalized (macroscopic) variable
that reflects both individual and group health status.
This macroscopic variable reveals a simple exponential relation
with age, indicating that the process of accumulation of deficits
(damage) is a leading factor causing death. The age trajectory
of the statistical distribution of this variable also suggests
that redundancy exhaustion is a general mechanism, reflecting
different diseases. The relationship between generalized variables
and the hazard (mortality) rate reveals that the latter can
be expressed in terms of variables generally available from
any cross-sectional database. In practical terms, this means
that the risk of mortality might readily be assessed from standard
biomedical appraisals collected on other grounds. This finding
is an example of how knowledge from different data sources can
be integrated to common good ends. Additionally, Internet related
technologies might provide ready means to facilitate interoperability
and data integration.
2.
A Framework for Semantic Context Representation of Multimedia
Resources
Weihong Huang , Yannick Prié , Pierre-Antoine Champin,
Alain Mille, LISI, Université Claude Bernard Lyon 1,
France
With the explosion of online multimedia resources, requirement
of intelligent content-based multimedia service increases rapidly.
One of the key challenges in this area is semantic contextual
knowledge representation of multimedia resources. Although current
image and video indexing techniques enable efficient feature-based
operation on multimedia resources, there still exists a "semantic
gap"between users and the computer systems, which refers
to the lack of coincidence between the information that one
can extract from the visual data and the interpretation that
the same data has for a user in a given situation.
In this paper, we present a novel model: annotation graph (AG)
for modeling and representing contextual knowledge of various
types of resources such as text, image, and audio-visual resources.
Based on the AG model, we attempt to build an annotation graph
framework towards bridging the "semantic gap" by offering
universal flexible knowledge creation, organization and retrieval
services to users. In this framework, users will not only benefit
from semantic query and navigation services, but also be able
to contribute in knowledge creation via semantic annotation.
In the AG model, four types of concrete description elements
are designed for concrete descriptions in specific situations,
while two types of abstract description elements are designed
for knowledge reusing in different situations. With these elements
and directed arcs between them, contextual knowledge at different
semantic levels could be represented through semantic annotation.
Within the global annotation graph constructed by all AGs, we
provide flexible semantic navigation using derivative graphs
(DG) and AG. DGs enable complement contextual knowledge representation
to AGs by focusing on different types of description elements.
Towards semantic query, we present a potential graph (PG) tool
to help users visualize query requests as PGs, and execute queries
by performing sub-graph matching with PGs. Prototype system
design and implementation aim at an integrated user-centered
semantic contextual knowledge creation, organization and retrieval
system.
3.
Passer de la représentation du présent à
la vision prospective du futur - Du Technology Forecast au Technology
Foresight
Henri Dou, CRRM, Université Aix Marseille III, Centre
Scientifique de Saint Jérôme, France
Jin Zhouiyng, Institute of Techno-Economics, Chinese Academy
of Social Science (CASS), China
De nos jours, le passage du système technology forecast
au système technology foresight est inévitable
pour éviter que le développement scientifique
ne soit qu'orienté verticalement au détriment
des retombées possibles (positives ou négatives)
au niveau de la Société. Dans cette communication
les auteurs aborderont les aspects méthodologiques de
cet passage ainsi que les différentes étapes qui
ont jalonnées depuis 1930 cette évolution. Les
analyses réalisées par différents payes
seront présentées, avec un panorama international
des actions en cours dans ce domaine.
Le concept Technology Foresight sera ensuite introduit dans
la méthodologie de l'Intelligence Compétitive
Technique ou Economique afin de créer pour des entreprises
une vision du développement soutenable et éthique
pour créer de nouveaux avantages.
La mise en uvre internationale du concept, tant au plan
Européen (6ième PCRD), qu'au niveau de la déclaration
de Bologne (Juin 1999), et des actions menées au Japon
ou en Chine (China 2020) sera analysée.
4.
Mise en place d'un système dynamique et interactif de
gestion d'activité et de connaissances d'un laboratoire
Mylène Leitzelman : Intelligence Process SAS, France
Valérie Léveillé : Case 422 Centre Scientifique
de Saint-Jérôme, France
Jacky Kister : UMR 6171 S.C.C - Faculté des Sciences
et Techniques de St Jérôme, France
Il s'agit de mettre en place de façon expérimentale
et pour le compte de l'UMR 6171 associé au CRRM, un système
interconnecté de gestion d'activité et de connaissances
pour gérer l'activité scientifique d'une unité
de recherche. Ce système sera doté de modules
de visualisation synthétique, statistiques et cartographiques
s'appuyant sur des méthodologies de datamining et de
bibliométrie. Le point clé de ce système
sera de proposer en même temps un outil de gestion stratégique
et d'organisation d'un laboratoire et un outil permettant la
compilation interlaboratoires pour en faire un outil d'analyse
ou de stratégie à une plus grande échelle
en laissant des accès plus ou moins libres pour que des
agents extérieurs puissent à partir des données
générer des indicateurs de performance, de valorisation,
de qualité des productions scientifiques et de relations
laboratoire/entreprises.
5.
La dimension éthique de la relation pédagogique
dans la formation à distance
M. Lebreton, C. Riffaut, H. Dou, Faculté des sciences
et techniques de Marseille Saint-Jérôme (CRRM),
France
De tous temps, enseigner a signifié être mis en
relation avec quelqu'un dans le but de lui apprendre quelque
chose. Le lien qui va unir le formateur à l'apprenant
sera le savoir. Se forme ainsi un triangle éducatif1
dont les branches constitutent la(les) relation(s) pédagogique(s).
Pour pouvoir activer cette structure, il est nécessaire
que chaque acteur connaisse avec clarté et précision
ses propres motivations et ses objectifs. Par ailleurs, il paraît
évident que pour transmettre et acquérir des savoirs,
il est nécessaire que les partenaires du processus d'apprentissage
partagent un certain nombre de valeurs communes, véritable
ciment de l'acte éducatif.
Au triangle sus-mentionné, correspond un triangle éthique
où à chaque sommet on peut placer l'intitulé
des missions éducatives : instruire, sociabiliser et
qualifier.
Instruire, c'est avant tout acquérir des connaissances.
Sociabiliser, c'est surtout acquérir des valeurs. Qualifier,
c'est intégrer dans une organisation productive.
Ces deux triangles ont fonctionné pendant des siécles
et l'arrivée des nouvelles technologies multimédia
et de la communication a destructuré la règle
des trois unités -le temps, le lieu et l'action2. Cet
ensemble est en train de se fissurer pour donner naissance à
un nouveau paysage scolaire où la classe ne sera plus
le seul lieu de formation, où le transfert des savoirs
pourra être fait à tout moment et en tout lieu
et où enfin l'action pédagogique sera individualisée
et individualisable.
Dans ce nouveau contexte, la relation pédagogique dans
la formation à distance va nécessiter la mise
en uvre de nouvelles compétences techniques, intellectuelles
et sociales ou éthiques.
Pour pouvoir aborder ces nouveaux défis, il paraît
nécessaire de chercher à savoir dans un premier
temps en quoi l'éthique peut nous aider à comprendre
de quelles manières ont évolué les dispositifs
fondamentaux de production des savoirs et les changements intervenus
dans le système de transfert des connaissances tout en
se préoccupant de l'adaptation et de la nécessaire
réactualisation pemanente des contenus éducatifs
qui s'imposeront dorénavant.
Par la suite, le questionnement éthique doit conduire
à aborder les conséquences liées à
la dépersonnalisation de la relation d'apprentissage.
A cet effet, il semble opportun de chercher à répondre
à deux questions fondamentales. L'une a trait au formateur,
est-il encore maître du processus de sociabilisation et
s'interroger par la suite pour savoir si la formation à
distance a encore des valeurs et dans ce cas quelles sont-elles?.
L'autre va concerner l'apprenant d'une part et l'on va s'interesser
à ce qu'il advient de son identité dans l'univers
du numérique et du virtuel et d'autre part chercher à
savoir quel est son salut face à la marchandisation des
connaissances et à l'accaparement des savoirs par des
empires informationnels.
L'ensemble de ces interrogations éthiques peut permettre
de commencer à trouver des débuts de solutions
à des problématiques sans frontière et
d'une complexité redoutable où cohabitent désormais
le rationnel et l'iirtionnel, le matériel et l'immatériel,
le personnel et l'impersonnel le tout immergé dans le
numérique, fondement de la virtualité.
1. Le triangle pédagogique, J. Houssaye, Berne, Ed.
Peter Lang
2. Rapport au Premier ministre du Sénateur A. Gérard,
1997