CODATA logo
CODATA 2002: Frontiers of
Scientific and Technical Data

Montréal, Canada — 29 September - 3 October
 

Data Science Abstracts

Proceedings
Table of Contents

Keynote Speakers

Invited Cross-Cutting Themes

CODATA 2015

Physical Science Data

Biological Science Data

Earth and Environmental Data

Medical and Health Data

Behavioral and Social Science Data

Informatics and Technology

Data Science

Data Policy

Technical Demonstrations

Large Data Projects

Poster Sessions

Public Lectures

Program at a Glance

Detailed Program

List of Participants
[PDF File]

(To view PDF files, you must have Adobe Acrobat Reader.
)

Conference Sponsors

About the CODATA 2002 Conference

 


Track I-D-5:
Data Science

Chair: Jacques-Emile Dubois, ITODYS, Université de Paris VII - France and Past-President, CODATA

1. Quality Control of Data in Data-Sharing Practices and Regulations
Paul Wouters and Anne Beaulieu, Networked Research and Digital Information (Nerdi), NIWI-KNAW, The Royal Netherlands Academy of Arts and Sciences, The Netherlands

Scientific research is generating increasing amounts of data. Overall, in each year more data has been generated than in all years before combined. At the same time, knowledge production is becoming more dependent on data sets. This puts the question of quality control of the data center stage. How is the scientific system coping with the formidable task of controlling for the quality of this flood of data? One area in which this question has not yet been fully explored is the domain of data-sharing practices and regulations. The need to share data among researchers and between researchers and the public has been put on the agenda at the level of science policy (Franken 2000), partly out of fear that the system might not be able to cope with the abundance of data. Data sharing is not only a technical issue, but a complex social process in which researchers have to balance different pressures and tensions.

Basically, two different modes of data sharing can be distinguished: peer-to-peer forms of data sharing and repository-based data sharing. In the first mode, researchers communicate directly with each other. In the second mode, there is a distance between the supplier of data and the user in which the rules of the specific data repository determine the conditions of data sharing. In both modes, the existence or lack of trust between the data supplier and the data user is crucial, though in different configurations. If data sharing becomes increasingly mediated by information and communication technologies, and hence less dependent on face to face communication, the generation of trust will have to be organised differently (Wouters and Beaulieu 2001). The same holds for forms of quality control of the data. How do researchers check for the quality in peer to peer data sharing? And how have data repositories and archives taken care of the need for quality control of the data supplied? Which dimensions of social relationships seem to be crucial in data quality control? Which technical solutions have been embedded in this social process and what role has been played by information and communication technologies?

This paper addresses these questions in a number of different scientific fields (among others functional brain imaging, high energy physics, astronomy, and molecular biology) because different scientific fields tend to display different configurations of these social processes.

References:

H. Franken (2000), “Conference Conclusions” in: Access to Publicly Financed Research, The Global Research Village III Conference, Conference Report (P. Schröder, ed.), NIWI-KNAW, Amsterdam.

Paul Wouters and Anne Beaulieu (2001), Trust Building and Data Sharing - an exploration of research practices, technologies and policies. Research Project Proposal, OECD/CSTP Working Group on Datasharing.

 

2. Distributed Oriented Massive Data Management: Progressive Algorithms and Data Structures
Rita Borgo, Visual Computing Group, Consiglio Nazionale delle Ricerche (C.N.R.), Italy
Valerio Pascucci, Lawrence Livermore National Laboratory (LLNL), USA


Projects dealing with massive amounts of data need to carefully consider all aspects of data acquisition, storage, retrieval and navigation. The recent growth in size of large simulation datasets still surpasses the combined advances in hardware infrastructure and processing algorithms for scientific visualization. The cost of storing and visualizing such datasets is prohibitive, so that only one out of every hundred time-steps can be really stored and visualized.

As a consequence interactive visualization of results is going to become increasingly difficult, especially as a daily routine from a desktop. High frequency of I/O operations starts dominating the overall running time. The visualization stage of the modeling-simulation-analysis activity, still the ideal effective way for scientists to gain qualitative understanding simulations results, becomes then the bottleneck of the entire process. In this panorama the efficiency of a visualization algorithm must be evaluated in the context of end-to-end systems instead of being optimized individually. There is a need at system level to design the visualization process as a pipeline of modules able to process data in stages creating a flow of data that need themselves to be optimized globally with respect to magnitude and location of available resources. To address these issues we propose an elegant and simple to implement framework for performing out-ofcore visualization and view dependent refinement of large volume datasets. We adopt a method for view dependent refinement that relies on longest edge-bisection strategies yet introducing a new method for extending the technique to the field of Volume Visualization while keeping untouched the simplicity of the technique itself. Results in this field are applicable in parallel and distributed computing ranging from cluster of PC's to more complex and expensive architectures. In our work we present a new progressive visualization algorithm where the input grid is traversed and organized in a hierarchical structure (from coarse level to fine level) and subsequent levels of detail are constructed and displayed to improve the output image. We uncouple the data extraction from its display: the hierarchy is built by one process that traverses the input 3D mesh while a second process performs the traversal and display. The scheme allows us to render at any given time partial results while the computation of the complete hierarchy makes progress. The regularity of the hierarchy allows the creation of a good data-partitioning scheme that allows us to balance processing time and data migration time still maintaining simplicity and memory/computing efficiency.



3. Knowledge Management in Physicochemical Property Databases - Knowledge Recovery and Retrieval of NIST/TRC Source Data System
Qian Dong, Thermodynamics Research Center (TRC), National Institute of Standards and Technology (NIST), USA
Xinjian Yan, Robert D. Chirico, Randolph C. Wilhoit, Michael Frenkel

Knowledge management has become more and more important to physicochemical databases that are generally characterized by their complexity in terms of chemical system identifiers, sets of property values, the relevant state variables, estimates of uncertainty, and a variety of other metadata. The need for automation of database operation, for assurance of high data quality, and for the availability and accessibility of data sources and knowledge is a driving force toward knowledge management in the scientific database field. Nevertheless, current relational database technology makes the construction and maintenance of database systems of such kind tedious and error prone, and it provides less support than the development of physicochemical databases requires.

The NIST/TRC SOURCE data system is an extensive repository system of experimental thermophysical and thermochemical properties and relevant measurement information that have been reported in the world’s scientific literature. It currently consists of nearly 2 million records for 30,000 chemicals including pure compounds, mixtures, and reaction systems, which have already created both a need and an opportunity for establishing a knowledge infrastructure and intelligent supporting systems for the core database. Every major stage of database operations and management, such as data structure design, data entry preparation, effective data quality assurance, as well as intelligent retrieval systems, depends to a degree on substantial domain knowledge. Domain knowledge regarding characteristics of compounds and properties, measurement methods, sample purity, estimation of uncertainties, data range and condition, as well as property data consistency are automatically captured and then represented within the database. Based upon this solid knowledge infrastructure, intelligent supporting systems are being built to assist (1) complex data entry preparation, (2) effective data quality assurance, (3) best data and model recommendation, and (4) knowledge retrieval.

In brief, the NIST/TRC SOURCE data system is a three-tier architecture. The first tier is considered as a relational database management system, the second tier refers to knowledge infrastructure, and the last represents intelligent supporting systems consisting of computing algorithms, methods, and tools to carry out particular tasks of database development and maintenance. The goals of the latter two tiers are to realize the intelligent management of scientific databases based on relational model. The development of knowledge infrastructure and intelligent supporting systems is described in the presentation.


4. Multi-Aspect Evaluation of Data Quality in Scientific Databases
Juliusz L. Kulikowski, Institute of Biocybernetics and Biomedical Engineering c/o the Polish Academy of Sciences, Poland

The problem of data quality evaluation arises both, when a database is to be designed and when database customers are going to use data in investigations, learning and/or decision making. However, it is not quite clear what does it mean, exactly, that the quality of some given data is high or, even, it is higher than this of some other ones. Of course, it suggests that a data quality evaluation method is possible. If so, it should reflect the data utility value, but can it be based on a numerical quality scale? It was shown by the author (1982) that information utility value is a multi-component vector rather than a scalar.Its components should characterise such information features as its relevance, actuality, credibility, accuracy, completeness, acceptability, etc. Therefore, data quality evaluation should be based on vectors ordering concepts. For this purpose the Kantorovitsch's proposal of a semi-ordered linear space (K-space) can be used. In this case vector components should satisfy the general vector-algebra assumptions concerning additivity and multiplication by real numbers. This is possible if data quality features are defined in an adequate way. It is also desired that data quality evaluation is extended on data sets. In K-space this can be reached in several ways, by introduction of the notions of: 1/minimum guaranteed and maximum possible data quality, 2/ average data quality, 3/ median data quality. In general, the systems of single data quality and data quality sets evaluation are not identical.

For example, a notion of data set redundancy (being an important component of its quality evaluation) is not applicable to single data. It also plays different roles if a data set is to be used for specific data selection and if it is taken as a basis of statistical inference. Therefore, data set quality depends on the users' point of view. On the other hand, there is no identity between points of view on data set quality of the users and of database designers, the last being intended to satisfy various and divergent users' requirements. The aim ot this paper is to present, with more details, the data quality evaluation method based on vectors ordering in K-space.


5. Modeling the Earth's Subsurface Temperature Distribution From a Stochastic Point of View
Kirti Srivastava, National Geophysical Research Institute, India

Stochastic modeling has played an important role in the quantification of errors in various scientific investigations. In the quantification of errors one looks for the first two moments i.e. mean and variance in the system output due to errors in the input parameters. Modeling a given physical system with the available information and obtaining meaningful insight into its behavior is of vital importance in any investigation. One such investigation in Earth sciences is to understand the crustal/lithospheric evolution and temperature controlled geological processes. For this an accurate estimation of the subsurface temperature field is essential. The thermal structure of the Earth's crust is influenced by its geothermal controlling parameters such as thermal conductivity, radiogenic heat sources and initial and boundary conditions.

Modeling the subsurface temperature field is either done using a deterministic approach or the stochastic approach. In the deterministic approach the controlling parameters are assumed to be known with certainty and the subsurface temperature field is obtained. However, due to inhomogeneous and anisotropic character of the Earth's interior some amount of uncertainty in the estimation of the geothermal parameters are bound to exist. Uncertainties in these parameters may arise from the inaccuracy of measurements or lack of information available on them. Such uncertainties in parameters are incorporated in the stochastic approach and an average picture of the thermal field along with its associated error bounds is obtained.

The quantification of uncertainty in the temperatures field is obtained using both random simulation and stochastic analytical methods. The random simulation method is a numerical method in which the uncertainties in the thermal field due to uncertainties in the controlling thermal parameters are quantified. The stochastic analytical method is generally solved using the small perturbation method and closed form analytical solutions to the first two moments are obtained. The stochastic solution to the steady state heat conduction equation has been obtained for two different conditions i.e. when the heat sources are random and when the thermal conductivity is random. Closed form analytical expressions for mean and variance of the subsurface temperature distribution and the heat flow have been obtained. This study has been applied to understand the thermal state in a tectonically active region in the Indian Shield.


Track IV-B-4:
Emerging Concepts of Data-Information-Knowledge Sharing

Henri Dou, Université Aix Marseille III, Marseille, France, and
Clément Paoli, Université of Marne la Vallée UMLV, Champ sur Marne, France

In various academic or professional activities the need to use distributed Data, Information and Knowledge (D-I-K) features, either as resources or in cooperative action, often becomes very critical. It is not enough to limit oneself to interfacing existing resources such as databases or management systems. In many instances, new actions and information tools must be developed. These are often critical aspects of some global changes required in existing information systems.

The complexity of situations to be dealt with implies an increasing demand for D-I-K attributes in large problems, such as environmental studies or medical systems. Hard and soft data must be joined to deal with situations where social, industrial, educational, and financial considerations are all involved. Cooperative work already calls for these intelligent knowledge management tools. Such changes will certainly induce new methodologies in management, education, and R&D.

This session will emphasize conceptual level of emerging global methodology as well as the implementation level of working tools for enabling D-I-K sharing in existing and future information systems. Issues that might be examined in greater detail include:

  • Systems to develop knowledge on a cooperative basis;
  • Access to D-I-K in remote teaching systems, virtual laboratories and financial aspects;
  • Corporate universities (case studies will be welcomed): alternate teaching and industrial D-I-K confidentiality innovation supported by information technology in educational systems and data format interchange in SEWS (Strategic Early Warning Systems) applied to education and usage of data;
  • Ethics in distance learning; and
  • Cases studies on various experiments and standardization of curriculum.

 

1. Data Integration and Knowledge Discovery in Biomedical Databases.
A Case Study

Arnold Mitnitski, Department of Medicine, Dalhousie University, Halifax, Canada
Alexander Mogilner, Montreal, Canada
Chris MacKnight, Division of Geriatric Medicine, Dalhousie University, Halifax, Canada
Kenneth Rockwood, Division of Geriatric Medicine, Dalhousie University, Halifax, Canada.

Biomedical (epidemiological) databases generally contain information about large numbers of individuals (health related variable: diseases, symptom and signs, physiological and psychological assessments, socio-economic variables etc.). Many include information about adverse outcomes (e.g.death), which makes it possible to discover links between health outcomes and other variables of interest (e.g., diseases, habits, function). Such databases also can be linked with demographic surveys that themselves contain large amounts of data aggregated by age and sex and with genetic databases. While each of the databases are usually created independently, for discrete purposes the possibility of integrating knowledge from several domains across databases is of significant scientific and practical interest. One example of linking a biomedical database (National Population Health Survey) containing more than 80,000 records of Canadian population in 1996-97 years and 38 variables (disabilities, diseases, health conditions) with mortality statistic obtained for Canadians male and female is discussed. First, the problem of the redundancy in the variables is considered. Redundancy makes it possible to derive a simple score as a generalized (macroscopic) variable that reflects both individual and group health status.

This macroscopic variable reveals a simple exponential relation with age, indicating that the process of accumulation of deficits (damage) is a leading factor causing death. The age trajectory of the statistical distribution of this variable also suggests that redundancy exhaustion is a general mechanism, reflecting different diseases. The relationship between generalized variables and the hazard (mortality) rate reveals that the latter can be expressed in terms of variables generally available from any cross-sectional database. In practical terms, this means that the risk of mortality might readily be assessed from standard biomedical appraisals collected on other grounds. This finding is an example of how knowledge from different data sources can be integrated to common good ends. Additionally, Internet related technologies might provide ready means to facilitate interoperability and data integration.

 

2. A Framework for Semantic Context Representation of Multimedia Resources
Weihong Huang , Yannick Prié , Pierre-Antoine Champin, Alain Mille, LISI, Université Claude Bernard Lyon 1, France

With the explosion of online multimedia resources, requirement of intelligent content-based multimedia service increases rapidly. One of the key challenges in this area is semantic contextual knowledge representation of multimedia resources. Although current image and video indexing techniques enable efficient feature-based operation on multimedia resources, there still exists a "semantic gap"between users and the computer systems, which refers to the lack of coincidence between the information that one can extract from the visual data and the interpretation that the same data has for a user in a given situation.

In this paper, we present a novel model: annotation graph (AG) for modeling and representing contextual knowledge of various types of resources such as text, image, and audio-visual resources. Based on the AG model, we attempt to build an annotation graph framework towards bridging the "semantic gap" by offering universal flexible knowledge creation, organization and retrieval services to users. In this framework, users will not only benefit from semantic query and navigation services, but also be able to contribute in knowledge creation via semantic annotation.

In the AG model, four types of concrete description elements are designed for concrete descriptions in specific situations, while two types of abstract description elements are designed for knowledge reusing in different situations. With these elements and directed arcs between them, contextual knowledge at different semantic levels could be represented through semantic annotation. Within the global annotation graph constructed by all AGs, we provide flexible semantic navigation using derivative graphs (DG) and AG. DGs enable complement contextual knowledge representation to AGs by focusing on different types of description elements. Towards semantic query, we present a potential graph (PG) tool to help users visualize query requests as PGs, and execute queries by performing sub-graph matching with PGs. Prototype system design and implementation aim at an integrated user-centered semantic contextual knowledge creation, organization and retrieval system.

 

3. Passer de la représentation du présent à la vision prospective du futur - Du Technology Forecast au Technology Foresight
Henri Dou, CRRM, Université Aix Marseille III, Centre Scientifique de Saint Jérôme, France
Jin Zhouiyng, Institute of Techno-Economics, Chinese Academy of Social Science (CASS), China

De nos jours, le passage du système technology forecast au système technology foresight est inévitable pour éviter que le développement scientifique ne soit qu'orienté verticalement au détriment des retombées possibles (positives ou négatives) au niveau de la Société. Dans cette communication les auteurs aborderont les aspects méthodologiques de cet passage ainsi que les différentes étapes qui ont jalonnées depuis 1930 cette évolution. Les analyses réalisées par différents payes seront présentées, avec un panorama international des actions en cours dans ce domaine.

Le concept Technology Foresight sera ensuite introduit dans la méthodologie de l'Intelligence Compétitive Technique ou Economique afin de créer pour des entreprises une vision du développement soutenable et éthique pour créer de nouveaux avantages.

La mise en œuvre internationale du concept, tant au plan Européen (6ième PCRD), qu'au niveau de la déclaration de Bologne (Juin 1999), et des actions menées au Japon ou en Chine (China 2020) sera analysée.

 

4. Mise en place d'un système dynamique et interactif de gestion d'activité et de connaissances d'un laboratoire
Mylène Leitzelman : Intelligence Process SAS, France
Valérie Léveillé : Case 422 Centre Scientifique de Saint-Jérôme, France
Jacky Kister : UMR 6171 S.C.C - Faculté des Sciences et Techniques de St Jérôme, France

Il s'agit de mettre en place de façon expérimentale et pour le compte de l'UMR 6171 associé au CRRM, un système interconnecté de gestion d'activité et de connaissances pour gérer l'activité scientifique d'une unité de recherche. Ce système sera doté de modules de visualisation synthétique, statistiques et cartographiques s'appuyant sur des méthodologies de datamining et de bibliométrie. Le point clé de ce système sera de proposer en même temps un outil de gestion stratégique et d'organisation d'un laboratoire et un outil permettant la compilation interlaboratoires pour en faire un outil d'analyse ou de stratégie à une plus grande échelle en laissant des accès plus ou moins libres pour que des agents extérieurs puissent à partir des données générer des indicateurs de performance, de valorisation, de qualité des productions scientifiques et de relations laboratoire/entreprises.

 

5. La dimension éthique de la relation pédagogique dans la formation à distance
M. Lebreton, C. Riffaut, H. Dou, Faculté des sciences et techniques de Marseille Saint-Jérôme (CRRM), France

De tous temps, enseigner a signifié être mis en relation avec quelqu'un dans le but de lui apprendre quelque chose. Le lien qui va unir le formateur à l'apprenant sera le savoir. Se forme ainsi un triangle éducatif1 dont les branches constitutent la(les) relation(s) pédagogique(s).
Pour pouvoir activer cette structure, il est nécessaire que chaque acteur connaisse avec clarté et précision ses propres motivations et ses objectifs. Par ailleurs, il paraît évident que pour transmettre et acquérir des savoirs, il est nécessaire que les partenaires du processus d'apprentissage partagent un certain nombre de valeurs communes, véritable ciment de l'acte éducatif.

Au triangle sus-mentionné, correspond un triangle éthique où à chaque sommet on peut placer l'intitulé des missions éducatives : instruire, sociabiliser et qualifier.

Instruire, c'est avant tout acquérir des connaissances. Sociabiliser, c'est surtout acquérir des valeurs. Qualifier, c'est intégrer dans une organisation productive.

Ces deux triangles ont fonctionné pendant des siécles et l'arrivée des nouvelles technologies multimédia et de la communication a destructuré la règle des trois unités -le temps, le lieu et l'action2. Cet ensemble est en train de se fissurer pour donner naissance à un nouveau paysage scolaire où la classe ne sera plus le seul lieu de formation, où le transfert des savoirs pourra être fait à tout moment et en tout lieu et où enfin l'action pédagogique sera individualisée et individualisable.

Dans ce nouveau contexte, la relation pédagogique dans la formation à distance va nécessiter la mise en œuvre de nouvelles compétences techniques, intellectuelles et sociales ou éthiques.
Pour pouvoir aborder ces nouveaux défis, il paraît nécessaire de chercher à savoir dans un premier temps en quoi l'éthique peut nous aider à comprendre de quelles manières ont évolué les dispositifs fondamentaux de production des savoirs et les changements intervenus dans le système de transfert des connaissances tout en se préoccupant de l'adaptation et de la nécessaire réactualisation pemanente des contenus éducatifs qui s'imposeront dorénavant.

Par la suite, le questionnement éthique doit conduire à aborder les conséquences liées à la dépersonnalisation de la relation d'apprentissage. A cet effet, il semble opportun de chercher à répondre à deux questions fondamentales. L'une a trait au formateur, est-il encore maître du processus de sociabilisation et s'interroger par la suite pour savoir si la formation à distance a encore des valeurs et dans ce cas quelles sont-elles?.

L'autre va concerner l'apprenant d'une part et l'on va s'interesser à ce qu'il advient de son identité dans l'univers du numérique et du virtuel et d'autre part chercher à savoir quel est son salut face à la marchandisation des connaissances et à l'accaparement des savoirs par des empires informationnels.

L'ensemble de ces interrogations éthiques peut permettre de commencer à trouver des débuts de solutions à des problématiques sans frontière et d'une complexité redoutable où cohabitent désormais le rationnel et l'iirtionnel, le matériel et l'immatériel, le personnel et l'impersonnel le tout immergé dans le numérique, fondement de la virtualité.

1. Le triangle pédagogique, J. Houssaye, Berne, Ed. Peter Lang
2. Rapport au Premier ministre du Sénateur A. Gérard, 1997

 

 

Last site update: 15 March 2003