Track I-D-5:
Data Science
Chair:
Jacques-Emile Dubois, ITODYS, Université de Paris
VII - France and Past-President, CODATA
|
1.
Quality Control of Data in Data-Sharing Practices and Regulations
Paul Wouters and Anne Beaulieu, Networked Research and Digital
Information (Nerdi), NIWI-KNAW, The Royal Netherlands Academy
of Arts and Sciences, The Netherlands
Scientific research is generating increasing amounts of data.
Overall, in each year more data has been generated than in all
years before combined. At the same time, knowledge production
is becoming more dependent on data sets. This puts the question
of quality control of the data center stage. How is the scientific
system coping with the formidable task of controlling for the
quality of this flood of data? One area in which this question
has not yet been fully explored is the domain of data-sharing
practices and regulations. The need to share data among researchers
and between researchers and the public has been put on the agenda
at the level of science policy (Franken 2000), partly out of
fear that the system might not be able to cope with the abundance
of data. Data sharing is not only a technical issue, but a complex
social process in which researchers have to balance different
pressures and tensions.
Basically, two different modes of data sharing can be distinguished:
peer-to-peer forms of data sharing and repository-based data
sharing. In the first mode, researchers communicate directly
with each other. In the second mode, there is a distance between
the supplier of data and the user in which the rules of the
specific data repository determine the conditions of data sharing.
In both modes, the existence or lack of trust between the data
supplier and the data user is crucial, though in different configurations.
If data sharing becomes increasingly mediated by information
and communication technologies, and hence less dependent on
face to face communication, the generation of trust will have
to be organised differently (Wouters and Beaulieu 2001). The
same holds for forms of quality control of the data. How do
researchers check for the quality in peer to peer data sharing?
And how have data repositories and archives taken care of the
need for quality control of the data supplied? Which dimensions
of social relationships seem to be crucial in data quality control?
Which technical solutions have been embedded in this social
process and what role has been played by information and communication
technologies?
This paper addresses
these questions in a number of different scientific fields (among
others functional brain imaging, high energy physics, astronomy,
and molecular biology) because different scientific fields tend
to display different configurations of these social processes.
References:
H. Franken (2000), Conference Conclusions in: Access
to Publicly Financed Research, The Global Research Village III
Conference, Conference Report (P. Schröder, ed.), NIWI-KNAW,
Amsterdam.
Paul Wouters and Anne Beaulieu (2001), Trust Building and Data
Sharing - an exploration of research practices, technologies
and policies. Research Project Proposal, OECD/CSTP Working Group
on Datasharing.
2.
Distributed Oriented Massive Data Management: Progressive Algorithms
and Data Structures
Rita Borgo, Visual Computing Group, Consiglio Nazionale delle
Ricerche (C.N.R.), Italy
Valerio Pascucci, Lawrence Livermore National Laboratory (LLNL),
USA
Projects dealing with massive amounts of data need to carefully
consider all aspects of data acquisition, storage, retrieval
and navigation. The recent growth in size of large simulation
datasets still surpasses the combined advances in hardware infrastructure
and processing algorithms for scientific visualization. The
cost of storing and visualizing such datasets is prohibitive,
so that only one out of every hundred time-steps can be really
stored and visualized.
As a consequence interactive visualization of results is going
to become increasingly difficult, especially as a daily routine
from a desktop. High frequency of I/O operations starts dominating
the overall running time. The visualization stage of the modeling-simulation-analysis
activity, still the ideal effective way for scientists to gain
qualitative understanding simulations results, becomes then
the bottleneck of the entire process. In this panorama the efficiency
of a visualization algorithm must be evaluated in the context
of end-to-end systems instead of being optimized individually.
There is a need at system level to design the visualization
process as a pipeline of modules able to process data in stages
creating a flow of data that need themselves to be optimized
globally with respect to magnitude and location of available
resources. To address these issues we propose an elegant and
simple to implement framework for performing out-ofcore visualization
and view dependent refinement of large volume datasets. We adopt
a method for view dependent refinement that relies on longest
edge-bisection strategies yet introducing a new method for extending
the technique to the field of Volume Visualization while keeping
untouched the simplicity of the technique itself. Results in
this field are applicable in parallel and distributed computing
ranging from cluster of PC's to more complex and expensive architectures.
In our work we present a new progressive visualization algorithm
where the input grid is traversed and organized in a hierarchical
structure (from coarse level to fine level) and subsequent levels
of detail are constructed and displayed to improve the output
image. We uncouple the data extraction from its display: the
hierarchy is built by one process that traverses the input 3D
mesh while a second process performs the traversal and display.
The scheme allows us to render at any given time partial results
while the computation of the complete hierarchy makes progress.
The regularity of the hierarchy allows the creation of a good
data-partitioning scheme that allows us to balance processing
time and data migration time still maintaining simplicity and
memory/computing efficiency.
3. Knowledge Management in Physicochemical
Property Databases - Knowledge Recovery and Retrieval of NIST/TRC
Source Data System
Qian Dong, Thermodynamics Research Center (TRC), National Institute
of Standards and Technology (NIST), USA
Xinjian Yan, Robert D. Chirico, Randolph C. Wilhoit, Michael
Frenkel
Knowledge management has become more and more important to physicochemical
databases that are generally characterized by their complexity
in terms of chemical system identifiers, sets of property values,
the relevant state variables, estimates of uncertainty, and
a variety of other metadata. The need for automation of database
operation, for assurance of high data quality, and for the availability
and accessibility of data sources and knowledge is a driving
force toward knowledge management in the scientific database
field. Nevertheless, current relational database technology
makes the construction and maintenance of database systems of
such kind tedious and error prone, and it provides less support
than the development of physicochemical databases requires.
The NIST/TRC SOURCE data system is an extensive repository system
of experimental thermophysical and thermochemical properties
and relevant measurement information that have been reported
in the worlds scientific literature. It currently consists
of nearly 2 million records for 30,000 chemicals including pure
compounds, mixtures, and reaction systems, which have already
created both a need and an opportunity for establishing a knowledge
infrastructure and intelligent supporting systems for the core
database. Every major stage of database operations and management,
such as data structure design, data entry preparation, effective
data quality assurance, as well as intelligent retrieval systems,
depends to a degree on substantial domain knowledge. Domain
knowledge regarding characteristics of compounds and properties,
measurement methods, sample purity, estimation of uncertainties,
data range and condition, as well as property data consistency
are automatically captured and then represented within the database.
Based upon this solid knowledge infrastructure, intelligent
supporting systems are being built to assist (1) complex data
entry preparation, (2) effective data quality assurance, (3)
best data and model recommendation, and (4) knowledge retrieval.
In brief, the NIST/TRC SOURCE data system is a three-tier architecture.
The first tier is considered as a relational database management
system, the second tier refers to knowledge infrastructure,
and the last represents intelligent supporting systems consisting
of computing algorithms, methods, and tools to carry out particular
tasks of database development and maintenance. The goals of
the latter two tiers are to realize the intelligent management
of scientific databases based on relational model. The development
of knowledge infrastructure and intelligent supporting systems
is described in the presentation.
4. Multi-Aspect Evaluation of Data
Quality in Scientific Databases
Juliusz L. Kulikowski, Institute of Biocybernetics and Biomedical
Engineering c/o the Polish Academy of Sciences, Poland
The problem of data quality evaluation arises both, when a database
is to be designed and when database customers are going to use
data in investigations, learning and/or decision making. However,
it is not quite clear what does it mean, exactly, that the quality
of some given data is high or, even, it is higher than this
of some other ones. Of course, it suggests that a data quality
evaluation method is possible. If so, it should reflect the
data utility value, but can it be based on a numerical quality
scale? It was shown by the author (1982) that information utility
value is a multi-component vector rather than a scalar.Its components
should characterise such information features as its relevance,
actuality, credibility, accuracy, completeness, acceptability,
etc. Therefore, data quality evaluation should be based on vectors
ordering concepts. For this purpose the Kantorovitsch's proposal
of a semi-ordered linear space (K-space) can be used. In this
case vector components should satisfy the general vector-algebra
assumptions concerning additivity and multiplication by real
numbers. This is possible if data quality features are defined
in an adequate way. It is also desired that data quality evaluation
is extended on data sets. In K-space this can be reached in
several ways, by introduction of the notions of: 1/minimum guaranteed
and maximum possible data quality, 2/ average data quality,
3/ median data quality. In general, the systems of single data
quality and data quality sets evaluation are not identical.
For example, a notion of data set redundancy (being an important
component of its quality evaluation) is not applicable to single
data. It also plays different roles if a data set is to be used
for specific data selection and if it is taken as a basis of
statistical inference. Therefore, data set quality depends on
the users' point of view. On the other hand, there is no identity
between points of view on data set quality of the users and
of database designers, the last being intended to satisfy various
and divergent users' requirements. The aim ot this paper is
to present, with more details, the data quality evaluation method
based on vectors ordering in K-space.
5.
Modeling the Earth's Subsurface Temperature Distribution From
a Stochastic Point of View
Kirti Srivastava, National Geophysical Research Institute, India
Stochastic modeling has played an important role in the quantification
of errors in various scientific investigations. In the quantification
of errors one looks for the first two moments i.e. mean and
variance in the system output due to errors in the input parameters.
Modeling a given physical system with the available information
and obtaining meaningful insight into its behavior is of vital
importance in any investigation. One such investigation in Earth
sciences is to understand the crustal/lithospheric evolution
and temperature controlled geological processes. For this an
accurate estimation of the subsurface temperature field is essential.
The thermal structure of the Earth's crust is influenced by
its geothermal controlling parameters such as thermal conductivity,
radiogenic heat sources and initial and boundary conditions.
Modeling the subsurface temperature field is either done using
a deterministic approach or the stochastic approach. In the
deterministic approach the controlling parameters are assumed
to be known with certainty and the subsurface temperature field
is obtained. However, due to inhomogeneous and anisotropic character
of the Earth's interior some amount of uncertainty in the estimation
of the geothermal parameters are bound to exist. Uncertainties
in these parameters may arise from the inaccuracy of measurements
or lack of information available on them. Such uncertainties
in parameters are incorporated in the stochastic approach and
an average picture of the thermal field along with its associated
error bounds is obtained.
The quantification of uncertainty in the temperatures field
is obtained using both random simulation and stochastic analytical
methods. The random simulation method is a numerical method
in which the uncertainties in the thermal field due to uncertainties
in the controlling thermal parameters are quantified. The stochastic
analytical method is generally solved using the small perturbation
method and closed form analytical solutions to the first two
moments are obtained. The stochastic solution to the steady
state heat conduction equation has been obtained for two different
conditions i.e. when the heat sources are random and when the
thermal conductivity is random. Closed form analytical expressions
for mean and variance of the subsurface temperature distribution
and the heat flow have been obtained. This study has been applied
to understand the thermal state in a tectonically active region
in the Indian Shield.
Track IV-B-4:
Emerging Concepts of Data-Information-Knowledge Sharing
Henri Dou, Université Aix Marseille III, Marseille,
France, and
Clément Paoli, Université of Marne la Vallée
UMLV, Champ sur Marne, France
In various
academic or professional activities the need to use
distributed Data, Information and Knowledge (D-I-K)
features, either as resources or in cooperative action,
often becomes very critical. It is not enough to limit
oneself to interfacing existing resources such as
databases or management systems. In many instances,
new actions and information tools must be developed.
These are often critical aspects of some global changes
required in existing information systems.
The complexity
of situations to be dealt with implies an increasing
demand for D-I-K attributes in large problems, such
as environmental studies or medical systems. Hard
and soft data must be joined to deal with situations
where social, industrial, educational, and financial
considerations are all involved. Cooperative work
already calls for these intelligent knowledge management
tools. Such changes will certainly induce new methodologies
in management, education, and R&D.
This session
will emphasize conceptual level of emerging global
methodology as well as the implementation level of
working tools for enabling D-I-K sharing in existing
and future information systems. Issues that might
be examined in greater detail include:
-
Systems
to develop knowledge on a cooperative basis;
-
Access
to D-I-K in remote teaching systems, virtual laboratories
and financial aspects;
-
Corporate
universities (case studies will be welcomed): alternate
teaching and industrial D-I-K confidentiality innovation
supported by information technology in educational
systems and data format interchange in SEWS (Strategic
Early Warning Systems) applied to education and
usage of data;
-
Ethics
in distance learning; and
-
Cases
studies on various experiments and standardization
of curriculum.
|
1.
Data Integration and Knowledge Discovery in Biomedical Databases.
A Case Study
Arnold Mitnitski,
Department of Medicine, Dalhousie University, Halifax, Canada
Alexander Mogilner, Montreal, Canada
Chris MacKnight, Division of Geriatric Medicine, Dalhousie University,
Halifax, Canada
Kenneth Rockwood, Division of Geriatric Medicine, Dalhousie
University, Halifax, Canada.
Biomedical (epidemiological) databases generally contain information
about large numbers of individuals (health related variable:
diseases, symptom and signs, physiological and psychological
assessments, socio-economic variables etc.). Many include information
about adverse outcomes (e.g.death), which makes it possible
to discover links between health outcomes and other variables
of interest (e.g., diseases, habits, function). Such databases
also can be linked with demographic surveys that themselves
contain large amounts of data aggregated by age and sex and
with genetic databases. While each of the databases are usually
created independently, for discrete purposes the possibility
of integrating knowledge from several domains across databases
is of significant scientific and practical interest. One example
of linking a biomedical database (National Population Health
Survey) containing more than 80,000 records of Canadian population
in 1996-97 years and 38 variables (disabilities, diseases, health
conditions) with mortality statistic obtained for Canadians
male and female is discussed. First, the problem of the redundancy
in the variables is considered. Redundancy makes it possible
to derive a simple score as a generalized (macroscopic) variable
that reflects both individual and group health status.
This macroscopic variable reveals a simple exponential relation
with age, indicating that the process of accumulation of deficits
(damage) is a leading factor causing death. The age trajectory
of the statistical distribution of this variable also suggests
that redundancy exhaustion is a general mechanism, reflecting
different diseases. The relationship between generalized variables
and the hazard (mortality) rate reveals that the latter can
be expressed in terms of variables generally available from
any cross-sectional database. In practical terms, this means
that the risk of mortality might readily be assessed from standard
biomedical appraisals collected on other grounds. This finding
is an example of how knowledge from different data sources can
be integrated to common good ends. Additionally, Internet related
technologies might provide ready means to facilitate interoperability
and data integration.
2.
A Framework for Semantic Context Representation of Multimedia
Resources
Weihong Huang , Yannick Prié , Pierre-Antoine Champin,
Alain Mille, LISI, Université Claude Bernard Lyon 1,
France
With the explosion of online multimedia resources, requirement
of intelligent content-based multimedia service increases rapidly.
One of the key challenges in this area is semantic contextual
knowledge representation of multimedia resources. Although current
image and video indexing techniques enable efficient feature-based
operation on multimedia resources, there still exists a "semantic
gap"between users and the computer systems, which refers
to the lack of coincidence between the information that one
can extract from the visual data and the interpretation that
the same data has for a user in a given situation.
In this paper, we present a novel model: annotation graph (AG)
for modeling and representing contextual knowledge of various
types of resources such as text, image, and audio-visual resources.
Based on the AG model, we attempt to build an annotation graph
framework towards bridging the "semantic gap" by offering
universal flexible knowledge creation, organization and retrieval
services to users. In this framework, users will not only benefit
from semantic query and navigation services, but also be able
to contribute in knowledge creation via semantic annotation.
In the AG model, four types of concrete description elements
are designed for concrete descriptions in specific situations,
while two types of abstract description elements are designed
for knowledge reusing in different situations. With these elements
and directed arcs between them, contextual knowledge at different
semantic levels could be represented through semantic annotation.
Within the global annotation graph constructed by all AGs, we
provide flexible semantic navigation using derivative graphs
(DG) and AG. DGs enable complement contextual knowledge representation
to AGs by focusing on different types of description elements.
Towards semantic query, we present a potential graph (PG) tool
to help users visualize query requests as PGs, and execute queries
by performing sub-graph matching with PGs. Prototype system
design and implementation aim at an integrated user-centered
semantic contextual knowledge creation, organization and retrieval
system.
3.
Passer de la représentation du présent à
la vision prospective du futur - Du Technology Forecast au Technology
Foresight
Henri Dou, CRRM, Université Aix Marseille III, Centre
Scientifique de Saint Jérôme, France
Jin Zhouiyng, Institute of Techno-Economics, Chinese Academy
of Social Science (CASS), China
De nos jours, le passage du système technology forecast
au système technology foresight est inévitable
pour éviter que le développement scientifique
ne soit qu'orienté verticalement au détriment
des retombées possibles (positives ou négatives)
au niveau de la Société. Dans cette communication
les auteurs aborderont les aspects méthodologiques de
cet passage ainsi que les différentes étapes qui
ont jalonnées depuis 1930 cette évolution. Les
analyses réalisées par différents payes
seront présentées, avec un panorama international
des actions en cours dans ce domaine.
Le concept Technology Foresight sera ensuite introduit dans
la méthodologie de l'Intelligence Compétitive
Technique ou Economique afin de créer pour des entreprises
une vision du développement soutenable et éthique
pour créer de nouveaux avantages.
La mise en uvre internationale du concept, tant au plan
Européen (6ième PCRD), qu'au niveau de la déclaration
de Bologne (Juin 1999), et des actions menées au Japon
ou en Chine (China 2020) sera analysée.
4.
Mise en place d'un système dynamique et interactif de
gestion d'activité et de connaissances d'un laboratoire
Mylène Leitzelman : Intelligence Process SAS, France
Valérie Léveillé : Case 422 Centre Scientifique
de Saint-Jérôme, France
Jacky Kister : UMR 6171 S.C.C - Faculté des Sciences
et Techniques de St Jérôme, France
Il s'agit de mettre en place de façon expérimentale
et pour le compte de l'UMR 6171 associé au CRRM, un système
interconnecté de gestion d'activité et de connaissances
pour gérer l'activité scientifique d'une unité
de recherche. Ce système sera doté de modules
de visualisation synthétique, statistiques et cartographiques
s'appuyant sur des méthodologies de datamining et de
bibliométrie. Le point clé de ce système
sera de proposer en même temps un outil de gestion stratégique
et d'organisation d'un laboratoire et un outil permettant la
compilation interlaboratoires pour en faire un outil d'analyse
ou de stratégie à une plus grande échelle
en laissant des accès plus ou moins libres pour que des
agents extérieurs puissent à partir des données
générer des indicateurs de performance, de valorisation,
de qualité des productions scientifiques et de relations
laboratoire/entreprises.
5.
La dimension éthique de la relation pédagogique
dans la formation à distance
M. Lebreton, C. Riffaut, H. Dou, Faculté des sciences
et techniques de Marseille Saint-Jérôme (CRRM),
France
De tous temps, enseigner a signifié être mis en
relation avec quelqu'un dans le but de lui apprendre quelque
chose. Le lien qui va unir le formateur à l'apprenant
sera le savoir. Se forme ainsi un triangle éducatif1
dont les branches constitutent la(les) relation(s) pédagogique(s).
Pour pouvoir activer cette structure, il est nécessaire
que chaque acteur connaisse avec clarté et précision
ses propres motivations et ses objectifs. Par ailleurs, il paraît
évident que pour transmettre et acquérir des savoirs,
il est nécessaire que les partenaires du processus d'apprentissage
partagent un certain nombre de valeurs communes, véritable
ciment de l'acte éducatif.
Au triangle sus-mentionné, correspond un triangle éthique
où à chaque sommet on peut placer l'intitulé
des missions éducatives : instruire, sociabiliser et
qualifier.
Instruire, c'est avant tout acquérir des connaissances.
Sociabiliser, c'est surtout acquérir des valeurs. Qualifier,
c'est intégrer dans une organisation productive.
Ces deux triangles ont fonctionné pendant des siécles
et l'arrivée des nouvelles technologies multimédia
et de la communication a destructuré la règle
des trois unités -le temps, le lieu et l'action2. Cet
ensemble est en train de se fissurer pour donner naissance à
un nouveau paysage scolaire où la classe ne sera plus
le seul lieu de formation, où le transfert des savoirs
pourra être fait à tout moment et en tout lieu
et où enfin l'action pédagogique sera individualisée
et individualisable.
Dans ce nouveau contexte, la relation pédagogique dans
la formation à distance va nécessiter la mise
en uvre de nouvelles compétences techniques, intellectuelles
et sociales ou éthiques.
Pour pouvoir aborder ces nouveaux défis, il paraît
nécessaire de chercher à savoir dans un premier
temps en quoi l'éthique peut nous aider à comprendre
de quelles manières ont évolué les dispositifs
fondamentaux de production des savoirs et les changements intervenus
dans le système de transfert des connaissances tout en
se préoccupant de l'adaptation et de la nécessaire
réactualisation pemanente des contenus éducatifs
qui s'imposeront dorénavant.
Par la suite, le questionnement éthique doit conduire
à aborder les conséquences liées à
la dépersonnalisation de la relation d'apprentissage.
A cet effet, il semble opportun de chercher à répondre
à deux questions fondamentales. L'une a trait au formateur,
est-il encore maître du processus de sociabilisation et
s'interroger par la suite pour savoir si la formation à
distance a encore des valeurs et dans ce cas quelles sont-elles?.
L'autre va concerner l'apprenant d'une part et l'on va s'interesser
à ce qu'il advient de son identité dans l'univers
du numérique et du virtuel et d'autre part chercher à
savoir quel est son salut face à la marchandisation des
connaissances et à l'accaparement des savoirs par des
empires informationnels.
L'ensemble de ces interrogations éthiques peut permettre
de commencer à trouver des débuts de solutions
à des problématiques sans frontière et
d'une complexité redoutable où cohabitent désormais
le rationnel et l'iirtionnel, le matériel et l'immatériel,
le personnel et l'impersonnel le tout immergé dans le
numérique, fondement de la virtualité.
1. Le triangle pédagogique, J. Houssaye, Berne, Ed.
Peter Lang
2. Rapport au Premier ministre du Sénateur A. Gérard,
1997
Track I-D-4:
The Public Domain in Scientific and Technical Data: A
Review of Recent Initiatives and Emerging Issues
Chair: Paul F. Uhlir,
The National Academies, USA
The body of
scientific and technical data and other information in
the public domain is massive and has contributed broadly
to the scientific, economic, social, cultural, and intellectual
vibrancy of the entire world. The "public domain"
may be defined in legal terms as sources and types of
data and information whose uses are not restricted by
statutory intellectual property regimes and that are accordingly
available to the public without authorization. In recent
years, however, there have been growing legal, economic,
and technological pressures on public-domain information-scientific
and otherwise-forcing a reevaluation of the role and value
of the public domain. Despite these pressures, some well-established
mechanisms for preserving the public domain in scientific
data exist in the government, university, and not-for-profit
sectors. In addition, very innovative models for promoting
various public-domain digital information resources are
now being developed by different groups in the scientific,
library, and legal communities. This session will review
some of the recent initiatives for preserving and promoting
the public domain in scientific data within CODATA and
ICSU, the US National Academies, OECD, UNESCO, and other
organizations, and will highlight some of the most important
emerging issues in this context.
|
1.
International Access to Data and Information
Ferris Webster, University of Delaware, USA
Access to data and
information for research and education is the principal concern
of the ICSU/CODATA ad hoc Group on Data and Information. The
Group tracks developments by intergovernmental organizations
with influence over data property rights. Where possible, the
Group works to assure that the policies of these organizations
recognize the public good to be derived by assuring access to
data and information for research and education.
A number of international
organizations have merited attention recently. New proprietary
data rights threaten to close off access to data and information
that could be vital for progress in research. The European Community
has been carrying out a review of its Database Directive. The
World Meteorological Organization's resolution on international
exchange of meteorological data has been the subject of continuing
debate. The Intergovernmental Oceanographic Commission is drafting
a new data policy that may have constraints that are parallel
to those of the WMO. The World Intellectual Property Organization
has had a potential treaty on databases simmering for several
years.
The latest developments
in these organizations will be reviewed, along with the activities
of the ICSU/CODATA Group.
2.
The OECD Follow up Group on Issues of Access to Publicly Funded
Research Data: A Summary of the Interim Report
Peter Arzberger, University of California at San Diego, USA
This talk will present
a summary of the interim report of the OECD Follow up Group
on Issues of Access to Publicly Funded Research Data. The Group's
efforts have origins in the 3rd Global Research Village conference
in Amsterdam, December 2000. In particular, it will include
issues of global sharing of research data. The Group has conducted
case studies of practices across different communities, and
looked at factors such as sociological, economic, technological
and legal issues that either enhance or inhibit data sharing.
The presentation will also address issues such as data ownership
and rights of disposal, multiple uses of data, the use of ICT
for widening the scale and scope of data-sharing, effects of
data-sharing on the research process, and co-ordination in data
management. The ultimate goal of the Group is to articulate
principles, based on best practices that can be interpreted
into the science policy arena. Some initial principles will
be discussed. Questions such as the following will be addressed:
-
What
principles should govern science policy in this area?
-
What
is the perspective of social informatics in this field?
-
What
role does the scientific community play in this?
It is intended that
this presentation will generate discussion and feedback on key
points of the Group's interim report.
3.
An Overview of Draft UNESCO Policy Guidelines for the Development
and Promotion of Public-Domain Information
John B. Rose, UNESCO, Paris, FRANCE
Paul F. Uhlir, The National
Academies, Washington, DC, USA
A significantly underappreciated, but essential, element of
the information revolution and emerging knowledge society is
the vast amount of information in the public domain. Whereas
the focus of most policy analyses and law making is almost exclusively
on the enhanced protection of private, proprietary information,
the role of public-domain information, especially of information
produced by the public sector, is seldom addressed and generally
poorly understood.
The purpose of UNESCO's Policy Guidelines for the Development
and Promotion of Public-Domain Information, therefore, is to
help develop and promote information in the public domain at
the national level, with particular attention to information
in digital form. These Policy Guidelines are intended to better
define public-domain information and to describe its role and
importance, specifically in the context of developing countries;
to suggest principles that can help guide the development of
policy, infrastructure and services for provision of government
information to the public; to assist in fostering the production,
archiving and dissemination of an electronic public domain of
information for development, with emphasis on ensuring multicultural,
multilingual content; and to help promote access of all citizens,
especially including disadvantaged communities, to information
required for individual and social development. This presentation
will review the main elements of the draft Policy Guidelines,
with particular focus on scientific data and information in
the public domain.
Complementary to, but distinct from, the public domain are
the wider range of information and data which could be made
available by rights holders under specific "open access"
conditions, as in the case of open source software, and the
free availability of protected information for certain specific
purposes, such as education and science under limitations and
exceptions to copyright (e.g., "fair use" in U.S.
law). UNESCO is working to promote international consensus on
the role of these facilities in the digital age, notably through
a recommendation under development on the "Promotion and
Use of Multilingualism and Universal Access to Cyberspace,"
which is intended to be presented to the World Summit on the
Information Society to be organized in Geneva (2003) and Tunis
(2005), as well as a number of other relevant programme actions
which will also be presented at the Summit.
4.
Emerging Models for Maintaining the Public Commons
in Scientific Data
Harlan Onsrud, University of Maine, USA
Scientists need full and open disclosure and the ability to
critique in detail the methods, data, and results of their peers.
Yet scientific publications and data sets are burdened increasingly
by access restrictions imposed by legislative acts and case
law that are detrimental to the advancement of science. As a
result, scientists and legal scholars are exploring combined
technological and legal workarounds that will allow scientists
to continue to adhere to the mores of science without being
declared as lawbreakers. This presentation reviews three separate
models that might be used for preserving and expanding the public
domain in scientific data. Explored are the technological and
legal underpinnings of Research Index, the Creative Commons
Project and the Public Commons for Geographic Data Project.
The first project relies heavily on protections granted to web
crawlers under the U.S. Digital Millennium Copyright Act while
the latter two rely on legal approaches utilizing open access
licenses.
5.
Progress, Challenges, and Opportunities for Public-Domain S&T
Data Policy Reform in China
Liu Chuang, Chinese Academy of Sciences, Beijing, China
China has experienced four different stages for public-domain
S&T data management and policy during the last quarter century.
Before 1980, most of the government funded S&T data were
free to be accessed, and the services received a good reputation
from the scientific community. Most of these data were recorded
on paper media, however, and took time to be accessed.
With the computer developments in the early 1980s, digital
data and databases increased rapidly. The data producers and
holders began to realize that the digital data could be an important
resources for the scientific activities. The policy to charge
fees for data access gained prominence between the early 1980s
and approximately 1993. During this time period, China experienced
new problems in S&T data management. For example, there
was an increase of parallel work in database development and
in data controlled by individual persons with a high risk of
losing the data, and the price of access to data became very
expensive in most cases.
In the 1994-2000 period, members of the scientific community
asked for data policy reform, and for lower costs of access
to government funded databases for non-profit applications.
The Ministry of Science and Technology (MOST) set up a group
to investigate China's S&T data sharing policies and practices.
A new program for S&T data sharing was initiated by MOST
in 2001. This was a major milestone for enhancing access to
and the application of public-domain S&T data. This new
program, along with the current development of a new data access
policy and support system, is expected to be greatly expanded
during next decade.
Track IV-A-4:
Confidentiality Preservation Techniques in the Behavioral,
Medical and Social Sciences
D. Johnson, Building Engineering and Science Talent,
San Diego, CA, USA
John L. Horn, Department of Psychology, University of
Southern California, USA
Julie Kaneshiro, National Institutes of Health, USA
Kurt Pawlik, Psychologisches Institut
I, Universität Hamburg, Germany
Michel Sabourin, Université
de Montréal, Canada
In
the behavioral and social sciences and in medicine, the
movement to place data
in electronic databases is hampered by considerations
of confidentiality. The data collected on individuals
by scientists in these areas of research are often highly
personal. In fact, it is often necessary to guarantee
potential research participants that the data collected
on them will be held in strictest confidence and that
their privacy will be protected. There has even been debate
in these sciences about whether data collected under a
formal confidentiality agreement can be placed in a database,
because such use might constitute a use of the data to
which the research participants did not consent.
The members of this panel will discuss a broad range of
techniques that are being used across the behavioral and
social sciences and medicine to protect the confidentiality
of individuals whose data are entered into an electronically
accessible database. Among the highly controversial data
to which these techniques are being applied are data on
accident avoidance by pilots of commercial aircraft and
data on medical errors. The stakes in finding ways to
use these data without violating confidentiality are high,
since the payoff from learning how to reduce airplane
accidents and medical mistakes is saved lives.
Standard techniques for separating identifier information
from data, as well as less common techniques such as the
introduction of systematic error in data, will be discussed.
Despite the methods that are in place and those that are
being experimented with, there is evidence that even sophisticated
protection techniques may not be enough. The group will
conclude its session with a discussion of this challenge.
|
1.
Issues in Accessing and Sharing Confidential Survey and Social
Science Data
Virginia A. de Wolf, USA
Researchers collect data from both individuals and organizations
under pledges of confidentiality. The U.S. Federal statistical
system has established practices and procedures that enable
others to access the confidential data it collects. The two
main methods are to restrict the content of the data (termed
"restricted data") prior to release to the general
public and to restrict the conditions under which the data can
be accessed, i.e., at what locations, for what purposes (termed
"restricted access"). This paper reviews restricted
data and restricted access practices in several U.S. statistical
agencies. It concludes with suggestions for sharing confidential
social science data.
2. Contemporary Statistical Techniques
for Closing the "Confidentiality Gap" in Behavioral
Science Research
John L. Horn, Department of Psychology, University of Southern
California, USA
Over the past three
decades, behavioral scientists have become acutely aware of
the need for both the privacy of research participants and the
confidentiality of research data. During this same time period,
knowledgeable researchers have created a variety of methods
and procedures to insure confidentiality. But many of the best
techniques used were not designed to permit the sharing of research
data with other researchers outside of the initial data collection
group. Since a great deal of behavioral science data collected
at the individual level require such protections they cannot
easily be shared with others in a confidential way. These practical
problems have created a great deal of confusion and a kind of
"confidentiality gap" among researchers and participants
alike. This presentation will review some available "statistical"
approaches to deal with these problems, and examples will be
drawn from research projects on human cognitive abilities. These
statistical techniques range from the classical use of replacement
or shuffled records to more contemporary techniques based on
multiple imputations. In addition, new indices will be used
to relate the potential loss of data accuracy versus the loss
of confidentiality. These indices will help researchers define
the confidentiality gap in their own and any other research
project.
References
-
Feinberg,
S.E, & Willenborg, L. C.R.J. (1998). Special issue on
"Disclosure limitation methods for protecting confidentiality
of statistical data." Journal of Official Statistics,
14 (4), 337-566.
-
Willenborg,
L. C.R.J. & de Waal, T. (2001). Elements of statistical
disclosure control. Lecture Notes in Statistics, 155. New
York: Springer-Verlag.
-
Clubb,
J.M., Austin, E.W., Geda, C.L. & Traugott, M.W. (1992).
Sharing research data in the social sciences. In G. H. Elder,
Jr., E. K. Pavalko & E. C. Clipp. Working with Archival
Data: Studying Lives (pp. 39-75). SAGE Publications.
-
Willenborg,
L. C.R.J. & de Waal, T. (1996). Statistical disclosure
control in practice. Lecture Notes in Statistics, 111. New
York: Springer-Verlag.
NASA Aviation Safety Reporting System
(ASRS)
Linda J. Connell, NASA Ames Research Center, USA
In 1974, the United States experienced a tragic aviation accident
involving a B-727 on approach to Dulles Airport in Virginia.
All passengers and crew were killed. The accident was classified
as a Controlled Flight Into Terrain event. During the NTSB accident
investigation, it was discovered from ATC and cockpit voice
recorder tapes that the crew had become confused over information
regarding the approach instructions, both in information provided
in approach charts and the ATC instruction "cleared for
the approach". It was discovered that another airline had
experienced a similar chain of events, but they detected the
error and increased their altitude. This action allowed them
to miss the on-coming mountain. The second event would be classified
as an incident. The benefit of the information spread rapidly
in this airline, but had not reached other airlines. As a result
of the NTSB findings, the FAA and NASA created the Aviation
Safety Reporting System in 1976. The presentation will describe
the background and principles that guide the operation of the
ASRS. The presentation will also include descriptions of the
uses of and products from approximately 490,000 incident reports.
Track II-D-2:
Technical Demonstrations
Chairs:
Richard
Chinman, University Corporation for Atmospheric Research,
Boulder, CO, USA
Robert
S. Chen, CIESIN, Columbia University, USA
|
1.
World Wide Web Mirroring Technology of the World Data Center
System
David M. Clark, World Data Center Panel, NOAA/NESDIS, USA
The widespread implementation and acceptance of the World Wide
Web (WWW) has changed many facets of the techniques by which
Earth and environmental data are accessed, compiled, archived,
analyzed and exchanged. The ICSU World Data Centers, established
over 50 years ago, are beginning to use this technology as they
evolve into a new way of operations. One key element of this
new technology is known as WWW mirroring. Strictly
speaking, mirroring is reproducing exactly the web content from
one site to another at physically separated location. However,
there are other types of mirroring which uses the
same technology, but are different in appearance and/or content
of the site. The WDCs are beginning to use these three types
of mirroring technology to encourage new partners in the WDC
system. These new WDC partners bring a regional diversity or
a discipline specific enhancement to the WDC system. Currently
there are ten sites on five continents mirroring a variety of
data types using the different modes of mirroring technology.
These include paleoclimate data mirrored in the US, Kenya, Argentina
and France, and space environment data mirrored in the US, Japan,
South Africa, Australia and Russia. These mirror sites have
greatly enhanced the exchange and integrity of the respective
discipline databases. A demonstration of this technology will
be presented.
2.
Natural Language Knowledge Discovery: Cluster Grouping Optimization
Robert J. Watts, U.S. Army Tank-automotive and Armaments
Command, National Automotive Center, USA
Alan L. Porter, Search Technology, Inc. and Georgia Tech, USA
Donghua Zhu, Beijing Institute of Technology, China
The Technology Opportunities
Analysis of Scientific Information System (Tech OASIS), commercially
available under the trade name VantagePoint, automates the identification
and visualization of relationships inherent in sets (i.e., hundreds
or thousands) of literature abstracts. A Tech OASIS proprietary
approach applies principal components analysis (PCA), multi-dimensional
scaling (MDS) and a path-erasing algorithm to elicit and display
clusters of related concepts. However, cluster groupings and
visual representations are not singular for the same set of
literature abstracts (i.e., user selection of the items to be
clustered and the number of factors to be considered will generate
alternative cluster solutions and relationships displays). Our
current research, the results of which shall be demonstrated,
seeks to identify and automate selection of a "best"
cluster analysis solution for a set of literature abstracts.
How then can a "best" solution be identified? Research
on quality measures of factor/cluster groups indicates that
those that appear promising are entropy, F measure and cohesiveness.
Our developed approach strives to minimize the entropy and F
measures and maximize cohesiveness, and also considers set coverage.
We apply this to automatically map conceptual (term) relationships
for 1202 abstracts concerning "natural language knowledge
discovery."
3.
ADRES: An online reporting system for veterinary hospitals
P.K. Sidhu and N.K. Dhand, Punjab Agricultural University, India
An animal husbandry department reporting system (ADRES) has
been developed for online submission of monthly progress reports
of veterinary hospitals. It is a database prepared under Microsoft
Access 2000, which has records of all the veterinary hospitals
and dispensaries of animal husbandry department, Punjab, India.
Every institution has been given a separate ID. The codes for
various infectious diseases have been selected according to
the codes given by OIE (Office International des Epizooties).
In addition to reports about disease occurrence, information
can also be recorded for progress of insemination program, animals
slaughtered in abattoirs, animals exported to other states and
countries, animal welfare camps held and farmer training camps
organized etc. Records can be easily compiled on sub-division,
district and state basis and reports can be prepared online
for submission to Government of India. It is visualized that
the system may make the reports submission digital, efficient
and accurate. Although, the database has been primarily developed
for Punjab State, other states of India and other countries
may also easily use it.
4.
PAU_Epi~AID: A relational database for epidemiological, clinical
and laboratory data management
N.K. Dhand, Punjab Agricultural University, India
A veterinary database (Punjab Agricultural University Epidemiological
Animal disease Investigation Database, PAU_ Epi~AID) has been
developed to meet the requirements of data management during
outbreak investigations, monitoring and surveillance, clinical
and laboratory investigations. It is based on Microsoft Access
2000 and includes a databank of digitalized information of all
states and union territories of India. Information of districts,
sub divisions, veterinary institutions and important villages
of Punjab (India) has also been incorporated, every unit being
represented by an independent numeric code. More than 60 interrelated
tables have been prepared for registering information on animal
disease outbreaks, farm data viz. housing, feeding, management,
past disease history, vaccination history etc. and animal general
information, production, reproduction and disease data. Findings
of various laboratories such as bacteriology, virology, pathology,
parasitology, molecular biology, toxicology, serology etc. can
also be documented. Data can be easily entered in simple forms
hyper-linked to one another, which allow queries and reports
preparation at click of mouse. Flexibility has been provided
for additional requirements due to diverse needs. The database
may be of immense use in data storage, retrieval and management
in epidemiological institutions and veterinary clinics.
5.
Archiving Technology for Natural Resources and Environmental
Data in Developing Countries, A Case Study in China
Wang Zhengxing, Chen Wenbo, Liu Chuang, Ding Xiaoqiang, Chinese
Academy of Sciences, China
Data archiving has long been regarded as a less important sector
in China. As a result, there is no long-term commitment at the
national level to preserve natural resources data, and usually
smaller budgets for data management than for research. Therefore,
it is essential to develop a feasible strategy and technology
to manage the exponential growth of the data. The strategy and
technology should be cost-saving, robust, user-friendly, and
sustainable in the long run. A PC-based system has been developed
to manage satellite imagery, Geographic Information System (GIS)
maps, tabular attribute data, and text data. The data in text
format include data policies compiled from international, national,
and regional organizations. Full documentation on these data
are on-line and free to download. Only metadata and documentation
are on-line for GIS maps and tabular data; the full datasets
are distributed by CD-ROM, e-mail, or ftp.
Remote sensing data are often too expensive for developing
countries. An agreement has been reached between GCIRC and remote
sensing receiving station vendors. According to the agreement,
GCIRC can freely use the remote sensing data (MODIS) from the
receiving station, conditional on making their system available
to demonstrate to potential buyers. This assures the most important
data source for archiving. Considering the huge volumes of data
and limited PC capacity, only quick-look images and metedata
are permanently on-line. Users can search for data by date,
geolocation, or granule. Full 1B images are updated daily and
kept on-line for one week; users can download the recent data
for free. All raw data (direct broadcast) and 1B images are
archived on CD-ROMs, which are easy to read using a personal
computer.
6.
Delivering interdisciplinary spatial data online: The Ramsar
Wetland Data Gateway
Greg Yetman and Robert S. Chen, Columbia University, USA
Natural resource managers and researchers around the world are
facing a range of cross-disciplinary issues involving global
and regional environmental change, threats to biodiversity and
long-term sustainability, and increasing human pressures on
the environment. They must increasingly harness a range of socioeconomic
and environental data to better understand and manage natural
resources at local, regional, and global scales.
This demonstration will illustrate an online information resource
designed to help meet the interdisciplinary data needs of scientists
and resource managers concerned with wetlands of international
importance. The Ramsar Wetland Data Gateway, developed in collaboration
with the Ramsar Bureau and Wetlands International, combines
relational database technology with interactive mapping tools
to provide powerful search and visualization capabilities across
a range of data from different sources and disciplines. The
Gateway is also being developed to support interoperable data
access across distributed spatial data servers.
Track I-D-3:
Land Remote Sensing - Landsat Today and Tomorrow
Chairs: Hedy Rossmeissl
and John Faundeen,
US Geological Survey, USA
Scientists in earth science research and applications
and map-makers have for many years been avid users of
remotely sensed Landsat data. The use of remote sensing
technology, and Landsat data in particular, is extremely
useful for illustrating: current conditions and temporal
change for monitoring and assessing the impacts of natural
disasters; aiding in the management of water, biological,
energy, and mineral resources; evaluating environmental
conditions; and enhancing the quality of life for citizens
across the globe. The size of the image files, however,
raises a variety of data management challenges. This session
will focus specifically on the 30-year experience with
Landsat image data and will examine four components: 1)
image tasking, access, and dissemination, 2) applications
and use of the imagery, 3) data archiving, and 4) the
future of the Landsat program.
|
1.
Tasking, Archiving & Dissemination of Landsat Data
Thomas J. Feehan, Canada Centre for Remote Sensing, Natural
Resources Canada, Canada
The Canada Centre
for Remote Sensing of Natural Resources Canada (CCRS) operates
two satellite ground receiving stations, the Prince Albert Satellite
Station located in Prince Albert Saskatchewan and the Gatineau
Satellite Station located in Cantley, Quebec. The CCRS stations
provide a North American data reception capability, acquiring
data to generate knowledge and information critical to resource
use decision making on local, regional, national and global
scales. CCRS' primary role is to provide data related to land
resources and climate change, contributing to sustainable land
management in Canada.
Operating in a multi-mission environment, including LANDSAT,
the CCRS stations have accumulated an archive in excess of 300
TeraBytes, dating back to 1972, when CCRS started receiving
LANDSAT-1 (ERST-1) data at the Prince Albert Satellite Station.
Data are made available to support near-real time applications
including ice monitoring, forest fire monitoring and mapping,
as well as non-real time applications such as climate change,
land use and topographic mapping. LANDSAT MSS, TM and ETM+ data
constitute a significant portion of the CCRS archive holdings.
In addition to Canadian Public Good data use, a spin-off benefit
includes the commercial exploitation by a CCRS distributor and
value-added services network.
2.
The Work of the U.S. National Satellite Land Remote Sensing
Data Archive Committee: 1998 - 2000
Joanne Irene Gabrynowicz, National Remote Sensing and Space
Law Center, University of Mississippi School of Law, USA
Earth observation data have been acquired and stored since the
early 1970s. One of the world's largest, and most important,
repositories for land satellite data is the Earth Resources
Observation Systems (EROS) Data Center (EDC). It is a data management,
systems development, and research field center for the U.S.
Geological Survey's (USGS) National Mapping Discipline in Sioux
Falls, South Dakota, USA. It was established in the early 1970s
and in 1992, the U.S. Congress established the National Satellite
Land Remote Sensing Data Archive at EDC. Although data have
been acquired and stored for decades, the world's remote sensing
community has only recently begun to address long-term data
preservation and access. One such effort was made recently by
remote sensing leaders from academia, industry and government
as members of a federal advisory committee from 1998 to 2000.
This presentation provides a brief account of the Committee's
work product.
3.
An Overview of the Landsat Data Continuity Mission (LDCM)
Bruce K. Quirk and Darla M. Duval*, U.S. Geological Survey
EROS Data Center, USA
Since 1972, the Landsat program has provided continuous observations
of the Earth's land areas, giving researchers and policy makers
an unprecedented vantage point for assessing global environmental
changes. The analysis of this record has driven a revolution
in terrestrial remote sensing over the past 30 years. Landsat
7, which was successfully launched in 1999, returned operation
of the Landsat program to the U.S. Government. Plans have been
made for the follow-on to Landsat 7, the Landsat Data Continuity
Mission (LDCM), which has a planned launch date of late 2006.
The scientific need for Landsat-type observations has not diminished
through time. Changes in global land cover have profound implications
for the global carbon cycle, climate, and functioning of ecosystems.
Furthermore, these changes must be monitored continually in
order to link them to natural and socioeconomic drivers. Landsat
observations play a key role, because they occupy that unique
part of the spatial-temporal domain that allows human-induced
changes to be separated from natural changes. Coarse-resolution
sensors, such as the Moderate-Resolution Imaging Spectroradiometer
(MODIS) and the Advanced Very High Resolution Radiometer (AVHRR)
are ideal for monitoring the daily and weekly changes in global
biophysical conditions but lack the resolution to accurately
measure the amount and origin of land cover change. High-resolution
commercial systems, while valuable for validation, cannot acquire
sufficient global data to meet scientific monitoring needs.
Landsat-type observations fill this unique niche.
A joint effort between NASA, the U.S. Geological Survey (USGS),
and the private sector, LDCM will continue the Landsat legacy
by incorporating enhancements that reduce system cost and improve
data quality. Following the 1992 Land Remote Sensing Policy
Act, the LDCM seeks a commercially owned and operated system
selected through a competitive procurement. Unlike earlier Landsat
commercialization efforts, however, the LDCM procurement is
based on a rigorous Science Data Specification and Data Policy,
which seeks to guarantee the quantity and quality of the data
while preserving reasonable cost and unrestricted data rights
for end users. Thus the LDCM represents a unique opportunity
for NASA and the USGS to provide science data in partnership
with private industry and to reduce cost and risk to both parties,
while creating an environment to expand the commercial remote
sensing market.
The data specification requires the provision of 250 scenes
per day, globally distributed, with modest improvements in radiometric
signal-to-noise (SNR) and dynamic range. Two additional bands
have been added: an "ultra-blue" band centered at
443 nm for coastal and aerosol studies, and a band at either
1,375 or 1,880 nm for cirrus cloud detection. No thermal bands
will be included on this mission. Additional details are available
on the LDCM specification, mission concept, and status.
* Raytheon. Work performed under U.S. Geological Survey contract
1434-CR-97-CN-40274.
4. Current Applications of Landsat
7 Data in Texas
Gordon L. Wells, Center for Space Research, The University of
Texas at Austin, USA
The rapid delivery of timely information useful to decision
makers is one of the primary goals of the data production and
application programs developed by the Mid-American Geospatial
Information Center (MAGIC) located at the University of Texas
at Austin's Center for Space Research. In a state the size and
nature of Texas, geospatial information collected by remote
sensing satellites can assist a broad range of operational activities
within federal, state, regional and local government departments.
In the field of emergency management, the state refreshes its
imagery basemap using Landsat 7 data on a seasonal basis to
capture the locations of recent additions to street and road
networks and new structures that might be vulnerable to wildfires
or flashfloods. Accurately geolocated satellite imagery can
be incorporated into the geographic information system used
by the Governor's Division of Emergency Management much more
rapidly than updated records received from the department of
transportation or local entities. For many activities involving
the protection and enhancement of natural resources, Landsat
7 data offer the most economic and effective means to address
problems that affect large areas. Invasive species detection
and eradication is a current concern of the Texas Department
of Agriculture, Texas Soil and Water Conservation Board and
the Upper Colorado River Authority. Invasive saltcedar is one
noxious species that can be identified and removed with the
help of satellite remote sensing. The information required by
policy makers may extend beyond state borders into regions where
satellite reconnaissance is the only practical tool available.
For international negotiations involving the shared water resources
of Texas and Mexico, satellite imagery has made a valuable contribution
to the monitoring of irrigation activities and the local effects
of drought conditions. In the future, there will be increasing
concentration on shortening the time lag between the collection
of instrument data by MAGIC's satellite receiving station and
final product delivery in the projection, datum and file format
required for immediate inclusion into operational analyses by
the various agencies in the region.
5.
Development of Land Cover Database of East Asia
Wang Zhengxing, Zhao Bingru, Liu Chuang, Global Change Information
and Research Center, Institute of Geography and Natural Resource
Research, Chinese Academy of Sciences, China
Land cover plays
a major role in a wide range of fields from global change to
regional sustainable development. Although land cover has dramatically
changed over the last few centuries, util now there has been
no consistent way of quantifying the changes globally (Nemani,
and Running, 1995). Land cover dataset currently used for parameterization
of global climate models are typically derived from a range
of preexisting maps and atlases (Olson and Watts, 1982; Matthews,
1983; Wilson and Henderson-Sellers, 1985), this approach has
several limitations (A. Strahler and J. Townshend, 1996). Another
important data source is statistical report, but some statistical
land cover data seems unreliable. At present, the only practical
way to develop land cover dataset consistently, continuously,
and at globally is satellite remote sensing. This is also true
for the development of land cover dataset of East Asia.
The 17-class IGBP land cover unit includes eleven classes of
natural vegetation, three classes of developed and mosaic lands,
and three classes of non-vegetated lands. This system may be
useful at global level, but there is a very serious shortcoming:
only one class for arable land. Since the arable land is the
most dynamic and important area of the man-nature system, it
is essential to characterize arable land sub-system to more
details.
There are still some potentials for finer classification in
current 1-km AVHRR-NDVI data sets. A decision tree classifier
is used to transfer all input data into various pre-defined
classes. The key to accurate interpretation is to identify more
reliable links (decision rules) between input data and output
classes. The basic theory under the decision tree is that any
land cover class should be an identical point determined by
a multi-dimensional spaces, including multi temporal NDVI, phenology,
ecological region, DEM, census data etc. The preliminary research
shows that stratification using ecological region and DEM can
simplify the decision tree structure and yield more meaningful
classes in China's major agricultural regions. Arable land cover
may be classified at two levels, first level describes how many
times the crops are planted, and second level the crop characteristics.
The current land cover classification based on 1-km AVHRR-NDVI
data sets still have serious limitations for parameterization
of some models. The nominal 1-km spatial resolution images may
produce quite a lot mixed pixels, but some models need pure
pixel, e.g. DNDC model. However, the coming 250-m MODIS-EVI
data set will narrow the gap between model need and data supply
to some extent. Using the approaches developed from AVHRR, MODIS
will yield more reliable land cover data of East Asia.
Track
II-D-1:
Roundtable Discussion on Preservation and Archiving of
Scientific and Technical Data in Developing Countries
Chair: William
Anderson, Praxis101, Rye, NY, USA
Session Organizers:
William Anderson, US National Committee for CODATA
Steve Rossouw, South African National Committee for CODATA
Liu Chuang, Chinese Academy of Sciences, Beijing, China
Paul F. Uhlir, US National
Committee for CODATA
|
A Working Group
on Scientific Data Archiving was formed following the 2000 CODATA
Conference. The primary objective of this Working Group has
been to create a focus within CODATA on the issues of scientific
and technical data preservation and access. This Working Group,
co-chaired by William Anderson and Steve Rossouw, has co-organized
a workshop on data archiving with the South African National
Research Foundation in Pretoria in May 2002. The Working Group
is preparing a report of its activities from 2001-2002.
Another initiative
of the Working Group has been to propose the creation of a CODATA
"Task Group on Preservation and Archiving of S&T Data
in Developing Countries." The proposed objectives of that
Task Group are to: promote a deeper understanding of the conditions
in developing countries with regard to long-term preservation,
archiving, and access to scientific and technical (S&T)
data; advance the development and adoption of improved archiving
procedures, technologies, standards, and policies; provide an
interdisciplinary forum and mechanisms for exchanging information
about S&T data archiving requirements and activities, with
particular focus on the concerns of developing countries; and
publish and disseminate broadly the results of these efforts.
The proposed Task Group would be co-chaired by William Anderson
and Liu Chuang.
An additional related
proposal of the Working Group has been to create a Web portal
on archiving and preservation of scientific and technical data
and information. This portal, which would be developed jointly
by CODATA with the International Council for Scientific and
Technical Information and other interested organizations, would
provide information about and links to online:
-
Scientific
and technical data and information archiving procedures, technologies,
standards, and policies;
-
Discipline-specific
and cross-disciplinary archiving projects and activities;
and
-
Expert points of contact in all countries, with particular
attention to those in developing countries.
Reports on all these
activities will be given at the Roundtable and will then be
discussed with the individuals who attend this session.
Overview
and Grand Challenges
Thursday, 3 October
1245 - 1330
Chair: Fedor
Kuznetzov, Institute of Inorganic Chemistry, Novosibirsk,
Russia |
Preserving
Scientific Data: Supporting Discovery into the Future
John Rumble, CODATA President
A wide variety of methods have been used to save and preserve
scientific data for thousands of years. The physical nature
of these means and the inherent difficulties of sharing the
physical media with others who need the data ha e been major
barriers in advancing research and scientific discovery. The
information revolution is changing this in many significant
ways; ease of availability, breadth of distribution, size and
completeness of data sets, and documentation. As a consequence,
scientific discovery itself is changing now, and in the future,
perhaps even more dramatically. In this talk I will review some
historical aspects of data preservation and the use of data
in discovery. And I will provide some speculations on how preserving
data digitally might revolutionize scientific discovery.
|