Home > Semantic Description, Publication and Discovery of Workflows in myGrid
Simon Miles1, Juri Papay1, Chris Wroe2, Phillip Lord2, Carole Goble2, Luc Moreau1
sm@ecs.soton.ac.uk; jp@ecs.soton.ac.uk; chris.wroe@cs.man.ac.uk; p.lord@russet.org.uk; carole@cs.man.ac.uk; L.Moreau@ecs.soton.ac.uk
1 School of Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, UK;
2 Department
of Computer Science, University of Manchester, Oxford Road, Manchester
M13 9PL, UK
Abstract
The bioinformatics
scientific process relies on in silico
experiments, which are experiments executed in full in a computational
environment. Scientists wish to encode the designs of these experiments
as workflows because they provide minimal, declarative descriptions
of the designs, which overcominge many
barriers to the sharing and re-use of these designs between
scientists and enable the use of the most appropriate services
available at any one time. We anticipate that the number of workflows
will increase quickly as more scientists begin to make use of existing
workflow construction tools to express their experiment designs.
, Dbut discovery then becomes an increasingly
hard problem, as it becomes more difficult for a scientist to identify
the workflows relevant to their particular research goals amongst all
those on offer. While many existing
approaches exist forto the publishing and discovery
of services, there have been fewno attempts to
address where and how authors of experimental designs should advertise
the availability of their work or how relevant workflows can be discovered
with minimal effort from the user. As the users designing and adapting
experiments will not necessarily have a computer science background,
we also have to consider how publishing and discovery can be achieved
in such a way that they are not required to have detailed technical
knowledge of workflow scripting languages. Furthermore, we believe they
should be able to make use of others’ expert knowledge (the semantics)
of the given scientific domain. In this paper, we define the issues
related to the semantic description, publishing and discovery of workflows,
and demonstrate how the architecture created by the myGrid
project aids scientists in this process. We give a walk-through of how
users can construct, publish, annotate, discover and enact workflows
via the user interfaces of the myGrid architecture; we then
describe novel middleware protocols, making use of the Semantic Web
technologies RDF and OWL to support workflow publishing and discovery.
Traditionally,
the biological scientific process has involved experiments on living
systems, in vivo, or on parts of a living system in a test tube,
in vitro. Bioinformatics has focused on supporting the experimental
biologist by enabling many more experiments to be carried out in
silico, that is computationally. As this better supports automation
and also harnesses the collective knowledge of the discipline, in
silico biological experiments have greatly enabled the process of
validating hypotheses, andor gathering additional
information to shape the design of future experiments. For science
to become efficient, If eexperiments
can need to be easily
shared, adapted and reused, hopefully science will become more
efficient; distributed architectures on the Internet promise to
be the most effective mechanism to achieve this goal for in silico
experiments.
Both
the Web Services and Grid architectures [WSArchWSArch, OGSAOGSA] have adopted a service-oriented
approach, in which computational resources, storage resources, programs
and databases can be represented by services. In such a
context, a service is a network-enabled entity capable of encapsulating
diverse implementations behind a common interface. The benefit of such
a uniform view is that it facilitates the composition of services into
more sophisticated services, hereby promoting sharing and reuse of resources
in distributed environments. To this end, a number of workflow
languages have emerged which are capable of describing complex compositions
of services, e.g. WSFL [WSFLWSFL], XLANG [XLANGXLANG], BPEL [BPELBPEL], XScufl [SCUFLSCUFL].
However, service-oriented architectures currently provide no mechanism to facilitate the sharing of workflows. At present, workflow authors simply make a list of the available workflows available via a Web page. With an increasing number of workflows, and sites listing them, searching for them in this way will soon become untenable.
DAML-S [DAMLSDAMLS] considers workflows
as largely equivalent to services with regards to publishing and discovery
because they are functional entities that are identified by their interface
(inputs and outputs) and overall function. Therefore, as with
services, workflows need to be published in order to be discovered and
reused. However, publishing a workflow involves two distinct steps:
first, the workflow script must be archived
in a repository from which it can be publicly retrieved; next, a description
of and a reference (e.g. a URI) to its script need to be advertised
in a registry. In this context, a registry is defined as
a service holding descriptions of workflows and services. Many protocols
for publishing service descriptions, including de-facto standards, such
as UDDI [UDDIUDDI], Jini [JiniJini], and BioMoby [WL02aWL02a] do not, in themselves,
address publication and discovery of workflows. DAML-S, on the other
hand, is an ontology capable of describing complex processes, but is
not a registry system for publishing and discovery.
Once
a published workflow has been discovered, scientists use their expert
knowledge of the scientific field to judging whether a design is applicable
to their own work. Unfortunately, such domain-specific knowledge is
not readily available from workflow scripts, which are engineered in
terms of programmatic notions such as interfaces, ports, operations
and messages of the service-oriented architecture in use [WSDLWSDL]; furthermore, domain
knowledge cannot be inferred from these low-level notions. However,
semantic descriptions can be added to workflows, in order to make
high-level knowledge explicit; these must be machine interpretable if
tools are to be capable of recommending the applicability of workflows
based on the domain-specific knowledge of a scientist.
myGrid
[myGridmyGrid] is a pilot project
funded by the UK e-Science programme to develop Grid middleware in biological
sciences context. The goal of the myGrid project [myGridmyGrid] is to develop a
software infrastructure that provides support for bioinformaticians
in the design and execution of workflow-based in silico experiments
using the resources of the Grid [FK03FK03].
In silico experiments can operate over the Grid, in which resources
are geographically distributed and managed by multiple institutions,
and the necessary tools, services and workflows are discovered and invoked
dynamically. It is a data-intensive Grid, where the complexity
is in the data itself, the number of repositories and tools that need
to be involved in the computations, and the heterogeneity of the data,
operations and tools. The myGrid architecture includes components
for composing workflows, annotating them with semantic descriptions,
publishing semantically described workflows, reasoning over semantic
descriptions, discovering workflows from semantic queries and executing
them. In previous papers, we have discussed various facets of our approach
to service publication and discovery, namely its preliminary design
[LWS+03bLWS+03b], its protocol for
annotating service descriptions [MPP+04MPP+04] and its performance
[MLM+04MLM+04, MPD+03aMPD+03a]. The purpose of
this paper is to discuss the final design of our architecture for workflow
publication and discovery, and its implementation and integration in
an electronic lab-book, for manipulation by the scientist. Specifically,
this paper focuses on the following technical contributions of the
myGrid architecture for publishing and discovering workflow-based
in silico experiments.
Section 2 presents an illustrative bioinformatics case study, including a representative workflow, to aid in describing and demonstrating the usefulness of our work. Section 3 shows the users’ perspective in sharing workflows from composition through publishing and description to discovery. In Section 4, we examine the use of semantic technologies to represent the knowledge used for discovery, and in Section 5 we define the protocols used in myGrid to process this information. The implementation of the middleware using this protocol is given in Section 6, the scope of our work and related work is discussed in Section 7 and we draw conclusions and suggest further work in Section 8.
For clarity
of explanation, weWe now present a case study to illustrate
our approach to semantic description and discovery of workflows.
It is a part of a myGrid exemplar bioinformatics application,
namely tThe Graves' Disease application, an exemplaer
application for myGrid, is intended to help the investigation
of a thyroid disorder [GravesGraves]. Specifically, the
purpose of the application is to help biologists identify gene mutations
that may be involved in causing the condition.
The
Graves’ Disease scenario uses a well known and common "candidate
gene" approach. We assume that previous biological investigations
have been used to isolate a region on the genome in which genes affecting
Graves’ Disease may lie. By looking through this region for
variations between Graves’ Disease and normal patients, then determining
whether these variations lie within a gene, a number of candidate genes
can be found. One of the most common variations is called a Single Nucleotide
Polymorphism (SNP), which is a variation involving only a single nucleotide,
rather than a large scale change affecting many nucleotides. But often
many of these polymorphismsmutations occur in
a region, most of which will be not related to Graves’ Disease. The
in silico process consists of gathering information from several
publicly available data resources, many of which have been made
available as services at one or more locations, describing the
current state of knowledge about the genes in question. Once such information
has been obtained for a set of candidate genes, the scientist can design
an in vitro experiment that will test their likelihood of being
involved in the disease.
To
enable re-running of the experiment and best use of Grid resources,
the experiment is encoded as a workflow, a composed set of services
or other workflows, which we refer to as CandidateGeneAnalysis.
This workflow takes a "probe set ID" referring to a gene sequence
in the Affymetrix database [AffymetrixAffymetrix] as input and
ultimately returns a record from the EMBL database containing information
about the sequence including SNPs. The workflow’s structure can be
seen on the left hand side of Figure
2Figure 2Figure 1Figure
1. The specific details can be found in its script1 available at http://www.ecs.soton.ac.uk/~sm/myGrid/AffyIdToEmblSnps.scufl, encoded in the SCUFL workflow language.
The SCUFL workflow language, developed as part of myGrid,
simplifies the process of creating workflows for biology by making the
language granularity and concepts as close as possible to those
that a potential user of the system would already be usingintuitive
as possible for potential users [SCUFLSCUFL].
Since
this experiment is more widely applicable than just for the study of
Graves’ Disease, the biologists may wish to share it with others,
and would want to do so in such a way that it can usefully be discovered
and re-used. User requirements [SGG+03aSGG+03a] have identified
some questions that scientists commonly ask about such kinds of experiment.
Specifically, since they aim to discover SNPs from gene sequence data,
they will seek experiments that:
Since experiments
are represented as workflows, and workflows are characterised by the
kind of their ir
inputs, their outputs and their function, a user will specifically
seek published workflows that:
From
tThis use case indicates that we need a
t is clear that there are a
large number of entities that we
need in order to perform in silico experiments;
, which have been summarised in Figure
1
summarises the terminology we adopt in this paper.
Next, we examine how users go about publishing workflows, so that
such the questions
questions above can be answered by the architecture to support
the discovery process.
some otherstors.org/akt/[DAML-ov v Novom roku.
the office.
ng on Tuesday. s and lot of succee New Year.
w Year.
alk etc.
g I
|
||||
Table
1 myGrid terminology.
Figure
Term | Description | Example | ||
Basic Concepts |
Workflow Language | A language for specifying a workflow. |
BPEL, Scufl |
|
Service |
An atomic entity that can be invoked |
Blast service at EBI |
Concrete | |
Workflow |
A composed set of services or other workflows and a specification of the data flow between them |
Candidate Gene Analysis |
||
Activity |
A workflow or a service. |
|||
Service Type |
Abstract activity definition that represent the class of a service or a workflow template |
Sequence Alignment, BLAST |
Abstract | |
Workflow language independent | Workflow
|
The salient features of a workflow that are desirable to be described conceptually or syntactically: inputs, outputs, task(s), component resources |
Input: probe set ID Output: EMBL_SNP Using: Affymetrix database, EMBL database, BLAST Task: SNP_annotation |
|
Workflow language dependent |
Workflow Template |
A workflow in which one or more or the activities are not directly invokable, but represented as a specification which can be resolved into invokable activity. |
The Candidate Gene Analysis data and control flow, choreographing service types (e.g. BLAST) instead of, or as well as, activities (e.g. BLAST at EBI). |
|
Workflow Script |
A specific specification, defined in terms of the workflow language, that we can directly enact. |
http://www.ecs.soton.ac.uk /~sm/myGrid/ AffyIdToEmblSnps.scufl |
Concrete |
Figure
Figure
1 Table 1 myGrid terminology.
The purpose
of this section is twofold. On the one hand, we illustrate our approach
to workflow publication and discovery, using snapshots of the graphical
user interface that the scientist is presented with when using the
myGrid system; the functionality of this interface was derived
from the user requirements we captured at the beginning of the
myGrid project [SGG+03aSGG+03a]. On the other
hand, we identify key technical requirements for the knowledge representation
that is required to support our approach. With the user-centric perspective
adopted by myGrid, we analyse the kinds of discovery that
scientists are confronted with: when composing workflows and when deciding
which scientific experiments to run. In order to be discovered, workflows
need to have been published, and we examine how suitable semantic descriptions
of these workflows can be made available to the system.
Designing a
workflow means linking together functional entities such as Web Services
or other workflows, which will be referred towe refer
to as activities (see Figure
1Table
1), so that the outputs of some are used as the inputs of
others. Workflows are constructed by linking together sub-activities
that pass data between each other. Figure
2Figure 2Figure 1Figure
1 shows the myGrid graphical workflow construction
tool Taverna [TavernaTaverna].
Figure
2Figure
1. Workflow construction using Taverna. The left-hand panel contains
a depiction of the workflow itself with each box representing an activity
in the workflow; when the workflow is enacted, this activity results
in a Web Service operation call or the invocation of another workflow.
Data flows from the inputs, represented by inverted triangles, through
the linked services to the output triangle at the bottom of the workflow.
The ‘Scufl Model Explorer’ panel shows a hierarchical view of the
workflow and ‘Enactor launch’ relates to test runs of the workflow.
Workflows
may not beneed not be created from scratch: they can
be adapted and personalised from previously written workflows. As part
of personalisation, the workflow’s author needs to discover existing
activities (workflows and services) so as to include them in their design.
Hence, since both services and workflows need to be discovered, both
are listed in the ‘Available services’ panel of Figure
2Figure 2Figure 1Figure
1. Crucially, user requirements have identified that:;
1) biologists require activities to be discoverable by the
function they perform, that is task orientated discovery
and 2) that
final selection ultimately rests with the scientist, who will select
those to be included according to the goal of the experiment they are
designing. To this end, scientists need to be able to draw on a wide
range of information about activities in order to inform their decisions.
Specifically, the following workflow descriptions2
are used: the workflow’s author and their institution, the function
of the workflow, the sub-activities it may invoke (and their function),
and the inputs and outputs of the workflow expressed in biological terms.
WIt is interesting to note that when the scientist
considers a workflow for insertion in an experiment, they do not
regard the workflowit as a “grayblack-box”,
because they want to know about the activities it is composed of, though
the fine details of their dependencies, control and data flows do not
matter at this stage.
Selecting
activities for thebased on the functions they
perform helps guarantee that the overall experiment has the intended
behaviour. However, further care is needed to ensure that the composition
is operationally consistent at the transport level: data types and formats
of outputs must be compatible with the inputs they feed into.
In order to verify such constraints, service interfaces [WSDLWSDL] and an equivalent
concept for workflows need to be made available to the scientists, who
will make sure that all data are suitably converted to ensure a coherent
composition. In Taverna, the scientist is made aware of the incompatability
of data types and formats (encoded as MIME types) by allowing them only
to make links between the output of one activity and the input of another
with the same type. To that end, Taverna relies on the WSDL interface
files of services and workflows, the details of which are hidden from
the scientists by the user interface.
Scientists
undertake their research by iteratively selecting and running workflows
and further analysing the data they produce. myGrid aids
this process by providing the myGrid workbench, a client
side electronic lab-book through which users can perform their in
silico experiments, as well as storing and organising their data.
A typical work pattern of the scientist consists of selecting a piece
of data stored locally and asking which workflows will accept inputs
with such biological type. In the screenshot of Figure
3Figure 3Figure 2Figure
2, the user has selected a piece of data, which is an Affymetrix
Probe Set ID referring to candidate gene data in the Affymetrix database
[AffymetrixAffymetrix], and asked to
find a workflow that is capable of taking this data as input.
User
requirements [SGG+03aSGG+03a] have identified
that bioinformaticians also want to be involved in the process of choosing
which experiments to run, and therefore, the myGrid system
does not offer fully automated workflow selection. Instead, the user
is presented with a list of workflow scripts and invited to make the
final selection. In Figure
3Figure 3Figure 2Figure
2, two applicable workflows have been discovered and displayed
in a list with the workflow graphical depiction shown to the right,
on selection.
Figure
2
Figure
3: Selecting workflow that takes an Affymetrix probe set
ID as input. The user has selected a piece of data, which is an Affymetrix
Probe Set ID referring to candidate gene data in
the Affymetrix database [AffymetrixAffymetrixAffymetrix], and
asked to find a workflow that is capable of taking this data as input.
As
well as this data-driven, context-
sensitive, method for discovering experiments, we also wish
to enable a task-orientated
and result-driven approaches,
by which workflows can be discovered
respectively by the function they perform and by the type of output
they produce. Besides this
data-driven method for discovering experiments for execution,
two more discovery modes are commonly used by the scientific community:
in a goal-driven approach, scientists attempt to discover workflows
capable of a given function, whereas in a
result-driven approach, they are looking for workflows producing
a given type of biological data.
To this end, scientists need to be able to browse through published
workflows, which have been categorised according their inputs (data-driven),
their outputs functionality (task-oriented)
and their functionalityoutputs (result-driven).
In Figure
4Figure 4Figure 3Figure
3, illustrates
the user browses browsing
available services and workflowsactivities
categorised according to those three axes. by the type of
input they take, the type of output
they produce or the type of task they perform.the task
they perform. The user can also browse activities by the inputs, and
outputs.
Figure
3Figure
4: Browsing categorised workflows and services. As shown,
the user can see two services/workflows available to do sequence alignment
on a gene sequence, using the services BLASTn and BLASTx.3
Scientists
require descriptions so they can judge which workflow is applicable
amongst the many available.
While the final decision remains with the scientists, we expect the
system to help them by sorting workflows according to the multiple
relevant criteria discussed above, e.g. input and output types, functions,
andvarious aspects (inputs, outputs, functionality)of
their signature, and possibly to rank them. Therefore, descriptions
need to be easily processable by the computer.
Workflow descriptions can be produced by workflow authors, but they need not. Indeed, our experience in myGrid shows that it is useful for a third-party to be able to provide such descriptions. For example, a description that contains useful information about the quality, accuracy or trustability of the results produced by an experiment should typically be provided by end users, rather than the workflow authors. Likewise, a reference ontology of the application domain may be revised after some experiments have been designed; it may then be useful that an ontology expert refines semantic descriptions according to the revised ontology.
Therefore,
in myGrid, we allow third-party users to generate workflow
descriptions, and provide a separate tool to help users to construct
such descriptions. The tool, called Pedro, is displayed in Figure
5Figure 5Figure 4Figure
4, which illustrates its use to create descriptions pertaining
to the CandidateGeneAnalysis
workflow.
Figure
4: Figure
5: Screenshot
of a workflow being annotated with semantic description using Pedro.
The various components of the workflow that can be annotated with descriptions
are displayed in the left hand panel. At a high level, the workflow
can be annotated with the organisation that has produced it and with
information about the type of biological data it takes as input and
produces as output and the overall biological function it performs.
The user has focused on a particular workflow sub-activity (named here
WORKFLOWOPERATION) and is providing information about one of the inputs
(called a PARAMETER) to that sub-activity, specifying a bioinformatics
term, ‘Affymetrix_probe_set_id’, that refers to the type and
origin of the data taken by the operation as input.
Although
we wish descriptions to be easily processable by the computer, some
descriptions may be solely aimed at users in judging the applicability
of a workflow and so can be written in free text. Figure
5Figure 5Figure 4Figure
4 illustrates both forms of annotation. In “parameterDescription”, a free text description has been
added to assist manual search and browsing of workflows
description. However, fields marked with an asterisk (“semanticType”
and “transportDataType”) are populated with concepts from
a controlled vocabulary. So, for example, Affymetrix_probe_set_id is a term in the myGrid
bioinformatics ontology, which provides a controlled vocabulary for
bioinformatics terms. Pedro
Pedro has the ability to choose the controlled vocabulary
that is applicable for each field of the annotation by focusing in on
a particular region of an ontology. To aid the user in identifying the
suitable terms of an ontology to select, the concepts of the bioinformatics
ontology can be browsed, as illustrated by Figure
6Figure 6Figure 5Figure
5.
Figure
5
Figure
6: Finding
the ontology term for describing the workflow’s output in the
myGrid ontology. The user has followed a classification
of the ontology terms, and has found the term
‘EMBL_record_accession_numberAffymetrixProbeSetId’ which represents an entry in
the scientist’s EMBL database for a gene, which
should contain SNP datadatabase, and
will be an inoutput of the
CandidateGeneAnalysis workflow.
We have found
that users wish to be involved in making the final selection of workflows
to be included in their scientific experiments. Therefore, all
experiment-related workflows will be chosen at composition time, and
we do not anticipate that any of these will be discovered at run-time,
i.e. when experiments are being enacted.
On the other hand, there exist experimentally neutral workflows, which
are composed of activities without any specific biological function
ascribed to them (e.g., format conversions, pretty printers). Such workflows
could be discoverable at run-time without involving the user.
Likewise, multiple providers may host instances of a same service, and
these should be automatically discoverable to make better use of resources
that are available at runtime. Currently, we consider that discovery
can only take place for workflows (and services) that have a functionality
and fully-defined interface identified at composition time.
We shall discuss how these restrictions can be relaxed in Section 7.Semantic
Description, Publication and Discovery of Workflows in myGridSimon Miles1,
Juri Papay1, Chris Wroe2, Phillip Lord2, Carole Goble2, Luc Moreau1sm@ecs.soton.ac.uk; jp@ecs.soton.ac.uk; chris.wroe@cs.man.ac.uk; p.lord@russet.org.uk; carole@cs.man.ac.uk; L.Moreau@ecs.soton.ac.uk1
School of Electronics and Computer Science, University of Southampton,
Southampton SO17 1BJ, UK;2 Department of Computer Science, University
of Manchester, Oxford Road, Manchester M13 9PL, UKAbstractThe bioinformatics
scientific process relies on in silico experiments, which are experiments
executed in full in a computational environment. Scientists wish to
encode the designs of these experiments as workflows because they provide
minimal, declarative descriptions of the designs, which overcome many
barriers to the sharing and re-use of the designs between scientists
and enable the use of the most appropriate services available at any
one time. We anticipate that the number of workflows will increase quickly
as more scientists begin to make use of existing workflow construction
tools to express their experiment designs, but discovery then becomes
an increasingly hard problem, as it becomes more difficult for a scientist
to identify the workflows relevant to their particular research goals
amongst all those on offer. While many existing approaches exist to
the publishing and discovery of services, there have been no attempts
to address where and how authors of experimental designs should advertise
the availability of their work or how relevant workflows can be discovered
with minimal effort from the user. As the users designing and adapting
experiments will not necessarily have a computer science background,
we also have to consider how publishing and discovery can be achieved
in such a way that they are not required to have detailed technical
knowledge of workflow scripting languages. Furthermore, we believe they
should be able to make use of others’ expert knowledge (the semantics)
of the given scientific domain. In this paper, we define the issues
related to the semantic description, publishing and discovery of workflows,
and demonstrate how the architecture created by the myGrid project aids
scientists in this process. We give a walk-through of how users can
construct, publish, annotate, discover and enact workflows via the user
interfaces of the myGrid architecture; we then describe novel middleware
protocols, making use of the Semantic Web technologies RDF and OWL to
support workflow publishing and discovery.1. IntroductionTraditionally,
the biological scientific process has involved experiments on living
systems, in vivo, or on parts of a living system in a test tube, in
vitro. Bioinformatics has focused on supporting the experimental biologist
by enabling many more experiments to be carried out in silico, that
is computationally. As this better supports automation and also harnesses
the collective knowledge of the discipline, in silico biological experiments
have greatly enabled the process of validating hypotheses, or gathering
additional information to shape the design of future experiments. For
science to become efficient, experiments need to be shared, adapted
and reused; distributed architectures on the Internet promise to be
the most effective mechanism to achieve this goal for in silico experiments.Both
the Web Services and Grid architectures [WSArch, OGSA] have adopted
a service-oriented approach, in which computational resources, storage
resources, programs and databases can be represented by services.
In such a context, a service is a network-enabled entity capable of
encapsulating diverse implementations behind a common interface. The
benefit of such a uniform view is that it facilitates the composition
of services into more sophisticated services, hereby promoting sharing
and reuse of resources in distributed environments. To this end,
a number of workflow languages have emerged which are capable of describing
complex compositions of services, e.g. WSFL [WSFL], XLANG [XLANG], BPEL
[BPEL], XScufl [SCUFL].However, service-oriented architectures currently
provide no mechanism to facilitate the sharing of workflows. At present,
workflow authors simply make a list of the available workflows available
via a Web page. With an increasing number of workflows, and sites listing
them, searching for them in this way will soon become untenable. DAML-S
[DAMLS] considers workflows as largely equivalent to services with regards
to publishing and discovery because they are functional entities that
are identified by their interface (inputs and outputs) and overall function.
Therefore, as with services, workflows need to be published in order
to be discovered and reused. However, publishing a workflow involves
two distinct steps: first, the workflow script must be archived in a
repository from which it can be publicly retrieved; next, a description
of and a reference (e.g. a URI) to its script need to be advertised
in a registry. In this context, a registry is defined as a service
holding descriptions of workflows and services. Many protocols for
publishing service descriptions, including de-facto standards, such
as UDDI [UDDI], Jini [Jini], and BioMoby [WL02a] do not, in themselves,
address publication and discovery of workflows. DAML-S, on the other
hand, is an ontology capable of describing complex processes, but is
not a registry system for publishing and discovery.Once a published
workflow has been discovered, scientists use their expert knowledge
of the scientific field to judging whether a design is applicable to
their own work. Unfortunately, such domain-specific knowledge is not
readily available from workflow scripts, which are engineered in terms
of programmatic notions such as interfaces, ports, operations and messages
of the service-oriented architecture in use [WSDL]; furthermore, domain
knowledge cannot be inferred from these low-level notions. However,
semantic descriptions can be added to workflows, in order to make high-level
knowledge explicit; these must be machine interpretable if tools are
to be capable of recommending the applicability of workflows based on
the domain-specific knowledge of a scientist.myGrid [myGrid] is a pilot
project funded by the UK e-Science programme to develop Grid middleware
in biological sciences context. The goal of the myGrid project [myGrid]
is to develop a software infrastructure that provides support for bioinformaticians
in the design and execution of workflow-based in silico experiments
using the resources of the Grid. In silico experiments can operate
over the Grid, in which resources are geographically distributed and
managed by multiple institutions, and the necessary tools, services
and workflows are discovered and invoked dynamically. It is a
data-intensive Grid, where the complexity is in the data itself, the
number of repositories and tools that need to be involved in the computations,
and the heterogeneity of the data, operations and tools. The myGrid
architecture includes components for composing workflows, annotating
them with semantic descriptions, publishing semantically described workflows,
reasoning over semantic descriptions, discovering workflows from semantic
queries and executing them. In previous papers, we have discussed various
facets of our approach to service publication and discovery, namely
its preliminary design [LWS+03b], its protocol for annotating service
descriptions [MPP+04] and its performance [MLM+04, MPD+03a]. The purpose
of this paper is to discuss the final design of our architecture for
workflow publication and discovery, and its implementation and integration
in an electronic lab-book, for manipulation by the scientist. Specifically,
this paper focuses on the following technical contributions of the myGrid
architecture for publishing and discovering workflow-based in silico
experiments.A definition of the protocol used to publish, annotate and
discover workflows in a registry. The protocol is independent of the
actual language used to encode workflows. To this end, it relies on
a notion of workflow skeleton which identifies, in an extensible manner,
the salient features of a workflow that can be described semantically.The
use of RDF (the W3C Resource Description Framework) [RDF], which underpins
the Semantic Web effort [BHL01a], as the underlying representation to
express service and workflow descriptions and to
facilitate the attachment of metadata to them. Besides being a flexible
and powerful representation formalism, RDF provides for uniform graph-based
querying using RDQL [RDQL], which is used in our registry to support
workflow discovery.The use of OWL ontologies to encode domain-specific
knowledge and to allow the inferences required by the discovery process.
Specifically, ontologies are used to index workflows according to their
functionality and the semantic types of their inputs and outputs, expressed
as biological concepts. A semantic find component, which uses a description
logic reasoner, provides complete reasoning over the rich OWL-based
descriptions of workflows, and facilitates discovery with complex queries
over these descriptions.A complete implementation of the architecture,
organised as a set of Web Services and associated user interfaces are
all available for download from http://www.myGrid.org.uk/myGrid/web/download/Section
2 presents an illustrative bioinformatics case study, including a representative
workflow, to aid in describing and demonstrating the usefulness of our
work. Section 3 shows the users’ perspective in sharing workflows
from composition through publishing and description to discovery. In
Section 4, we examine the use of semantic technologies to represent
the knowledge used for discovery, and in Section 5 we define the protocols
used in myGrid to process this information. The implementation of the
middleware using this protocol is given in Section 6, the scope of our
work and related work is discussed in Section 7 and we draw conclusions
and suggest further work in Section 8.2. Case StudyFor clarity of explanation,
we now present a case study to illustrate our approach to semantic description
and discovery of workflows. It is a part of a myGrid exemplar bioinformatics
application, namely the Graves' Disease application, intended to help
the investigation of a thyroid disorder [Graves]. Specifically, the
purpose of the application is to help biologists identify gene mutations
that may be involved in causing the condition.The Graves’ Disease
scenario uses a well known and common "candidate gene" approach.
We assume that previous biological investigations have been used to
isolate a region on the genome in which genes affecting Graves’ Disease
may lie. By looking through this region for variations between
Graves’ Disease and normal patients, then determining whether these
variations lie within a gene, a number of candidate genes can be found.
One of the most common variations is
called a Single Nucleotide Polymorphism (SNP), which is a variation
involving only a single nucleotide, rather than a large scale change
affecting many nucleotides. But often many of these mutations occur
in a region, most of which will be not related to Graves’ Disease.
The in silico process consists of gathering information from several
publicly available data resources describing the current state of knowledge
about the genes in question. Once such information has been obtained
for a set of candidate genes, the scientist can design an in vitro experiment
that will test their likelihood of being involved in the disease.To
enable re-running of the experiment and best use of Grid resources,
the experiment is encoded as a workflow, which we refer to as CandidateGeneAnalysis.
This workflow takes a "probe set ID" referring to a gene sequence
in the Affymetrix database [Affymetrix] as input and ultimately returns
a record from the EMBL database containing information about the sequence
including SNPs. The workflow’s structure can be seen on the left hand
side of Figure 1. The specific details can be found in its script available
at http://www.ecs.soton.ac.uk/~sm/myGrid/AffyIdToEmblSnps.scufl, encoded in the SCUFL workflow
language. The SCUFL workflow language, developed as part of myGrid,
simplifies the process of creating workflows for biology by making the
language granularity and concepts as close as possible to those that
a potential user of the system would already be using [SCUFL].Since
this experiment is more widely applicable than just for the study of
Graves’ Disease, the biologists may wish to share it with others,
and would want to do so in such a way that it can usefully be discovered
and re-used. User requirements [SGG+03a] have identified some questions
that scientists commonly ask about such kinds of experiment. Specifically,
since they aim to discover SNPs from gene sequence data, they will seek
experiments that:process a given sort of data (e.g. genes),provide extra
information on given data (e.g. gene data),provide a given type of output
(e.g. SNPs),perform searches over named public databases.Since experiments
are represented as workflows, and workflows are characterised by their
inputs, their outputs and their function, a user will specifically seek
published workflows that:have a given semantic type (e.g. gene data)
as one of their inputs,perform a given type of function (e.g. to provide
extra information on a gene),have a specific semantic type (e.g. SNPs)
as one of their outputs,use certain services (e.g. named public genetic
information databases).Next, we examine how users go about publishing
workflows, so that such questions can be answered by the architecture
to support the discovery process. 3. The Users’ PerspectiveThe purpose
of this section is twofold. On the one hand, we illustrate our approach
to workflow publication and discovery, using snapshots of the graphical
user interface that the scientist is presented with when using the myGrid
system; the functionality of this interface was derived from the user
requirements we captured at the beginning of the myGrid project [SGG+03a].
On the other hand, we identify key technical requirements for the knowledge
representation that is required to support our approach. With the user-centric
perspective adopted by myGrid, we analyse the kinds of discovery that
scientists are confronted with: when composing workflows and when deciding
which scientific experiments to run. In order to be discovered, workflows
need to have been published, and we examine how suitable semantic descriptions
of these workflows can be made available to the system.3.1. Construction-time
DiscoveryDesigning a workflow means linking together functional entities
such as Web Services or other workflows, which will be referred to as
activities, so that the outputs of some are used as the inputs of others.
Workflows are constructed by linking together sub-activities that pass
data between each other. Figure 1 shows the myGrid graphical workflow
construction tool Taverna [Taverna].
Figure 1. Workflow construction using Taverna. The left-hand panel contains
a depiction of the workflow itself with each box representing an activity
in the workflow; when the workflow is enacted, this activity results
in a Web Service operation call or the invocation of another workflow.
Data flows from the inputs, represented by inverted triangles, through
the linked services to the output triangle at the bottom of
the workflow. The ‘Scufl Model Explorer’ panel shows a hierarchical
view of the workflow and ‘Enactor launch’ relates to test runs of
the workflow.Workflows may not be created from scratch: they can be
adapted and personalised from previously written workflows. As part
of personalisation, the workflow’s author needs to discover existing
activities (workflows and services) so as to include them in their design.
Hence, since both services and workflows need to be discovered, both
are listed in the ‘Available services’ panel of Figure 1. Crucially,
user requirements have identified that biologists require activities
to be discoverable by the function they perform, and that final selection
ultimately rests with the scientist, who will select those to be included
according to the goal of the experiment they are designing. To this
end, scientists need to be able to draw on a wide range of information
about activities in order to inform their decisions. Specifically,
the following workflow descriptions4 are used:
the workflow’s author and their institution, the function of the workflow,
the sub-activities it may invoke (and their function), and the inputs
and outputs of the workflow expressed in biological terms. It
is interesting to note that when the scientist considers a workflow
for insertion in an experiment, they do not regard the workflow as a
black-box, because they want to know about the activities it is composed
of, though the fine details of their dependencies, control and data
flows do not matter at this stage.
Selecting activities for the functions they perform helps guarantee
that the overall experiment has the intended behaviour. However, further
care is needed to ensure that the composition is operationally consistent
at the transport level: data types and formats of outputs must be compatible
with the inputs they feed into. In order to verify such constraints,
service interfaces [WSDL] and an equivalent concept for workflows need
to be made available to the scientists, who will make sure that all
data are suitably converted to ensure a coherent composition. In Taverna,
the scientist is made aware of the incompatability of data types and
formats (encoded as MIME types) by allowing them only to make links
between the output of one activity and the input of another with the
same type. To that end, Taverna relies on the WSDL interface files of
services and workflows, the details of which are hidden from the scientists
by the user interface.3.2. Experiment-time Workflow DiscoveryScientists
undertake their research by iteratively selecting and running workflows
and further analysing the data they produce. myGrid aids this process
by providing the myGrid workbench, a client side electronic lab-book
through which users can perform their in silico experiments, as well
as storing and organising their data. A typical work pattern of the
scientist consists of selecting a piece of data stored locally and asking
which workflows will accept inputs with such biological type.
In the screenshot of Figure 2, the user has selected a piece of data,
which is an Affymetrix Probe Set ID referring to candidate gene data
in the Affymetrix database [Affymetrix], and asked to find a workflow
that is capable of taking this data as input.
User requirements [SGG+03a] have identified that bioinformaticians also
want to be involved in the process of choosing which experiments to
run, and therefore, the myGrid system does not offer fully automated
workflow selection. Instead, the user is presented with a list of workflow
scripts and invited to make the final selection. In Figure 2, two applicable
workflows have been discovered and displayed in a list with the workflow
graphical depiction shown to the right, on selection.
Figure 2: Selecting workflow that takes an Affymetrix probe
set ID as input. The user has selected a piece of data, which is an
Affymetrix Probe Set ID referring to candidate gene data in the Affymetrix
database [Affymetrix], and asked to find a workflow that is capable
of taking this data as input. Besides this data-driven method for discovering
experiments for execution, two more discovery modes are commonly used
by the scientific community: in a goal-driven approach, scientists
attempt to discover workflows capable of a given function, whereas in
a result-driven approach, they are looking for workflows producing a
given type of biological data. To this end, scientists need to be able
to browse through published workflows, which have been categorised according
their inputs, their outputs and their functionality.
In Figure 3, the user browses available services and workflows by the
type of input they take, the type of output they produce or the type
of task they perform.Figure 3: Browsing categorised workflows and services.
As shown, the user can see two services/workflows available to do sequence
alignment on a gene sequence, using the services BLASTn and BLASTx.53.3. Workflow DescriptionsScientists
require descriptions so they can judge which workflow is applicable
amongst the many available. While the final decision remains with
the scientists, we expect the system to help them by sorting workflows
according to the multiple relevant criteria discussed above, e.g. input
and output types, functions, and possibly to rank them. Therefore, descriptions
need to be easily processable by the computer.Workflow descriptions
can be produced by workflow authors, but they need not. Indeed, our
experience in myGrid shows that it is useful for a third-party to be
able to provide such descriptions. For example, a description that contains
useful information about the quality, accuracy or trustability of the
results produced by an experiment should typically be provided by end
users, rather than the workflow authors. Likewise, a reference ontology
of the application domain may be revised after some experiments have
been designed; it may then be useful that an ontology expert refines
semantic descriptions according to the revised ontology.Therefore, in
myGrid, we allow third-party users to generate workflow descriptions,
and provide a separate tool to help users to construct such descriptions.
The tool, called Pedro, is displayed in Figure 4, which illustrates
its use to create descriptions pertaining to the CandidateGeneAnalysis
workflow. Figure 4: Screenshot of a workflow being annotated with semantic
description using Pedro. The various components of the workflow that
can be annotated with descriptions are displayed in the left hand panel.
At a high level, the workflow can be annotated with the organisation
that has produced it and with information about the type of biological
data it takes as input and produces as output and the overall biological
function it performs. The user has focused on a particular workflow
sub-activity (named here WORKFLOWOPERATION) and is providing information
about one of the inputs (called a PARAMETER) to that sub-activity, specifying
a bioinformatics term, ‘Affymetrix_probe_set_id’, that refers to
the type and origin of the data taken by the operation as input.
Although we wish descriptions to be easily processable by the computer,
some descriptions may be solely aimed at users in judging the applicability
of a workflow and so can be written in free text. Figure 4 illustrates
both forms of annotation. In “parameterDescription”, a free text
description has been added to assist manual search and browsing of workflow
description. However, fields marked with an asterisk (“semanticType”
and “transportDataType”) are populated with concepts from a controlled
vocabulary. So, for example, Affymetrix_probe_set_id is a term in the
myGrid bioinformatics ontology, which provides a controlled vocabulary
for bioinformatics terms. Pedro has the ability to choose the controlled
vocabulary that is applicable for each field of the annotation by focusing
in on a particular region of an ontology. To aid the user in identifying
the suitable terms of an ontology to select, the concepts of the bioinformatics
ontology can be browsed, as illustrated by Figure 5.
Figure 5: Finding the ontology term for describing the workflow’s
output in the myGrid ontology. The user has followed a classification
of the ontology terms, and has found the term
‘EMBL_record_accession_number’ which represents an entry in EMBL
database for a gene, which should contain SNP data, and will be an
output of the CandidateGeneAnalysis workflow.3.5. Requirements for Discovery
and PublicationThe previous discussion has shown that workflows and
services share many common requirements in terms of discovery.
During the composition phase, they are nearly undistinguishable, except
for the fact that workflows capture a scientific process, and therefore
need to expose some of their internal activities to support the scientist’s
judgement. As far as discovery of experimental workflows is concerned,
full automation is not desirable, because scientists care about the
process design.In order to support the discovery process we have described,
the descriptions associated with workflows will have to contain the
following semantic-rich information.The authors of
a workflow and their institution;The function that a workflow performs
expressed in biological terms;The type of biological data that it takes
as input and/or produces as output, and their relationship with the
transport data types and formats expected and produced by the workflow;The
activities that a workflow is composed of (and their respective descriptions).These
descriptions will have to be:Computer processable so that the system
can present the user with relevant choices;Based on ontologies so that
suitable classifications can be shown to users;Produceable by authors
and third-party users.Different kinds of descriptions will be used at
different times. Descriptions pertaining to functionality and sub-activities
will only be used at composition and experiment selection time, with
scientist involvement, whereas interfaces (with transport types and
formats) will also be used at runtime.4.
Knowledge RepresentationMetadata for workflow
discoveryT
The previous
discussion has shown that workflows and services share many common requirements
in terms of discovery. During the composition phase, they are
nearly undistinguishable, except for the fact that workflows capture
a scientific process, and therefore need to expose some of their internal
activities to support the scientist’s judgement.
Fully automaticcally discovery ofing potential
workflows is undesirable; this would be equivalent to automating scientific
investigations and robs the scientist of the essential control of their
own experiment. Examples of “experimentally neutral” workflows are
comparatively rare and confined to sub-workflows such as format transformations
or data cleaning [Wroe03, IEEE IS forthcomingWGG+04aWGG+03a].
To support the discovery process a range of descriptions are associated with a workflow. These descriptions should be:
Following this set of requirements, we introduce the notion of a workflow executive summary, which captures the aspects that can facilitate the discovery of a workflow. Specifically, the executive summary includes the following descriptions:
For completeness,
we note that the workflow executive summary should be differentiated
from operational descriptions, which contain information about workflow
execution, such as cost, quality of service and access rights. Figure
7Figure 7Figure 6
Figure X shows the three categories of descriptions commonly
used when making a choice: and
those covering general metadata about the
operational context of the workflow as a whole, and
those catering for the signatureexecutive summary
of the workflow, and those covering general metadata about the
operational context of the workflow as a whole, and those covering
the metadata about the provenance of the workflow as a
whole (we do not discuss the provenance metadata further in this paper).
Operational
descriptions include:Provenance records such as the
authors, creation date, update history
and institution body;Operational
records such as the cost, quality of service,
usage statistics and access rights.Signature descriptions cover:
The overall functional task (or tasks if there is more than one interpretational
viewpoint) that a workflow performs expressed in biological terms;The
type of data that it takes as input and/or produces as output;The activities
that a workflow is composed of (and their respective descriptions).The
signatureexecutive summary
requires descriptions at three levels of abstraction (table 1):
The development of controlled vocabularies and the annotation of workflows with them at publication time are both labour intensive activities. We do not wish to preclude those registered workflows that do not enjoy these descriptions, and so we make them optional, with the commensurate diminished functionality that attends such an omission.
The
previous section has identified the need for a controlled vocabulary
of terms that describe biological data types and workflow functions.
This rich descriptive frameworkIn combination with interface
information, is they
iare intended to achieve various discovery capabilities,
at different times (composition, experiment selection, and run
time). Interface and syntactic descriptions are used at run
time; semantic and syntactic descriptions at
the point of composition and experimental selection; operational descriptions
at all times [Wroe, IEEE IS, forthcomingWGG+04aWGG+03a].
To
represent the knowledge embodied in the descriptions we
To address these requirements, we
have adopted a hybrid approach, combining two Semantic Web technologies,
namely OWL and RDF, the use of which we present in this section..
Workflow
registry
entry
Operational
Descriptions
Cost, QoS
Access rights…
Workfllow
Sxecutive
Summary
Descriptions
Inputs,
Outputs,
Tasks,
Component
resources
Syntactic
descriptions
e.g. MIME types
Invokable Interface
descriptions
e.g. XML data types
Conceptual
descriptions
RDF
OWL
OWL/
RDF
RDF Store
stored
encoded
Scufl
URI
Provenance
Descriptions
Authors,
creation date,
institution…
WSDL
Workflow
template
registry
entry
encoded
ScuflURI
URI
storedencoded
UDDIWSDL
Figure X:
Figure
7: The metadata associated with a registered workflow, giving their
knowledge representational forms (RDF, OWL,
WSDL), all of which bottom out in an RDF
formstore, for which we use the Jena
toolkit [Jena]. . Broken lines indicate
optional metadata; shadows indicate
multiple metadata entries are possible.
Table
Kinds of metadataFigure4.1.
Representing Semantic metadata: OWL
The representation
of conceptual metadata requires encoding a large body of domain knowledge,
with a large, and highly interconnected set of terms.
There has been a significant amount of work on using ontologies to describe
Web Services to enable their discovery and composition [DAMLSDAML-S, some others].
Although DAML-S aims to address the semantic encoding of
invocation and execution monitoring of services and service compositions,
the use of semantics in myGrid has
focused primarily on service and workflow discovery.
Within
myGrid, we built a suite of ontologies, describing biology,
bioinformatics, web services and workflows
[WGG+04aWSG+03a]. We
based the workflow ontology on the
service profile from DAML-S
[DAMLSref], the domain ontologies
on various de facto community standard ontologies such as the
Gene Ontology [GO00] and TAMBIS
[BGB+99], and models of publication and organisation based on
the AKT ontology [refAKTAKT].
The
OWL Web Ontology Language has recently emerged as the W3C Proposed Recommendation
for representing ontologies [HP-Sv-H03OWLOWL].
The majority of work in Semantic Web Services has used either OWL
or its predecessor, DAML+OIL, and we fall in line with this practice,
not only because it is an exchange standard, but because
the use of OWL provides us with a number of advantages.
4.1.
The Use of OWLRecently
the W3C has developed the OWL ontology language as a standard for representing
ontologies. There has been a significant amount of work on applying
OWL, and its predecessor, DAML+OIL, to the task of describing Web Services
to enable their discovery and composition [DAML-S].
The use of semantics in myGrid has, however, been focused
primarily on service discovery, while DAML-S also addresses other aims
such as invocation and execution monitoring.
One of the main advantages of
OWL is that its ’‘s
underlying formal semantics enables reasoners to classify
descriptions based on the properties of those descriptions. This provides
computational support to enable the building of complex ontologies of
the domain. Additionally, when applied to workflowservice
discovery, we automatically classify and discover workflowsservices
described in terms of a domain and service
workflow ontology. As such, it provides a good fit for
the user requirements for enabling service discovery.
Consequently, it is natural for us to
Within myGrid, we built a suite of ontologies, describing
biology, bioinformatics, and also web services.
[WSG+03a]. In these ontologies we have used the
service profile from DAML-S and can now describe services semantically
in terms of the domain, using tools such as Pedro (Figure 4), to support
these descriptions. The use of OWL in this way provides us with a number
of advantages. We can form queries for services
workflows (and services)
in terms of their properties. For example, the query below describes
a service in terms of the task that it performs. Equally, we can express
queries for workflows and
services by each of their input or output types, or
query against the resources that they useexecutive summary
components. Queries of this form can be presented naturally
in a browsable interface, as shown in Figure
4Figure 4Figure 3Figure
3. This interface takes advantage of the simple expressive capabilities
of OWL, in that services
workflows and services will classify under many different
parents, for instance most of the services shown as performing “aligning”
will also classify under “local_aligning”; the latter being a specialisation
of the former. The reasoning capabilities of OWL mean that we are not
required to pre-enumerate at design time all of the possible service
workflows classifications, but can generate new ones “on
the fly”, or even change the classifications of services as we change
our ideas about the domain.
intersectionOf(
myGrid_bioinformatics_primitive_service_operation
restriction(performs_task someValuesFrom(aligning))
)
In
many cases, this use of reasoning for
to forming classificationsclassify
services is sufficient. The multi-axial classifications shown
in Figure
4Figure 4Figure 3Figure
3, actually present a large number of different workflow/service
classifications which narrow the choices of services
to a point where the user can select for themselves the servicesthose
that they require. B However, by using OWL,
, we can also exploit the full expressiveness of this language,
to build highly complex queries, which we can use to enable more automated
service selection.
Although
the use ofHowever, the use of
OWL provides considerable advantages there
are alsoalso brings some difficulties. The use of reasoning
technology can complicate the architecture required to support it. Furthermore,
while OWL can be used to present relatively straightforward interfaces
for the selection of workflowsservices, it comes
with an upfront cost, namely that of producing a large domain ontology,
and then describing the services
workflows in terms of that ontology. At the current time
this cost is considerable, although it is hoped that this should lessen
as tools, such as PedroPedro, develop further.
For these reasons, we would expect that the primary use of OWL based
service or workflow descriptions will be in a curated set of services,
workflow, or third party descriptions, for use within a system such
myGrid, rather than as a general tool for descriptions of Web
Services in general. Is it for this reason thatAs a
result, within myGrid, we also provide
support for workflow discovery based on other description technologies,
as detailed in the next section.
Our workflow
descriptions have to draw on and seamlessly integrate multiple existing
information models, namely WSDL, DAMLS-Profile, and UDDI, and have to
support for metadata attachment, as we now explain.
(i) Interfaces have been identified as useful information in
the discovery process. As we focus on Web Services, we adopt WSDL as
the interface language for services, and we propose to
use the same language to define the interface of a workflow, composed
of its inputs and outputs. (ii) Semantic augmentation by
authors and third parties requires a mechanism by which additional semantic
descriptions can be attached to existing workflow descriptions;
hence, our information model requires support for metadata. (iii)
Furthermore, the semantic functionality of a workflow will be structured
according to the DAML-S profileusing OWL, and the
myGrid ontologies, as discussed in Section 4.1. (iv)
Additionally, we have identified that runtime discovery could take place
for workflows and services, for which an interface and functionality
have been identified at composition time. The de-facto standard for
Web Service discovery is UDDI; adopting the UDDI information model will
help us preserve compatibility with existing systems (such as enactment
engines).
Therefore,
wWe have adopted RDF [RDFRDF] as the representation
formalism to express such complex service descriptions. RDF is a very
flexible language in which relations are described between resources,
in the form of triples. A triple associates one resource, the subject,
to another, the object, by a relation labelled with a specific
property. Our reasons for using RDF are based on the technical requirements
of publishing and discovery.
4.3. Benefit
of the Hybrid ApproachIn
summary, myGrid has adopted a hybrid approach for its knowledge
representation. RDF is the underpinning format in which all workflow
(and service) descriptions are encoded. This is an extensible
format, which provides us with a powerful graph-based query capability
using RDQL. The rest of the paper will discuss how this RDF-based
information model is used in the
a “registry” that holds all
workflow descriptions, and which provides us with efficient query capabilities
necessary for run-time discovery. Within the registry, some
of the the semantic signatureexecutive summary
metadata and some of the operational metadata
will contain semantic descriptions referring to OWL concepts.
Semantic reasoning will be undertaken by a “semantic
find component”, which will deal with the semantic-rich
discovery process taking place at construction and experiment selection
time.
The myGrid architecture
defines protocols for publishing the semantic descriptionsthe
signatures of workflows and their executive
summaries, and for
performing discovery based on those descriptions. The two principal
components involved in publication and discovery are the registry,
which holds the advertisements for workflows, and the semantic find
component, which aids discovery of workflows by matching semantic
queries against the semantic descriptions in the registry. In order
to aid the publishing process, we introduce a new abstraction, a
workflow skeletonsignature,
which is used in the publishing protocol.
The starting point for advertising
a workflow is the authoring of semantic descriptions, as described in
Section 3.2, and this requires the author of the workflow description
to know what components of the workflow they can describe and how. A
key requirement of our architecture is that it must support multiple
workflow languages, or versions of them, because the
myGrid SCUFL workflow language is still evolving,.
Ultimately, this should
help to ensure that the architecture is future-proof.and
ultimately this requires that the architecture is future-proof.
So, we have introduced the notion of a workflow
skeleton signatureexecutive
summary as an abstraction of a
workflow abstraction, independent of any particular
scripting language. At the
authoring time, usually within the Pedro tool, this
signatureexecutive summary is represented in an XML schema,
which is shown in Figure
8Figure 8Figure 7Figure
6.
At an abstract level,
a workflow can be described by some simple information such as
its name or the organisation which owns it. Each sub-activity that takes
place in a workflow can have further description, including the task
it performs, the resources it uses and further details on its inputs
and outputs (as for the workflow as a whole). As identified in the technical
requirements, for each input and output of a workflow, we provide three
forms of type information: the syntactic type of a parameter gives the
structure of the data, typically an XSD type, used by the SOAP transport
layer (encoded as ‘transportDataType’ in the workflow skeleton);
the MIME type gives technical information about the data used for display
(encoded as ‘format’); the semantic type gives
a conceptual description of the type of data as described in Section
4.1(encoded as ‘semanticType’).
Figure 6 summarises the contents of the workflow skeleton, in which
a sub-activity is called a WORKFLOWOPERATION.
Figure 6:
Figure
8: Contents
of workflow signatureexecutive
summarykeleton.
The workflow skeleton
signatureexecutive summary
entities that can be annotated by semantic descriptions are shown
in the left hand panel of Pedro in Figure
5Figure 5Figure 4Figure
4, and are the same as those in this figure.
Data derived from the invocation metadata, typically an XSD type, used
by the SOAP transport layer, is encoded as
“transportDataType” in the
signatureexecutive summary, while the conceptual
metadata, is encoded as an arbitrary OWL concept in
“semanticType”. Finally, syntactic metadata normally represented
as a MIME type is represented in “format”. For clarity we have omitted
the workflow provenance metadata from this figure.
The process of publishing a
workflow in myGrid is shown in Figures Figure
9Figure 9Figure 87
and Figure
10Figure 10Figure 98.
Overall, the publishing process involves the user, the workflow construction
and annotation tools, a storage device to archive workflows, a registry
in which advertising and searching are performed and a semantic find
component performing any necessary reasoning over any ontology-based
semantic descriptions. Our sequence diagrams regard these components
as separate, but any specific deployment may seek to integrate some
of them. The script is archived in a store and made available via a
URI, which is advertised in the registry by the user, possibly using
a workflow construction tool. Then semantic descriptions, and other
metadata, are attached to the workflow, its inputs and its outputs through
successive calls to the directory (see Figure
9Figure 9Figure 8Figure
8). Whenever a new service
workflow and new metadata are added to the directory, a notification
is sent to the semantic find component, with the advertisement
referred referring
to the workflow by a unique key; as a result,
the semantic find component,
which creates an optimised and de-normalised
indexes of the workflows by their semantic types.
These optimisations enable us to perform service discovery more rapidly
than otherwise in order to support efficient discovery,
and use technology described elsewhere[BHLT].
Figure 7Figure
9: Sequence of actions taken in publishing workflow
Figure 8Figure
10: Sequence of actions taken in attaching metadata
to
a workflow
The discovery of workflows,
or other activities, is shown in Figure
11Figure 11Figure 10Figure
9. Within myGrid, there are two main reasons for discovery;
firstly in response to a user request usually through interaction with
the workbench, and secondly during the process of resolving
the abstract activity specifications into
a workflow template, into a complete workflow script which can
be fully enactedinvokable instances (see Section 3.4).
As users generally wish
to discover services in terms of their own domain, this discovery normally
involves the conceptual metadata, and is shown in Figure
11Figure 11Figure 10Figure
9. Following user activity involving either the context sensitive
workflow selection, or browsing interfaces shown in Section 3, the workbench
generates a semantic query. 5.3. Discovery ProtocolThe process
of discovering a semantically described workflow in
myGrid, which also applies to services, is shown in Figure
9. The user performs an action, such as clicking to display the workflows
that would take a given piece of data as an input, which causes a semantic
query to be created by the WorkBench (see Figure 9).
This query is sent to the semantic find component, which uses the retrieved
semantic descriptions to determine which workflows match the query.
The technical details, including the name, interface and endpoint of
each applicable workflow script, are extracted from the registry and
returned to be displayed to the user. On
selecting a workflow to enact, the workflow script is sent to an enactor
service The user can then select the final workflow, if there
is more than one, which will then be sent to the enactor.
Figure 9Figure
11: Sequence of actions taken in discovering Figure
12: Sequence of actions taken in discovering services
workflow by user through a
user interface
during workflow enactment Figure
: Sequence of actions taken in discovering services during workflow
enactment
The enactor may also use the
registry at run-time. As described earlier a workflow template
describes an in silico experiment, where some activity definitions
have been defined abstractly by service types rather than end
points of specific Web Services.
The workflow can contain abstract activity definitions that represent
service types, rather than explicit calls to particular Web Services.
In this case, the workflow is a template
[WGG+03a] for a given in silico
experiment rather than an instance
of that experiment. Workflow templates should not be confused with
workflow skeletons, described in Section 5.1., which also provide
a high-level abstraction of an experiment but contain no information
about the flow of the experiment, are language-independent and used
for description rather than enactment.
In this caseTherefore, queries will generally
involve the invocation metadata, and will involve only the registry,
as shown in
Figure 12 Figure 12
Figure 11Figure 11Figure
10. Following discovery the enactor can then continue with invoking
the returned service. need to be issued in order for the
workflow to be enactable (see Figure 10). The queries are then sent
to the registry from the enactor, and a service chosen by the enactor
from those matching the query.
The design decisions involved
in developing the above protocol are driven by the user and technical
requirements. For instance, while the workflow skeleton does not
reveal details of the workflow script’s internal execution, it does
include an enumeration of its sub-activities, and this latter point
was identified by user requirements: some bioinformaticians like to
know which services are involved in a workflow to help them decide whether
the workflow is appropriate for their specific goals. In general, the
absence of details of the script in the workflow skeleton, both enhances
clarity in attaching descriptions and ensures that the workflow skeleton
is independent of any one scripting language because it does not make
any notion of control or data flow explicit..
The motivation for treating the registry and
the semantic find component as two separate modules, and passing messages
between them, is that onlynot all
discovery involving conceptual metadata will require semantic
reasoning and will not therefore need the semantic find component.
. In particular,So, discovery by the
workflow enactment engine will attempt to match a service by its interfaces,
ensuring that it can accept the data produced by earlier activities
in the workflow, rather than its domain-specific, e.g. biological, type.
Additionally this architecture means that the semantic find component
can be deployed independently of the registry, enabling discovery over
other Also, a semantic find component may be used independently
of the registry to perform reasoning over queries to other
data sources such as databases or Web pagesWhile conceptually
separate., these two modules can be tightly integrated
in any specific implementation in order to improve efficiency. The following
section will discuss alternative deployments of the semantic find component.
In this section, we describe the design and implementation of the main components of the myGrid architecture used in publishing and discovering workflows, namely the registry and the semantic find component.
There
are existingExisting
protocols for service publishing and discovery, such as UDDI for Web
Services [UDDIUDDI].,
do not provide support for workflows.
, We have taken the approach that workflow scripts and
services are almost equivalent for the purpose of discovery. Both are
functional entities taking inputs and producing outputs according to
some interface and internal algorithm and are available from a given
endpoint (where to download the script from in the case of a workflow).
Executing them requires different processes, but this is relevant only
to enactment and not to the advertising of the workflow/service.
By drawing this equivalence between services and workflows, we can reuse
the UDDI API to enable their registration and discovery.
The
difference in execution, however, does mean that it needs to be obvious
which type of activity the advert applies to. This requires us to attach
additional metadata to the advert.
In previous sections we have also identified the need to attach other
additional metadata, in the form of OWL, or RDF to the activity in the
registry. Therefore we have built the
myGrid registry to be UDDI-compliant, but, in addition,
but they do not provide adequate functionality for publishing workflow
scripts, for attaching semantic descriptions to services or workflows,
or for discovering services or workflows based on their semantics.
myGrid’s registry
is a UDDI-compliant service registry extended so as to allow workflow
scripts to be published and discovered, and arbitrary structured metadata,
including semantic descriptions, to be added to entities in the published
workflows.We have taken the approach that workflow scripts and services
are almost equivalent for the purpose of discovery. Both are functional
entities taking inputs and producing outputs according to some interface
and internal algorithm and are available from a given endpoint (where
to download the script from in the case of a workflow). Executing them
requires different processes, but this is relevant only to enactment
and not to the advertising of the workflow/service, though it should
be obvious from the advert which type of entity it is. Extra optional
information may be available for workflow scripts, in that the workflow
sub-activity may be advertised and used to filter the workflows discovered.
By drawing an equivalence between workflow and service adverts, we can
use our extension of UDDI’s service publishing and discovery APIs,
by adding metadata for marking an advert as being for a workflow script,
and for specifying activities used within the workflow.We
we have specified a protocol for attaching metadata to activitiesentities
described in the service registry [MPP+04MMP+04]. The metadata can
be a simple string value for recording, for example, an estimate of
the average time a workflow takes to execute. Alternatively, it can
be a URI, to a concept in the ontology. For a more complex semantic
description, for example, in which ontology concepts are qualified by
property values, structured RDF [RDFRDF] metadata can be attached.
The API message structure
for one metadata attachment method is given in Figure
13Figure 13Figure 12Figure
11; similar methods also exist to attach metadata to a service
(or workflow), a business, and to query for services or workflows by
the metadata attached to them.
Figure
11
Figure 13: API for attaching metadata to WSDL message parts (inputs or outputs of workflows). To attach metadata the client must identify the entity to which metadata is attached and provide all details of the metadata itself. In this case, a message part is uniquely identified, according to the WSDL specification, by the namespace and local name of the message containing that part plus the part name. Metadata in our registry is given a type, by which the client can determine what the metadata is about, and a value. The value may be either a string, a URI (usually an ontology term) or structured metadata expressed as an assertion in one of the triple languages (such as RDF XML or N3).
A key characteristic
of the registry is that the underlying information is stored as RDF
[RDFRDF] in a Jena [Jena] triple
store, for reasons discussed in Section 4.23.
For completeness, Appendix 1 contains the RDF representation of the
CandidateGeneAnalysis workflow advertisment, as contained in the registry.
Figure 12 illustrates how the advertisement for the
CandidateGeneAnalysis workflow has been annotated with a semantic
description in the registry’s RDF store.
Figure
14Figure 14Figure 13Figure
13 shows the architecture of the registry, which is available
as a Web Service in the myGrid distribution. The client interacts
with the registry through a set of interfaces, which allow services
and workflows to be published and discovered again as UDDI business
services, metadata, either conceptual, or operational to
be attached and used in discovery, and semantic discovery to take
place. Other features of the registry include sending of notifications
when services, workflows and metadata are added or removed, third party
annotations of services, federation of the registry and policy-based
management of its contents but these are beyond the scope of the paper.
# Base:
@prefix biodata: <http://www.ecs.soton.ac.uk/~sm/myGrid/myGrid.daml#>
@prefix registry: <http://www.myGrid.ecs.soton.ac.uk/myGrid.rdfs#>
. @prefix wsdl: <http://www.myGrid.ecs.soton.ac.uk/wsdl.rdfs#>
. @prefix uddiv2: <http://www.myGrid.ecs.soton.ac.uk/uddiv2.rdfs#>
. @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .[]
a <uddiv2:BusinessService> ; <uddiv2:hasName>
[ a <uddiv2:NameBag> ; <rdf:_1>
"CandidateGeneAnalysis" ] ;
<uddiv2:hasServiceKey> "d0892afd-198d-404b-bfdf-31c7fa4df8f3"
; <uddiv2:hasMetadata> [ a <isWorkflowScript>
; (1) <rdf:value> "yes"
; ] ; <uddiv2:hasBindingTemplate> ...
[ a <uddiv2:AccessPoint> ; (2)
<uddiv2:hasText>
"http://www.ecs.soton.ac.uk/~sm/myGrid/CandidateGeneAnalysis.scufl"
; <uddiv2:hasURLType>
"http" ] ...
<uddiv2:hasOverviewDoc>
[ a <uddiv2:OverviewDoc> ;
<uddiv2:hasOverviewURL>
"http://www.ecs.soton.ac.uk/~sm/myGrid/CandidateGeneAnalysis.wsdl"
...[] a <wsdl:WSDLOverviewDoc> ; <wsdl:hasFilename>
"http://www.ecs.soton.ac.uk/~sm/myGrid/CandidateGeneAnalysis.wsdl"
; <wsdl:hasMessage> [ a <wsdl:MessageBag> ;
<rdf:_1> [
_:b1 <wsdl:Message> ; (3)
<wsdl:hasQName> [ a <wsdl:QName>
; <wsdl:hasLocalName>
"WholeWorkflowRunRequest" ;
<wsdl:hasNameSpace> "http://www.ecs.soton.ac.uk/~sm/myGrid/myGrid.daml”
] <wsdl:hasMessagePart>
[ a <wsdl:PartBag> ;
<rdf:_1> [ a <wsdl:MessagePart>
; <wsdl:hasName>
"probeSetId" ;
<wsdl:hasTypeName>
[ a <wsdl:QName> ;
<wsdl:hasLocalName> "string" ; (4)
<wsdl:hasNameSpace> "http://www.w3.org/2001/XMLSchema"
] ; <uddiv2:hasMetadata>
[ a <biodata:semanticType> ; (5)
<rdf:value> "biodata:Affymetrix_probe_set_id"
; ];
<uddiv2:hasMetadata>
[ a <biodata:formats> ;
<rdf:value>
[ a <biodata:formatBag> ;
<rdf:_1>
[ a <biodata:format> ;
<biodata:hasFormatSystem> "MIME" ; (6)
<biodata:hasFormatIdentifier> "text/x-record-ids"
;
] ]
] ;...Figure
12: Representation of published workflow description stored in RDF (in
N3 format). The workflow is advertised, following the UDDI specification,
as a BusinessService, and marked as a workflow by attaching metadata
(‘isWorkflowScript’) (1). The workflow refers
to the location of the workflow script (‘AccessPoint’) (2) and its WSDL interface.
The interface element is further expanded to show the messages that
are accepted as input (3) and returned as output, and metadata
is added to provide the syntactic (4), semantic (‘biodata:Affymetrix_probe_set_id’) (5) and
MIME (6) types.Figure
13
Figure 14: Architecture of the Registry.
The myGrid
semantic find component is responsible for analysing and making
inferences over semantic descriptions with reference to an ontologyconceptual
metadata, and is used for querying over andactivities
described with this metadata. categorising semantically described
entities, such as services and workflows. This component
provides answers to semantic queries over semantic descriptions, using
the structure of the ontology to both broaden and refine the search
resultsAs this component receives queries expressed in OWL,
we can use it to broaden or narrow searches as required. For example,
by adding properties to an OWL concept expression we specialise the
query and narrow the number of candidate workflows (we travel
down the classification lattice);
by removing properties we broaden the query and
extend the number of candidate workflows that will be classified by
the expression (we travel up the classification lattice)
[WSG+03a]. TheIts architecture is
depicted in Figure
15Figure 15Figure 14Figure
14. The semantic
find component itself is responsible for the following.
Figure
14
Figure
15: Architecture of the Semantic Find Component. The description
database holds semantic descriptions gathered from resources published
in the registry; the ontology server provides access to the domain ontologies
and manages interaction with the description logic reasoner FaCT [H99aH99aH99a].
Two
deployments of the semantic find component are considered. As illustrated
in Figure 13, the semantic find component can be embedded in the registry,
with queries over the conceptual metadata being processed
differently depending on whether semantic reasoning adds extra information
or not by the semantic find component, while non conceptual
queries would be answered by the registry. Alternatively, the
component can be deployed as an autonomous service able to reason over
semantic descriptions from a variety of sources including databases
and Web pages.
Exact
details of the semantic matching algorithms whereby a resource description
is matched to semantic query should not impact directly on the architecture
described in this paper. In early implementations of this service, we
have performed simple subsumption matching between query and description,
although matching algorithms such as those described elsewhere [PKPS02aPKPS02aPKPS02a] could also be supported.An
example of the more complex reasoning, using the ontology and its reasoner,
required for some queries would be when a user asks for a service taking
a given semantic type of input (‘Affymetrix_probe_set_id’) and performing
a given semantic type of task (‘retrieving’). In this case, the
index would not contain a specific term representing this conjunctive
concept, and so reasoning must occur to find the parent concepts in
the ontology that can be used for discovery. Clients can present a semantic
query to the semantic find component, making use of the metadata structure
described in Figure 11 to encapsulate the semantic query expressed in
OWL, and be returned locations of archived workflows or services that
match.
The
Web Service Architecture details the existence of a directory service
for the registration and subsequent discovery of services, and languages
for the composition of services into workflows. For directories,
the UDDI [UDDIUDDI] registry (Universal
Description, Discovery, and Integration) has become the de-facto. Service
descriptions in UDDI are composed from a limited set of high-level data
constructs (Business Entity, Business Service etc.) which can include
other constructs following a rigid schema. Some of these constructs,
such as tModels, Category Bags and Identifier Bags, can be seen as metadata
associated with the service description. However, while useful in a
limited way, they are all restrictive in scope of description and their
use in searching the directory. We
extendimprove upon
UDDI by allowing arbitrary structured metadata to be attached to not
only the services and workflows published, but also their interfaces.
For
workflow languages, numerous candidates have been
In Section 3.4, we considered automated discovery of activities
in workflows at enactment time that have a given function and interface,
but it would be useful in many cases to specify activities at construction
time without restricting them to a particular interface. As mentioned
in Section 3.2., we refer to workflows containing these abstract descriptions
in place of one or more services or sub-workflows as
workflow templates. However, we have discovered that services
with the exact same functionality still often require different ways
of being enacted and so cannot be easily substituted one for another
[WGG+03a]. For instance, one of the ways in which an activity can be
distinguished from another is in its
invocation model, so that one service may perform a function
with one operation call that requires multiple calls to another service
(the example given in [WGG+03a] is of different deployments of the BLAST
service discussed in Section 3).Recently numerous languages
have been developed that enable the composition of Web Services into
workflows proposed, including:. These include
BPEL4WS (Business Process Execution Language for Web Services) [BPELBPEL], Web Services Flow
Language (WSFL) [WSFLWSFL], XLANG (Web Services
for Business Process Design) [XLANGXLANG] and Scufl (Simple
Conceptual Unified Flow Language) [SCUFLSCUFL]. These languages
differ in their expressiveness and flexibility. It is unlikely that
in the foreseeable future a single workflow language will emerge as
a universal standard, although there is some encouraging development
in this direction represented by BPEL4WS which integrates the key features
of WSFL and XLANG. In myGrid, we have used Scufl to provide
a simple representation of the activities of a workflow in such a way
that it is easy for a bioinformatician to conceptualise and manipulate
the overall experimental design by abstracting away from
the details of low level service orchestration:
[Addis03]Addis 2003, AHM2003 paper].
The
motivation to discover and compose Web Services in automated and intelligent
ways has fuelled many researchers from the Semantic Web community to
apply knowledge technologies to service descriptions,
often building on past work in Problem Solving Methods [WSFMWSFMWSFM].
Early work has focused on semantic service discovery [DAMLSDAML-S]; more recent
work has shifted to automated intelligent service composition, primarily
through the use of AI planning techniques [WPS03ref].
Our semantic descriptions support the composition of services by enabling
semantic and syntactic capability checking of input and output types;
however, we do not support automated
workflow planning as the plan is the biologist’s experiment and our
experiences suggest they demand complete control over the
definition. DAML-S [DAMLSDAMLS] attempts a full
description of a service as a process that can be enacted to achieve
a goal. A full DAML-S service description incorporates three component
perspectives: a planning view of service based on
“inputs, outputs, preconditions, and effects” (the service profile);
the workflow view of the more primitive services needed to accomplish
a complex goal (the service process); the mapping of the atomic parts
of this workflow to their concrete WSDL [WSDLWSDL] descriptions
(the service grounding). DAML-S provides an alternate mechanism that
allows service publishers to attach semantic information to the parameters
of a service. Indeed, the argument types referred to by the profile
input and output parameters are semantic. Such semantic types are mapped
to the syntactic type specified in the WSDL interface by the intermediary
of the service grounding. Such a mechanism is
welcome but convoluted and limited.
The mapping from semantic to syntactic types involves the process model,
and it only supports semantic annotations provided by the publisher,
and not by third party annotators; a profile only supports one semantic
description per parameter and does not allow multiple interpretations.
Finally, such semantic annotations are restricted to input and output
parameters, but may not be applied in a similar manner to other elements
of a WSDL interface specification, e.g. operations or sets of operations
collected in port types.
From
the distributed Grid computing community,
tThe ICENI project also
uses Web Ontology Language (OWL)OWL for semantic
annotation [HLN03aHLN03a] and, while
a promising approach,but it
so far deals only with the ontological description of service interfaces,
; it ignoringes other aspects such
as the semantic annotation of WSDL documents, and does not
and support
workflow discovery. BAlso, because the descriptions
are added directly to the interfaces in the source code, only the service
provider can publish semantic descriptions (not third parties), which
imposes restriction on the community using the system. WIn
the myGrid project, we have considered these aspects
highly relevant and opted for the use of a flexible structure
which enables annotation with both semantic and other metadata, by both
service providers and third parties.
Finally,
the biology domain has been investigating its own mechanisms for publishing
bio-Web Services. The most well known proposal is
The UDDI [UDDI] registry (Universal Description, Discovery, and
Integration) has become the de-facto standard for service discovery
in the Web Services community. Service descriptions in UDDI are composed
from a limited set of high-level data constructs (Business Entity, Business
Service etc.) which can include other constructs following a rigid schema.
Some of these constructs, such as tModels, Category Bags and Identifier
Bags, can be seen as metadata associated with the service description.
However, while useful in a limited way, they are all restrictive in
scope of description and their use in searching the directory. We improve
upon UDDI by allowing arbitrary structured metadata to be attached to
not only the services and workflows published, but also their interfaces.DAML-S
[DAMLS] attempts a full description of a service as a process that can
be enacted to achieve a goal. A full DAML-S service description incorporates
three component perspectives: a planning view of service based on
“inputs, outputs, preconditions, and effects” (the service profile);
the workflow view of the more primitive services needed to accomplish
a complex goal (the service process); the mapping of the atomic parts
of this workflow to their concrete WSDL [WSDL] descriptions (the service
grounding). DAML-S provides an alternate mechanism
that allows service publishers to attach semantic information to the
parameters of a service. Indeed, the argument types referred to
by the profile input and output parameters are semantic. Such semantic
types are mapped to the syntactic type specified in the WSDL interface
by the intermediary of the service grounding. We feel that such
a mechanism is a step in the right direction, but it is convoluted because
the mapping from semantic to syntactic types involves the process model.
It also has some limitations since it only supports semantic annotations
provided by the publisher, and not by third party annotators; a profile
only supports one semantic description per parameter and does not allow
multiple interpretations. Finally, such semantic annotations are
restricted to input and output parameters, but may not be applied in
a similar manner to other elements of a WSDL interface specification,
e.g. operations or sets of operations collected in port types.BioMOBY
[WL02aWL02a],
is a service discovery architecture based on a view of a
service as an atomic process or operation that takes a set of inputs
and produces a set of outputs. The service, inputs and outputs
will beare given semantic types which also defines
the message format. It is limited in that itHowever,
BioMOBY has a number of limitations:
it does not support the UDDI protocol, so specialist
clients have to be developed;
it , and it does not have a general attachment mechanism
for service descriptions; and
it . Finally, BioMOBY
does not explicit address the publishing or discovery of workflows.
myGrid and BioMOBY are working closely together to develop
a common semantic registry framework.
WIn
this paper, we have demonstrated how the myGrid architecture
can be used to construct, publish, semantically describe, annotate and
discover workflows as part of scientists’ experimental processes.
Scientists without detailed computer science knowledge wish to share
and use each others’ experimental designs, but discovering the designs
available becomes difficult when there are a large and increasing number
available in a distributed system such as the Web. The myGrid
architecture, making use of Web Services, workflows, enhanced service
discovery technologies, Semantic Web technologies and semantic descriptions
enables scientists to do this with easemore easily.
We have shown how the process takes place from the users’ perspective
and presented the underlying protocol implemented by our middleware.
We recognise that there can be multiple registries owned by different people and organisations, in which many useful workflows may be published. For this reason, future work on the registry will concentrate on federation of registries and the personalisation of registries to contain the information most useful to individuals, which could include semantic descriptions other than those provided by the workflow author.
It
is useful to specify activities at construction time without restricting
them to a particular interface. These workflow templates contain abstract
descriptions in place of one or more services or sub-workflows.
However, in practice we find that services with the exact same functionality
still often require different ways of being enacted and
so cannot be easily substituted one for another [WSG+03aWGG+03a]. For instance,
one of the ways in which an activity can be distinguished from another
is in its invocation model, so that one service may perform a
function with one operation call that requires multiple calls to another
service (the example given in [WGG+04aWGG+03a] is of different
deployments of the BLAST service discussed in Section 3).
In
testing myGrid, we have found thatThe discovery
of workflows by the type of input, and classifying them by
function for browsing by the user, have been theturn
out to be the most helpful applications of the semantic descriptions
provided. It has been clear that better tools for the attachment and,
later, maintenance (if mistakes or imprecision is found) of semantic
descriptions are required, as the annotator should be an expert in the
domain of the descriptions rather than the languages and structures
in which the description is expressed. Future work in tools concentrates
on two areas: making the publication
of semantic annotations incidental
and making discovery invisible in the sense that the user sees the workflow
discovery as a part of their natural scientific environment in their
terms.
[Addis03] Matthew
Addis, Justin Ferris, Mark Greenwood, Darren Marvin, Peter Li, Tom Oinn
and Anil Wipat: Experiences with eScience workflow specification and
enactment in bioinformatics, In proceeding of the UK OST e-Science secocnd
All Hands Meeting 2003 (AHM’03),
pages 459-467, Notthingham, UK,
September 2003.
[Affymetrix] Affymetrix. http://www.affymetrix.com. Last visited 2003.
[AHM03]
[AGM+90a] S.F. Altschul, W. Gish, M. Miller, E.
W. Myers and D.J. Lipman. Basic Local Alignment Search Tool. In Journal
of Molecular Biology, 215:403-410, 1990.
[AKT] AKT Project. http://www.aktors.org/. Last visited 2003.
[BGB+99] Patricia G. Baker, Carole Goble, Sean Bechhofer, Norman Paton, Robert Stevens, Andy Brass. An Ontology for Bioinformatics Applications. Bioinformatics, 15(6) pp 510--520, 1999.
[BHL01a] Tim Berners-Lee, James Hendler, and Ora Lassila. The Semantic Web. Scientific
American, 284(5):34–43, 2001.
[BHLT] Instance Store. http://instancestore.man.ac.uk/. Last visited 2003.
[BioJava] BioJava. http://www.biojava.org/. Last visited 2003.
[BioPerl] BioPerl. http://bioperl.org/. Last visited 2003.
[BPEL] Business Process Execution Language for Web Services. http://www-106.ibm.com/developerworks/webservices/library/ws-bpel/. Last visited 2003.
[BHLT][DAMLS] The DAML
Services Coalition (alphabetically Anupriya Ankolenkar, Mark Burstein,
Jerry R. Hobbs, Ora Lassila, David L. Martin, Drew McDermott, Sheila
A. McIlraith, Srini Narayanan, Massimo Paolucci, Terry R. Payne and
Katia Sycara), "DAML-S: Web Service Description for the Semantic
Web", The First International Semantic Web Conference (ISWC),
Sardinia (Italy), June, 2002.
[EMBOSS] EMBOSS. http://www.hgmp.mrc.ac.uk/Software/EMBOSS. Last visited 2003.
[FK03] Ian Foster and Carl Kesselman. The Grid, Blueprint for a New Computing Infrastructure. 2nd edition, Morgan Kaufmann, 2003.
[FKNT02a] Ian Foster, Carl Kesselman, Jeffrey Nick and Steve Tueke. The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration, Globus, 2002.
[FreeFluo] FreeFluo. http://freefluo.sourceforge.net/. Last visited 2003.
[GGS+03a] Mark
Greenwood, Carole Goble, Robert Stevens, Jun Zhao, Matthew Addis, Darren
Marvin, Luc Moreau, and Tom Oinn. Provenance of e-science experiments
- experience from bioinformatics. In Proceedings of the UK OST e-Science
second All Hands Meeting 2003 (AHM'03), pages 223-226, Nottingham,
UK, September 2003. ISBN 1-904425-11-9.
[Graves] National Graves’ Disease Foundation Frequently Asked Questions. http://www.ngdf.org/faq.htm. Last visited 2003.
[GO00] The Gene Ontology Consortium. 2000. Gene Ontology: tool for the unification of biology. Nat Genet 25: 25-29.
[H99a] Ian Horrocks. FaCT and iFaCT. In P. Lambrix, A Borgida, M. Lenzerini, R M�ller, and P. Patel-Schneider, editors. Proceedings of the International Workshop on Description Logics (DL’99), pages 133-135, 1999.
[HLN03a] J. Hau, W. Lee, and S. Newhouse. Autonomic Service Adaptation using Ontological Annotation.In 4th International Workshop on Grid Computing, Grid 2003, Phoenix, USA, November 2003.
[HP-Sv-H03] Ian Horrocks, Peter F. Patel-Schneider, and Frank van Harmelen. From SHIQ and RDF to OWL: The making of a web ontology language. Journal of Web Semantics, Vol. 1, No. 1 December 2003, Elsevier.
J. Hau, W. Lee, S. Newhouse. The ICENI Semantic Service Adaptation
Framework. In UK e-Science All Hands Meeting, p. 79-86, Nottingham,
UK, September 2003. ISBN 1-904425-11-9.[Jena] Jena
Semantic Web Toolkit, http://www.hlp.hp.com/semweb/jena.htm. Last visited 2003.
[Jini] Jini. http://www.jini.org/. Last visited 2003.
[LWS+03b] Phillip Lord, Chris Wroe, Robert Stevens, Carole Goble, Simon Miles, Luc Moreau, Keith Decker, Terry Payne, and Juri Papay. Semantic and Personalised Service Discovery. In W. K. Cheung and Y. Ye, editors, Proceedings of Workshop on Knowledge Grid and Grid Intelligence (KGGI'03), in conjunction with 2003 IEEE/WIC International Conference on Web Intelligence/Intelligent Agent Technology, pages 100-107, Halifax, Canada, October 2003. Department of Mathematics and Computing Science, Saint Mary's University, Halifax, Nova Scotia, Canada.
[MLM+04]
Luc Moreau, Mike Luck, Simon Miles, Jury Papay, Keith Decker, and Terry
Payne. Methodologies and Software Engineering for Agent Systems, chapter
Agents and the Grid: Service Discovery. Kluwer, 2004.[MMP+04]
[MPD+03a] Simon Miles, Juri Papay, Vijay Dialani, Michael Luck, Keith Decker, Terry Payne, and Luc Moreau. Personalised grid service discovery. IEE Proceedings Software: Special Issue on Performance Engineering, 150(4):252-256, August 2003.
[MPP+04] Simon
Miles, Juri Papay, Terry Payne, Keith Decker and Luc Moreau. Towards
a Protocol for the Attachment of Semantic Descriptions to Grid Services.
To appear in In Proceedings of 2nd
European Across Grids Conference (AxGrids 2004). 2004.
[myGrid] myGrid UK e-Science Project. http://www.myGrid.org.uk. Last visited 2003.
[OGSA] OGSA. https://forge.gridforum.org/projects/ogsa-wg. Last visited 2003.
[OWL] Web
Ontology Language Overview. http://www.w3.org/TR/owl-features/. Last
visited
2003.[PEDRedro] Pedro. http://pedrodownload.man.ac.uk/. Last visited 2003.
[PKPS02a] Massimo Paolucci, Takahiro Kawamura, Terry Payne and Katia Sycara. Semantic Matching of Web Services Capabilities. In The First International Semantic Web Conference (ISWC), 2002.
[RDF] Resource Description Framework (RDF). http://www.w3.org/RDF/, Created 2001.
[RDQL] RDQL. http://www.hpl.hp.com/semweb/rdql.htm. Last visited 2003.
[SGG+03a] Robert
Stevens, Kevin Glover, Chris Greenhalgh, Claire Jennings, Simon Pearce,
Peter Li, Mielena Radenkovic, Anil Wipat. In
Proceedings of the UK OST e-Science second All Hands Meeting 2003 (AHM'03),
pages 43-50, Nottingham, UK, September 2003. ISBN 1-904425-11-9.[SCUFL] SCUFL
Simple Conceptual Unified Flow Language (SCUFL). http://taverna.sourceforge.net/schemata/XScufl.html. Last visited 2003.
[Taverna] Taverna. http://taverna.sourceforge.net/. Last visited 2003.
[UDDI] Universal
Description, Discovery and Integration of Business of the Web. ww.uddi.org,
2001.
[WGG+043a] Chris
Wroe, Carole Goble, Mark Greenwood, Phillip Lord, Simon Miles, Luc Moreau,
Juri Papay, Terry Payne. Experiment automation using semantic data on
a bioinformatics Grid. Submitted for publication in
IEEE Intelligent Systems, Jan/Feb
2004..
[WL02a] M.D.Wilkinson and M.Links. BioMoby: an open source biological web services proposal. Briefings In Bioinformatics, 4(3), 2002.
[WSDL] Web Services Description Language (WSDL) 1.1. http://www.w3c.org/TR/wsdl. Last visited 2003.
[WSArch] Web Services Architecture. Latest version available from http://www.w3.org/2002/ws/arch/. Last visited 2003.
[WSFL] Web Services Flow Language.
http://www-3.ibm.com/software/solutions/webservices/pdf/WSFL.pdf. Last visited 2003.
[WSFM] D. Fensel, C. Bussler, "The Web Service Modeling Framework WSMF", Technical Report, Vrije Universiteit Amsterdam
[WPS03] Dan Wu, Bijan Parsia, Evren Sirin, et al. Automating DAML-S Web Services Composition Using SHOP2 in Proceeding of 2nd International Semantic Web Conference ISCW2003, Lecture Notes in Computer Science, Springer-Verlag, Heidelberg, Volume 2870 / 2003, pp. 195 – 210, October 2003.
[WSG+03a] Chris Wroe, Robert Stevens, Carole Goble, Angus Roberts, and Mark Greenwood. A suite of DAML+OIL ontologies to describe bioinformatics web services and data. International Journal of Cooperative InformationSystems, 12(2):197–224, 2003
[XLANG] XLANG. http://www.gotdotnet.com/team/xml_wsspecs/xlang-c/default.htm. Last visited 2003.
[ref]
Appendix 1. RDF Representation of a Published Workflow
Below, we find the representation of a published workflow description stored in RDF (in N3 format). The workflow is advertised, following the UDDI specification, as a BusinessService, and marked as a workflow by attaching metadata (‘isWorkflowScript’) (1). The workflow refers to the location of the workflow script (‘AccessPoint’) (2) and its WSDL interface. The interface element is further expanded to show the messages that are accepted as input (3) and returned as output, and metadata is added to provide the syntactic (4), semantic (‘biodata:Affymetrix_probe_set_id’) (5) and MIME (6) types.
# Base:
@prefix biodata: <http://www.ecs.soton.ac.uk/~sm/myGrid/myGrid.daml#>
@prefix registry: <http://www.myGrid.ecs.soton.ac.uk/myGrid.rdfs#> .
@prefix wsdl: <http://www.myGrid.ecs.soton.ac.uk/wsdl.rdfs#> .
@prefix uddiv2: <http://www.myGrid.ecs.soton.ac.uk/uddiv2.rdfs#> .
@prefix rdf:
<http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
[] a <uddiv2:BusinessService> ;
<uddiv2:hasName>
[ a <uddiv2:NameBag> ;
<rdf:_1> "CandidateGeneAnalysis"
] ;
<uddiv2:hasServiceKey> "d0892afd-198d-404b-bfdf-31c7fa4df8f3" ;
<uddiv2:hasMetadata>
[ a <isWorkflowScript> ; (1)
<rdf:value> "yes" ;
] ;
<uddiv2:hasBindingTemplate> ...
[ a <uddiv2:AccessPoint> ; (2)
<uddiv2:hasText>
"http://www.ecs.soton.ac.uk/~sm/myGrid/CandidateGeneAnalysis.scufl" ;
<uddiv2:hasURLType> "http"
] ...
<uddiv2:hasOverviewDoc>
[ a <uddiv2:OverviewDoc> ;
<uddiv2:hasOverviewURL>
"http://www.ecs.soton.ac.uk/~sm/myGrid/CandidateGeneAnalysis.wsdl" ...
[] a <wsdl:WSDLOverviewDoc> ;
<wsdl:hasFilename> "http://www.ecs.soton.ac.uk/~sm/myGrid/CandidateGeneAnalysis.wsdl" ;
<wsdl:hasMessage>
[ a <wsdl:MessageBag> ;
<rdf:_1>
[ _:b1 <wsdl:Message> ; (3)
<wsdl:hasQName>
[ a <wsdl:QName> ;
<wsdl:hasLocalName> "WholeWorkflowRunRequest" ;
<wsdl:hasNameSpace> "http://www.ecs.soton.ac.uk/~sm/myGrid/myGrid.daml”
]
<wsdl:hasMessagePart>
[ a <wsdl:PartBag> ;
<rdf:_1>
[ a <wsdl:MessagePart> ;
<wsdl:hasName> "probeSetId" ;
<wsdl:hasTypeName>
[ a <wsdl:QName> ;
<wsdl:hasLocalName> "string" ; (4)
<wsdl:hasNameSpace> "http://www.w3.org/2001/XMLSchema"
] ;
<uddiv2:hasMetadata>
[ a <biodata:semanticType> ; (5)
<rdf:value> "biodata:Affymetrix_probe_set_id" ;
];
<uddiv2:hasMetadata>
[ a <biodata:formats> ;
<rdf:value>
[ a <biodata:formatBag> ;
<rdf:_1>
[ a <biodata:format> ;
<biodata:hasFormatSystem> "MIME" ; (6)
<biodata:hasFormatIdentifier> "text/x-record-ids" ;
]
]
] ;
...
1 available at http://www.ecs.soton.ac.uk/~sm/myGrid/AffyIdToEmblSnps.scufl
2
In this paper, we focus on workflow descriptions, but we note that service
descriptions are similar. Service descriptions differ in that
the institution is the one hosting the service, and that services do
not tend to have sub-activities associated with them. We
shall come back to this specific point in Section 7.
3 BLAST, “the Basic Local Alignment
Search Tool” [AGM+90a] [AGM+90a]
is an application that encompasses a number of services used to compare
a DNA or protein sequence with the large public databases of known sequences.
It can therefore accept as input different types of sequence data whether
protein or DNA, perform a search over one or more databases and produce
its results in a variety of formats. BLAST is highly parameterisable,
able to search over many databases with many types of sequence. In fact,
BLAST has several instantiations specialised for different sequence
types: BLASTn for searching nucleotide sequences over nucleotide sequence
databases, BLASTx for nucleotide sequences over protein databases.
4
In this paper, we focus on workflow descriptions, but we note that service
descriptions are similar. Service descriptions differ in that
the institution is the one hosting the service, and that services do
not tend to have sub-activities associated with them.
We shall come back to this specific point in Section 7.
5 BLAST,
“the Basic Local Alignment Search Tool” [AGM+90a] is an application
that encompasses a number of services used to compare a DNA or
protein sequence with the large public databases of known sequences.
It can therefore accept as input different types of sequence data whether
protein or DNA, perform a search over one or more databases and produce
its results in a variety of formats.
BLAST is highly parameterisable, able to search over many databases
with many types of sequence. In fact, BLAST has several instantiations
specialised for different sequence types: BLASTn for searching nucleotide
sequences over nucleotide sequence databases, BLASTx for nucleotide
sequences over protein databases.
All Rights Reserved Powered by Free Document Search and Download
Copyright © 2011