Home > Semantic Description, Publication and Discovery of Workflows in myGrid

Semantic Description, Publication and Discovery of Workflows in myGrid

Semantic Description, Publication and Discovery of Workflows in myGrid

Simon Miles1, Juri Papay1, Chris Wroe2, Phillip Lord2, Carole Goble2, Luc Moreau1

sm@ecs.soton.ac.uk; jp@ecs.soton.ac.uk; chris.wroe@cs.man.ac.uk; p.lord@russet.org.uk; carole@cs.man.ac.uk; L.Moreau@ecs.soton.ac.uk

1 School of Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, UK;

2 Department of Computer Science, University of Manchester, Oxford Road, Manchester M13 9PL, UK

Abstract

The bioinformatics scientific process relies on in silico experiments, which are experiments executed in full in a computational environment. Scientists wish to encode the designs of these experiments as workflows because they provide minimal, declarative descriptions of the designs, which overcominge many barriers to the sharing and re-use of these designs between scientists and enable the use of the most appropriate services available at any one time. We anticipate that the number of workflows will increase quickly as more scientists begin to make use of existing workflow construction tools to express their experiment designs. , Dbut discovery then becomes an increasingly hard problem, as it becomes more difficult for a scientist to identify the workflows relevant to their particular research goals amongst all those on offer. While many existing approaches exist forto the publishing and discovery of services, there have been fewno attempts to address where and how authors of experimental designs should advertise the availability of their work or how relevant workflows can be discovered with minimal effort from the user. As the users designing and adapting experiments will not necessarily have a computer science background, we also have to consider how publishing and discovery can be achieved in such a way that they are not required to have detailed technical knowledge of workflow scripting languages. Furthermore, we believe they should be able to make use of others’ expert knowledge (the semantics) of the given scientific domain. In this paper, we define the issues related to the semantic description, publishing and discovery of workflows, and demonstrate how the architecture created by the myGrid project aids scientists in this process. We give a walk-through of how users can construct, publish, annotate, discover and enact workflows via the user interfaces of the myGrid architecture; we then describe novel middleware protocols, making use of the Semantic Web technologies RDF and OWL to support workflow publishing and discovery.

1. Introduction

Traditionally, the biological scientific process has involved experiments on living systems, in vivo, or on parts of a living system in a test tube, in vitro. Bioinformatics has focused on supporting the experimental biologist by enabling many more experiments to be carried out in silico, that is computationally. As this better supports automation and also harnesses the collective knowledge of the discipline, in silico biological experiments have greatly enabled the process of validating hypotheses, andor gathering additional information to shape the design of future experiments. For science to become efficient, If eexperiments can need to be easily shared, adapted and reused, hopefully science will become more efficient; distributed architectures on the Internet promise to be the most effective mechanism to achieve this goal for in silico experiments.

Both the Web Services and Grid architectures [WSArchWSArch, OGSAOGSA] have adopted a service-oriented approach, in which computational resources, storage resources, programs and databases can be represented by services. In such a context, a service is a network-enabled entity capable of encapsulating diverse implementations behind a common interface. The benefit of such a uniform view is that it facilitates the composition of services into more sophisticated services, hereby promoting sharing and reuse of resources in distributed environments. To this end, a number of workflow languages have emerged which are capable of describing complex compositions of services, e.g. WSFL [WSFLWSFL], XLANG [XLANGXLANG], BPEL [BPELBPEL], XScufl [SCUFLSCUFL].

However, service-oriented architectures currently provide no mechanism to facilitate the sharing of workflows. At present, workflow authors simply make a list of the available workflows available via a Web page. With an increasing number of workflows, and sites listing them, searching for them in this way will soon become untenable.

DAML-S [DAMLSDAMLS] considers workflows as largely equivalent to services with regards to publishing and discovery because they are functional entities that are identified by their interface (inputs and outputs) and overall function. Therefore, as with services, workflows need to be published in order to be discovered and reused. However, publishing a workflow involves two distinct steps: first, the workflow script must be archived in a repository from which it can be publicly retrieved; next, a description of and a reference (e.g. a URI) to its script need to be advertised in a registry. In this context, a registry is defined as a service holding descriptions of workflows and services. Many protocols for publishing service descriptions, including de-facto standards, such as UDDI [UDDIUDDI], Jini [JiniJini], and BioMoby [WL02aWL02a] do not, in themselves, address publication and discovery of workflows. DAML-S, on the other hand, is an ontology capable of describing complex processes, but is not a registry system for publishing and discovery.

Once a published workflow has been discovered, scientists use their expert knowledge of the scientific field to judging whether a design is applicable to their own work. Unfortunately, such domain-specific knowledge is not readily available from workflow scripts, which are engineered in terms of programmatic notions such as interfaces, ports, operations and messages of the service-oriented architecture in use [WSDLWSDL]; furthermore, domain knowledge cannot be inferred from these low-level notions. However, semantic descriptions can be added to workflows, in order to make high-level knowledge explicit; these must be machine interpretable if tools are to be capable of recommending the applicability of workflows based on the domain-specific knowledge of a scientist.

myGrid [myGridmyGrid] is a pilot project funded by the UK e-Science programme to develop Grid middleware in biological sciences context. The goal of the myGrid project [myGridmyGrid] is to develop a software infrastructure that provides support for bioinformaticians in the design and execution of workflow-based in silico experiments using the resources of the Grid [FK03FK03]. In silico experiments can operate over the Grid, in which resources are geographically distributed and managed by multiple institutions, and the necessary tools, services and workflows are discovered and invoked dynamically. It is a data-intensive Grid, where the complexity is in the data itself, the number of repositories and tools that need to be involved in the computations, and the heterogeneity of the data, operations and tools. The myGrid architecture includes components for composing workflows, annotating them with semantic descriptions, publishing semantically described workflows, reasoning over semantic descriptions, discovering workflows from semantic queries and executing them. In previous papers, we have discussed various facets of our approach to service publication and discovery, namely its preliminary design [LWS+03bLWS+03b], its protocol for annotating service descriptions [MPP+04MPP+04] and its performance [MLM+04MLM+04, MPD+03aMPD+03a]. The purpose of this paper is to discuss the final design of our architecture for workflow publication and discovery, and its implementation and integration in an electronic lab-book, for manipulation by the scientist. Specifically, this paper focuses on the following technical contributions of the myGrid architecture for publishing and discovering workflow-based in silico experiments.

  • A definition of the protocol used to publish, annotate and discover workflows in a registry. The protocol is independent of the actual language used to encode workflows. To this end, it relies on a notion of workflow skeletonsignatureexecutive summary, which identifies, in an extensible manner, the salient features of a workflow that can be described semanticallyconceptually or syntactically.
  • The use of RDF (the W3C Resource Description Framework) [RDFRDF], which underpins the Semantic Web effort [BHL01aBHL01a], as the underlying representation to express service and workflow descriptions and to facilitate the attachment of metadata to them. Besides being a flexible and powerful representation formalism, RDF provides for uniform graph-based querying using RDQL [RDQLRDQL], which is used in our registry to support workflow discovery.
  • The use of OWL ontologies to encode domain-specific knowledge and to allow the inferences required by the discovery process. Specifically, ontologies are used to index workflows according to their functionality and the semantic types of their inputs and outputs, expressed as biological concepts. A semantic find component, which uses a description logic reasoner, provides complete reasoning over the rich OWL-based descriptions of workflows, and facilitates discovery with complex queries over these descriptions.
  • A complete implementation of the architecture, organised as a set of Web Services and associated user interfaces are all available for download from http://www.myGrid.org.uk/myGrid/web/download/

Section 2 presents an illustrative bioinformatics case study, including a representative workflow, to aid in describing and demonstrating the usefulness of our work. Section 3 shows the users’ perspective in sharing workflows from composition through publishing and description to discovery. In Section 4, we examine the use of semantic technologies to represent the knowledge used for discovery, and in Section 5 we define the protocols used in myGrid to process this information. The implementation of the middleware using this protocol is given in Section 6, the scope of our work and related work is discussed in Section 7 and we draw conclusions and suggest further work in Section 8.

2. Case StudyWorkflows in Graves’ disease experiments

For clarity of explanation, weWe now present a case study to illustrate our approach to semantic description and discovery of workflows. It is a part of a myGrid exemplar bioinformatics application, namely tThe Graves' Disease application, an exemplaer application for myGrid, is intended to help the investigation of a thyroid disorder [GravesGraves]. Specifically, the purpose of the application is to help biologists identify gene mutations that may be involved in causing the condition.

The Graves’ Disease scenario uses a well known and common "candidate gene" approach. We assume that previous biological investigations have been used to isolate a region on the genome in which genes affecting Graves’ Disease may lie. By looking through this region for variations between Graves’ Disease and normal patients, then determining whether these variations lie within a gene, a number of candidate genes can be found. One of the most common variations is called a Single Nucleotide Polymorphism (SNP), which is a variation involving only a single nucleotide, rather than a large scale change affecting many nucleotides. But often many of these polymorphismsmutations occur in a region, most of which will be not related to Graves’ Disease. The in silico process consists of gathering information from several publicly available data resources, many of which have been made available as services at one or more locations, describing the current state of knowledge about the genes in question. Once such information has been obtained for a set of candidate genes, the scientist can design an in vitro experiment that will test their likelihood of being involved in the disease.

To enable re-running of the experiment and best use of Grid resources, the experiment is encoded as a workflow, a composed set of services or other workflows, which we refer to as CandidateGeneAnalysis. This workflow takes a "probe set ID" referring to a gene sequence in the Affymetrix database [AffymetrixAffymetrix] as input and ultimately returns a record from the EMBL database containing information about the sequence including SNPs. The workflow’s structure can be seen on the left hand side of Figure 2Figure 2Figure 1Figure 1. The specific details can be found in its script[1] available at http://www.ecs.soton.ac.uk/~sm/myGrid/AffyIdToEmblSnps.scufl, encoded in the SCUFL workflow language. The SCUFL workflow language, developed as part of myGrid, simplifies the process of creating workflows for biology by making the language granularity and concepts as close as possible to those that a potential user of the system would already be usingintuitive as possible for potential users [SCUFLSCUFL].

Since this experiment is more widely applicable than just for the study of Graves’ Disease, the biologists may wish to share it with others, and would want to do so in such a way that it can usefully be discovered and re-used. User requirements [SGG+03aSGG+03a] have identified some questions that scientists commonly ask about such kinds of experiment. Specifically, since they aim to discover SNPs from gene sequence data, they will seek experiments that:

1) process a given sort of data (e.g. genes),

2) provide extra information on given data (e.g. gene data),retrieve information from a public database about a specific gene

3) provide a given type of output (e.g. SNPs),.

4) perform searches over named public databases.

Since experiments are represented as workflows, and workflows are characterised by the kind of their ir inputs, their outputs and their function, a user will specifically seek published workflows that:

1) have a given semantic type (e.g. gene sequence data) as one of their inputs,

2) perform a given type of function (e.g. to provide extra information on a generetrieve a database record about a gene),

3) have a specific semantic type (e.g. SNPs) as one of their outputs,

4) use certain services (e.g. named public genetic information databases).

From tThis use case indicates that we need a t is clear that there are a large number of entities that we need in order to perform in silico experiments; , which have been summarised in Figure 1 summarises the terminology we adopt in this paper. Next, we examine how users go about publishing workflows, so that such the questions questions above can be answered by the architecture to support the discovery process.


some otherstors.org/akt/[DAML-ov v Novom roku.
the office.
ng on Tuesday. s and lot of succee New Year.
w Year.
alk etc.
g I

Term

Description

Example

Basic Concepts

Workflow Language

A language for specifying a workflow.

BPEL, Scufl

Service

An atomic entity that can be invoked

Blast service at EBI

Concrete

Workflow

A composed set of services or other workflows and a specification of the data flow between them

Candidate Gene

Analysis

Activity

A workflow or a service.

Service Type

Abstract activity definition that represent the class of a service or a workflow template

Sequence Alignment, BLAST

Abstract

Workflow

Signature

The salient features of a workflow that are desirable to be described conceptually or syntactically: inputs, outputs, task(s), component resources

Input: probe set ID

Output: EMBL_SNP

Using: Affymetrix database, EMBL database, BLAST

Task: SNP_annotation

Workflow language independent

Workflow

Template

A workflow in which one or more or the activities are not directly invokable, but represented as a specification which can be resolved into invokable activity.

The Candidate Gene

Analysis data and control flow, choreographing service types (e.g. BLAST) instead of, or as well as, activities (e.g. BLAST at EBI).

Workflow language dependent

Workflow Script

A specific specification, defined in terms of the workflow language, that we can directly enact.

http://www.ecs.soton.ac.uk /~sm/myGrid/ AffyIdToEmblSnps.scufl

Concrete

Table 1 myGrid terminology.

Figure

Term

Description

Example

Basic Concepts

Workflow Language

A language for specifying a workflow.

BPEL, Scufl

Service

An atomic entity that can be invoked

Blast service at EBI

Concrete

Workflow

A composed set of services or other workflows and a specification of the data flow between them

Candidate Gene

Analysis

Activity

A workflow or a service.

Service Type

Abstract activity definition that represent the class of a service or a workflow template

Sequence Alignment, BLAST

Abstract

Workflow language independent

Workflow

SignatureExecutive summary

The salient features of a workflow that are desirable to be described conceptually or syntactically: inputs, outputs, task(s), component resources

Input: probe set ID

Output: EMBL_SNP

Using: Affymetrix database, EMBL database, BLAST

Task: SNP_annotation

Workflow language dependent

Workflow

Template

A workflow in which one or more or the activities are not directly invokable, but represented as a specification which can be resolved into invokable activity..

The Candidate Gene

Analysis data and control flow, choreographing service types (e.g. BLAST) instead of, or as well as, activities (e.g. BLAST at EBI).

Workflow Script

A specific specification, defined in terms of the workflow language, that we can directly enact.

http://www.ecs.soton.ac.uk /~sm/myGrid/ AffyIdToEmblSnps.scufl

Concrete

Figure

Figure 1 Table 1 myGrid terminology.

3. The Users’ Perspective

The purpose of this section is twofold. On the one hand, we illustrate our approach to workflow publication and discovery, using snapshots of the graphical user interface that the scientist is presented with when using the myGrid system; the functionality of this interface was derived from the user requirements we captured at the beginning of the myGrid project [SGG+03aSGG+03a]. On the other hand, we identify key technical requirements for the knowledge representation that is required to support our approach. With the user-centric perspective adopted by myGrid, we analyse the kinds of discovery that scientists are confronted with: when composing workflows and when deciding which scientific experiments to run. In order to be discovered, workflows need to have been published, and we examine how suitable semantic descriptions of these workflows can be made available to the system.

3.1. Construction-time Discovery

Designing a workflow means linking together functional entities such as Web Services or other workflows, which will be referred towe refer to as activities (see Figure 1Table 1), so that the outputs of some are used as the inputs of others. Workflows are constructed by linking together sub-activities that pass data between each other. Figure 2Figure 2Figure 1Figure 1 shows the myGrid graphical workflow construction tool Taverna [TavernaTaverna].

Figure 2Figure 1. Workflow construction using Taverna. The left-hand panel contains a depiction of the workflow itself with each box representing an activity in the workflow; when the workflow is enacted, this activity results in a Web Service operation call or the invocation of another workflow. Data flows from the inputs, represented by inverted triangles, through the linked services to the output triangle at the bottom of the workflow. The ‘Scufl Model Explorer’ panel shows a hierarchical view of the workflow and ‘Enactor launch’ relates to test runs of the workflow.

Workflows may not beneed not be created from scratch: they can be adapted and personalised from previously written workflows. As part of personalisation, the workflow’s author needs to discover existing activities (workflows and services) so as to include them in their design. Hence, since both services and workflows need to be discovered, both are listed in the ‘Available services’ panel of Figure 2Figure 2Figure 1Figure 1. Crucially, user requirements have identified that:; 1) biologists require activities to be discoverable by the function they perform, that is task orientated discovery and 2) that final selection ultimately rests with the scientist, who will select those to be included according to the goal of the experiment they are designing. To this end, scientists need to be able to draw on a wide range of information about activities in order to inform their decisions. Specifically, the following workflow descriptions[2] are used: the workflow’s author and their institution, the function of the workflow, the sub-activities it may invoke (and their function), and the inputs and outputs of the workflow expressed in biological terms. WIt is interesting to note that when the scientist considers a workflow for insertion in an experiment, they do not regard the workflowit as a grayblack-box, because they want to know about the activities it is composed of, though the fine details of their dependencies, control and data flows do not matter at this stage.

Selecting activities for thebased on the functions they perform helps guarantee that the overall experiment has the intended behaviour. However, further care is needed to ensure that the composition is operationally consistent at the transport level: data types and formats of outputs must be compatible with the inputs they feed into. In order to verify such constraints, service interfaces [WSDLWSDL] and an equivalent concept for workflows need to be made available to the scientists, who will make sure that all data are suitably converted to ensure a coherent composition. In Taverna, the scientist is made aware of the incompatability of data types and formats (encoded as MIME types) by allowing them only to make links between the output of one activity and the input of another with the same type. To that end, Taverna relies on the WSDL interface files of services and workflows, the details of which are hidden from the scientists by the user interface.

3.2. Experiment-time Workflow Discovery

Scientists undertake their research by iteratively selecting and running workflows and further analysing the data they produce. myGrid aids this process by providing the myGrid workbench, a client side electronic lab-book through which users can perform their in silico experiments, as well as storing and organising their data. A typical work pattern of the scientist consists of selecting a piece of data stored locally and asking which workflows will accept inputs with such biological type. In the screenshot of Figure 3Figure 3Figure 2Figure 2, the user has selected a piece of data, which is an Affymetrix Probe Set ID referring to candidate gene data in the Affymetrix database [AffymetrixAffymetrix], and asked to find a workflow that is capable of taking this data as input.

User requirements [SGG+03aSGG+03a] have identified that bioinformaticians also want to be involved in the process of choosing which experiments to run, and therefore, the myGrid system does not offer fully automated workflow selection. Instead, the user is presented with a list of workflow scripts and invited to make the final selection. In Figure 3Figure 3Figure 2Figure 2, two applicable workflows have been discovered and displayed in a list with the workflow graphical depiction shown to the right, on selection.

Figure 2

Figure 3: Selecting workflow that takes an Affymetrix probe set ID as input. The user has selected a piece of data, which is an Affymetrix Probe Set ID referring to candidate gene data in the Affymetrix database [AffymetrixAffymetrixAffymetrix], and asked to find a workflow that is capable of taking this data as input.

As well as this data-driven, context- sensitive, method for discovering experiments, we also wish to enable a task-orientated and result-driven approaches, by which workflows can be discovered respectively by the function they perform and by the type of output they produce.

Besides this data-driven method for discovering experiments for execution, two more discovery modes are commonly used by the scientific community: in a goal-driven approach, scientists attempt to discover workflows capable of a given function, whereas in a result-driven approach, they are looking for workflows producing a given type of biological data. To this end, scientists need to be able to browse through published workflows, which have been categorised according their inputs (data-driven), their outputs functionality (task-oriented) and their functionalityoutputs (result-driven). In Figure 4Figure 4Figure 3Figure 3, illustrates the user browses browsing available services and workflowsactivities categorised according to those three axes. by the type of input they take, the type of output they produce or the type of task they perform.the task they perform. The user can also browse activities by the inputs, and outputs.

Figure 3

Figure 4: Browsing categorised workflows and services. As shown, the user can see two services/workflows available to do sequence alignment on a gene sequence, using the services BLASTn and BLASTx.[3]

3.3. Workflow Descriptions

Scientists require descriptions so they can judge which workflow is applicable amongst the many available. While the final decision remains with the scientists, we expect the system to help them by sorting workflows according to the multiple relevant criteria discussed above, e.g. input and output types, functions, andvarious aspects (inputs, outputs, functionality)of their signature, and possibly to rank them. Therefore, descriptions need to be easily processable by the computer.

Workflow descriptions can be produced by workflow authors, but they need not. Indeed, our experience in myGrid shows that it is useful for a third-party to be able to provide such descriptions. For example, a description that contains useful information about the quality, accuracy or trustability of the results produced by an experiment should typically be provided by end users, rather than the workflow authors. Likewise, a reference ontology of the application domain may be revised after some experiments have been designed; it may then be useful that an ontology expert refines semantic descriptions according to the revised ontology.

Therefore, in myGrid, we allow third-party users to generate workflow descriptions, and provide a separate tool to help users to construct such descriptions. The tool, called Pedro, is displayed in Figure 5Figure 5Figure 4Figure 4, which illustrates its use to create descriptions pertaining to the CandidateGeneAnalysis workflow.

Figure 4:

Figure 5: Screenshot of a workflow being annotated with semantic description using Pedro. The various components of the workflow that can be annotated with descriptions are displayed in the left hand panel. At a high level, the workflow can be annotated with the organisation that has produced it and with information about the type of biological data it takes as input and produces as output and the overall biological function it performs. The user has focused on a particular workflow sub-activity (named here WORKFLOWOPERATION) and is providing information about one of the inputs (called a PARAMETER) to that sub-activity, specifying a bioinformatics term, ‘Affymetrix_probe_set_id’, that refers to the type and origin of the data taken by the operation as input.

Although we wish descriptions to be easily processable by the computer, some descriptions may be solely aimed at users in judging the applicability of a workflow and so can be written in free text. Figure 5Figure 5Figure 4Figure 4 illustrates both forms of annotation. In “parameterDescription”, a free text description has been added to assist manual search and browsing of workflows description. However, fields marked with an asterisk (“semanticType” and “transportDataType”) are populated with concepts from a controlled vocabulary. So, for example, Affymetrix_probe_set_id is a term in the myGrid bioinformatics ontology, which provides a controlled vocabulary for bioinformatics terms. Pedro Pedro has the ability to choose the controlled vocabulary that is applicable for each field of the annotation by focusing in on a particular region of an ontology. To aid the user in identifying the suitable terms of an ontology to select, the concepts of the bioinformatics ontology can be browsed, as illustrated by Figure 6Figure 6Figure 5Figure 5.

Figure 5

Figure 6: Finding the ontology term for describing the workflow’s output in the myGrid ontology. The user has followed a classification of the ontology terms, and has found the term ‘EMBL_record_accession_numberAffymetrixProbeSetId’ which represents an entry in the scientist’s EMBL database for a gene, which should contain SNP datadatabase, and will be an inoutput of the CandidateGeneAnalysis workflow.

3.4. Run-time Discovery

We have found that users wish to be involved in making the final selection of workflows to be included in their scientific experiments. Therefore, all experiment-related workflows will be chosen at composition time, and we do not anticipate that any of these will be discovered at run-time, i.e. when experiments are being enacted.

On the other hand, there exist experimentally neutral workflows, which are composed of activities without any specific biological function ascribed to them (e.g., format conversions, pretty printers). Such workflows could be discoverable at run-time without involving the user. Likewise, multiple providers may host instances of a same service, and these should be automatically discoverable to make better use of resources that are available at runtime. Currently, we consider that discovery can only take place for workflows (and services) that have a functionality and fully-defined interface identified at composition time. We shall discuss how these restrictions can be relaxed in Section 7.

Semantic Description, Publication and Discovery of Workflows in myGrid

Simon Miles1, Juri Papay1, Chris Wroe2, Phillip Lord2, Carole Goble2, Luc Moreau1

sm@ecs.soton.ac.uk; jp@ecs.soton.ac.uk; chris.wroe@cs.man.ac.uk; p.lord@russet.org.uk; carole@cs.man.ac.uk; L.Moreau@ecs.soton.ac.uk

1 School of Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, UK;

2 Department of Computer Science, University of Manchester, Oxford Road, Manchester M13 9PL, UK

Abstract

The bioinformatics scientific process relies on in silico experiments, which are experiments executed in full in a computational environment. Scientists wish to encode the designs of these experiments as workflows because they provide minimal, declarative descriptions of the designs, which overcome many barriers to the sharing and re-use of the designs between scientists and enable the use of the most appropriate services available at any one time. We anticipate that the number of workflows will increase quickly as more scientists begin to make use of existing workflow construction tools to express their experiment designs, but discovery then becomes an increasingly hard problem, as it becomes more difficult for a scientist to identify the workflows relevant to their particular research goals amongst all those on offer. While many existing approaches exist to the publishing and discovery of services, there have been no attempts to address where and how authors of experimental designs should advertise the availability of their work or how relevant workflows can be discovered with minimal effort from the user. As the users designing and adapting experiments will not necessarily have a computer science background, we also have to consider how publishing and discovery can be achieved in such a way that they are not required to have detailed technical knowledge of workflow scripting languages. Furthermore, we believe they should be able to make use of others’ expert knowledge (the semantics) of the given scientific domain. In this paper, we define the issues related to the semantic description, publishing and discovery of workflows, and demonstrate how the architecture created by the myGrid project aids scientists in this process. We give a walk-through of how users can construct, publish, annotate, discover and enact workflows via the user interfaces of the myGrid architecture; we then describe novel middleware protocols, making use of the Semantic Web technologies RDF and OWL to support workflow publishing and discovery.

1. Introduction

Traditionally, the biological scientific process has involved experiments on living systems, in vivo, or on parts of a living system in a test tube, in vitro. Bioinformatics has focused on supporting the experimental biologist by enabling many more experiments to be carried out in silico, that is computationally. As this better supports automation and also harnesses the collective knowledge of the discipline, in silico biological experiments have greatly enabled the process of validating hypotheses, or gathering additional information to shape the design of future experiments. For science to become efficient, experiments need to be shared, adapted and reused; distributed architectures on the Internet promise to be the most effective mechanism to achieve this goal for in silico experiments.

Both the Web Services and Grid architectures [WSArch, OGSA] have adopted a service-oriented approach, in which computational resources, storage resources, programs and databases can be represented by services. In such a context, a service is a network-enabled entity capable of encapsulating diverse implementations behind a common interface. The benefit of such a uniform view is that it facilitates the composition of services into more sophisticated services, hereby promoting sharing and reuse of resources in distributed environments. To this end, a number of workflow languages have emerged which are capable of describing complex compositions of services, e.g. WSFL [WSFL], XLANG [XLANG], BPEL [BPEL], XScufl [SCUFL].

However, service-oriented architectures currently provide no mechanism to facilitate the sharing of workflows. At present, workflow authors simply make a list of the available workflows available via a Web page. With an increasing number of workflows, and sites listing them, searching for them in this way will soon become untenable.

DAML-S [DAMLS] considers workflows as largely equivalent to services with regards to publishing and discovery because they are functional entities that are identified by their interface (inputs and outputs) and overall function. Therefore, as with services, workflows need to be published in order to be discovered and reused. However, publishing a workflow involves two distinct steps: first, the workflow script must be archived in a repository from which it can be publicly retrieved; next, a description of and a reference (e.g. a URI) to its script need to be advertised in a registry. In this context, a registry is defined as a service holding descriptions of workflows and services. Many protocols for publishing service descriptions, including de-facto standards, such as UDDI [UDDI], Jini [Jini], and BioMoby [WL02a] do not, in themselves, address publication and discovery of workflows. DAML-S, on the other hand, is an ontology capable of describing complex processes, but is not a registry system for publishing and discovery.

Once a published workflow has been discovered, scientists use their expert knowledge of the scientific field to judging whether a design is applicable to their own work. Unfortunately, such domain-specific knowledge is not readily available from workflow scripts, which are engineered in terms of programmatic notions such as interfaces, ports, operations and messages of the service-oriented architecture in use [WSDL]; furthermore, domain knowledge cannot be inferred from these low-level notions. However, semantic descriptions can be added to workflows, in order to make high-level knowledge explicit; these must be machine interpretable if tools are to be capable of recommending the applicability of workflows based on the domain-specific knowledge of a scientist.

myGrid [myGrid] is a pilot project funded by the UK e-Science programme to develop Grid middleware in biological sciences context. The goal of the myGrid project [myGrid] is to develop a software infrastructure that provides support for bioinformaticians in the design and execution of workflow-based in silico experiments using the resources of the Grid. In silico experiments can operate over the Grid, in which resources are geographically distributed and managed by multiple institutions, and the necessary tools, services and workflows are discovered and invoked dynamically. It is a data-intensive Grid, where the complexity is in the data itself, the number of repositories and tools that need to be involved in the computations, and the heterogeneity of the data, operations and tools. The myGrid architecture includes components for composing workflows, annotating them with semantic descriptions, publishing semantically described workflows, reasoning over semantic descriptions, discovering workflows from semantic queries and executing them. In previous papers, we have discussed various facets of our approach to service publication and discovery, namely its preliminary design [LWS+03b], its protocol for annotating service descriptions [MPP+04] and its performance [MLM+04, MPD+03a]. The purpose of this paper is to discuss the final design of our architecture for workflow publication and discovery, and its implementation and integration in an electronic lab-book, for manipulation by the scientist. Specifically, this paper focuses on the following technical contributions of the myGrid architecture for publishing and discovering workflow-based in silico experiments.

A definition of the protocol used to publish, annotate and discover workflows in a registry. The protocol is independent of the actual language used to encode workflows. To this end, it relies on a notion of workflow skeleton which identifies, in an extensible manner, the salient features of a workflow that can be described semantically.

The use of RDF (the W3C Resource Description Framework) [RDF], which underpins the Semantic Web effort [BHL01a], as the underlying representation to express service and workflow descriptions and to facilitate the attachment of metadata to them. Besides being a flexible and powerful representation formalism, RDF provides for uniform graph-based querying using RDQL [RDQL], which is used in our registry to support workflow discovery.

The use of OWL ontologies to encode domain-specific knowledge and to allow the inferences required by the discovery process. Specifically, ontologies are used to index workflows according to their functionality and the semantic types of their inputs and outputs, expressed as biological concepts. A semantic find component, which uses a description logic reasoner, provides complete reasoning over the rich OWL-based descriptions of workflows, and facilitates discovery with complex queries over these descriptions.

A complete implementation of the architecture, organised as a set of Web Services and associated user interfaces are all available for download from http://www.myGrid.org.uk/myGrid/web/download/

Section 2 presents an illustrative bioinformatics case study, including a representative workflow, to aid in describing and demonstrating the usefulness of our work. Section 3 shows the users’ perspective in sharing workflows from composition through publishing and description to discovery. In Section 4, we examine the use of semantic technologies to represent the knowledge used for discovery, and in Section 5 we define the protocols used in myGrid to process this information. The implementation of the middleware using this protocol is given in Section 6, the scope of our work and related work is discussed in Section 7 and we draw conclusions and suggest further work in Section 8.

2. Case Study

For clarity of explanation, we now present a case study to illustrate our approach to semantic description and discovery of workflows. It is a part of a myGrid exemplar bioinformatics application, namely the Graves' Disease application, intended to help the investigation of a thyroid disorder [Graves]. Specifically, the purpose of the application is to help biologists identify gene mutations that may be involved in causing the condition.

The Graves’ Disease scenario uses a well known and common "candidate gene" approach. We assume that previous biological investigations have been used to isolate a region on the genome in which genes affecting Graves’ Disease may lie. By looking through this region for variations between Graves’ Disease and normal patients, then determining whether these variations lie within a gene, a number of candidate genes can be found. One of the most common variations is called a Single Nucleotide Polymorphism (SNP), which is a variation involving only a single nucleotide, rather than a large scale change affecting many nucleotides. But often many of these mutations occur in a region, most of which will be not related to Graves’ Disease. The in silico process consists of gathering information from several publicly available data resources describing the current state of knowledge about the genes in question. Once such information has been obtained for a set of candidate genes, the scientist can design an in vitro experiment that will test their likelihood of being involved in the disease.

To enable re-running of the experiment and best use of Grid resources, the experiment is encoded as a workflow, which we refer to as CandidateGeneAnalysis. This workflow takes a "probe set ID" referring to a gene sequence in the Affymetrix database [Affymetrix] as input and ultimately returns a record from the EMBL database containing information about the sequence including SNPs. The workflow’s structure can be seen on the left hand side of Figure 1. The specific details can be found in its script available at http://www.ecs.soton.ac.uk/~sm/myGrid/AffyIdToEmblSnps.scufl, encoded in the SCUFL workflow language. The SCUFL workflow language, developed as part of myGrid, simplifies the process of creating workflows for biology by making the language granularity and concepts as close as possible to those that a potential user of the system would already be using [SCUFL].

Since this experiment is more widely applicable than just for the study of Graves’ Disease, the biologists may wish to share it with others, and would want to do so in such a way that it can usefully be discovered and re-used. User requirements [SGG+03a] have identified some questions that scientists commonly ask about such kinds of experiment. Specifically, since they aim to discover SNPs from gene sequence data, they will seek experiments that:

process a given sort of data (e.g. genes),

provide extra information on given data (e.g. gene data),

provide a given type of output (e.g. SNPs),

perform searches over named public databases.

Since experiments are represented as workflows, and workflows are characterised by their inputs, their outputs and their function, a user will specifically seek published workflows that:

have a given semantic type (e.g. gene data) as one of their inputs,

perform a given type of function (e.g. to provide extra information on a gene),

have a specific semantic type (e.g. SNPs) as one of their outputs,

use certain services (e.g. named public genetic information databases).

Next, we examine how users go about publishing workflows, so that such questions can be answered by the architecture to support the discovery process.

3. The Users’ Perspective

The purpose of this section is twofold. On the one hand, we illustrate our approach to workflow publication and discovery, using snapshots of the graphical user interface that the scientist is presented with when using the myGrid system; the functionality of this interface was derived from the user requirements we captured at the beginning of the myGrid project [SGG+03a]. On the other hand, we identify key technical requirements for the knowledge representation that is required to support our approach. With the user-centric perspective adopted by myGrid, we analyse the kinds of discovery that scientists are confronted with: when composing workflows and when deciding which scientific experiments to run. In order to be discovered, workflows need to have been published, and we examine how suitable semantic descriptions of these workflows can be made available to the system.

3.1. Construction-time Discovery

Designing a workflow means linking together functional entities such as Web Services or other workflows, which will be referred to as activities, so that the outputs of some are used as the inputs of others. Workflows are constructed by linking together sub-activities that pass data between each other. Figure 1 shows the myGrid graphical workflow construction tool Taverna [Taverna].

Figure 1. Workflow construction using Taverna. The left-hand panel contains a depiction of the workflow itself with each box representing an activity in the workflow; when the workflow is enacted, this activity results in a Web Service operation call or the invocation of another workflow. Data flows from the inputs, represented by inverted triangles, through the linked services to the output triangle at the bottom of the workflow. The ‘Scufl Model Explorer’ panel shows a hierarchical view of the workflow and ‘Enactor launch’ relates to test runs of the workflow.

Workflows may not be created from scratch: they can be adapted and personalised from previously written workflows. As part of personalisation, the workflow’s author needs to discover existing activities (workflows and services) so as to include them in their design. Hence, since both services and workflows need to be discovered, both are listed in the ‘Available services’ panel of Figure 1. Crucially, user requirements have identified that biologists require activities to be discoverable by the function they perform, and that final selection ultimately rests with the scientist, who will select those to be included according to the goal of the experiment they are designing. To this end, scientists need to be able to draw on a wide range of information about activities in order to inform their decisions. Specifically, the following workflow descriptions[4] are used: the workflow’s author and their institution, the function of the workflow, the sub-activities it may invoke (and their function), and the inputs and outputs of the workflow expressed in biological terms. It is interesting to note that when the scientist considers a workflow for insertion in an experiment, they do not regard the workflow as a black-box, because they want to know about the activities it is composed of, though the fine details of their dependencies, control and data flows do not matter at this stage.

Selecting activities for the functions they perform helps guarantee that the overall experiment has the intended behaviour. However, further care is needed to ensure that the composition is operationally consistent at the transport level: data types and formats of outputs must be compatible with the inputs they feed into. In order to verify such constraints, service interfaces [WSDL] and an equivalent concept for workflows need to be made available to the scientists, who will make sure that all data are suitably converted to ensure a coherent composition. In Taverna, the scientist is made aware of the incompatability of data types and formats (encoded as MIME types) by allowing them only to make links between the output of one activity and the input of another with the same type. To that end, Taverna relies on the WSDL interface files of services and workflows, the details of which are hidden from the scientists by the user interface.

3.2. Experiment-time Workflow Discovery

Scientists undertake their research by iteratively selecting and running workflows and further analysing the data they produce. myGrid aids this process by providing the myGrid workbench, a client side electronic lab-book through which users can perform their in silico experiments, as well as storing and organising their data. A typical work pattern of the scientist consists of selecting a piece of data stored locally and asking which workflows will accept inputs with such biological type. In the screenshot of Figure 2, the user has selected a piece of data, which is an Affymetrix Probe Set ID referring to candidate gene data in the Affymetrix database [Affymetrix], and asked to find a workflow that is capable of taking this data as input.

User requirements [SGG+03a] have identified that bioinformaticians also want to be involved in the process of choosing which experiments to run, and therefore, the myGrid system does not offer fully automated workflow selection. Instead, the user is presented with a list of workflow scripts and invited to make the final selection. In Figure 2, two applicable workflows have been discovered and displayed in a list with the workflow graphical depiction shown to the right, on selection.

Figure 2: Selecting workflow that takes an Affymetrix probe set ID as input. The user has selected a piece of data, which is an Affymetrix Probe Set ID referring to candidate gene data in the Affymetrix database [Affymetrix], and asked to find a workflow that is capable of taking this data as input.

Besides this data-driven method for discovering experiments for execution, two more discovery modes are commonly used by the scientific community: in a goal-driven approach, scientists attempt to discover workflows capable of a given function, whereas in a result-driven approach, they are looking for workflows producing a given type of biological data. To this end, scientists need to be able to browse through published workflows, which have been categorised according their inputs, their outputs and their functionality. In Figure 3, the user browses available services and workflows by the type of input they take, the type of output they produce or the type of task they perform.

Figure 3: Browsing categorised workflows and services. As shown, the user can see two services/workflows available to do sequence alignment on a gene sequence, using the services BLASTn and BLASTx.[5]

3.3. Workflow Descriptions

Scientists require descriptions so they can judge which workflow is applicable amongst the many available. While the final decision remains with the scientists, we expect the system to help them by sorting workflows according to the multiple relevant criteria discussed above, e.g. input and output types, functions, and possibly to rank them. Therefore, descriptions need to be easily processable by the computer.

Workflow descriptions can be produced by workflow authors, but they need not. Indeed, our experience in myGrid shows that it is useful for a third-party to be able to provide such descriptions. For example, a description that contains useful information about the quality, accuracy or trustability of the results produced by an experiment should typically be provided by end users, rather than the workflow authors. Likewise, a reference ontology of the application domain may be revised after some experiments have been designed; it may then be useful that an ontology expert refines semantic descriptions according to the revised ontology.

Therefore, in myGrid, we allow third-party users to generate workflow descriptions, and provide a separate tool to help users to construct such descriptions. The tool, called Pedro, is displayed in Figure 4, which illustrates its use to create descriptions pertaining to the CandidateGeneAnalysis workflow.

Figure 4: Screenshot of a workflow being annotated with semantic description using Pedro. The various components of the workflow that can be annotated with descriptions are displayed in the left hand panel. At a high level, the workflow can be annotated with the organisation that has produced it and with information about the type of biological data it takes as input and produces as output and the overall biological function it performs. The user has focused on a particular workflow sub-activity (named here WORKFLOWOPERATION) and is providing information about one of the inputs (called a PARAMETER) to that sub-activity, specifying a bioinformatics term, ‘Affymetrix_probe_set_id’, that refers to the type and origin of the data taken by the operation as input.

Although we wish descriptions to be easily processable by the computer, some descriptions may be solely aimed at users in judging the applicability of a workflow and so can be written in free text. Figure 4 illustrates both forms of annotation. In “parameterDescription”, a free text description has been added to assist manual search and browsing of workflow description. However, fields marked with an asterisk (“semanticType” and “transportDataType”) are populated with concepts from a controlled vocabulary. So, for example, Affymetrix_probe_set_id is a term in the myGrid bioinformatics ontology, which provides a controlled vocabulary for bioinformatics terms. Pedro has the ability to choose the controlled vocabulary that is applicable for each field of the annotation by focusing in on a particular region of an ontology. To aid the user in identifying the suitable terms of an ontology to select, the concepts of the bioinformatics ontology can be browsed, as illustrated by Figure 5.

Figure 5: Finding the ontology term for describing the workflow’s output in the myGrid ontology. The user has followed a classification of the ontology terms, and has found the term ‘EMBL_record_accession_number’ which represents an entry in EMBL database for a gene, which should contain SNP data, and will be an output of the CandidateGeneAnalysis workflow.

3.5. Requirements for Discovery and Publication

The previous discussion has shown that workflows and services share many common requirements in terms of discovery. During the composition phase, they are nearly undistinguishable, except for the fact that workflows capture a scientific process, and therefore need to expose some of their internal activities to support the scientist’s judgement. As far as discovery of experimental workflows is concerned, full automation is not desirable, because scientists care about the process design.

In order to support the discovery process we have described, the descriptions associated with workflows will have to contain the following semantic-rich information.

The authors of a workflow and their institution;

The function that a workflow performs expressed in biological terms;

The type of biological data that it takes as input and/or produces as output, and their relationship with the transport data types and formats expected and produced by the workflow;

The activities that a workflow is composed of (and their respective descriptions).

These descriptions will have to be:

Computer processable so that the system can present the user with relevant choices;

Based on ontologies so that suitable classifications can be shown to users;

Produceable by authors and third-party users.

Different kinds of descriptions will be used at different times. Descriptions pertaining to functionality and sub-activities will only be used at composition and experiment selection time, with scientist involvement, whereas interfaces (with transport types and formats) will also be used at runtime.

4. Knowledge RepresentationMetadata for workflow discovery

T

4. Metadata for workflow discovery

The previous discussion has shown that workflows and services share many common requirements in terms of discovery. During the composition phase, they are nearly undistinguishable, except for the fact that workflows capture a scientific process, and therefore need to expose some of their internal activities to support the scientist’s judgement. Fully automaticcally discovery ofing potential workflows is undesirable; this would be equivalent to automating scientific investigations and robs the scientist of the essential control of their own experiment. Examples of experimentally neutral workflows are comparatively rare and confined to sub-workflows such as format transformations or data cleaning [Wroe03, IEEE IS forthcomingWGG+04aWGG+03a].

To support the discovery process a range of descriptions are associated with a workflow. These descriptions should be:

  • Produceable by authors and third-party users;
  • Computer processable so that the system can present the user with relevant choices;
  • Extensible;
  • Based on ontologies so that suitable classifications can be shown to users.

Following this set of requirements, we introduce the notion of a workflow executive summary, which captures the aspects that can facilitate the discovery of a workflow. Specifically, the executive summary includes the following descriptions:

  • The overall functional task (or tasks if there is more than one interpretational viewpoint) that a workflow performs expressed in biological terms;
  • The type of data that it takes as input and/or produces as output;
  • The activities that a workflow is composed of (and their respective descriptions);
  • Factual information about the workflow, such as name, organisation producing it, and location.
  • Factual information about the provenance of the workflow, such as the authors, and its creation and update history.

For completeness, we note that the workflow executive summary should be differentiated from operational descriptions, which contain information about workflow execution, such as cost, quality of service and access rights.

Figure 7Figure 7Figure 6 Figure X shows the three categories of descriptions commonly used when making a choice: and those covering general metadata about the operational context of the workflow as a whole, and those catering for the signatureexecutive summary of the workflow, and those covering general metadata about the operational context of the workflow as a whole, and those covering the metadata about the provenance of the workflow as a whole (we do not discuss the provenance metadata further in this paper).

Operational descriptions include:

  • Provenance records such as the authors, creation date, update history and institution body;
  • Operational records such as the cost, quality of service, usage statistics and access rights.

Signature descriptions cover:

  • The overall functional task (or tasks if there is more than one interpretational viewpoint) that a workflow performs expressed in biological terms;
  • The type of data that it takes as input and/or produces as output;
  • The activities that a workflow is composed of (and their respective descriptions).

The signatureexecutive summary requires descriptions at three levels of abstraction (table 1):

  • Mandatory interface description or andand workflow script URI that captures activities in terms of thethat specify how to enact the workflow, invokable interface that they present, and enables us to execute the workflow, and express the transport data types that the workflow and formats expected s and producesd by it;
  • Optional syntactic descriptions which might include MIME types of the input and output data, expressing the format in which data is encoded, and words that describe the task and components that might be used by a search engine;
  • Optional conceptual descriptions that enables users which to discover services based on their knowledge of the specific domain, in this case bioinformatics. We use a controlled vocabulary of terms to describe the biological data types, functions and component resources.

The development of controlled vocabularies and the annotation of workflows with them at publication time are both labour intensive activities. We do not wish to preclude those registered workflows that do not enjoy these descriptions, and so we make them optional, with the commensurate diminished functionality that attends such an omission.

The previous section has identified the need for a controlled vocabulary of terms that describe biological data types and workflow functions. This rich descriptive frameworkIn combination with interface information, is they iare intended to achieve various discovery capabilities, at different times (composition, experiment selection, and run time). Interface and syntactic descriptions are used at run time; semantic and syntactic descriptions at the point of composition and experimental selection; operational descriptions at all times [Wroe, IEEE IS, forthcomingWGG+04aWGG+03a].

To represent the knowledge embodied in the descriptions we To address these requirements, we have adopted a hybrid approach, combining two Semantic Web technologies, namely OWL and RDF, the use of which we present in this section..

Figure X:

Figure 7: The metadata associated with a registered workflow, giving their knowledge representational forms (RDF, OWL, WSDL), all of which bottom out in an RDF formstore, for which we use the Jena toolkit [Jena]. . Broken lines indicate optional metadata; shadows indicate multiple metadata entries are possible.

Table Kinds of metadata

Figure4.1. Representing Semantic metadata: OWL

The representation of conceptual metadata requires encoding a large body of domain knowledge, with a large, and highly interconnected set of terms. There has been a significant amount of work on using ontologies to describe Web Services to enable their discovery and composition [DAMLSDAML-S, some others]. Although DAML-S aims to address the semantic encoding of invocation and execution monitoring of services and service compositions, the use of semantics in myGrid has focused primarily on service and workflow discovery.

Within myGrid, we built a suite of ontologies, describing biology, bioinformatics, web services and workflows [WGG+04aWSG+03a]. We based the workflow ontology on the service profile from DAML-S [DAMLSref], the domain ontologies on various de facto community standard ontologies such as the Gene Ontology [GO00] and TAMBIS [BGB+99], and models of publication and organisation based on the AKT ontology [refAKTAKT].

The OWL Web Ontology Language has recently emerged as the W3C Proposed Recommendation for representing ontologies [HP-Sv-H03OWLOWL]. The majority of work in Semantic Web Services has used either OWL or its predecessor, DAML+OIL, and we fall in line with this practice, not only because it is an exchange standard, but because the use of OWL provides us with a number of advantages.

4.1. The Use of OWL

Recently the W3C has developed the OWL ontology language as a standard for representing ontologies. There has been a significant amount of work on applying OWL, and its predecessor, DAML+OIL, to the task of describing Web Services to enable their discovery and composition [DAML-S]. The use of semantics in myGrid has, however, been focused primarily on service discovery, while DAML-S also addresses other aims such as invocation and execution monitoring.

One of the main advantages of OWL is that its s underlying formal semantics enables reasoners to classify descriptions based on the properties of those descriptions. This provides computational support to enable the building of complex ontologies of the domain. Additionally, when applied to workflowservice discovery, we automatically classify and discover workflowsservices described in terms of a domain and service workflow ontology. As such, it provides a good fit for the user requirements for enabling service discovery. Consequently, it is natural for us to

Within myGrid, we built a suite of ontologies, describing biology, bioinformatics, and also web services. [WSG+03a]. In these ontologies we have used the service profile from DAML-S and can now describe services semantically in terms of the domain, using tools such as Pedro (Figure 4), to support these descriptions.

The use of OWL in this way provides us with a number of advantages. We can form queries for services workflows (and services) in terms of their properties. For example, the query below describes a service in terms of the task that it performs. Equally, we can express queries for workflows and services by each of their input or output types, or query against the resources that they useexecutive summary components. Queries of this form can be presented naturally in a browsable interface, as shown in Figure 4Figure 4Figure 3Figure 3. This interface takes advantage of the simple expressive capabilities of OWL, in that services workflows and services will classify under many different parents, for instance most of the services shown as performing “aligning” will also classify under “local_aligning”; the latter being a specialisation of the former. The reasoning capabilities of OWL mean that we are not required to pre-enumerate at design time all of the possible service workflows classifications, but can generate new ones “on the fly”, or even change the classifications of services as we change our ideas about the domain.

intersectionOf(

myGrid_bioinformatics_primitive_service_operation

restriction(performs_task someValuesFrom(aligning))

)

In many cases, this use of reasoning for to forming classificationsclassify services is sufficient. The multi-axial classifications shown in Figure 4Figure 4Figure 3Figure 3, actually present a large number of different workflow/service classifications which narrow the choices of services to a point where the user can select for themselves the servicesthose that they require. B However, by using OWL, , we can also exploit the full expressiveness of this language, to build highly complex queries, which we can use to enable more automated service selection.

Although the use ofHowever, the use of OWL provides considerable advantages there are alsoalso brings some difficulties. The use of reasoning technology can complicate the architecture required to support it. Furthermore, while OWL can be used to present relatively straightforward interfaces for the selection of workflowsservices, it comes with an upfront cost, namely that of producing a large domain ontology, and then describing the services workflows in terms of that ontology. At the current time this cost is considerable, although it is hoped that this should lessen as tools, such as PedroPedro, develop further. For these reasons, we would expect that the primary use of OWL based service or workflow descriptions will be in a curated set of services, workflow, or third party descriptions, for use within a system such myGrid, rather than as a general tool for descriptions of Web Services in general. Is it for this reason thatAs a result, within myGrid, we also provide support for workflow discovery based on other description technologies, as detailed in the next section.

4.2. The Use of RDF

Our workflow descriptions have to draw on and seamlessly integrate multiple existing information models, namely WSDL, DAMLS-Profile, and UDDI, and have to support for metadata attachment, as we now explain. (i) Interfaces have been identified as useful information in the discovery process. As we focus on Web Services, we adopt WSDL as the interface language for services, and we propose to use the same language to define the interface of a workflow, composed of its inputs and outputs. (ii) Semantic augmentation by authors and third parties requires a mechanism by which additional semantic descriptions can be attached to existing workflow descriptions; hence, our information model requires support for metadata. (iii) Furthermore, the semantic functionality of a workflow will be structured according to the DAML-S profileusing OWL, and the myGrid ontologies, as discussed in Section 4.1. (iv) Additionally, we have identified that runtime discovery could take place for workflows and services, for which an interface and functionality have been identified at composition time. The de-facto standard for Web Service discovery is UDDI; adopting the UDDI information model will help us preserve compatibility with existing systems (such as enactment engines).

Therefore, wWe have adopted RDF [RDFRDF] as the representation formalism to express such complex service descriptions. RDF is a very flexible language in which relations are described between resources, in the form of triples. A triple associates one resource, the subject, to another, the object, by a relation labelled with a specific property. Our reasons for using RDF are based on the technical requirements of publishing and discovery.

  • RDF can store arbitrarily structured metadata, including semantic descriptions that refer to terms in an ontology; it provides a uniform language in which to express multiple information models (UDDI, WSFL, DAML-S).
  • RDF is naturally designed to express the attachment of metadata to existing concepts of a workflow description; such a capability is ideally suited to our semantic augmentations.
  • Once all the information is expressed uniformly in RDF, it can be searched uniformly (both for data and metadata) using graph-based queries, which can easily be expressed in languages such as RDQL.

4.3. Benefit of the Hybrid Approach

In summary, myGrid has adopted a hybrid approach for its knowledge representation. RDF is the underpinning format in which all workflow (and service) descriptions are encoded. This is an extensible format, which provides us with a powerful graph-based query capability using RDQL. The rest of the paper will discuss how this RDF-based information model is used in the a registry that holds all workflow descriptions, and which provides us with efficient query capabilities necessary for run-time discovery. Within the registry, some of the the semantic signatureexecutive summary metadata and some of the operational metadata will contain semantic descriptions referring to OWL concepts. Semantic reasoning will be undertaken by a semantic find component, which will deal with the semantic-rich discovery process taking place at construction and experiment selection time.

5. myGrid Protocols for Publishing and Discovery

The myGrid architecture defines protocols for publishing the semantic descriptionsthe signatures of workflows and their executive summaries, and for performing discovery based on those descriptions. The two principal components involved in publication and discovery are the registry, which holds the advertisements for workflows, and the semantic find component, which aids discovery of workflows by matching semantic queries against the semantic descriptions in the registry. In order to aid the publishing process, we introduce a new abstraction, a workflow skeletonsignature, which is used in the publishing protocol.

5.1. Workflow SkeletonEncoding a workflow signatureexecutive summary

The starting point for advertising a workflow is the authoring of semantic descriptions, as described in Section 3.2, and this requires the author of the workflow description to know what components of the workflow they can describe and how. A key requirement of our architecture is that it must support multiple workflow languages, or versions of them, because the myGrid SCUFL workflow language is still evolving,.

Ultimately, this should help to ensure that the architecture is future-proof.and ultimately this requires that the architecture is future-proof. So, we have introduced the notion of a workflow skeleton signatureexecutive summary as an abstraction of a workflow abstraction, independent of any particular scripting language.

At the authoring time, usually within the Pedro tool, this signatureexecutive summary is represented in an XML schema, which is shown in Figure 8Figure 8Figure 7Figure 6.

At an abstract level, a workflow can be described by some simple information such as its name or the organisation which owns it. Each sub-activity that takes place in a workflow can have further description, including the task it performs, the resources it uses and further details on its inputs and outputs (as for the workflow as a whole). As identified in the technical requirements, for each input and output of a workflow, we provide three forms of type information: the syntactic type of a parameter gives the structure of the data, typically an XSD type, used by the SOAP transport layer (encoded as ‘transportDataType’ in the workflow skeleton); the MIME type gives technical information about the data used for display (encoded as ‘format’); the semantic type gives a conceptual description of the type of data as described in Section 4.1(encoded as ‘semanticType’). Figure 6 summarises the contents of the workflow skeleton, in which a sub-activity is called a WORKFLOWOPERATION.

Figure 6:

Figure 8: Contents of workflow signatureexecutive summarykeleton. The workflow skeleton signatureexecutive summary entities that can be annotated by semantic descriptions are shown in the left hand panel of Pedro in Figure 5Figure 5Figure 4Figure 4, and are the same as those in this figure. Data derived from the invocation metadata, typically an XSD type, used by the SOAP transport layer, is encoded as “transportDataType” in the signatureexecutive summary, while the conceptual metadata, is encoded as an arbitrary OWL concept in “semanticType”. Finally, syntactic metadata normally represented as a MIME type is represented in “format. For clarity we have omitted the workflow provenance metadata from this figure.

5.2. Publishing and discovery pProtocols

The process of publishing a workflow in myGrid is shown in Figures Figure 9Figure 9Figure 87 and Figure 10Figure 10Figure 98. Overall, the publishing process involves the user, the workflow construction and annotation tools, a storage device to archive workflows, a registry in which advertising and searching are performed and a semantic find component performing any necessary reasoning over any ontology-based semantic descriptions. Our sequence diagrams regard these components as separate, but any specific deployment may seek to integrate some of them. The script is archived in a store and made available via a URI, which is advertised in the registry by the user, possibly using a workflow construction tool. Then semantic descriptions, and other metadata, are attached to the workflow, its inputs and its outputs through successive calls to the directory (see Figure 9Figure 9Figure 8Figure 8). Whenever a new service workflow and new metadata are added to the directory, a notification is sent to the semantic find component, with the advertisement referred referring to the workflow by a unique key; as a result, the semantic find component, which creates an optimised and de-normalised indexes of the workflows by their semantic types. These optimisations enable us to perform service discovery more rapidly than otherwise in order to support efficient discovery, and use technology described elsewhere[BHLT].

Figure 7Figure 9: Sequence of actions taken in publishing workflow

Figure 8Figure 10: Sequence of actions taken in attaching metadata to

a workflow

The discovery of workflows, or other activities, is shown in Figure 11Figure 11Figure 10Figure 9. Within myGrid, there are two main reasons for discovery; firstly in response to a user request usually through interaction with the workbench, and secondly during the process of resolving the abstract activity specifications into a workflow template, into a complete workflow script which can be fully enactedinvokable instances (see Section 3.4).

As users generally wish to discover services in terms of their own domain, this discovery normally involves the conceptual metadata, and is shown in Figure 11Figure 11Figure 10Figure 9. Following user activity involving either the context sensitive workflow selection, or browsing interfaces shown in Section 3, the workbench generates a semantic query. 5.3. Discovery Protocol

The process of discovering a semantically described workflow in myGrid, which also applies to services, is shown in Figure 9. The user performs an action, such as clicking to display the workflows that would take a given piece of data as an input, which causes a semantic query to be created by the WorkBench (see Figure 9). This query is sent to the semantic find component, which uses the retrieved semantic descriptions to determine which workflows match the query. The technical details, including the name, interface and endpoint of each applicable workflow script, are extracted from the registry and returned to be displayed to the user. On selecting a workflow to enact, the workflow script is sent to an enactor service The user can then select the final workflow, if there is more than one, which will then be sent to the enactor.

Figure 9Figure 11: Sequence of actions taken in discovering Figure 12: Sequence of actions taken in discovering services

workflow by user through a user interface during workflow enactment Figure : Sequence of actions taken in discovering services during workflow enactment

The enactor may also use the registry at run-time. As described earlier a workflow template describes an in silico experiment, where some activity definitions have been defined abstractly by service types rather than end points of specific Web Services. The workflow can contain abstract activity definitions that represent service types, rather than explicit calls to particular Web Services. In this case, the workflow is a template [WGG+03a] for a given in silico experiment rather than an instance of that experiment. Workflow templates should not be confused with workflow skeletons, described in Section 5.1., which also provide a high-level abstraction of an experiment but contain no information about the flow of the experiment, are language-independent and used for description rather than enactment. In this caseTherefore, queries will generally involve the invocation metadata, and will involve only the registry, as shown in Figure 12 Figure 12 Figure 11Figure 11Figure 10. Following discovery the enactor can then continue with invoking the returned service.

need to be issued in order for the workflow to be enactable (see Figure 10). The queries are then sent to the registry from the enactor, and a service chosen by the enactor from those matching the query.

Figure 10: Sequence of actions taken in discovering services during workflow enactment

5.34. Discussion

The design decisions involved in developing the above protocol are driven by the user and technical requirements. For instance, while the workflow skeleton does not reveal details of the workflow script’s internal execution, it does include an enumeration of its sub-activities, and this latter point was identified by user requirements: some bioinformaticians like to know which services are involved in a workflow to help them decide whether the workflow is appropriate for their specific goals. In general, the absence of details of the script in the workflow skeleton, both enhances clarity in attaching descriptions and ensures that the workflow skeleton is independent of any one scripting language because it does not make any notion of control or data flow explicit..

The motivation for treating the registry and the semantic find component as two separate modules, and passing messages between them, is that onlynot all discovery involving conceptual metadata will require semantic reasoning and will not therefore need the semantic find component. . In particular,So, discovery by the workflow enactment engine will attempt to match a service by its interfaces, ensuring that it can accept the data produced by earlier activities in the workflow, rather than its domain-specific, e.g. biological, type. Additionally this architecture means that the semantic find component can be deployed independently of the registry, enabling discovery over other

Also, a semantic find component may be used independently of the registry to perform reasoning over queries to other data sources such as databases or Web pagesWhile conceptually separate., these two modules can be tightly integrated in any specific implementation in order to improve efficiency. The following section will discuss alternative deployments of the semantic find component.

6. Implementation

In this section, we describe the design and implementation of the main components of the myGrid architecture used in publishing and discovering workflows, namely the registry and the semantic find component.

6.1. Registry

There are existingExisting protocols for service publishing and discovery, such as UDDI for Web Services [UDDIUDDI]., do not provide support for workflows. , We have taken the approach that workflow scripts and services are almost equivalent for the purpose of discovery. Both are functional entities taking inputs and producing outputs according to some interface and internal algorithm and are available from a given endpoint (where to download the script from in the case of a workflow). Executing them requires different processes, but this is relevant only to enactment and not to the advertising of the workflow/service. By drawing this equivalence between services and workflows, we can reuse the UDDI API to enable their registration and discovery.

The difference in execution, however, does mean that it needs to be obvious which type of activity the advert applies to. This requires us to attach additional metadata to the advert. In previous sections we have also identified the need to attach other additional metadata, in the form of OWL, or RDF to the activity in the registry. Therefore we have built the myGrid registry to be UDDI-compliant, but, in addition, but they do not provide adequate functionality for publishing workflow scripts, for attaching semantic descriptions to services or workflows, or for discovering services or workflows based on their semantics. myGrid’s registry is a UDDI-compliant service registry extended so as to allow workflow scripts to be published and discovered, and arbitrary structured metadata, including semantic descriptions, to be added to entities in the published workflows.

We have taken the approach that workflow scripts and services are almost equivalent for the purpose of discovery. Both are functional entities taking inputs and producing outputs according to some interface and internal algorithm and are available from a given endpoint (where to download the script from in the case of a workflow). Executing them requires different processes, but this is relevant only to enactment and not to the advertising of the workflow/service, though it should be obvious from the advert which type of entity it is. Extra optional information may be available for workflow scripts, in that the workflow sub-activity may be advertised and used to filter the workflows discovered. By drawing an equivalence between workflow and service adverts, we can use our extension of UDDI’s service publishing and discovery APIs, by adding metadata for marking an advert as being for a workflow script, and for specifying activities used within the workflow.

We we have specified a protocol for attaching metadata to activitiesentities described in the service registry [MPP+04MMP+04]. The metadata can be a simple string value for recording, for example, an estimate of the average time a workflow takes to execute. Alternatively, it can be a URI, to a concept in the ontology. For a more complex semantic description, for example, in which ontology concepts are qualified by property values, structured RDF [RDFRDF] metadata can be attached. The API message structure for one metadata attachment method is given in Figure 13Figure 13Figure 12Figure 11; similar methods also exist to attach metadata to a service (or workflow), a business, and to query for services or workflows by the metadata attached to them.

Figure 11

Figure 13: API for attaching metadata to WSDL message parts (inputs or outputs of workflows). To attach metadata the client must identify the entity to which metadata is attached and provide all details of the metadata itself. In this case, a message part is uniquely identified, according to the WSDL specification, by the namespace and local name of the message containing that part plus the part name. Metadata in our registry is given a type, by which the client can determine what the metadata is about, and a value. The value may be either a string, a URI (usually an ontology term) or structured metadata expressed as an assertion in one of the triple languages (such as RDF XML or N3).

A key characteristic of the registry is that the underlying information is stored as RDF [RDFRDF] in a Jena [Jena] triple store, for reasons discussed in Section 4.23. For completeness, Appendix 1 contains the RDF representation of the CandidateGeneAnalysis workflow advertisment, as contained in the registry. Figure 12 illustrates how the advertisement for the CandidateGeneAnalysis workflow has been annotated with a semantic description in the registry’s RDF store.

Figure 14Figure 14Figure 13Figure 13 shows the architecture of the registry, which is available as a Web Service in the myGrid distribution. The client interacts with the registry through a set of interfaces, which allow services and workflows to be published and discovered again as UDDI business services, metadata, either conceptual, or operational to be attached and used in discovery, and semantic discovery to take place. Other features of the registry include sending of notifications when services, workflows and metadata are added or removed, third party annotations of services, federation of the registry and policy-based management of its contents but these are beyond the scope of the paper.

# Base:

@prefix biodata: <http://www.ecs.soton.ac.uk/~sm/myGrid/myGrid.daml#>

@prefix registry: <http://www.myGrid.ecs.soton.ac.uk/myGrid.rdfs#> .

@prefix wsdl: <http://www.myGrid.ecs.soton.ac.uk/wsdl.rdfs#> .

@prefix uddiv2: <http://www.myGrid.ecs.soton.ac.uk/uddiv2.rdfs#> .

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

[] a <uddiv2:BusinessService> ;

<uddiv2:hasName>

[ a <uddiv2:NameBag> ;

<rdf:_1> "CandidateGeneAnalysis"

] ;

<uddiv2:hasServiceKey> "d0892afd-198d-404b-bfdf-31c7fa4df8f3" ;

<uddiv2:hasMetadata>

[ a <isWorkflowScript> ; (1)

<rdf:value> "yes" ;

] ;

<uddiv2:hasBindingTemplate> ...

[ a <uddiv2:AccessPoint> ; (2)

<uddiv2:hasText>

"http://www.ecs.soton.ac.uk/~sm/myGrid/CandidateGeneAnalysis.scufl" ;

<uddiv2:hasURLType> "http"

] ...

<uddiv2:hasOverviewDoc>

[ a <uddiv2:OverviewDoc> ;

<uddiv2:hasOverviewURL>

"http://www.ecs.soton.ac.uk/~sm/myGrid/CandidateGeneAnalysis.wsdl" ...

[] a <wsdl:WSDLOverviewDoc> ;

<wsdl:hasFilename> "http://www.ecs.soton.ac.uk/~sm/myGrid/CandidateGeneAnalysis.wsdl" ;

<wsdl:hasMessage>

[ a <wsdl:MessageBag> ;

<rdf:_1>

[ _:b1 <wsdl:Message> ; (3)

<wsdl:hasQName>

[ a <wsdl:QName> ;

<wsdl:hasLocalName> "WholeWorkflowRunRequest" ;

<wsdl:hasNameSpace> "http://www.ecs.soton.ac.uk/~sm/myGrid/myGrid.daml”

]

<wsdl:hasMessagePart>

[ a <wsdl:PartBag> ;

<rdf:_1>

[ a <wsdl:MessagePart> ;

<wsdl:hasName> "probeSetId" ;

<wsdl:hasTypeName>

[ a <wsdl:QName> ;

<wsdl:hasLocalName> "string" ; (4)

<wsdl:hasNameSpace> "http://www.w3.org/2001/XMLSchema"

] ;

<uddiv2:hasMetadata>

[ a <biodata:semanticType> ; (5)

<rdf:value> "biodata:Affymetrix_probe_set_id" ;

];

<uddiv2:hasMetadata>

[ a <biodata:formats> ;

<rdf:value>

[ a <biodata:formatBag> ;

<rdf:_1>

[ a <biodata:format> ;

<biodata:hasFormatSystem> "MIME" ; (6)

<biodata:hasFormatIdentifier> "text/x-record-ids" ;

]

]

] ;

...

Figure 12: Representation of published workflow description stored in RDF (in N3 format). The workflow is advertised, following the UDDI specification, as a BusinessService, and marked as a workflow by attaching metadata (‘isWorkflowScript’) (1). The workflow refers to the location of the workflow script (‘AccessPoint’) (2) and its WSDL interface. The interface element is further expanded to show the messages that are accepted as input (3) and returned as output, and metadata is added to provide the syntactic (4), semantic (‘biodata:Affymetrix_probe_set_id’) (5) and MIME (6) types.

Figure 13

Figure 14: Architecture of the Registry.

6.2. Semantic Find Component

The myGrid semantic find component is responsible for analysing and making inferences over semantic descriptions with reference to an ontologyconceptual metadata, and is used for querying over andactivities described with this metadata. categorising semantically described entities, such as services and workflows. This component provides answers to semantic queries over semantic descriptions, using the structure of the ontology to both broaden and refine the search resultsAs this component receives queries expressed in OWL, we can use it to broaden or narrow searches as required. For example, by adding properties to an OWL concept expression we specialise the query and narrow the number of candidate workflows (we travel down the classification lattice); by removing properties we broaden the query and extend the number of candidate workflows that will be classified by the expression (we travel up the classification lattice) [WSG+03a]. TheIts architecture is depicted in Figure 15Figure 15Figure 14Figure 14. The semantic find component itself is responsible for the following.

  • Every time a new service is advertised or metadata is updated, the ontology service and associated reasoner indexes items in a descriptions database to ensure efficient retrieval of entries at time of discovery. Storing the descriptions in a commodity database, as opposed to the mature description logic reasoner technology also has obvious advantages for scalability of the system in practice. A fuller description of this technology is available elsewhere [BHLTBHLT].
  • Discovery queries are processed using the pre-built index or if necessary the ontology service and associated reasoner.

Figure 14

Figure 15: Architecture of the Semantic Find Component. The description database holds semantic descriptions gathered from resources published in the registry; the ontology server provides access to the domain ontologies and manages interaction with the description logic reasoner FaCT [H99aH99aH99a].

Two deployments of the semantic find component are considered. As illustrated in Figure 13, the semantic find component can be embedded in the registry, with queries over the conceptual metadata being processed differently depending on whether semantic reasoning adds extra information or not by the semantic find component, while non conceptual queries would be answered by the registry. Alternatively, the component can be deployed as an autonomous service able to reason over semantic descriptions from a variety of sources including databases and Web pages.

Exact details of the semantic matching algorithms whereby a resource description is matched to semantic query should not impact directly on the architecture described in this paper. In early implementations of this service, we have performed simple subsumption matching between query and description, although matching algorithms such as those described elsewhere [PKPS02aPKPS02aPKPS02a] could also be supported.

An example of the more complex reasoning, using the ontology and its reasoner, required for some queries would be when a user asks for a service taking a given semantic type of input (‘Affymetrix_probe_set_id’) and performing a given semantic type of task (‘retrieving’). In this case, the index would not contain a specific term representing this conjunctive concept, and so reasoning must occur to find the parent concepts in the ontology that can be used for discovery. Clients can present a semantic query to the semantic find component, making use of the metadata structure described in Figure 11 to encapsulate the semantic query expressed in OWL, and be returned locations of archived workflows or services that match.

7. Discussion and Related Work

The Web Service Architecture details the existence of a directory service for the registration and subsequent discovery of services, and languages for the composition of services into workflows. For directories, the UDDI [UDDIUDDI] registry (Universal Description, Discovery, and Integration) has become the de-facto. Service descriptions in UDDI are composed from a limited set of high-level data constructs (Business Entity, Business Service etc.) which can include other constructs following a rigid schema. Some of these constructs, such as tModels, Category Bags and Identifier Bags, can be seen as metadata associated with the service description. However, while useful in a limited way, they are all restrictive in scope of description and their use in searching the directory. We extendimprove upon UDDI by allowing arbitrary structured metadata to be attached to not only the services and workflows published, but also their interfaces.

For workflow languages, numerous candidates have been In Section 3.4, we considered automated discovery of activities in workflows at enactment time that have a given function and interface, but it would be useful in many cases to specify activities at construction time without restricting them to a particular interface. As mentioned in Section 3.2., we refer to workflows containing these abstract descriptions in place of one or more services or sub-workflows as workflow templates. However, we have discovered that services with the exact same functionality still often require different ways of being enacted and so cannot be easily substituted one for another [WGG+03a]. For instance, one of the ways in which an activity can be distinguished from another is in its invocation model, so that one service may perform a function with one operation call that requires multiple calls to another service (the example given in [WGG+03a] is of different deployments of the BLAST service discussed in Section 3).

Recently numerous languages have been developed that enable the composition of Web Services into workflows proposed, including:. These include BPEL4WS (Business Process Execution Language for Web Services) [BPELBPEL], Web Services Flow Language (WSFL) [WSFLWSFL], XLANG (Web Services for Business Process Design) [XLANGXLANG] and Scufl (Simple Conceptual Unified Flow Language) [SCUFLSCUFL]. These languages differ in their expressiveness and flexibility. It is unlikely that in the foreseeable future a single workflow language will emerge as a universal standard, although there is some encouraging development in this direction represented by BPEL4WS which integrates the key features of WSFL and XLANG. In myGrid, we have used Scufl to provide a simple representation of the activities of a workflow in such a way that it is easy for a bioinformatician to conceptualise and manipulate the overall experimental design by abstracting away from the details of low level service orchestration: [Addis03]Addis 2003, AHM2003 paper].

The motivation to discover and compose Web Services in automated and intelligent ways has fuelled many researchers from the Semantic Web community to apply knowledge technologies to service descriptions, often building on past work in Problem Solving Methods [WSFMWSFMWSFM]. Early work has focused on semantic service discovery [DAMLSDAML-S]; more recent work has shifted to automated intelligent service composition, primarily through the use of AI planning techniques [WPS03ref]. Our semantic descriptions support the composition of services by enabling semantic and syntactic capability checking of input and output types; however, we do not support automated workflow planning as the plan is the biologist’s experiment and our experiences suggest they demand complete control over the definition. DAML-S [DAMLSDAMLS] attempts a full description of a service as a process that can be enacted to achieve a goal. A full DAML-S service description incorporates three component perspectives: a planning view of service based on “inputs, outputs, preconditions, and effects” (the service profile); the workflow view of the more primitive services needed to accomplish a complex goal (the service process); the mapping of the atomic parts of this workflow to their concrete WSDL [WSDLWSDL] descriptions (the service grounding). DAML-S provides an alternate mechanism that allows service publishers to attach semantic information to the parameters of a service. Indeed, the argument types referred to by the profile input and output parameters are semantic. Such semantic types are mapped to the syntactic type specified in the WSDL interface by the intermediary of the service grounding. Such a mechanism is welcome but convoluted and limited. The mapping from semantic to syntactic types involves the process model, and it only supports semantic annotations provided by the publisher, and not by third party annotators; a profile only supports one semantic description per parameter and does not allow multiple interpretations. Finally, such semantic annotations are restricted to input and output parameters, but may not be applied in a similar manner to other elements of a WSDL interface specification, e.g. operations or sets of operations collected in port types.

From the distributed Grid computing community,

tThe ICENI project also uses Web Ontology Language (OWL)OWL for semantic annotation [HLN03aHLN03a] and, while a promising approach,but it so far deals only with the ontological description of service interfaces, ; it ignoringes other aspects such as the semantic annotation of WSDL documents, and does not and support workflow discovery. BAlso, because the descriptions are added directly to the interfaces in the source code, only the service provider can publish semantic descriptions (not third parties), which imposes restriction on the community using the system. WIn the myGrid project, we have considered these aspects highly relevant and opted for the use of a flexible structure which enables annotation with both semantic and other metadata, by both service providers and third parties.

Finally, the biology domain has been investigating its own mechanisms for publishing bio-Web Services. The most well known proposal is The UDDI [UDDI] registry (Universal Description, Discovery, and Integration) has become the de-facto standard for service discovery in the Web Services community. Service descriptions in UDDI are composed from a limited set of high-level data constructs (Business Entity, Business Service etc.) which can include other constructs following a rigid schema. Some of these constructs, such as tModels, Category Bags and Identifier Bags, can be seen as metadata associated with the service description. However, while useful in a limited way, they are all restrictive in scope of description and their use in searching the directory. We improve upon UDDI by allowing arbitrary structured metadata to be attached to not only the services and workflows published, but also their interfaces.

DAML-S [DAMLS] attempts a full description of a service as a process that can be enacted to achieve a goal. A full DAML-S service description incorporates three component perspectives: a planning view of service based on “inputs, outputs, preconditions, and effects” (the service profile); the workflow view of the more primitive services needed to accomplish a complex goal (the service process); the mapping of the atomic parts of this workflow to their concrete WSDL [WSDL] descriptions (the service grounding). DAML-S provides an alternate mechanism that allows service publishers to attach semantic information to the parameters of a service. Indeed, the argument types referred to by the profile input and output parameters are semantic. Such semantic types are mapped to the syntactic type specified in the WSDL interface by the intermediary of the service grounding. We feel that such a mechanism is a step in the right direction, but it is convoluted because the mapping from semantic to syntactic types involves the process model. It also has some limitations since it only supports semantic annotations provided by the publisher, and not by third party annotators; a profile only supports one semantic description per parameter and does not allow multiple interpretations. Finally, such semantic annotations are restricted to input and output parameters, but may not be applied in a similar manner to other elements of a WSDL interface specification, e.g. operations or sets of operations collected in port types.

BioMOBY [WL02aWL02a], is a service discovery architecture based on a view of a service as an atomic process or operation that takes a set of inputs and produces a set of outputs. The service, inputs and outputs will beare given semantic types which also defines the message format. It is limited in that itHowever, BioMOBY has a number of limitations: it does not support the UDDI protocol, so specialist clients have to be developed; it , and it does not have a general attachment mechanism for service descriptions; and it . Finally, BioMOBY does not explicit address the publishing or discovery of workflows. myGrid and BioMOBY are working closely together to develop a common semantic registry framework.

8. Conclusions Discussion and Future Work

WIn this paper, we have demonstrated how the myGrid architecture can be used to construct, publish, semantically describe, annotate and discover workflows as part of scientists’ experimental processes. Scientists without detailed computer science knowledge wish to share and use each others’ experimental designs, but discovering the designs available becomes difficult when there are a large and increasing number available in a distributed system such as the Web. The myGrid architecture, making use of Web Services, workflows, enhanced service discovery technologies, Semantic Web technologies and semantic descriptions enables scientists to do this with easemore easily. We have shown how the process takes place from the users’ perspective and presented the underlying protocol implemented by our middleware.

We recognise that there can be multiple registries owned by different people and organisations, in which many useful workflows may be published. For this reason, future work on the registry will concentrate on federation of registries and the personalisation of registries to contain the information most useful to individuals, which could include semantic descriptions other than those provided by the workflow author.

It is useful to specify activities at construction time without restricting them to a particular interface. These workflow templates contain abstract descriptions in place of one or more services or sub-workflows. However, in practice we find that services with the exact same functionality still often require different ways of being enacted and so cannot be easily substituted one for another [WSG+03aWGG+03a]. For instance, one of the ways in which an activity can be distinguished from another is in its invocation model, so that one service may perform a function with one operation call that requires multiple calls to another service (the example given in [WGG+04aWGG+03a] is of different deployments of the BLAST service discussed in Section 3).

In testing myGrid, we have found thatThe discovery of workflows by the type of input, and classifying them by function for browsing by the user, have been theturn out to be the most helpful applications of the semantic descriptions provided. It has been clear that better tools for the attachment and, later, maintenance (if mistakes or imprecision is found) of semantic descriptions are required, as the annotator should be an expert in the domain of the descriptions rather than the languages and structures in which the description is expressed. Future work in tools concentrates on two areas: making the publication of semantic annotations incidental and making discovery invisible in the sense that the user sees the workflow discovery as a part of their natural scientific environment in their terms.

References

[Addis03] Matthew Addis, Justin Ferris, Mark Greenwood, Darren Marvin, Peter Li, Tom Oinn and Anil Wipat: Experiences with eScience workflow specification and enactment in bioinformatics, In proceeding of the UK OST e-Science secocnd All Hands Meeting 2003 (AHM’03), pages 459-467, Notthingham, UK, September 2003.

[Affymetrix] Affymetrix. http://www.affymetrix.com. Last visited 2003.

[AHM03]

[AGM+90a] S.F. Altschul, W. Gish, M. Miller, E. W. Myers and D.J. Lipman. Basic Local Alignment Search Tool. In Journal of Molecular Biology, 215:403-410, 1990.

[AKT] AKT Project. http://www.aktors.org/. Last visited 2003.

[BGB+99] Patricia G. Baker, Carole Goble, Sean Bechhofer, Norman Paton, Robert Stevens, Andy Brass. An Ontology for Bioinformatics Applications. Bioinformatics, 15(6) pp 510--520, 1999.

[BHL01a] Tim Berners-Lee, James Hendler, and Ora Lassila. The Semantic Web. Scientific

American, 284(5):34–43, 2001.

[BHLT] Instance Store. http://instancestore.man.ac.uk/. Last visited 2003.

[BioJava] BioJava. http://www.biojava.org/. Last visited 2003.

[BioPerl] BioPerl. http://bioperl.org/. Last visited 2003.

[BPEL] Business Process Execution Language for Web Services.

http://www-106.ibm.com/developerworks/webservices/library/ws-bpel/. Last visited 2003.

[BHLT]

[DAMLS] The DAML Services Coalition (alphabetically Anupriya Ankolenkar, Mark Burstein, Jerry R. Hobbs, Ora Lassila, David L. Martin, Drew McDermott, Sheila A. McIlraith, Srini Narayanan, Massimo Paolucci, Terry R. Payne and Katia Sycara), "DAML-S: Web Service Description for the Semantic Web", The First International Semantic Web Conference (ISWC), Sardinia (Italy), June, 2002.

[EMBOSS] EMBOSS. http://www.hgmp.mrc.ac.uk/Software/EMBOSS. Last visited 2003.

[FK03] Ian Foster and Carl Kesselman. The Grid, Blueprint for a New Computing Infrastructure. 2nd edition, Morgan Kaufmann, 2003.

[FKNT02a] Ian Foster, Carl Kesselman, Jeffrey Nick and Steve Tueke. The Physiology of the Grid: An Open

Grid Services Architecture for Distributed Systems Integration, Globus, 2002.

[FreeFluo] FreeFluo. http://freefluo.sourceforge.net/. Last visited 2003.

[GGS+03a] Mark Greenwood, Carole Goble, Robert Stevens, Jun Zhao, Matthew Addis, Darren Marvin, Luc Moreau, and Tom Oinn. Provenance of e-science experiments - experience from bioinformatics. In Proceedings of the UK OST e-Science second All Hands Meeting 2003 (AHM'03), pages 223-226, Nottingham, UK, September 2003. ISBN 1-904425-11-9.

[Graves] National Graves’ Disease Foundation Frequently Asked Questions. http://www.ngdf.org/faq.htm. Last visited 2003.

[GO00] The Gene Ontology Consortium. 2000. Gene Ontology: tool for the unification of biology. Nat Genet 25: 25-29.

[H99a] Ian Horrocks. FaCT and iFaCT. In P. Lambrix, A Borgida, M. Lenzerini, R Möller, and P. Patel-Schneider, editors. Proceedings of the International Workshop on Description Logics (DL’99), pages 133-135, 1999.

[HLN03a] J. Hau, W. Lee, and S. Newhouse. Autonomic Service Adaptation using Ontological Annotation.In 4th International Workshop on Grid Computing, Grid 2003, Phoenix, USA, November 2003.

[HP-Sv-H03] Ian Horrocks, Peter F. Patel-Schneider, and Frank van Harmelen. From SHIQ and RDF to OWL: The making of a web ontology language. Journal of Web Semantics, Vol. 1, No. 1 December 2003, Elsevier.

J. Hau, W. Lee, S. Newhouse. The ICENI Semantic Service Adaptation Framework. In UK e-Science All Hands Meeting, p. 79-86, Nottingham, UK, September 2003. ISBN 1-904425-11-9.

[Jena] Jena Semantic Web Toolkit, http://www.hlp.hp.com/semweb/jena.htm. Last visited 2003.

[Jini] Jini. http://www.jini.org/. Last visited 2003.

[LWS+03b] Phillip Lord, Chris Wroe, Robert Stevens, Carole Goble, Simon Miles, Luc Moreau, Keith Decker, Terry Payne, and Juri Papay. Semantic and Personalised Service Discovery. In W. K. Cheung and Y. Ye, editors, Proceedings of Workshop on Knowledge Grid and Grid Intelligence (KGGI'03), in conjunction with 2003 IEEE/WIC International Conference on Web Intelligence/Intelligent Agent Technology, pages 100-107, Halifax, Canada, October 2003. Department of Mathematics and Computing Science, Saint Mary's University, Halifax, Nova Scotia, Canada.

[MLM+04] Luc Moreau, Mike Luck, Simon Miles, Jury Papay, Keith Decker, and Terry Payne. Methodologies and Software Engineering for Agent Systems, chapter Agents and the Grid: Service Discovery. Kluwer, 2004.

[MMP+04]

[MPD+03a] Simon Miles, Juri Papay, Vijay Dialani, Michael Luck, Keith Decker, Terry Payne, and Luc Moreau. Personalised grid service discovery. IEE Proceedings Software: Special Issue on Performance Engineering, 150(4):252-256, August 2003.

[MPP+04] Simon Miles, Juri Papay, Terry Payne, Keith Decker and Luc Moreau. Towards a Protocol for the

Attachment of Semantic Descriptions to Grid Services. To appear in In Proceedings of 2nd European Across Grids Conference (AxGrids 2004). 2004.

[myGrid] myGrid UK e-Science Project. http://www.myGrid.org.uk. Last visited 2003.

[OGSA] OGSA. https://forge.gridforum.org/projects/ogsa-wg. Last visited 2003.

[OWL] Web Ontology Language Overview. http://www.w3.org/TR/owl-features/. Last visited 2003.

[PEDRedro] Pedro. http://pedrodownload.man.ac.uk/. Last visited 2003.

[PKPS02a] Massimo Paolucci, Takahiro Kawamura, Terry Payne and Katia Sycara. Semantic Matching of Web Services Capabilities. In The First International Semantic Web Conference (ISWC), 2002.

[RDF] Resource Description Framework (RDF). http://www.w3.org/RDF/, Created 2001.

[RDQL] RDQL. http://www.hpl.hp.com/semweb/rdql.htm. Last visited 2003.

[SGG+03a] Robert Stevens, Kevin Glover, Chris Greenhalgh, Claire Jennings, Simon Pearce, Peter Li, Mielena Radenkovic, Anil Wipat. In Proceedings of the UK OST e-Science second All Hands Meeting 2003 (AHM'03), pages 43-50, Nottingham, UK, September 2003. ISBN 1-904425-11-9.

[SCUFL] SCUFL Simple Conceptual Unified Flow Language (SCUFL). http://taverna.sourceforge.net/schemata/XScufl.html. Last visited 2003.

[Taverna] Taverna. http://taverna.sourceforge.net/. Last visited 2003.

[UDDI] Universal Description, Discovery and Integration of Business of the Web. ww.uddi.org, 2001.

[WGG+043a] Chris Wroe, Carole Goble, Mark Greenwood, Phillip Lord, Simon Miles, Luc Moreau, Juri Papay, Terry Payne. Experiment automation using semantic data on a bioinformatics Grid. Submitted for publication in IEEE Intelligent Systems, Jan/Feb 2004..

[WL02a] M.D.Wilkinson and M.Links. BioMoby: an open source biological web services proposal. Briefings In Bioinformatics, 4(3), 2002.

[WSDL] Web Services Description Language (WSDL) 1.1. http://www.w3c.org/TR/wsdl. Last visited 2003.

[WSArch] Web Services Architecture. Latest version available from http://www.w3.org/2002/ws/arch/. Last visited 2003.

[WSFL] Web Services Flow Language.

http://www-3.ibm.com/software/solutions/webservices/pdf/WSFL.pdf. Last visited 2003.

[WSFM] D. Fensel, C. Bussler, "The Web Service Modeling Framework WSMF", Technical Report, Vrije Universiteit Amsterdam

[WPS03] Dan Wu, Bijan Parsia, Evren Sirin, et al. Automating DAML-S Web Services Composition Using SHOP2 in Proceeding of 2nd International Semantic Web Conference ISCW2003, Lecture Notes in Computer Science, Springer-Verlag, Heidelberg, Volume 2870 / 2003, pp. 195 – 210, October 2003.

[WSG+03a] Chris Wroe, Robert Stevens, Carole Goble, Angus Roberts, and Mark Greenwood. A suite of DAML+OIL ontologies to describe bioinformatics web services and data. International Journal of Cooperative InformationSystems, 12(2):197–224, 2003

[XLANG] XLANG. http://www.gotdotnet.com/team/xml_wsspecs/xlang-c/default.htm. Last visited 2003.

[ref]

Appendix 1. RDF Representation of a Published Workflow

Below, we find the representation of a published workflow description stored in RDF (in N3 format). The workflow is advertised, following the UDDI specification, as a BusinessService, and marked as a workflow by attaching metadata (‘isWorkflowScript’) (1). The workflow refers to the location of the workflow script (‘AccessPoint’) (2) and its WSDL interface. The interface element is further expanded to show the messages that are accepted as input (3) and returned as output, and metadata is added to provide the syntactic (4), semantic (‘biodata:Affymetrix_probe_set_id’) (5) and MIME (6) types.

# Base:

@prefix biodata: <http://www.ecs.soton.ac.uk/~sm/myGrid/myGrid.daml#>

@prefix registry: <http://www.myGrid.ecs.soton.ac.uk/myGrid.rdfs#> .

@prefix wsdl: <http://www.myGrid.ecs.soton.ac.uk/wsdl.rdfs#> .

@prefix uddiv2: <http://www.myGrid.ecs.soton.ac.uk/uddiv2.rdfs#> .

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

[] a <uddiv2:BusinessService> ;

<uddiv2:hasName>

[ a <uddiv2:NameBag> ;

<rdf:_1> "CandidateGeneAnalysis"

] ;

<uddiv2:hasServiceKey> "d0892afd-198d-404b-bfdf-31c7fa4df8f3" ;

<uddiv2:hasMetadata>

[ a <isWorkflowScript> ; (1)

<rdf:value> "yes" ;

] ;

<uddiv2:hasBindingTemplate> ...

[ a <uddiv2:AccessPoint> ; (2)

<uddiv2:hasText>

"http://www.ecs.soton.ac.uk/~sm/myGrid/CandidateGeneAnalysis.scufl" ;

<uddiv2:hasURLType> "http"

] ...

<uddiv2:hasOverviewDoc>

[ a <uddiv2:OverviewDoc> ;

<uddiv2:hasOverviewURL>

"http://www.ecs.soton.ac.uk/~sm/myGrid/CandidateGeneAnalysis.wsdl" ...

[] a <wsdl:WSDLOverviewDoc> ;

<wsdl:hasFilename> "http://www.ecs.soton.ac.uk/~sm/myGrid/CandidateGeneAnalysis.wsdl" ;

<wsdl:hasMessage>

[ a <wsdl:MessageBag> ;

<rdf:_1>

[ _:b1 <wsdl:Message> ; (3)

<wsdl:hasQName>

[ a <wsdl:QName> ;

<wsdl:hasLocalName> "WholeWorkflowRunRequest" ;

<wsdl:hasNameSpace> "http://www.ecs.soton.ac.uk/~sm/myGrid/myGrid.daml”

]

<wsdl:hasMessagePart>

[ a <wsdl:PartBag> ;

<rdf:_1>

[ a <wsdl:MessagePart> ;

<wsdl:hasName> "probeSetId" ;

<wsdl:hasTypeName>

[ a <wsdl:QName> ;

<wsdl:hasLocalName> "string" ; (4)

<wsdl:hasNameSpace> "http://www.w3.org/2001/XMLSchema"

] ;

<uddiv2:hasMetadata>

[ a <biodata:semanticType> ; (5)

<rdf:value> "biodata:Affymetrix_probe_set_id" ;

];

<uddiv2:hasMetadata>

[ a <biodata:formats> ;

<rdf:value>

[ a <biodata:formatBag> ;

<rdf:_1>

[ a <biodata:format> ;

<biodata:hasFormatSystem> "MIME" ; (6)

<biodata:hasFormatIdentifier> "text/x-record-ids" ;

]

]

] ;

...

1


[1] available at http://www.ecs.soton.ac.uk/~sm/myGrid/AffyIdToEmblSnps.scufl

[2] In this paper, we focus on workflow descriptions, but we note that service descriptions are similar. Service descriptions differ in that the institution is the one hosting the service, and that services do not tend to have sub-activities associated with them. We shall come back to this specific point in Section 7.

[3] BLAST, “the Basic Local Alignment Search Tool” [AGM+90a] [AGM+90a] is an application that encompasses a number of services used to compare a DNA or protein sequence with the large public databases of known sequences. It can therefore accept as input different types of sequence data whether protein or DNA, perform a search over one or more databases and produce its results in a variety of formats. BLAST is highly parameterisable, able to search over many databases with many types of sequence. In fact, BLAST has several instantiations specialised for different sequence types: BLASTn for searching nucleotide sequences over nucleotide sequence databases, BLASTx for nucleotide sequences over protein databases.

[4] In this paper, we focus on workflow descriptions, but we note that service descriptions are similar. Service descriptions differ in that the institution is the one hosting the service, and that services do not tend to have sub-activities associated with them. We shall come back to this specific point in Section 7.

[5] BLAST, “the Basic Local Alignment Search Tool” [AGM+90a] is an application that encompasses a number of services used to compare a DNA or protein sequence with the large public databases of known sequences. It can therefore accept as input different types of sequence data whether protein or DNA, perform a search over one or more databases and produce its results in a variety of formats. BLAST is highly parameterisable, able to search over many databases with many types of sequence. In fact, BLAST has several instantiations specialised for different sequence types: BLASTn for searching nucleotide sequences over nucleotide sequence databases, BLASTx for nucleotide sequences over protein databases.

Set Home | Add to Favorites

All Rights Reserved Powered by Free Document Search and Download

Copyright © 2011
This site does not host pdf,doc,ppt,xls,rtf,txt files all document are the property of their respective owners. complaint#downhi.com
TOP