[provenance-challenge] Re: review of workflows for pc3
Yogesh Simmhan
yoges at microsoft.com
Tue Nov 25 13:33:58 GMT 2008
Hi Paul,
Thanks for your comments. Regarding the ease of portability of the Pan-STARRS Load/Merge workflow, all our activities are either SQL queries and updates, or file system operations. While our current executables are for MSSQL/C#, the SQL activities are simple enough to port to any relational DBMS (MySQL, Apache Derby, ...) and programming language. The main workflows operate on 3 relational tables with about 50 columns.
If selected, we can provide Java source code using Derby, in addition to the C# version using MSSQL. We'll also provide textual descriptions of the activities to enable them to be ported to other DB/languages.
While the typical Pan-STARRS workflows operate on large datasets, there is nothing that prevents the challenge workflows from operating on a subset of those. Indeed, we use small CSV files and databases (<1MB) for our own testing that we can provide for the challenge.
Metadata about the telescope is not part of the normal workflow pipeline, but we can consider incorporating supplementary annotations about the telescope outside the scope of the workflow to see how the provenance systems embed annotations in OPM and handle annotation queries.
Best,
--Yogesh
|
| pgroth at ISI.EDU wrote:
| > Hi,
| >
| > To kick start our discussion about what workflows should be used for
| the third
| > provenance challenge, below are my thoughts on which would be most
| appropriate
| > and some questions to the authors. First, let me say that I thought
| all the
| > workflows would provide a good basis for an interesting challenge but
| to be
| > decisive I'm selected two.
| >
| > The two selection criteria I used were the complexity of the
| structures within
| > the workflows (i.e. did it have loops, hierarchies, collections, etc.)
| and how
| > easy it would be for other teams to get the workflows up and running.
| I believe
| > given the complex control structures in some of these workflows that
| it would
| > be difficult to provide intermediary data sets and thus teams would
| need to
| > execute the workflows themselves unlike previous challenges where
| dummy
| > components could be used.
| >
| > 1. Build and test workflow
| > In terms of being able to execute the workflows, the Software build
| and testing
| > workflow seems by far the easiest to get up and running. Most systems
| have ant
| > and java and the build file can be easily adapted to use Makefiles.
| Likewise,
| > the ant file has a multi-level hierarchy, which is an interesting
| structure.
| > The downside to the workflow is it's lack of complexity, it does not
| have
| > collections or nested data sets. However, I think the workflow would
| make for a
| > simple starting point for testing interoperability before moving on
| to the more
| > complex second workflow. Furthermore, by using an ant file the
| challenge does
| > not become too workflow specific.
| >
| > 2. MSR-WSU Pan-Starrs workflow
| > My first choice for second workflow is the MSR-WSU, Panstarrs
| workflow. It has a
| > number of interesting workflow structures such as if/else as well as
| loops over
| > collections. I also like the the idea of having multiple levels of
| abstraction
| > around database tables. It would be interesting to ask for the
| provenance of an
| > individual items in a table and retrieve all the modifications on
| each table
| > including modifications to individual items. The explicit use of
| database
| > tables might also encourage the database community to get involved
| with the
| > challenge. What do others think on this issue?
| >
| > I'm wondering if the questions about external details from the
| Neptune workflow
| > (e.g. the types of sensor detail) could be incorporated in the
| Panstars
| > workflow? For example, the telescope which the data was collected
| from?
| >
| > The major reservation I have with this workflow is how easy it would
| be for
| > others to execute. Given the Pan-STARRS workflow is designed to work
| with large
| > data, can the MSR team comment on whether small data sets are
| available? Also,
| > given that the implementation requires .Net, how easy could this be
| run on
| > non-windows machines? Are there non-windows executables available?
| >
| > * myExperiment & Brain Imaging Workflows
| > If the Panstarrs workflow can not be executed by different teams
| easily, I think
| > we should look at selecting one of these options. Can these two teams
| comment
| > on how easy it would be for others to use the components within their
| workflows
| > without invoking their particular workflow enactment engines?
| >
| > I did like the dynamic nature of the Taverna workflow as it makes for
| a good
| > case for provenance (e.g. the abstracts returned from PubMed will
| vary over
| > time) Could we incorporate this into our selections?
| >
| > With that, what do you think?
| >
| > Thanks,
| > Paul
| >
| > --------------------------------------------------------------
| > Paul Groth, Ph.D.
| > Postdoctoral Research Associate
| > Information Sciences Institute
| > University of Southern California
| > pgroth at isi.edu
| > Tel: 310 448 8482 Fax: 310 822 0751
| > http://www.isi.edu/~pgroth/
| > http://thinklinks.wordpress.org
| >
| >
| >
| >
| >
|
|
| --
| Professor Luc Moreau tel: +44 23 8059 4487
| Electronics and Computer Science email: l.moreau at ecs.soton.ac.uk
| University of Southampton www: www.ecs.soton.ac.uk/~lavm
| Southampton SO17 1BJ skype: prof.luc.moreau
| United Kingdom fring: Luc
|
|
|
More information about the Provenance-challenge-ipaw-info
mailing list