[provenance-challenge] Re: review of workflows for pc3

Roger Barga barga at microsoft.com
Wed Nov 26 20:50:23 GMT 2008


Hi Luc,

I am joining this thread a bit late in the day, but it looks Yogesh and Satya have provided the materials you requested.  Something worth sharing is that I have been working with a group on the design of an open source scientific database (SciDB) and I am working on the provenance support.  One idea I am interested in exploring is how OPM might fit with SciDB.  That is, if we use OPM to represent provenace from the workflow system and then store the data in SciDB, could this provenance also be stored in the database.  As the DML operates on data in SciDB the provenance data is augmented, so if the data is pulled out by a workflow system the provenance again propagates and continues to be augmented.  There are obvious variations, in which an adapter layer translates OPM into the provenance model supported by SciDB (and vice-versa).



Are you going to attend the upcoming eScience conference and/or workshop in Indiana?  Can't imagine why you wouldn't want to visit lovely Indiana in Dec.  If so I would like to discuss this with you; otherwise, perhaps we can arrange a phone conference or some other location to discuss.  I am keen to see us build a bridge between workflow provenance and database provenance and this does seem like an opportunity.



Let me know.



Cheers,

Roger

PS - the last time we met you mentioned a single malt with 'provenance' in the name.  Was that an Ardbeg Provenance by chance?  If so, I had a chance to try it on a recent trip to Edinburgh - absolutely wonderful.
________________________________________
From: provenance-challenge-ipaw-info-bounces at ipaw.info [provenance-challenge-ipaw-info-bounces at ipaw.info] On Behalf Of Luc Moreau [L.Moreau at ecs.soton.ac.uk]
Sent: Wednesday, November 26, 2008 6:44 AM
To: provenance-challenge at ipaw.info
Cc: Satya Sahoo; Paul Groth
Subject: [provenance-challenge] Re: review of workflows for pc3

Thanks Yogesh.  Is there some slides or papers about Roger's work?

 From a challenge view point, it would be useful to characterise the
type of provenance we would ideally like
to capture within the database. It seems that a layered model is
particularly appropriate here: the activity level
description could constitute an OPM account, whereas a more fine-grained
provenance (with the database sense) could
form another account.

Luc


Yogesh Simmhan wrote:
> Hi Luc,
>
> In the current system, we work around having to instrument the DB by having individual SQL queries wrapped as C# activities. The activities pass through the input params to the parameterized SQL queries. Provenance is captured at the activity level. We also capture the actual queries and query plans from MSSQL server, but don't integrate it with the provenance yet.
>
> Roger B. is working on a design and prototype for a more DB centric and semantic approach using materialized views and first class provenance operators. His presentation at the recent provenance in workflows workshop at Utah talked about it (http://wiki.esi.ac.uk/ProvenanceInWorkflows).
>
> Best,
> --Yogesh
>
>
> | -----Original Message-----
> | From: provenance-challenge-ipaw-info-bounces at ipaw.info
> | [mailto:provenance-challenge-ipaw-info-bounces at ipaw.info] On Behalf Of
> | Luc Moreau
> | Sent: Wednesday, November 26, 2008 4:02 AM
> | To: provenance-challenge at ipaw.info; Paul Groth
> | Cc: Satya Sahoo
> | Subject: [provenance-challenge] Re: review of workflows for pc3
> |
> | Yogesh,
> |
> | There is however an interesting technical challenge (probably
> | appropriate for a provenance challenge!).
> | If we intend to export provenance information into the OPM format, we
> | probably need
> | to capture this information (in part) inside the database processing
> | SQL
> | queries.
> | Are you already doing this in your system?
> |
> | This presents us with an opportunity to have contributions from members
> | of the database community.
> | Who is on this list at this moment? (James? Peter? Val? Jan?  Natalia?)
> |
> | This will require us to structure the workflow in different "stages"
> | where different technologies (including databases)
> | are involved.
> |
> | Can you comment on this?
> |
> | Cheers,
> | Luc
> |
> | Yogesh Simmhan wrote:
> | > Hi Paul,
> | >
> | > Thanks for your comments. Regarding the ease of portability of the
> | Pan-STARRS Load/Merge workflow, all our activities are either SQL
> | queries and updates, or file system operations. While our current
> | executables are for MSSQL/C#, the SQL activities are simple enough to
> | port to any relational DBMS (MySQL, Apache Derby, ...) and programming
> | language. The main workflows operate on 3 relational tables with about
> | 50 columns.
> | >
> | > If selected, we can provide Java source code using Derby, in addition
> | to the C# version using MSSQL. We'll also provide textual descriptions
> | of the activities to enable them to be ported to other DB/languages.
> | >
> | > While the typical Pan-STARRS workflows operate on large datasets,
> | there is nothing that prevents the challenge workflows from operating
> | on a subset of those. Indeed, we use small CSV files and databases
> | (<1MB) for our own testing that we can provide for the challenge.
> | >
> | > Metadata about the telescope is not part of the normal workflow
> | pipeline, but we can consider incorporating supplementary annotations
> | about the telescope outside the scope of the workflow to see how the
> | provenance systems embed annotations in OPM and handle annotation
> | queries.
> | >
> | > Best,
> | > --Yogesh
> | >
> | >
> | > |
> | > | pgroth at ISI.EDU wrote:
> | > | > Hi,
> | > | >
> | > | > To kick start our discussion about what workflows should be used
> | for
> | > | the third
> | > | > provenance challenge, below are my thoughts on which would be
> | most
> | > | appropriate
> | > | > and some questions to the authors. First, let me say that I
> | thought
> | > | all the
> | > | > workflows would provide a good basis for an interesting challenge
> | but
> | > | to be
> | > | > decisive I'm selected two.
> | > | >
> | > | > The two selection criteria I used were the complexity of the
> | > | structures within
> | > | > the workflows (i.e. did it have loops, hierarchies, collections,
> | etc.)
> | > | and how
> | > | > easy it would be for other teams to get the workflows up and
> | running.
> | > | I believe
> | > | > given the complex control structures in some of these workflows
> | that
> | > | it would
> | > | > be difficult to provide intermediary data sets and thus teams
> | would
> | > | need to
> | > | > execute the workflows themselves unlike previous challenges where
> | > | dummy
> | > | > components could be used.
> | > | >
> | > | > 1. Build and test workflow
> | > | > In terms of being able to execute the workflows, the Software
> | build
> | > | and testing
> | > | > workflow seems by far the easiest to get up and running. Most
> | systems
> | > | have ant
> | > | > and java and the build file can be easily adapted to use
> | Makefiles.
> | > | Likewise,
> | > | > the ant file has a multi-level hierarchy, which is an interesting
> | > | structure.
> | > | > The downside to the workflow is it's lack of complexity, it does
> | not
> | > | have
> | > | > collections or nested data sets. However, I think the workflow
> | would
> | > | make for a
> | > | > simple starting point for testing interoperability before moving
> | on
> | > | to the more
> | > | > complex second workflow. Furthermore, by using an ant file the
> | > | challenge does
> | > | > not become too workflow specific.
> | > | >
> | > | > 2. MSR-WSU Pan-Starrs workflow
> | > | > My first choice for second workflow is the MSR-WSU, Panstarrs
> | > | workflow. It has a
> | > | > number of interesting workflow structures such as if/else as well
> | as
> | > | loops over
> | > | > collections. I also like the the idea of having multiple levels
> | of
> | > | abstraction
> | > | > around database tables. It would be interesting to ask for the
> | > | provenance of an
> | > | > individual items in a table and retrieve all the modifications on
> | > | each table
> | > | > including modifications to individual items. The explicit use of
> | > | database
> | > | > tables might also encourage the database community to get
> | involved
> | > | with the
> | > | > challenge. What do others think on this issue?
> | > | >
> | > | > I'm wondering if the questions about external details from the
> | > | Neptune workflow
> | > | > (e.g. the types of sensor detail) could be incorporated in the
> | > | Panstars
> | > | > workflow? For example, the telescope which the data was collected
> | > | from?
> | > | >
> | > | > The major reservation I have with this workflow is how easy it
> | would
> | > | be for
> | > | > others to execute. Given the Pan-STARRS workflow is designed to
> | work
> | > | with large
> | > | > data, can the MSR team comment on whether small data sets are
> | > | available? Also,
> | > | > given that the implementation requires .Net, how easy could this
> | be
> | > | run on
> | > | > non-windows machines? Are there non-windows executables available?
> | > | >
> | > | > * myExperiment & Brain Imaging Workflows
> | > | > If the Panstarrs workflow can not be executed by different teams
> | > | easily, I think
> | > | > we should look at selecting one of these options. Can these two
> | teams
> | > | comment
> | > | > on how easy it would be for others to use the components within
> | their
> | > | workflows
> | > | > without invoking their particular workflow enactment engines?
> | > | >
> | > | > I did like the dynamic nature of the Taverna workflow as it makes
> | for
> | > | a good
> | > | > case for provenance (e.g. the abstracts returned from PubMed will
> | > | vary over
> | > | > time) Could we incorporate this into our selections?
> | > | >
> | > | > With that, what do you think?
> | > | >
> | > | > Thanks,
> | > | > Paul
> | > | >
> | > | > --------------------------------------------------------------
> | > | > Paul Groth, Ph.D.
> | > | > Postdoctoral Research Associate
> | > | > Information Sciences Institute
> | > | > University of Southern California
> | > | > pgroth at isi.edu
> | > | > Tel:  310 448 8482  Fax: 310 822 0751
> | > | > http://www.isi.edu/~pgroth/
> | > | > http://thinklinks.wordpress.org
> | > | >
> | > | >
> | > | >
> | > | >
> | > | >
> | > |
> | > |
> | > | --
> | > | Professor Luc Moreau               tel:   +44 23 8059 4487
> | > | Electronics and Computer Science   email: l.moreau at ecs.soton.ac.uk
> | > | University of Southampton          www:   www.ecs.soton.ac.uk/~lavm
> | > | Southampton SO17 1BJ               skype: prof.luc.moreau
> | > | United Kingdom                     fring: Luc
> | > |
> | > |
> | > |
> | >
> | >
> | >
> |
> |
> | --
> | Professor Luc Moreau               tel:   +44 23 8059 4487
> | Electronics and Computer Science   email: l.moreau at ecs.soton.ac.uk
> | University of Southampton          www:   www.ecs.soton.ac.uk/~lavm
> | Southampton SO17 1BJ               skype: prof.luc.moreau
> | United Kingdom                     fring: Luc
> |
> |
> |
>
>
>


--
Professor Luc Moreau               tel:   +44 23 8059 4487
Electronics and Computer Science   email: l.moreau at ecs.soton.ac.uk
University of Southampton          www:   www.ecs.soton.ac.uk/~lavm
Southampton SO17 1BJ               skype: prof.luc.moreau
United Kingdom                     fring: Luc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/provenance-challenge-ipaw-info/attachments/20081126/cbae195f/attachment-0001.html 


More information about the Provenance-challenge-ipaw-info mailing list