[provenance-challenge] Re: review of workflows for pc3

Yogesh Simmhan yoges at microsoft.com
Wed Nov 26 13:42:27 GMT 2008


Hi Luc,

In the current system, we work around having to instrument the DB by having individual SQL queries wrapped as C# activities. The activities pass through the input params to the parameterized SQL queries. Provenance is captured at the activity level. We also capture the actual queries and query plans from MSSQL server, but don't integrate it with the provenance yet.

Roger B. is working on a design and prototype for a more DB centric and semantic approach using materialized views and first class provenance operators. His presentation at the recent provenance in workflows workshop at Utah talked about it (http://wiki.esi.ac.uk/ProvenanceInWorkflows).

Best,
--Yogesh


| -----Original Message-----
| From: provenance-challenge-ipaw-info-bounces at ipaw.info
| [mailto:provenance-challenge-ipaw-info-bounces at ipaw.info] On Behalf Of
| Luc Moreau
| Sent: Wednesday, November 26, 2008 4:02 AM
| To: provenance-challenge at ipaw.info; Paul Groth
| Cc: Satya Sahoo
| Subject: [provenance-challenge] Re: review of workflows for pc3
|
| Yogesh,
|
| There is however an interesting technical challenge (probably
| appropriate for a provenance challenge!).
| If we intend to export provenance information into the OPM format, we
| probably need
| to capture this information (in part) inside the database processing
| SQL
| queries.
| Are you already doing this in your system?
|
| This presents us with an opportunity to have contributions from members
| of the database community.
| Who is on this list at this moment? (James? Peter? Val? Jan?  Natalia?)
|
| This will require us to structure the workflow in different "stages"
| where different technologies (including databases)
| are involved.
|
| Can you comment on this?
|
| Cheers,
| Luc
|
| Yogesh Simmhan wrote:
| > Hi Paul,
| >
| > Thanks for your comments. Regarding the ease of portability of the
| Pan-STARRS Load/Merge workflow, all our activities are either SQL
| queries and updates, or file system operations. While our current
| executables are for MSSQL/C#, the SQL activities are simple enough to
| port to any relational DBMS (MySQL, Apache Derby, ...) and programming
| language. The main workflows operate on 3 relational tables with about
| 50 columns.
| >
| > If selected, we can provide Java source code using Derby, in addition
| to the C# version using MSSQL. We'll also provide textual descriptions
| of the activities to enable them to be ported to other DB/languages.
| >
| > While the typical Pan-STARRS workflows operate on large datasets,
| there is nothing that prevents the challenge workflows from operating
| on a subset of those. Indeed, we use small CSV files and databases
| (<1MB) for our own testing that we can provide for the challenge.
| >
| > Metadata about the telescope is not part of the normal workflow
| pipeline, but we can consider incorporating supplementary annotations
| about the telescope outside the scope of the workflow to see how the
| provenance systems embed annotations in OPM and handle annotation
| queries.
| >
| > Best,
| > --Yogesh
| >
| >
| > |
| > | pgroth at ISI.EDU wrote:
| > | > Hi,
| > | >
| > | > To kick start our discussion about what workflows should be used
| for
| > | the third
| > | > provenance challenge, below are my thoughts on which would be
| most
| > | appropriate
| > | > and some questions to the authors. First, let me say that I
| thought
| > | all the
| > | > workflows would provide a good basis for an interesting challenge
| but
| > | to be
| > | > decisive I'm selected two.
| > | >
| > | > The two selection criteria I used were the complexity of the
| > | structures within
| > | > the workflows (i.e. did it have loops, hierarchies, collections,
| etc.)
| > | and how
| > | > easy it would be for other teams to get the workflows up and
| running.
| > | I believe
| > | > given the complex control structures in some of these workflows
| that
| > | it would
| > | > be difficult to provide intermediary data sets and thus teams
| would
| > | need to
| > | > execute the workflows themselves unlike previous challenges where
| > | dummy
| > | > components could be used.
| > | >
| > | > 1. Build and test workflow
| > | > In terms of being able to execute the workflows, the Software
| build
| > | and testing
| > | > workflow seems by far the easiest to get up and running. Most
| systems
| > | have ant
| > | > and java and the build file can be easily adapted to use
| Makefiles.
| > | Likewise,
| > | > the ant file has a multi-level hierarchy, which is an interesting
| > | structure.
| > | > The downside to the workflow is it's lack of complexity, it does
| not
| > | have
| > | > collections or nested data sets. However, I think the workflow
| would
| > | make for a
| > | > simple starting point for testing interoperability before moving
| on
| > | to the more
| > | > complex second workflow. Furthermore, by using an ant file the
| > | challenge does
| > | > not become too workflow specific.
| > | >
| > | > 2. MSR-WSU Pan-Starrs workflow
| > | > My first choice for second workflow is the MSR-WSU, Panstarrs
| > | workflow. It has a
| > | > number of interesting workflow structures such as if/else as well
| as
| > | loops over
| > | > collections. I also like the the idea of having multiple levels
| of
| > | abstraction
| > | > around database tables. It would be interesting to ask for the
| > | provenance of an
| > | > individual items in a table and retrieve all the modifications on
| > | each table
| > | > including modifications to individual items. The explicit use of
| > | database
| > | > tables might also encourage the database community to get
| involved
| > | with the
| > | > challenge. What do others think on this issue?
| > | >
| > | > I'm wondering if the questions about external details from the
| > | Neptune workflow
| > | > (e.g. the types of sensor detail) could be incorporated in the
| > | Panstars
| > | > workflow? For example, the telescope which the data was collected
| > | from?
| > | >
| > | > The major reservation I have with this workflow is how easy it
| would
| > | be for
| > | > others to execute. Given the Pan-STARRS workflow is designed to
| work
| > | with large
| > | > data, can the MSR team comment on whether small data sets are
| > | available? Also,
| > | > given that the implementation requires .Net, how easy could this
| be
| > | run on
| > | > non-windows machines? Are there non-windows executables available?
| > | >
| > | > * myExperiment & Brain Imaging Workflows
| > | > If the Panstarrs workflow can not be executed by different teams
| > | easily, I think
| > | > we should look at selecting one of these options. Can these two
| teams
| > | comment
| > | > on how easy it would be for others to use the components within
| their
| > | workflows
| > | > without invoking their particular workflow enactment engines?
| > | >
| > | > I did like the dynamic nature of the Taverna workflow as it makes
| for
| > | a good
| > | > case for provenance (e.g. the abstracts returned from PubMed will
| > | vary over
| > | > time) Could we incorporate this into our selections?
| > | >
| > | > With that, what do you think?
| > | >
| > | > Thanks,
| > | > Paul
| > | >
| > | > --------------------------------------------------------------
| > | > Paul Groth, Ph.D.
| > | > Postdoctoral Research Associate
| > | > Information Sciences Institute
| > | > University of Southern California
| > | > pgroth at isi.edu
| > | > Tel:  310 448 8482  Fax: 310 822 0751
| > | > http://www.isi.edu/~pgroth/
| > | > http://thinklinks.wordpress.org
| > | >
| > | >
| > | >
| > | >
| > | >
| > |
| > |
| > | --
| > | Professor Luc Moreau               tel:   +44 23 8059 4487
| > | Electronics and Computer Science   email: l.moreau at ecs.soton.ac.uk
| > | University of Southampton          www:   www.ecs.soton.ac.uk/~lavm
| > | Southampton SO17 1BJ               skype: prof.luc.moreau
| > | United Kingdom                     fring: Luc
| > |
| > |
| > |
| >
| >
| >
|
|
| --
| Professor Luc Moreau               tel:   +44 23 8059 4487
| Electronics and Computer Science   email: l.moreau at ecs.soton.ac.uk
| University of Southampton          www:   www.ecs.soton.ac.uk/~lavm
| Southampton SO17 1BJ               skype: prof.luc.moreau
| United Kingdom                     fring: Luc
|
|
|




More information about the Provenance-challenge-ipaw-info mailing list