[provenance-challenge] Re: review of workflows for pc3
Luc Moreau
L.Moreau at ecs.soton.ac.uk
Wed Nov 26 22:51:19 GMT 2008
Roger Barga wrote:
>
> Hi Luc,
>
> I am joining this thread a bit late in the day, but it looks Yogesh
> and Satya have provided the materials you requested. Something worth
> sharing is that I have been working with a group on the design of an
> open source scientific database (SciDB) and I am working on the
> provenance support. One idea I am interested in exploring is how
> OPM might fit with SciDB. That is, if we use OPM to represent
> provenace from the workflow system and then store the data in SciDB,
> could this provenance also be stored in the database. As the DML
> operates on data in SciDB the provenance data is augmented, so if the
> data is pulled out by a workflow system the provenance again
> propagates and continues to be augmented. There are obvious
> variations, in which an adapter layer translates OPM into the
> provenance model supported by SciDB (and vice-versa).
>
Roger,
I received the papers sent earlier in the mailing list. Thanks!
I think it's important to bridge the gap between so-called workflow
provenance and database provenance.
For the challenge, we should find out if OPM is right for that. I am
unclear about what the answer is.
>
>
> Are you going to attend the upcoming eScience conference and/or
> workshop in Indiana? Can't imagine why you wouldn't want to visit
> lovely Indiana in Dec. If so I would like to discuss this with you;
> otherwise, perhaps we can arrange a phone conference or some other
> location to discuss. I am keen to see us build a bridge between
> workflow provenance and database provenance and this does seem like an
> opportunity.
>
It's still teaching term for me, and I won't go to Indiana. I have
suggested to Paul I could join your meeting by
phone/skype. This may unnecessarily complicate logistics. Alternatively,
we should definitely talk.
For the challenge, it's important that we identify scientific goals (as
the one just mentioned; collections is also an interesting one). We
have also to be practical about the complexity of what people can
achieve. So PC3 should be structured in multiple stages so that teams
can focus on the issues they are intested in (or have the bandwidth to
contribute to).
Luc
>
>
> Let me know.
>
>
>
> Cheers,
>
> Roger
>
>
> PS - the last time we met you mentioned a single malt with
> 'provenance' in the name. Was that an Ardbeg Provenance by chance?
> If so, I had a chance to try it on a recent trip to Edinburgh -
> absolutely wonderful.
> ________________________________________
> From: provenance-challenge-ipaw-info-bounces at ipaw.info
> [provenance-challenge-ipaw-info-bounces at ipaw.info] On Behalf Of Luc
> Moreau [L.Moreau at ecs.soton.ac.uk]
> Sent: Wednesday, November 26, 2008 6:44 AM
> To: provenance-challenge at ipaw.info
> Cc: Satya Sahoo; Paul Groth
> Subject: [provenance-challenge] Re: review of workflows for pc3
>
> Thanks Yogesh. Is there some slides or papers about Roger's work?
>
> From a challenge view point, it would be useful to characterise the
> type of provenance we would ideally like
> to capture within the database. It seems that a layered model is
> particularly appropriate here: the activity level
> description could constitute an OPM account, whereas a more fine-grained
> provenance (with the database sense) could
> form another account.
>
> Luc
>
>
> Yogesh Simmhan wrote:
> > Hi Luc,
> >
> > In the current system, we work around having to instrument the DB by
> having individual SQL queries wrapped as C# activities. The activities
> pass through the input params to the parameterized SQL queries.
> Provenance is captured at the activity level. We also capture the
> actual queries and query plans from MSSQL server, but don't integrate
> it with the provenance yet.
> >
> > Roger B. is working on a design and prototype for a more DB centric
> and semantic approach using materialized views and first class
> provenance operators. His presentation at the recent provenance in
> workflows workshop at Utah talked about it
> (http://wiki.esi.ac.uk/ProvenanceInWorkflows).
> >
> > Best,
> > --Yogesh
> >
> >
> > | -----Original Message-----
> > | From: provenance-challenge-ipaw-info-bounces at ipaw.info
> > | [mailto:provenance-challenge-ipaw-info-bounces at ipaw.info] On Behalf Of
> > | Luc Moreau
> > | Sent: Wednesday, November 26, 2008 4:02 AM
> > | To: provenance-challenge at ipaw.info; Paul Groth
> > | Cc: Satya Sahoo
> > | Subject: [provenance-challenge] Re: review of workflows for pc3
> > |
> > | Yogesh,
> > |
> > | There is however an interesting technical challenge (probably
> > | appropriate for a provenance challenge!).
> > | If we intend to export provenance information into the OPM format, we
> > | probably need
> > | to capture this information (in part) inside the database processing
> > | SQL
> > | queries.
> > | Are you already doing this in your system?
> > |
> > | This presents us with an opportunity to have contributions from
> members
> > | of the database community.
> > | Who is on this list at this moment? (James? Peter? Val? Jan?
> Natalia?)
> > |
> > | This will require us to structure the workflow in different "stages"
> > | where different technologies (including databases)
> > | are involved.
> > |
> > | Can you comment on this?
> > |
> > | Cheers,
> > | Luc
> > |
> > | Yogesh Simmhan wrote:
> > | > Hi Paul,
> > | >
> > | > Thanks for your comments. Regarding the ease of portability of the
> > | Pan-STARRS Load/Merge workflow, all our activities are either SQL
> > | queries and updates, or file system operations. While our current
> > | executables are for MSSQL/C#, the SQL activities are simple enough to
> > | port to any relational DBMS (MySQL, Apache Derby, ...) and programming
> > | language. The main workflows operate on 3 relational tables with about
> > | 50 columns.
> > | >
> > | > If selected, we can provide Java source code using Derby, in
> addition
> > | to the C# version using MSSQL. We'll also provide textual descriptions
> > | of the activities to enable them to be ported to other DB/languages.
> > | >
> > | > While the typical Pan-STARRS workflows operate on large datasets,
> > | there is nothing that prevents the challenge workflows from operating
> > | on a subset of those. Indeed, we use small CSV files and databases
> > | (<1MB) for our own testing that we can provide for the challenge.
> > | >
> > | > Metadata about the telescope is not part of the normal workflow
> > | pipeline, but we can consider incorporating supplementary annotations
> > | about the telescope outside the scope of the workflow to see how the
> > | provenance systems embed annotations in OPM and handle annotation
> > | queries.
> > | >
> > | > Best,
> > | > --Yogesh
> > | >
> > | >
> > | > |
> > | > | pgroth at ISI.EDU wrote:
> > | > | > Hi,
> > | > | >
> > | > | > To kick start our discussion about what workflows should be used
> > | for
> > | > | the third
> > | > | > provenance challenge, below are my thoughts on which would be
> > | most
> > | > | appropriate
> > | > | > and some questions to the authors. First, let me say that I
> > | thought
> > | > | all the
> > | > | > workflows would provide a good basis for an interesting
> challenge
> > | but
> > | > | to be
> > | > | > decisive I'm selected two.
> > | > | >
> > | > | > The two selection criteria I used were the complexity of the
> > | > | structures within
> > | > | > the workflows (i.e. did it have loops, hierarchies, collections,
> > | etc.)
> > | > | and how
> > | > | > easy it would be for other teams to get the workflows up and
> > | running.
> > | > | I believe
> > | > | > given the complex control structures in some of these workflows
> > | that
> > | > | it would
> > | > | > be difficult to provide intermediary data sets and thus teams
> > | would
> > | > | need to
> > | > | > execute the workflows themselves unlike previous challenges
> where
> > | > | dummy
> > | > | > components could be used.
> > | > | >
> > | > | > 1. Build and test workflow
> > | > | > In terms of being able to execute the workflows, the Software
> > | build
> > | > | and testing
> > | > | > workflow seems by far the easiest to get up and running. Most
> > | systems
> > | > | have ant
> > | > | > and java and the build file can be easily adapted to use
> > | Makefiles.
> > | > | Likewise,
> > | > | > the ant file has a multi-level hierarchy, which is an
> interesting
> > | > | structure.
> > | > | > The downside to the workflow is it's lack of complexity, it does
> > | not
> > | > | have
> > | > | > collections or nested data sets. However, I think the workflow
> > | would
> > | > | make for a
> > | > | > simple starting point for testing interoperability before moving
> > | on
> > | > | to the more
> > | > | > complex second workflow. Furthermore, by using an ant file the
> > | > | challenge does
> > | > | > not become too workflow specific.
> > | > | >
> > | > | > 2. MSR-WSU Pan-Starrs workflow
> > | > | > My first choice for second workflow is the MSR-WSU, Panstarrs
> > | > | workflow. It has a
> > | > | > number of interesting workflow structures such as if/else as
> well
> > | as
> > | > | loops over
> > | > | > collections. I also like the the idea of having multiple levels
> > | of
> > | > | abstraction
> > | > | > around database tables. It would be interesting to ask for the
> > | > | provenance of an
> > | > | > individual items in a table and retrieve all the
> modifications on
> > | > | each table
> > | > | > including modifications to individual items. The explicit use of
> > | > | database
> > | > | > tables might also encourage the database community to get
> > | involved
> > | > | with the
> > | > | > challenge. What do others think on this issue?
> > | > | >
> > | > | > I'm wondering if the questions about external details from the
> > | > | Neptune workflow
> > | > | > (e.g. the types of sensor detail) could be incorporated in the
> > | > | Panstars
> > | > | > workflow? For example, the telescope which the data was
> collected
> > | > | from?
> > | > | >
> > | > | > The major reservation I have with this workflow is how easy it
> > | would
> > | > | be for
> > | > | > others to execute. Given the Pan-STARRS workflow is designed to
> > | work
> > | > | with large
> > | > | > data, can the MSR team comment on whether small data sets are
> > | > | available? Also,
> > | > | > given that the implementation requires .Net, how easy could this
> > | be
> > | > | run on
> > | > | > non-windows machines? Are there non-windows executables
> available?
> > | > | >
> > | > | > * myExperiment & Brain Imaging Workflows
> > | > | > If the Panstarrs workflow can not be executed by different teams
> > | > | easily, I think
> > | > | > we should look at selecting one of these options. Can these two
> > | teams
> > | > | comment
> > | > | > on how easy it would be for others to use the components within
> > | their
> > | > | workflows
> > | > | > without invoking their particular workflow enactment engines?
> > | > | >
> > | > | > I did like the dynamic nature of the Taverna workflow as it
> makes
> > | for
> > | > | a good
> > | > | > case for provenance (e.g. the abstracts returned from PubMed
> will
> > | > | vary over
> > | > | > time) Could we incorporate this into our selections?
> > | > | >
> > | > | > With that, what do you think?
> > | > | >
> > | > | > Thanks,
> > | > | > Paul
> > | > | >
> > | > | > --------------------------------------------------------------
> > | > | > Paul Groth, Ph.D.
> > | > | > Postdoctoral Research Associate
> > | > | > Information Sciences Institute
> > | > | > University of Southern California
> > | > | > pgroth at isi.edu
> > | > | > Tel: 310 448 8482 Fax: 310 822 0751
> > | > | > http://www.isi.edu/~pgroth/
> > | > | > http://thinklinks.wordpress.org
> > | > | >
> > | > | >
> > | > | >
> > | > | >
> > | > | >
> > | > |
> > | > |
> > | > | --
> > | > | Professor Luc Moreau tel: +44 23 8059 4487
> > | > | Electronics and Computer Science email: l.moreau at ecs.soton.ac.uk
> > | > | University of Southampton www:
> www.ecs.soton.ac.uk/~lavm
> > | > | Southampton SO17 1BJ skype: prof.luc.moreau
> > | > | United Kingdom fring: Luc
> > | > |
> > | > |
> > | > |
> > | >
> > | >
> > | >
> > |
> > |
> > | --
> > | Professor Luc Moreau tel: +44 23 8059 4487
> > | Electronics and Computer Science email: l.moreau at ecs.soton.ac.uk
> > | University of Southampton www: www.ecs.soton.ac.uk/~lavm
> > | Southampton SO17 1BJ skype: prof.luc.moreau
> > | United Kingdom fring: Luc
> > |
> > |
> > |
> >
> >
> >
>
>
> --
> Professor Luc Moreau tel: +44 23 8059 4487
> Electronics and Computer Science email: l.moreau at ecs.soton.ac.uk
> University of Southampton www: www.ecs.soton.ac.uk/~lavm
> Southampton SO17 1BJ skype: prof.luc.moreau
> United Kingdom fring: Luc
>
--
Professor Luc Moreau tel: +44 23 8059 4487
Electronics and Computer Science email: l.moreau at ecs.soton.ac.uk
University of Southampton www: www.ecs.soton.ac.uk/~lavm
Southampton SO17 1BJ skype: prof.luc.moreau
United Kingdom fring: Luc
More information about the Provenance-challenge-ipaw-info
mailing list