[provenance-challenge] Re: review of workflows for pc3
Roger Barga
barga at microsoft.com
Wed Nov 26 22:46:42 GMT 2008
This was no doubt one of the finest whiskeys I have ever tasted. However, the saying "champagne tastes on a beer budget" comes to mind, so I will enjoy my 10 year Ardbeg and Ardbeg Nam Best.
roger
________________________________________
From: provenance-challenge-ipaw-info-bounces at ipaw.info [provenance-challenge-ipaw-info-bounces at ipaw.info] On Behalf Of Luc Moreau [L.Moreau at ecs.soton.ac.uk]
Sent: Wednesday, November 26, 2008 2:43 PM
To: provenance-challenge at ipaw.info
Cc: Satya Sahoo; Paul Groth
Subject: [provenance-challenge] Re: review of workflows for pc3
Roger Barga wrote:
>
> PS - the last time we met you mentioned a single malt with
> 'provenance' in the name. Was that an Ardbeg Provenance by chance?
> If so, I had a chance to try it on a recent trip to Edinburgh -
> absolutely wonderful.
>
It was! (and for the couple remaining glasses, no doubt will be).
Trying to search for the provenance of 'provenance whisky', I found two
interesting pages:
http://www.thewhiskyexchange.com/P-6279.aspx
http://www.whiskymag.com/whisky/brand/ardbeg/whisky615.html
It is really great!
Luc
> ________________________________________
> From: provenance-challenge-ipaw-info-bounces at ipaw.info
> [provenance-challenge-ipaw-info-bounces at ipaw.info] On Behalf Of Luc
> Moreau [L.Moreau at ecs.soton.ac.uk]
> Sent: Wednesday, November 26, 2008 6:44 AM
> To: provenance-challenge at ipaw.info
> Cc: Satya Sahoo; Paul Groth
> Subject: [provenance-challenge] Re: review of workflows for pc3
>
> Thanks Yogesh. Is there some slides or papers about Roger's work?
>
> From a challenge view point, it would be useful to characterise the
> type of provenance we would ideally like
> to capture within the database. It seems that a layered model is
> particularly appropriate here: the activity level
> description could constitute an OPM account, whereas a more fine-grained
> provenance (with the database sense) could
> form another account.
>
> Luc
>
>
> Yogesh Simmhan wrote:
> > Hi Luc,
> >
> > In the current system, we work around having to instrument the DB by
> having individual SQL queries wrapped as C# activities. The activities
> pass through the input params to the parameterized SQL queries.
> Provenance is captured at the activity level. We also capture the
> actual queries and query plans from MSSQL server, but don't integrate
> it with the provenance yet.
> >
> > Roger B. is working on a design and prototype for a more DB centric
> and semantic approach using materialized views and first class
> provenance operators. His presentation at the recent provenance in
> workflows workshop at Utah talked about it
> (http://wiki.esi.ac.uk/ProvenanceInWorkflows).
> >
> > Best,
> > --Yogesh
> >
> >
> > | -----Original Message-----
> > | From: provenance-challenge-ipaw-info-bounces at ipaw.info
> > | [mailto:provenance-challenge-ipaw-info-bounces at ipaw.info] On Behalf Of
> > | Luc Moreau
> > | Sent: Wednesday, November 26, 2008 4:02 AM
> > | To: provenance-challenge at ipaw.info; Paul Groth
> > | Cc: Satya Sahoo
> > | Subject: [provenance-challenge] Re: review of workflows for pc3
> > |
> > | Yogesh,
> > |
> > | There is however an interesting technical challenge (probably
> > | appropriate for a provenance challenge!).
> > | If we intend to export provenance information into the OPM format, we
> > | probably need
> > | to capture this information (in part) inside the database processing
> > | SQL
> > | queries.
> > | Are you already doing this in your system?
> > |
> > | This presents us with an opportunity to have contributions from
> members
> > | of the database community.
> > | Who is on this list at this moment? (James? Peter? Val? Jan?
> Natalia?)
> > |
> > | This will require us to structure the workflow in different "stages"
> > | where different technologies (including databases)
> > | are involved.
> > |
> > | Can you comment on this?
> > |
> > | Cheers,
> > | Luc
> > |
> > | Yogesh Simmhan wrote:
> > | > Hi Paul,
> > | >
> > | > Thanks for your comments. Regarding the ease of portability of the
> > | Pan-STARRS Load/Merge workflow, all our activities are either SQL
> > | queries and updates, or file system operations. While our current
> > | executables are for MSSQL/C#, the SQL activities are simple enough to
> > | port to any relational DBMS (MySQL, Apache Derby, ...) and programming
> > | language. The main workflows operate on 3 relational tables with about
> > | 50 columns.
> > | >
> > | > If selected, we can provide Java source code using Derby, in
> addition
> > | to the C# version using MSSQL. We'll also provide textual descriptions
> > | of the activities to enable them to be ported to other DB/languages.
> > | >
> > | > While the typical Pan-STARRS workflows operate on large datasets,
> > | there is nothing that prevents the challenge workflows from operating
> > | on a subset of those. Indeed, we use small CSV files and databases
> > | (<1MB) for our own testing that we can provide for the challenge.
> > | >
> > | > Metadata about the telescope is not part of the normal workflow
> > | pipeline, but we can consider incorporating supplementary annotations
> > | about the telescope outside the scope of the workflow to see how the
> > | provenance systems embed annotations in OPM and handle annotation
> > | queries.
> > | >
> > | > Best,
> > | > --Yogesh
> > | >
> > | >
> > | > |
> > | > | pgroth at ISI.EDU wrote:
> > | > | > Hi,
> > | > | >
> > | > | > To kick start our discussion about what workflows should be used
> > | for
> > | > | the third
> > | > | > provenance challenge, below are my thoughts on which would be
> > | most
> > | > | appropriate
> > | > | > and some questions to the authors. First, let me say that I
> > | thought
> > | > | all the
> > | > | > workflows would provide a good basis for an interesting
> challenge
> > | but
> > | > | to be
> > | > | > decisive I'm selected two.
> > | > | >
> > | > | > The two selection criteria I used were the complexity of the
> > | > | structures within
> > | > | > the workflows (i.e. did it have loops, hierarchies, collections,
> > | etc.)
> > | > | and how
> > | > | > easy it would be for other teams to get the workflows up and
> > | running.
> > | > | I believe
> > | > | > given the complex control structures in some of these workflows
> > | that
> > | > | it would
> > | > | > be difficult to provide intermediary data sets and thus teams
> > | would
> > | > | need to
> > | > | > execute the workflows themselves unlike previous challenges
> where
> > | > | dummy
> > | > | > components could be used.
> > | > | >
> > | > | > 1. Build and test workflow
> > | > | > In terms of being able to execute the workflows, the Software
> > | build
> > | > | and testing
> > | > | > workflow seems by far the easiest to get up and running. Most
> > | systems
> > | > | have ant
> > | > | > and java and the build file can be easily adapted to use
> > | Makefiles.
> > | > | Likewise,
> > | > | > the ant file has a multi-level hierarchy, which is an
> interesting
> > | > | structure.
> > | > | > The downside to the workflow is it's lack of complexity, it does
> > | not
> > | > | have
> > | > | > collections or nested data sets. However, I think the workflow
> > | would
> > | > | make for a
> > | > | > simple starting point for testing interoperability before moving
> > | on
> > | > | to the more
> > | > | > complex second workflow. Furthermore, by using an ant file the
> > | > | challenge does
> > | > | > not become too workflow specific.
> > | > | >
> > | > | > 2. MSR-WSU Pan-Starrs workflow
> > | > | > My first choice for second workflow is the MSR-WSU, Panstarrs
> > | > | workflow. It has a
> > | > | > number of interesting workflow structures such as if/else as
> well
> > | as
> > | > | loops over
> > | > | > collections. I also like the the idea of having multiple levels
> > | of
> > | > | abstraction
> > | > | > around database tables. It would be interesting to ask for the
> > | > | provenance of an
> > | > | > individual items in a table and retrieve all the
> modifications on
> > | > | each table
> > | > | > including modifications to individual items. The explicit use of
> > | > | database
> > | > | > tables might also encourage the database community to get
> > | involved
> > | > | with the
> > | > | > challenge. What do others think on this issue?
> > | > | >
> > | > | > I'm wondering if the questions about external details from the
> > | > | Neptune workflow
> > | > | > (e.g. the types of sensor detail) could be incorporated in the
> > | > | Panstars
> > | > | > workflow? For example, the telescope which the data was
> collected
> > | > | from?
> > | > | >
> > | > | > The major reservation I have with this workflow is how easy it
> > | would
> > | > | be for
> > | > | > others to execute. Given the Pan-STARRS workflow is designed to
> > | work
> > | > | with large
> > | > | > data, can the MSR team comment on whether small data sets are
> > | > | available? Also,
> > | > | > given that the implementation requires .Net, how easy could this
> > | be
> > | > | run on
> > | > | > non-windows machines? Are there non-windows executables
> available?
> > | > | >
> > | > | > * myExperiment & Brain Imaging Workflows
> > | > | > If the Panstarrs workflow can not be executed by different teams
> > | > | easily, I think
> > | > | > we should look at selecting one of these options. Can these two
> > | teams
> > | > | comment
> > | > | > on how easy it would be for others to use the components within
> > | their
> > | > | workflows
> > | > | > without invoking their particular workflow enactment engines?
> > | > | >
> > | > | > I did like the dynamic nature of the Taverna workflow as it
> makes
> > | for
> > | > | a good
> > | > | > case for provenance (e.g. the abstracts returned from PubMed
> will
> > | > | vary over
> > | > | > time) Could we incorporate this into our selections?
> > | > | >
> > | > | > With that, what do you think?
> > | > | >
> > | > | > Thanks,
> > | > | > Paul
> > | > | >
> > | > | > --------------------------------------------------------------
> > | > | > Paul Groth, Ph.D.
> > | > | > Postdoctoral Research Associate
> > | > | > Information Sciences Institute
> > | > | > University of Southern California
> > | > | > pgroth at isi.edu
> > | > | > Tel: 310 448 8482 Fax: 310 822 0751
> > | > | > http://www.isi.edu/~pgroth/
> > | > | > http://thinklinks.wordpress.org
> > | > | >
> > | > | >
> > | > | >
> > | > | >
> > | > | >
> > | > |
> > | > |
> > | > | --
> > | > | Professor Luc Moreau tel: +44 23 8059 4487
> > | > | Electronics and Computer Science email: l.moreau at ecs.soton.ac.uk
> > | > | University of Southampton www:
> www.ecs.soton.ac.uk/~lavm
> > | > | Southampton SO17 1BJ skype: prof.luc.moreau
> > | > | United Kingdom fring: Luc
> > | > |
> > | > |
> > | > |
> > | >
> > | >
> > | >
> > |
> > |
> > | --
> > | Professor Luc Moreau tel: +44 23 8059 4487
> > | Electronics and Computer Science email: l.moreau at ecs.soton.ac.uk
> > | University of Southampton www: www.ecs.soton.ac.uk/~lavm
> > | Southampton SO17 1BJ skype: prof.luc.moreau
> > | United Kingdom fring: Luc
> > |
> > |
> > |
> >
> >
> >
>
>
> --
> Professor Luc Moreau tel: +44 23 8059 4487
> Electronics and Computer Science email: l.moreau at ecs.soton.ac.uk
> University of Southampton www: www.ecs.soton.ac.uk/~lavm
> Southampton SO17 1BJ skype: prof.luc.moreau
> United Kingdom fring: Luc
>
--
Professor Luc Moreau tel: +44 23 8059 4487
Electronics and Computer Science email: l.moreau at ecs.soton.ac.uk
University of Southampton www: www.ecs.soton.ac.uk/~lavm
Southampton SO17 1BJ skype: prof.luc.moreau
United Kingdom fring: Luc
More information about the Provenance-challenge-ipaw-info
mailing list