[provenance-challenge] Re: review of workflows for pc3
Jose Manuel Gómez Pérez
jmgomez at isoco.com
Wed Nov 26 23:23:59 GMT 2008
Luc, Roger,
Sorry to break in but perhaps you are referring to this whiskey? I found
it in Inverness, though to be honest I didn't dare buying it. Anyway, I
couldn't resist taking a picture.
Cheers,
Jose
Luc Moreau wrote:
> Roger Barga wrote:
>>
>> PS - the last time we met you mentioned a single malt with
>> 'provenance' in the name. Was that an Ardbeg Provenance by chance?
>> If so, I had a chance to try it on a recent trip to Edinburgh -
>> absolutely wonderful.
>>
>
> It was! (and for the couple remaining glasses, no doubt will be).
>
> Trying to search for the provenance of 'provenance whisky', I found two
> interesting pages:
>
> http://www.thewhiskyexchange.com/P-6279.aspx
> http://www.whiskymag.com/whisky/brand/ardbeg/whisky615.html
>
> It is really great!
> Luc
>
>> ________________________________________
>> From: provenance-challenge-ipaw-info-bounces at ipaw.info
>> [provenance-challenge-ipaw-info-bounces at ipaw.info] On Behalf Of Luc
>> Moreau [L.Moreau at ecs.soton.ac.uk]
>> Sent: Wednesday, November 26, 2008 6:44 AM
>> To: provenance-challenge at ipaw.info
>> Cc: Satya Sahoo; Paul Groth
>> Subject: [provenance-challenge] Re: review of workflows for pc3
>>
>> Thanks Yogesh. Is there some slides or papers about Roger's work?
>>
>> From a challenge view point, it would be useful to characterise the
>> type of provenance we would ideally like
>> to capture within the database. It seems that a layered model is
>> particularly appropriate here: the activity level
>> description could constitute an OPM account, whereas a more fine-grained
>> provenance (with the database sense) could
>> form another account.
>>
>> Luc
>>
>>
>> Yogesh Simmhan wrote:
>> > Hi Luc,
>> >
>> > In the current system, we work around having to instrument the DB by
>> having individual SQL queries wrapped as C# activities. The activities
>> pass through the input params to the parameterized SQL queries.
>> Provenance is captured at the activity level. We also capture the
>> actual queries and query plans from MSSQL server, but don't integrate
>> it with the provenance yet.
>> >
>> > Roger B. is working on a design and prototype for a more DB centric
>> and semantic approach using materialized views and first class
>> provenance operators. His presentation at the recent provenance in
>> workflows workshop at Utah talked about it
>> (http://wiki.esi.ac.uk/ProvenanceInWorkflows).
>> >
>> > Best,
>> > --Yogesh
>> >
>> >
>> > | -----Original Message-----
>> > | From: provenance-challenge-ipaw-info-bounces at ipaw.info
>> > | [mailto:provenance-challenge-ipaw-info-bounces at ipaw.info] On
>> Behalf Of
>> > | Luc Moreau
>> > | Sent: Wednesday, November 26, 2008 4:02 AM
>> > | To: provenance-challenge at ipaw.info; Paul Groth
>> > | Cc: Satya Sahoo
>> > | Subject: [provenance-challenge] Re: review of workflows for pc3
>> > |
>> > | Yogesh,
>> > |
>> > | There is however an interesting technical challenge (probably
>> > | appropriate for a provenance challenge!).
>> > | If we intend to export provenance information into the OPM format, we
>> > | probably need
>> > | to capture this information (in part) inside the database processing
>> > | SQL
>> > | queries.
>> > | Are you already doing this in your system?
>> > |
>> > | This presents us with an opportunity to have contributions from
>> members
>> > | of the database community.
>> > | Who is on this list at this moment? (James? Peter? Val? Jan?
>> Natalia?)
>> > |
>> > | This will require us to structure the workflow in different "stages"
>> > | where different technologies (including databases)
>> > | are involved.
>> > |
>> > | Can you comment on this?
>> > |
>> > | Cheers,
>> > | Luc
>> > |
>> > | Yogesh Simmhan wrote:
>> > | > Hi Paul,
>> > | >
>> > | > Thanks for your comments. Regarding the ease of portability of the
>> > | Pan-STARRS Load/Merge workflow, all our activities are either SQL
>> > | queries and updates, or file system operations. While our current
>> > | executables are for MSSQL/C#, the SQL activities are simple enough to
>> > | port to any relational DBMS (MySQL, Apache Derby, ...) and
>> programming
>> > | language. The main workflows operate on 3 relational tables with
>> about
>> > | 50 columns.
>> > | >
>> > | > If selected, we can provide Java source code using Derby, in
>> addition
>> > | to the C# version using MSSQL. We'll also provide textual
>> descriptions
>> > | of the activities to enable them to be ported to other DB/languages.
>> > | >
>> > | > While the typical Pan-STARRS workflows operate on large datasets,
>> > | there is nothing that prevents the challenge workflows from operating
>> > | on a subset of those. Indeed, we use small CSV files and databases
>> > | (<1MB) for our own testing that we can provide for the challenge.
>> > | >
>> > | > Metadata about the telescope is not part of the normal workflow
>> > | pipeline, but we can consider incorporating supplementary annotations
>> > | about the telescope outside the scope of the workflow to see how the
>> > | provenance systems embed annotations in OPM and handle annotation
>> > | queries.
>> > | >
>> > | > Best,
>> > | > --Yogesh
>> > | >
>> > | >
>> > | > |
>> > | > | pgroth at ISI.EDU wrote:
>> > | > | > Hi,
>> > | > | >
>> > | > | > To kick start our discussion about what workflows should be
>> used
>> > | for
>> > | > | the third
>> > | > | > provenance challenge, below are my thoughts on which would be
>> > | most
>> > | > | appropriate
>> > | > | > and some questions to the authors. First, let me say that I
>> > | thought
>> > | > | all the
>> > | > | > workflows would provide a good basis for an interesting
>> challenge
>> > | but
>> > | > | to be
>> > | > | > decisive I'm selected two.
>> > | > | >
>> > | > | > The two selection criteria I used were the complexity of the
>> > | > | structures within
>> > | > | > the workflows (i.e. did it have loops, hierarchies,
>> collections,
>> > | etc.)
>> > | > | and how
>> > | > | > easy it would be for other teams to get the workflows up and
>> > | running.
>> > | > | I believe
>> > | > | > given the complex control structures in some of these workflows
>> > | that
>> > | > | it would
>> > | > | > be difficult to provide intermediary data sets and thus teams
>> > | would
>> > | > | need to
>> > | > | > execute the workflows themselves unlike previous challenges
>> where
>> > | > | dummy
>> > | > | > components could be used.
>> > | > | >
>> > | > | > 1. Build and test workflow
>> > | > | > In terms of being able to execute the workflows, the Software
>> > | build
>> > | > | and testing
>> > | > | > workflow seems by far the easiest to get up and running. Most
>> > | systems
>> > | > | have ant
>> > | > | > and java and the build file can be easily adapted to use
>> > | Makefiles.
>> > | > | Likewise,
>> > | > | > the ant file has a multi-level hierarchy, which is an
>> interesting
>> > | > | structure.
>> > | > | > The downside to the workflow is it's lack of complexity, it
>> does
>> > | not
>> > | > | have
>> > | > | > collections or nested data sets. However, I think the workflow
>> > | would
>> > | > | make for a
>> > | > | > simple starting point for testing interoperability before
>> moving
>> > | on
>> > | > | to the more
>> > | > | > complex second workflow. Furthermore, by using an ant file the
>> > | > | challenge does
>> > | > | > not become too workflow specific.
>> > | > | >
>> > | > | > 2. MSR-WSU Pan-Starrs workflow
>> > | > | > My first choice for second workflow is the MSR-WSU, Panstarrs
>> > | > | workflow. It has a
>> > | > | > number of interesting workflow structures such as if/else as
>> well
>> > | as
>> > | > | loops over
>> > | > | > collections. I also like the the idea of having multiple levels
>> > | of
>> > | > | abstraction
>> > | > | > around database tables. It would be interesting to ask for the
>> > | > | provenance of an
>> > | > | > individual items in a table and retrieve all the
>> modifications on
>> > | > | each table
>> > | > | > including modifications to individual items. The explicit
>> use of
>> > | > | database
>> > | > | > tables might also encourage the database community to get
>> > | involved
>> > | > | with the
>> > | > | > challenge. What do others think on this issue?
>> > | > | >
>> > | > | > I'm wondering if the questions about external details from the
>> > | > | Neptune workflow
>> > | > | > (e.g. the types of sensor detail) could be incorporated in the
>> > | > | Panstars
>> > | > | > workflow? For example, the telescope which the data was
>> collected
>> > | > | from?
>> > | > | >
>> > | > | > The major reservation I have with this workflow is how easy it
>> > | would
>> > | > | be for
>> > | > | > others to execute. Given the Pan-STARRS workflow is designed to
>> > | work
>> > | > | with large
>> > | > | > data, can the MSR team comment on whether small data sets are
>> > | > | available? Also,
>> > | > | > given that the implementation requires .Net, how easy could
>> this
>> > | be
>> > | > | run on
>> > | > | > non-windows machines? Are there non-windows executables
>> available?
>> > | > | >
>> > | > | > * myExperiment & Brain Imaging Workflows
>> > | > | > If the Panstarrs workflow can not be executed by different
>> teams
>> > | > | easily, I think
>> > | > | > we should look at selecting one of these options. Can these two
>> > | teams
>> > | > | comment
>> > | > | > on how easy it would be for others to use the components within
>> > | their
>> > | > | workflows
>> > | > | > without invoking their particular workflow enactment engines?
>> > | > | >
>> > | > | > I did like the dynamic nature of the Taverna workflow as it
>> makes
>> > | for
>> > | > | a good
>> > | > | > case for provenance (e.g. the abstracts returned from PubMed
>> will
>> > | > | vary over
>> > | > | > time) Could we incorporate this into our selections?
>> > | > | >
>> > | > | > With that, what do you think?
>> > | > | >
>> > | > | > Thanks,
>> > | > | > Paul
>> > | > | >
>> > | > | > --------------------------------------------------------------
>> > | > | > Paul Groth, Ph.D.
>> > | > | > Postdoctoral Research Associate
>> > | > | > Information Sciences Institute
>> > | > | > University of Southern California
>> > | > | > pgroth at isi.edu
>> > | > | > Tel: 310 448 8482 Fax: 310 822 0751
>> > | > | > http://www.isi.edu/~pgroth/
>> > | > | > http://thinklinks.wordpress.org
>> > | > | >
>> > | > | >
>> > | > | >
>> > | > | >
>> > | > | >
>> > | > |
>> > | > |
>> > | > | --
>> > | > | Professor Luc Moreau tel: +44 23 8059 4487
>> > | > | Electronics and Computer Science email:
>> l.moreau at ecs.soton.ac.uk
>> > | > | University of Southampton www:
>> www.ecs.soton.ac.uk/~lavm
>> > | > | Southampton SO17 1BJ skype: prof.luc.moreau
>> > | > | United Kingdom fring: Luc
>> > | > |
>> > | > |
>> > | > |
>> > | >
>> > | >
>> > | >
>> > |
>> > |
>> > | --
>> > | Professor Luc Moreau tel: +44 23 8059 4487
>> > | Electronics and Computer Science email: l.moreau at ecs.soton.ac.uk
>> > | University of Southampton www: www.ecs.soton.ac.uk/~lavm
>> > | Southampton SO17 1BJ skype: prof.luc.moreau
>> > | United Kingdom fring: Luc
>> > |
>> > |
>> > |
>> >
>> >
>> >
>>
>>
>> --
>> Professor Luc Moreau tel: +44 23 8059 4487
>> Electronics and Computer Science email: l.moreau at ecs.soton.ac.uk
>> University of Southampton www: www.ecs.soton.ac.uk/~lavm
>> Southampton SO17 1BJ skype: prof.luc.moreau
>> United Kingdom fring: Luc
>>
>
>
--
Jose Manuel Gomez-Perez
Research Manager
jmgomez at isoco.com
#T +34913349778
#M +34609077103
Pedro de Valdivia, 10
28006 Madrid, Spain
iSOCO
enabling the networked economy
www.isoco.com
P Please consider your environmental responsibility before printing this
e-mail
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 100_1251.JPG
Type: image/jpeg
Size: 586794 bytes
Desc: not available
Url : http://mailman.ecs.soton.ac.uk/pipermail/provenance-challenge-ipaw-info/attachments/20081127/202c75d7/attachment-0001.jpe
More information about the Provenance-challenge-ipaw-info
mailing list