Remote execution (code to the data)

Mon Oct 23 18:26:46 CEST 2023

Hi Markus,

Thank you for your feedback. Good points and I agree with your main 
concern; that we should not try to re-invent the wheel.

> What scares me a bit in that context is that we try and develop
> our own standard for compute service discovery and job submission
> when my impression is everyone around us is doing that, too.

You are right, there are a lot of people working on this field. This is 
the next step that needs to be solved to enable remote analysis of huge 
data. However, as far as I can tell, almost all of them are 
concentrating on the job submission part. Few, if any, are looking in 
detail at the service discovery part.

The ExecutionPlanner interface is designed to address the service 
discovery part of the question from a science user’s perspective.

I've already met two of the examples you gave, CERN's Reana and the 
Common Workflow Language (CWL). I haven't met TARDIS before, I will look 
into it.

Taking the first two, they are both platforms that enable the user to 
describe a workflow consisting of an arbitrarily complex sequence of 
steps, with details for where to get the data from and what executables 
to run.

If I am part of a small project, I may have permission to run my  
workflows on a compute platform provided by my university. One project, 
one, local, Reana instance provided by my institute.

If I am part of two projects, I may have access to two CWL execution 
platforms, one for each project. Which one I use depends on which data 
set I want to analyse.

If I am a member of a large project like SKA, I may have access to 16 
different execution platforms, one at each regional centre. Each of 
which may have different capabilities in terms of compute resources, 
storage space and data access. Which one I should use depends on the 
complexity of my code, what compute resources it needs, which data I 
want to analyse and the current load and capacity of each of the 
platforms.

If I am a member of the IVOA, I could query the registry and find 30+ 
different execution platforms, with a a mixture of Reana, CWL and some 
custom execution platforms.

The ExecutionPlanner interface is designed to answer one question from 
the science user’s perspective. Can I execute this <Reana script> on 
<your platform> ?

I don't need to know the details of how. I just want a simple yes|no; 
Can you execute it, and if so when, and with what resources.

The platform behind the Execution Planner interface may be a fully 
functional Reana deployment, in which case the answer is yes. It could 
be a custom deployment that is capable of executing some Reana 
functionality, in which case the answer will depend on the details of 
the <Reana script> submitted. Or it could be a next generation 2028 
workflow execution system that has a backward compatibility module for 
executing legacy <Reana script> from 2023.

The ExecutionPlanner data model allows the user to describe the task 
they want to execute e.g. <this Reana script> and get a response that 
contains the information they need to make an informed choice between 
different offers from different platforms.

The ExecutionPlanner does not replace the execution service. The actual 
execution step would still be done by the underlying Reana or CWL 
platform.

In terms of the token delegation process you describe. I totally agree. 
We should re-use what is already standard practice. If we add a security 
token to the mix, the ExecutionPlanner question becomes "Can I execute 
<this script> on <your platform> with <this token>".

If the platform does not know how to handle the token, the 
ExecutionPlanner answer is simply 'no, unknown token type'.

If the platform does have the right mechanism for renewing and 
respawning the initial token into all the right sub-tokens needed to 
perform the specified workflow on the specified data, then the 
ExecutionPlanner answer is 'yes, I can offer you a 4 hour slot starting 
at 2pm this afternoon'.
If another service responds with 'yes, I can offer you a 8 hour slot 
starting at 8pm this evening', the user now has the information needed 
to choose between them. Do they go for the short slot this afternoon, or 
the longer slot in the evening.

The assumption is that by saying 'yes' the two services are undertaking 
to perform the necessary magic to renew and respawn tokens as required. 
However, all the details of how this happens are internal to the 
execution platform.

The science user makes their choice based on the complexity of their 
script and the urgency of their project. The science user doesn't need 
to know about the cryptographic magic involved.

Hope this helps to explain.

Cheers,
-- Dave

On 2023-10-23 08:25, Markus Demleitner wrote:
> Dear GWS,
> 
> On Thu, Oct 19, 2023 at 06:45:18AM +0100, Dave Morris wrote:
>> Vicente was concerned that this may be too complex to implement in
>> fast-moving digital scenario depicted by Cloud and Container players. 
>> It
>> might be better to concentrate on what we already have and see if we 
>> can
>> define some common patterns for accessing platforms based on a minimum
>> compatibility at technology stack level (Kubernetes, S3 etc.). Based 
>> on this
>> assessment we could measure the gap across organisations in order to
>> implement PoCs for those that are close to each other.
>> 
>> I'm following up on this in a mailing list thread because I have also 
>> heard
>> similar concerns and suggestions from other people too.
>> 
>> This thread is somewhere for us to discuss the pros and cons of the 
>> two
>> directions, the abstract ExecutionPlanner interface, and the more 
>> pragmatic
>> approach looking for common patterns in how we use the technologies.
> 
> Disclaimer: I'm not actually running any compute services and don't
> intend to.  I'm an outsider in this business, and I'd have kept my
> mouth shut if others had chimed in.  Since they haven't, and since I
> think we need to discuss this, let me throw in some probably fairly
> incompetent words.
> 
> As a general stance, I am very much in favour of giving users a
> chance to avoid lock-ins, which (with network services) means
> creating facilities to discover services -- e.g., Dave's
> ExecutionPlanner -- and to have *somewhat* common interfaces to using
> them -- e.g., Dave's ExecutionWorker.
> 
> I am hence very much in favour of attempting to adopt or develop
> abstraction mechanisms where reasonable.  What scares me a bit in
> that context is that we try and develop our own standard for
> compute service discovery and job submission when my impression is
> everyone around us is doing that, too.
> 
> For instance, in the context of a national federation effort I've
> been asked to participate in, there is
> https://cobald-tardis.readthedocs.io/.  Many other things are going
> on that seem at least related, such as CERN's reana or -- this one
> I'm planning to have a closer look at -- the common workflow language
> CWL.
> 
> In comparison to many of these other efforts, the Execution Planner
> in its iWD form seems simple and pragmatic to me.  On the other hand,
> these people solve lots of hard problems that we are probably
> glossing over; I was fairly impressed by a talk about a complex
> machinery that enables *a particular* submission service to feed
> containers access tokens to *a particular* storage service so that
> long-running jobs can keep writing when the storage access tokens
> expire rapidly.  The thought of having to write an interoperable
> standard catering to this kind of thing makes we want to end this
> mail.
> 
>             -- Markus
> 
> (who's still dreaming that some advanced array manipulation scheme
> like the one I talked about in Santiago --
> http://wiki.ivoa.net/internal/IVOA/InterOpOct2017DAL/arraysql.pdf --
> accessible over plain old TAP would cover a substantial portion of
> our code-to-data requirement beyond ADQL).