Remote execution (code to the data)
Dave Morris
dave.morris at metagrid.co.uk
Mon Oct 23 18:26:46 CEST 2023
Hi Markus,
Thank you for your feedback. Good points and I agree with your main
concern; that we should not try to re-invent the wheel.
> What scares me a bit in that context is that we try and develop
> our own standard for compute service discovery and job submission
> when my impression is everyone around us is doing that, too.
You are right, there are a lot of people working on this field. This is
the next step that needs to be solved to enable remote analysis of huge
data. However, as far as I can tell, almost all of them are
concentrating on the job submission part. Few, if any, are looking in
detail at the service discovery part.
The ExecutionPlanner interface is designed to address the service
discovery part of the question from a science user’s perspective.
I've already met two of the examples you gave, CERN's Reana and the
Common Workflow Language (CWL). I haven't met TARDIS before, I will look
into it.
Taking the first two, they are both platforms that enable the user to
describe a workflow consisting of an arbitrarily complex sequence of
steps, with details for where to get the data from and what executables
to run.
If I am part of a small project, I may have permission to run my
workflows on a compute platform provided by my university. One project,
one, local, Reana instance provided by my institute.
If I am part of two projects, I may have access to two CWL execution
platforms, one for each project. Which one I use depends on which data
set I want to analyse.
If I am a member of a large project like SKA, I may have access to 16
different execution platforms, one at each regional centre. Each of
which may have different capabilities in terms of compute resources,
storage space and data access. Which one I should use depends on the
complexity of my code, what compute resources it needs, which data I
want to analyse and the current load and capacity of each of the
platforms.
If I am a member of the IVOA, I could query the registry and find 30+
different execution platforms, with a a mixture of Reana, CWL and some
custom execution platforms.
The ExecutionPlanner interface is designed to answer one question from
the science user’s perspective. Can I execute this <Reana script> on
<your platform> ?
I don't need to know the details of how. I just want a simple yes|no;
Can you execute it, and if so when, and with what resources.
The platform behind the Execution Planner interface may be a fully
functional Reana deployment, in which case the answer is yes. It could
be a custom deployment that is capable of executing some Reana
functionality, in which case the answer will depend on the details of
the <Reana script> submitted. Or it could be a next generation 2028
workflow execution system that has a backward compatibility module for
executing legacy <Reana script> from 2023.
The ExecutionPlanner data model allows the user to describe the task
they want to execute e.g. <this Reana script> and get a response that
contains the information they need to make an informed choice between
different offers from different platforms.
The ExecutionPlanner does not replace the execution service. The actual
execution step would still be done by the underlying Reana or CWL
platform.
In terms of the token delegation process you describe. I totally agree.
We should re-use what is already standard practice. If we add a security
token to the mix, the ExecutionPlanner question becomes "Can I execute
<this script> on <your platform> with <this token>".
If the platform does not know how to handle the token, the
ExecutionPlanner answer is simply 'no, unknown token type'.
If the platform does have the right mechanism for renewing and
respawning the initial token into all the right sub-tokens needed to
perform the specified workflow on the specified data, then the
ExecutionPlanner answer is 'yes, I can offer you a 4 hour slot starting
at 2pm this afternoon'.
If another service responds with 'yes, I can offer you a 8 hour slot
starting at 8pm this evening', the user now has the information needed
to choose between them. Do they go for the short slot this afternoon, or
the longer slot in the evening.
The assumption is that by saying 'yes' the two services are undertaking
to perform the necessary magic to renew and respawn tokens as required.
However, all the details of how this happens are internal to the
execution platform.
The science user makes their choice based on the complexity of their
script and the urgency of their project. The science user doesn't need
to know about the cryptographic magic involved.
Hope this helps to explain.
Cheers,
-- Dave
On 2023-10-23 08:25, Markus Demleitner wrote:
> Dear GWS,
>
> On Thu, Oct 19, 2023 at 06:45:18AM +0100, Dave Morris wrote:
>> Vicente was concerned that this may be too complex to implement in
>> fast-moving digital scenario depicted by Cloud and Container players.
>> It
>> might be better to concentrate on what we already have and see if we
>> can
>> define some common patterns for accessing platforms based on a minimum
>> compatibility at technology stack level (Kubernetes, S3 etc.). Based
>> on this
>> assessment we could measure the gap across organisations in order to
>> implement PoCs for those that are close to each other.
>>
>> I'm following up on this in a mailing list thread because I have also
>> heard
>> similar concerns and suggestions from other people too.
>>
>> This thread is somewhere for us to discuss the pros and cons of the
>> two
>> directions, the abstract ExecutionPlanner interface, and the more
>> pragmatic
>> approach looking for common patterns in how we use the technologies.
>
> Disclaimer: I'm not actually running any compute services and don't
> intend to. I'm an outsider in this business, and I'd have kept my
> mouth shut if others had chimed in. Since they haven't, and since I
> think we need to discuss this, let me throw in some probably fairly
> incompetent words.
>
> As a general stance, I am very much in favour of giving users a
> chance to avoid lock-ins, which (with network services) means
> creating facilities to discover services -- e.g., Dave's
> ExecutionPlanner -- and to have *somewhat* common interfaces to using
> them -- e.g., Dave's ExecutionWorker.
>
> I am hence very much in favour of attempting to adopt or develop
> abstraction mechanisms where reasonable. What scares me a bit in
> that context is that we try and develop our own standard for
> compute service discovery and job submission when my impression is
> everyone around us is doing that, too.
>
> For instance, in the context of a national federation effort I've
> been asked to participate in, there is
> https://cobald-tardis.readthedocs.io/. Many other things are going
> on that seem at least related, such as CERN's reana or -- this one
> I'm planning to have a closer look at -- the common workflow language
> CWL.
>
> In comparison to many of these other efforts, the Execution Planner
> in its iWD form seems simple and pragmatic to me. On the other hand,
> these people solve lots of hard problems that we are probably
> glossing over; I was fairly impressed by a talk about a complex
> machinery that enables *a particular* submission service to feed
> containers access tokens to *a particular* storage service so that
> long-running jobs can keep writing when the storage access tokens
> expire rapidly. The thought of having to write an interoperable
> standard catering to this kind of thing makes we want to end this
> mail.
>
> -- Markus
>
> (who's still dreaming that some advanced array manipulation scheme
> like the one I talked about in Santiago --
> http://wiki.ivoa.net/internal/IVOA/InterOpOct2017DAL/arraysql.pdf --
> accessible over plain old TAP would cover a substantial portion of
> our code-to-data requirement beyond ADQL).
More information about the grid
mailing list