UWS 1.1 alternative, WAIT

Dave Morris dave.morris at metagrid.co.uk
Wed Jun 4 16:16:25 PDT 2014


Hi all,

The following ideas are based on suggestions raised at the meeting, plus 
some ideas discussed with Pat after the meeting. Please bear in mind 
some or all of this may be wrong, it is still a work in progress.

---- DRAFT IDEA ----

What if we keep the existing endpoint and add a WAIT parameter to the 
UWS methods.

The WAIT parameter could be passed to any of the existing UWS methods, 
including both the GET and POST requests.

The meaning of the WAIT parameter is defined as :

a) If a 'state change' is not possible, because the job has reached a 
static 'end state', then the request returns immediately as normal.

b) If a 'state change' is still possible, then the server may block 
until a 'state change' does occur, up to the maximum number of seconds 
given by the value of the WAIT parameter.

Where a 'state change' is defined as :

c) A change to the job status or phase, or the status or value of one or 
more of the job results.

This could allow us to add a level of blocking behaviour to all of the 
UWS methods, without having to make any additional requests or redirects 
beyond what is already required for normal UWS operation.

----

For example :

To start a UWS job, we currently use a POST request to set the phase, 
and then use GET requests to poll the server to see when the job 
completes.

If we add a WAIT parameter the initial POST request, then it may block 
until either the state changes or the time limit is reached.

A POST request with ?PHASE=RUN&WAIT=60 means the client is willing to 
wait for 60 seconds before it expects a response.

If a 'state change' occurs before the 60 seconds has elapsed, then the 
server may respond as soon as the event occurs, otherwise the server may 
block until the 60 seconds has expired before responding.

To poll the status a client normally sends a GET request to one of the 
job endpoints.

A GET request with ?WAIT=60 means that the client is willing to wait for 
60 seconds before it expects a response.

If a 'state change' occurs during this time, then the server may respond 
immediately. Otherwise the server may block until the 60 seconds has 
expired before responding.

----

Note - the use of 'may' rather than 'will' or 'must' in the descriptions 
is deliberate. As someone said in the meeting, implementing blocking 
calls over a network connection is error prone and should not be relied 
upon. Which means client code should always plan for and expect errors 
to occur, and we should design the specification with that in mind.

----

The definition of a 'state change' is deliberately broad, so as to 
include things like a change in the value of a result while the job is 
executing.

One of the use cases we would like to be able to support is to use an 
inline result value (planned for the next version of UWS) to include a 
row count in our TAP service results. The row count would be updated 
regularly during the query processing, enabling the client to 'peek' at 
the current value using a GET request. In this example, a change to the 
value of the row counter would constitute a 'state change' and trigger a 
response to any waiting WAIT requests.

----

The idea behind the blocking request(s) was to avoid the additional 
processing and network traffic caused by clients using short polling 
loops with little or no delay to make the client appear responsive to 
changes in the job state.

Given that the version 1.0 client would already be making multiple 
polling requests anyway, it makes sense to define 'state change' in as 
broad a way as possible. Placing the onus on the client to detect and 
handle false positives when they occur.

Specifically, even if the client requests WAIT=60, the server may still 
respond before the 60 seconds has been reached, even if there has been 
no 'state change'. It is up to the client to determine what if any 
changes have occurred by checking the content of the response - which is 
what the client would have to have done anyway if it had been making 
repeated polling requests without the blocking WAIT.

----

Allowing for false positives in the specification has the effect of 
making the WAIT parameter optional, avoiding the need to wait for 
version 2.0 of the specification.

Any existing version 1.0 server would not understand a WAIT parameter, 
and would respond immediately as normal.

A version 1.1 server _may_ recognise and understand the WAIT parameter, 
and it _may_ delay responding until a 'state change' occurs or the time 
limit is reached. However, a version 1.1 server _may_ still respond 
earlier than the specified time limit.

The interaction between client and server for a version 1.1 service with 
the WAIT parameter would be the same as version 1.0 without the WAIT. 
Adding the WAIT parameter just delays the response from the server, 
reducing the overhead of frequent polling. The rest of the client / 
server interaction remains the same.

Using Mark's example, if the current interaction is something like

     do {
         sleep(DELAY)
         phase = read(job-url/phase)
         displayToUser(phase)
     } while ( isNotTerminal(phase) )

then the new version would be

     do {
         phase = read(job-url/phase, DELAY)
         displayToUser(phase)
     } while ( isNotTerminal(phase) )

The WAIT parameter on the read() moves the sleep(DELAY) step from the 
client to the server, where it can be interrupted if something 
interesting happens. This allows us to set a longer delay, reducing the 
overhead caused by repeated polling, while still being able to respond 
rapidly when something does happen.

----

Hope this helps,
Dave



--------
Dave Morris
Software Developer
Wide Field Astronomy Unit
Institute for Astronomy
University of Edinburgh
--------

On 2014-06-04 07:58, Paul Harrison wrote:
> Hi Mark/Pat
> 
> We certainly could expand the blocking behaviour beyond what is in the
> WD - I kept it reasonably conservative (i.e. the change of the
> EXECUTING phase) as I felt that was the general consensus of the
> meeting - At a minimum I think we should adopt Mark’s idea if we want
> to restrict the blocking to phase, but allow for different phase
> transitions to block.
> 
> We could go further and do as Pat/Dave say and allow the endpoint
> (though we should change the blocking URL from {jobid}/blockingphase
> to something with a more general end path segment if adopted) to block
> until the UWS deems that something “interesting” has happened, and I
> agree that in this case it would be better to redirect to the job URL
> as the response so that the client could detect what has changed (to
> allow for the partial results use case).
> 
> I think that we could avoid the race conditions that Mark is worried
> about with some explicit rule on the server that it never blocks for
> one of the terminal phases (as Pat said below). Additionally saying
> that the server must return always when phase changes (but can return
> sooner) should address Mark’s concerns I think.
> 
> It seems to me a pretty sensible suggestion to make this full blocking
> generalisation  - my only concern is really that we might be missing
> something that does not allow this to be a “1.1” release - i.e. breaks
> backward compatibility. If no-one comes up with any objections I will
> make the changes to the WD.
> 
> Paul.
> 
> On 2014-06 -04, at 00:17, Mark Taylor <M.B.Taylor at bristol.ac.uk> wrote:
> 
>> Pat,
>> 
>> I'm fine with the sentiment of this: blocking poll returns whenever
>> anything changes, and leaves you to go and find out the new state
>> of the job.  But it's hard to make it robust against race conditions
>> unless the pre-change behaviour is defined implicitly (as in the text
>> from the current WD) or explicitly (as in my suggestion).
>> 
>> If the behaviour is simply, as you say:
>> 
>>> - block until something in the job changes (any phase change, 
>>> addition of
>>> result(s), etc) and then redirects (to the job url)
>> 
>> then to, e.g., wait for a change you will have to do something like:
>> 
>>    do {
>>        phase = read(job-url/phase)
>>        displayToUser(phase)
>>        waitFor(job-url/blocking-phase)
>>    } while ( isNotTerminal(phase) )
>> 
>> but the trouble is that between the invocations:
>> 
>>   phase=read(job-url/phase)
>> 
>> and
>> 
>>   waitFor(job-url/blocking-phase)
>> 
>> the phase might have changed and you'd never know, so you could
>> miss a transition, or in the worst case (e.g. if it went from
>> EXECUTING to COMPLETED between those calls) be sat there for ever,
>> or at least until timeout, waiting for a change that would never
>> happen.
>> 
>> Mark
>> 
>> On Tue, 3 Jun 2014, Patrick Dowler wrote:
>> 
>>> 
>>> When Dave and I were talking about this on the bus, we came to the 
>>> conclusion
>>> that the block behaviour could be quite simple and more general:
>>> 
>>> - block until something in the job changes (any phase change, 
>>> addition of
>>> result(s), etc) and then redirects (to the job url)
>>> 
>>> - if the job is in a state where it cannot change (one of the 
>>> terminal phases)
>>> then the resource no longer blocks, immediate redirect
>>> 
>>> 
>>> I think this way you don't really need anything extra at all. The 
>>> client does
>>> need to check the job to see exactly what changed, but they can just 
>>> get the
>>> phase if that is what they care about.
>>> 
>>> The client is "notified" of every state change, but we don't ever 
>>> convey the
>>> change itself in the notification (typical event-handler patterns try 
>>> to do
>>> that but the payload continually changes as you adapt to new use 
>>> cases; I
>>> think we need to avoid that trap). For example, say we decided to add 
>>> (or
>>> allow as an extension) a progress indicator in the job; an 
>>> implementer could
>>> unblock whenever the progress indicator changed - the payload doesn't 
>>> have to
>>> change and the client could decide to get the phase, the progress 
>>> indicator,
>>> or the whole job.
>>> 
>>> In principle, the response from the block could still be the 
>>> (text/plain)
>>> phase, but I find that to be marginally misleading if other changes 
>>> of state
>>> are also triggering the unblock... I'm more in favour of redirecting 
>>> to the
>>> job url when unblocking as it is semantically correct although less 
>>> efficient
>>> in many typical cases.
>>> 
>>> 
>>> One of Dave's use cases was that they want the client to be able to 
>>> see and
>>> react to results as they are added to the job (during the EXECUTING 
>>> phase). I
>>> think being able to expose partial results is a nice feature that 
>>> enables some
>>> interesting things without changing anything for those that don't 
>>> need this.
>>> 
>>> My personal use case was similar: a web UI issuing TAP  queries and 
>>> trying to
>>> avoid (i) hammering the service with polling and (ii) extra apparent 
>>> latency
>>> by polling less frequently.  Dave's idea to expose partial results 
>>> would also
>>> allow one to support starting to read the async results before they 
>>> were
>>> completely written, which would let me have the robustness of async 
>>> with the
>>> faster response of sync.
>>> 
>>> Pat
>>> 
>>> On 03/06/14 09:33 AM, Mark Taylor wrote:
>>>> Paul+GWS,
>>>> 
>>>> the Blocking Endpoints addition is a great idea, I hadn't noticed 
>>>> this
>>>> was under consideration, sorry I couldn't attend the relevant GWS
>>>> session at ESAC since I was elsewhere.  However, if I understand
>>>> the current proposal as written in WD-UWS-1.1-20140527, it doesn't
>>>> quite fit my requirements at least.
>>>> 
>>>> In topcat and stilts, I don't just want to wait for the job to 
>>>> complete,
>>>> I'd like to display the current phase to the user as it transitions
>>>> between phases, in particular the potentially long-lived phases not
>>>> under client control, i.e. QUEUED, EXECUTING and maybe SUSPENDED.
>>>> Since the current proposal only lets me do a long poll to detect
>>>> the end of EXECUTING, I'm still going to have to do normal polling
>>>> to find out when EXECUTING starts.
>>>> 
>>>> I thought about suggesting that the blockingphase endpoint should
>>>> return whenever the phase changes, but that doesn't really work
>>>> because you don't know for sure what the phase was when you called 
>>>> it.
>>>> So instead, could that endpoint have a parameter, something like
>>>> 
>>>>    /{jobs}/(job-id)/blockingphase?not_equal_to=EXECUTING
>>>> 
>>>> which returns immediately if the phase is different from the
>>>> given parameter value (EXECUTING in the example above), and
>>>> otherwise blocks until it changes?
>>>> 
>>>> Mark


More information about the grid mailing list