UWS 1.1 alternative, WAIT

Wed Jun 4 23:56:47 PDT 2014

Hi Dave,

I like the ?WAIT idea, but I think that it would be preferable to restrict the functionality to the new endpoint rather than put it everywhere (as I think that you are suggesting). I feel this mainly because I believe that existing implementations on the client and server would need more extensive reworking in existing functionality, whereas the new end point could be “added in” as extra functionality, which seems more appropriate for a point change in the standard. I also think that for the client to deal with a mixed ecology of 1.0 and 1.1 servers is more difficult if you allow this behaviour on existing endpoints, rather than the simple 404 on the single endpoint if the block functionality is not there.

I have answered a few specific points in text below.

On 2014-06 -05, at 00:16, Dave Morris <dave.morris at metagrid.co.uk> wrote:

> Hi all,
> 
> The following ideas are based on suggestions raised at the meeting, plus some ideas discussed with Pat after the meeting. Please bear in mind some or all of this may be wrong, it is still a work in progress.
> 
> ---- DRAFT IDEA ----
> 
> What if we keep the existing endpoint and add a WAIT parameter to the UWS methods.
> 
> The WAIT parameter could be passed to any of the existing UWS methods, including both the GET and POST requests.
> 
> The meaning of the WAIT parameter is defined as :
> 
> a) If a 'state change' is not possible, because the job has reached a static 'end state', then the request returns immediately as normal.
> 
> b) If a 'state change' is still possible, then the server may block until a 'state change' does occur, up to the maximum number of seconds given by the value of the WAIT parameter.
> 
> Where a 'state change' is defined as :
> 
> c) A change to the job status or phase, or the status or value of one or more of the job results.
> 
> This could allow us to add a level of blocking behaviour to all of the UWS methods, without having to make any additional requests or redirects beyond what is already required for normal UWS operation.
> 
> ----
> 
> For example :
> 
> To start a UWS job, we currently use a POST request to set the phase, and then use GET requests to poll the server to see when the job completes.
> 
> If we add a WAIT parameter the initial POST request, then it may block until either the state changes or the time limit is reached.
> 
> A POST request with ?PHASE=RUN&WAIT=60 means the client is willing to wait for 60 seconds before it expects a response.

The ?PHASE=RUN is a command to the UWS to start the job if possible - however, if the UWS is busy then the job might just be QUEUED rather than immediately EXECUTING. The client would probably want to behave differently in these two cases, and would prefer to know immediately which would mean that WAIT is superfluous for the initial POST. I also think that waiting for a response in the case of the initial creation of the job blurs the distinction between the acceptance (or not) of the job and the UWS and some sort of network failure, and makes the logic that needs to be employed by the client more difficult - it is simpler if the initial POST for job creation returns immediately and the blocking (or not) interactions start from there. 

> 
> If a 'state change' occurs before the 60 seconds has elapsed, then the server may respond as soon as the event occurs, otherwise the server may block until the 60 seconds has expired before responding.
> 
> To poll the status a client normally sends a GET request to one of the job endpoints.
> 
> A GET request with ?WAIT=60 means that the client is willing to wait for 60 seconds before it expects a response.
> 
> If a 'state change' occurs during this time, then the server may respond immediately. Otherwise the server may block until the 60 seconds has expired before responding.
> 
> ----
> 
> Note - the use of 'may' rather than 'will' or 'must' in the descriptions is deliberate. As someone said in the meeting, implementing blocking calls over a network connection is error prone and should not be relied upon. Which means client code should always plan for and expect errors to occur, and we should design the specification with that in mind.
> 
> ——

This is an area that does need a bit of thought with the blocking idea - clearly some sort of network error is different from the blocked response - and if these errors did not occur there would be no point in the asynchronous UWS anyway - all calls could just be synchronous. I think we can just expect the clients to behave as they would normally for a communications error with the server, and probably do not really have to say anything much in the standard about this.

> 
> The definition of a 'state change' is deliberately broad, so as to include things like a change in the value of a result while the job is executing.
> 
> One of the use cases we would like to be able to support is to use an inline result value (planned for the next version of UWS) to include a row count in our TAP service results. The row count would be updated regularly during the query processing, enabling the client to 'peek' at the current value using a GET request. In this example, a change to the value of the row counter would constitute a 'state change' and trigger a response to any waiting WAIT requests.
> 
> ----
> 
> The idea behind the blocking request(s) was to avoid the additional processing and network traffic caused by clients using short polling loops with little or no delay to make the client appear responsive to changes in the job state.
> 
> Given that the version 1.0 client would already be making multiple polling requests anyway, it makes sense to define 'state change' in as broad a way as possible. Placing the onus on the client to detect and handle false positives when they occur.
> 
> Specifically, even if the client requests WAIT=60, the server may still respond before the 60 seconds has been reached, even if there has been no 'state change'. It is up to the client to determine what if any changes have occurred by checking the content of the response - which is what the client would have to have done anyway if it had been making repeated polling requests without the blocking WAIT.

> 
> ----
> 
> Allowing for false positives in the specification has the effect of making the WAIT parameter optional, avoiding the need to wait for version 2.0 of the specification.

We would need to think about “allowing” a server to respond before the 60s is up, because in extremis that is just like saying that the server can not block at all, and this is another reason for not having the WAIT on all endpoints in a 1.1 version. In addition it is perfectly possible that there is no change after the 60s and it is for this reason that I prefer the idea of a single special blocking endpoint that always redirects to the /jobs/{jobid} endpoint - all the information about what has changed or not is in the full job response.

> 
> Any existing version 1.0 server would not understand a WAIT parameter, and would respond immediately as normal.
> 
> A version 1.1 server _may_ recognise and understand the WAIT parameter, and it _may_ delay responding until a 'state change' occurs or the time limit is reached. However, a version 1.1 server _may_ still respond earlier than the specified time limit.
> 
> The interaction between client and server for a version 1.1 service with the WAIT parameter would be the same as version 1.0 without the WAIT. Adding the WAIT parameter just delays the response from the server, reducing the overhead of frequent polling. The rest of the client / server interaction remains the same.
> 
> Using Mark's example, if the current interaction is something like
> 
>    do {
>        sleep(DELAY)
>        phase = read(job-url/phase)
>        displayToUser(phase)
>    } while ( isNotTerminal(phase) )
> 
> then the new version would be
> 
>    do {
>        phase = read(job-url/phase, DELAY)
>        displayToUser(phase)
>    } while ( isNotTerminal(phase) )
> 
> The WAIT parameter on the read() moves the sleep(DELAY) step from the client to the server, where it can be interrupted if something interesting happens. This allows us to set a longer delay, reducing the overhead caused by repeated polling, while still being able to respond rapidly when something does happen.
> 
> ----
> 
> Hope this helps,

it does!

Paul.