TAP Implementation issues (cont'd): UWS

Tue Nov 3 06:55:05 PST 2009

Tom,

if it was broken for UWS, then it would be broken for all dynamic  
content produced by any web server, of any kind, anywhere. Please stop  
scaremongering. UWS works, without caching problems, in several  
implementations. Dynamic content works, and we don't have to mess  
around with HTTP details to make it work. What does not  work is  
misusing HTTP get to change state "because it's simpler".

Guy

On 3 Nov 2009, at 14:46, Tom McGlynn wrote:

> By the by,  presumably this issue of whether we can count on getting  
> an up to date version of a UWS resources accessed via a GET call  
> applies not just to the job list but to any resource called by GET.   
> They are all dynamic.  The phase request in particular seems  
> designed to be called repeatedly in a loop and might be especially  
> liable to caching. So this needs to be resolved clearly.  The text  
> that you quoted suggests that the proposed UWS standard is on pretty  
> thin ice and that we are relying on the vagaries of current server  
> implementations.  I was surprised that CGI like requests are  
> specifically exempted from being cached so that they are not subject  
> to this unless the user explicitly sets a future date for  
> expiration.  Perhaps going back to a more CGI like syntax
>
>   $rooturl/$jobid?phase
> rather  than
>   $rooturl/$jobid/phase
>
> should be considered since that seems to get the HTTP standard a  
> little closer our side.  Or warn users to specify $rooturl/$jobid/ 
> phase?junk if they wish to be careful about avoiding caching.
>
> Though I'm not sure that's really enough since even for a CGI-like  
> GET there's nothing that the current standard says that precludes a  
> service from setting an expiration date in the future.  It seems  
> that to do any GET requests properly the protocol needs to say  
> something more proactive to ensure that caching is turned off.
>
>
> 	Regards,
> 	Tom
>
>
> Tom McGlynn wrote:
>> I can't really see the difference here between the create case  
>> (i.e., my
>> use of GET to create a job) and the job list case.  In fact the  
>> text you
>> quote seems to suggest that the job list is more likely to be cached
>> than a job creation request with parameters.
>> In both cases we start out with some URL (in fact the same URL for  
>> both
>> cases).  There are two differences in the processing at that HTTP  
>> sees:
>> The return code is different (303 for a creation versus 200 for a job
>> list).  There is nothing that I know of in the HTTP protocol that
>> suggests that a 200 is less likely to get a Expiration time header
>> associated with it than a 303 (or more likely for that matter).
>> The more significant difference is that the job creation URL is  
>> going to
>> have job creation parameters (well at least for me!).  That  
>> suggests --
>> looking at the text you quoted below -- that it is less likely to  
>> have a
>> problem with an expiration time being associated with it.  It's the  
>> job
>> list that is more likely.
> -- I got slightly confused here in discussing the consequences  The  
> expiration time header is a side issue.  The question is whether the  
> server (or some proxy in the path) is allowed to cache.  The words  
> below suggest it is for a URL that does not include ? (e.g., a  
> request to list the jobs or get the current phase) when either there  
> is no expiration date or the expiration date is in the future.  For  
> CGI like requests with parameters, caching is allowed only when  
> there is an expiration date in the future.
>> Now you may be entirely correct that the servlet containers that  
>> you've
>> used take care of this very nicely.  But it doesn't sound like it's
>> required by the protocol.  So to ensure that users get an up to  
>> date job
>> list I think the protocol needs  to specify that any job list  
>> specify an
>> immediate expiration to disable caching - the very details that you  
>> were
>> leery of below.
>>        Regards,
>>        Tom
>> Guy Rixon wrote:
>>> FWIW, this is the passage in RFC 2616 that restricts the use of  
>>> GET for
>>> state-changing requests:
>>>
>>>
>>>      13.9 Side Effects of GET and HEAD
>>>
>>> Unless the origin server explicitly prohibits the caching of their
>>> responses, the application of GET and HEAD methods to any resources
>>> SHOULD NOT have side effects that would lead to erroneous behavior  
>>> if
>>> these responses are taken from a cache. They MAY still have side
>>> effects, but a cache is not required to consider such side effects  
>>> in
>>> its caching decisions. Caches are always expected to observe an  
>>> origin
>>> server's explicit restrictions on caching.
>>>
>>> We note one exception to this rule: since some applications have
>>> traditionally used GETs and HEADs with query URLs (those  
>>> containing a
>>> "?" in the rel_path part) to perform operations with significant  
>>> side
>>> effects, caches MUST NOT treat responses to such URIs as fresh  
>>> unless
>>> the server provides an explicit expiration time. This specifically  
>>> means
>>> that responses from HTTP/1.0 servers for such URIs SHOULD NOT be  
>>> taken
>>> from a cache. See section 9.1.1
>>> <http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html#sec9.1.1> for
>>> related information.
>>>
>>>
>>> Reading the details of this, we /could /have written UWS to start
>>> queries using GET, but then we'd have to qualify it with "and then  
>>> you
>>> must do x, y and z with the response headers to turn off caching".  
>>> This
>>> is extra work for the implementor and quite unecessary because a  
>>> POSTed
>>> request is naturally uncached. Creating a sub-resource is exactly  
>>> what
>>> POST is for.
>>>
>>> The particular caching cock-up I had in mind works like this:
>>>
>>> 1. You GET the job list with query-starting parameters.
>>> 2. The service starts a query and returns 303 "See other"  
>>> redirecting to
>>> the new job
>>> 3. The web server tacks on some default cache-expiration because,  
>>> hey,
>>> it's a GET response and we're suppose to cache those.
>>> 4. You GET the job list again with parameters for another query.
>>> 5. The web-cache sends you back the 303 for the first query and  
>>> the TAP
>>> service never sees the request.
>>>
>>> Maybe this only catchs you out if the queries in steps 1 and 4 are  
>>> the
>>> same - i.e. same URL including query string. Not sure about the  
>>> details
>>> there...meaning that it's subtle and scary and dangerous to depend  
>>> on
>>> this kind of usage. On the other hand, using POST makes it 100%  
>>> certain
>>> that the cache won't swallow your request.
>>>
>>> In respect of job lists getting cached what sees to happen with  
>>> Java is
>>> this: when you GET a response from a servlet or JSP, everything  
>>> works as
>>> expected with dynamic content; you never seem to get a stale page.
>>> However, when the same servlet engine delivers a static HTML page,  
>>> or a
>>> CSS stylesheet or (particularly) and XSLT for in-browser  
>>> transformation
>>> then it caches like hell and you can easily miss updates. (This  
>>> bit me
>>> repeatedly when I was setting up the browser-side styling of the job
>>> resources for my TAP implementation.) Something in the servlet  
>>> engine
>>> seems to know when the content should be dynamic and adjusts the  
>>> HTTP
>>> headers appropriately.
>>>
>>> Cheers,
>>> Guy
>>>
>>>
>>> On 2 Nov 2009, at 16:56, Tom McGlynn wrote:
>>>
>>>> Guy Rixon wrote:
>>>>> Tom,
>>>>> if you want to get the job list then go ahead and do HTTP-GET.  
>>>>> That's
>>>>> part of UWS (although implementations may restrict the set of you
>>>>> reported to be those owned by the caller). What you can't do is  
>>>>> use
>>>>> HTTP-GET to submit a query via UWS. If you want to use GET to do a
>>>>> query then you're doing a synchronous query by definition.
>>>>> Cheers,
>>>>> Guy
>>>> But I recall from earlier in this discussion some clever fellow  
>>>> said...
>>>>
>>>>> GET responses can be cached, and the caching is out of your  
>>>>> control
>>>>> as a service provider - it may be on the user's LAN (HTTP proxy)  
>>>>> or
>>>>> in their client (browser cache). If you send the same query twice
>>>>> then via GET,
>>>>> for the second request you could get the response for the first,
>>>> pulled from the cache,
>>>>> and no new job.
>>>> If what you said earlier is correct, then I can't rely on what I  
>>>> get
>>>> back from a GET call.  I might be getting a cached response and not
>>>> the current state of the system.  If caching is truly an issue,  
>>>> then
>>>> there doesn't seem to be any reliable way to get the job list.
>>>>
>>>> Tom
>>>>
>>>>> On 2 Nov 2009, at 14:13, Tom McGlynn wrote:
>>>>>> Paul Harrison wrote:
>>>>>>> Guy has already done a good job of answering most of these  
>>>>>>> points - I
>>>>>>> The UWS design of the two stage process is for two principal  
>>>>>>> reasons
>>>>>>> a) to be able to manipulate job metadata parameters before the  
>>>>>>> job is
>>>>>>> run - e.g. the DestructionTime - and receive the feedback from  
>>>>>>> the
>>>>>>> service whether it is prepared to honour such requests before   
>>>>>>> actually
>>>>>>> committing the job.
>>>>>>> b) to allow complete parameter namespace freedom on the job  
>>>>>>> creation
>>>>>>> step - i.e. if PHASE is used by UWS then it could not be a  
>>>>>>> parameter
>>>>>>> for the implementation protocol.
>>>>>>> So if for a particular implementation using UWS there is no  
>>>>>>> problem
>>>>>>> with meeting that second condition, then there is no  
>>>>>>> particular  reason
>>>>>>> why job metadata parameters could not be included with the  
>>>>>>> initial  UWS
>>>>>>> job creation step if desired - this would require revision of  
>>>>>>> the UWS
>>>>>>> specification to include this possibility - I think that this  
>>>>>>> is a
>>>>>>> small enough change to be added into the document as part of  
>>>>>>> the  RFC -
>>>>>>> it does have a larger impact on possible service implementers  
>>>>>>> however
>>>>>>> - their code might not be structured to allow this change  
>>>>>>> easily. For
>>>>>>> a generalized UWS client the change would not be so great, all  
>>>>>>> that
>>>>>>> would happen is that after the initial submission a job object  
>>>>>>> would
>>>>>>> be returned with the PHASE=EXECUTING, and general clients  
>>>>>>> should not
>>>>>>> make any assumptions about state in their coding, so should  
>>>>>>> probably
>>>>>>> still be able to react appropriately.
>>>>>>> Just as a side note to show that UWS is not so strange in this
>>>>>>> multiple interaction between client and server - consider  
>>>>>>> what  happens
>>>>>>> when a web browser loads a web page - it does the initial get  
>>>>>>> of the
>>>>>>> html, then parses this html and then gets images, javascript  
>>>>>>> etc.
>>>>>>> before the page is shown to the user.
>>>>>> I trust the goal is not to require that UWS services need to have
>>>>>> the complexity of an interactive visual Web browser.  The  
>>>>>> protocol
>>>>>> should cater to simple applications as well.
>>>>>>
>>>>>> I'd be perfectly happy with a change that made it clear that the
>>>>>> request that created the job could return it in any state.  In
>>>>>> fact,  even without the desire to be able to start jobs at  
>>>>>> creation,
>>>>>> that  is probably needed to accommodate the situation where  
>>>>>> there is
>>>>>> a  problem detected in creating the job but we want the user to  
>>>>>> be
>>>>>> able  to parse the error with IVOA protocol level error handling.
>>>>>>
>>>>>> The specific text that I find problematic is in 2.1.3:
>>>>>>
>>>>>> PENDING: the job is accepted by the service but not yet committed
>>>>>> for execution by the client.  In this state the job quote can be
>>>>>> read and evaluated.  This is the state into which a job enters  
>>>>>> when
>>>>>> it is first created.
>>>>>>
>>>>>> in conjunction with
>>>>>>
>>>>>> 2.2.3.1 Creating a job.
>>>>>>
>>>>>> Posting a request to the job list creates a new job (unless the
>>>>>> service rejects the request).
>>>>>>
>>>>>>
>>>>>>
>>>>>> I would suggest something like the following changes
>>>>>> -
>>>>>> In 2.1.3 add somewhere
>>>>>>
>>>>>> Phases are ordered with PENDING before QUEUED, QUEUED before
>>>>>> EXECUTING and EXECUTING before the trio of COMPLETED, ABORTED and
>>>>>> ERROR.  The state of a job may change only to a later state but
>>>>>> need  not pass through any intermediate state.  A job may be  
>>>>>> created
>>>>>> in  any state.
>>>>>> -
>>>>>> Delete the last sentence in the definition of PENDING.
>>>>>> -
>>>>>>
>>>>>>
>>>>>> I think it would be desirable to suggest how a job could be  
>>>>>> created
>>>>>> in the run state even if this is not required by the standard  It
>>>>>> would be possible to do this without polluting the parameter name
>>>>>> space by specifying a new URI for that.  E.g., ${jobs} creates a
>>>>>> new  request but does not start it.  ${jobs}/run could create and
>>>>>> start  the job if that is permitted in the given service.   
>>>>>> However I
>>>>>> find  the worry about pollution of the parameter name space less
>>>>>> than  compelling, since we require certain parameters to be  
>>>>>> used in
>>>>>> calls  to start the job running or to alter other aspects of the
>>>>>> job.  It  would be poor practice to have phase= mean one thing  
>>>>>> in a
>>>>>> job  creation request and mean something else in a job update
>>>>>> request.   In effect, these parameters are reserved already.
>>>>>>
>>>>>>
>>>>>> This discussion brought up another thing that's not really clear.
>>>>>> 2.2.3.1 has the little parenthetical phrase "(unless the service
>>>>>> rejects the request)" which is neither explained, nor is the  
>>>>>> action
>>>>>> implied in rejection specified.  I think something like:
>>>>>>
>>>>>> Errors may occur in the creation of the job.  Where possible a  
>>>>>> job
>>>>>> should be created in the ERROR phase with a error message that
>>>>>> describes the problems.  If this is not possible, an HTTP 500  
>>>>>> error
>>>>>> must be returned.
>>>>>>
>>>>>> would be clearer for implementation and should replace that  
>>>>>> phrase.
>>>>>>
>>>>>>
>>>>>> Finally,  one new issue/concern.  [This probably reflects a  
>>>>>> lack of
>>>>>> understanding on my part but if so then perhaps it could be
>>>>>> clarified in the standard.]:  It doesn't seem like there is any
>>>>>> valid way to get the current job list.
>>>>>>
>>>>>> I can't do a GET request for  /{$jobs} because that's cacheable  
>>>>>> and
>>>>>> the list is dynamic.  And I can't do a POST request for it  
>>>>>> since a
>>>>>> post to /{$jobs} means that I'm creating a new job and I'm  
>>>>>> supposed
>>>>>> to be redirected to the job information for the newly created  
>>>>>> job.
>>>>>>
>>>>>> So how do I get to it?
>>>>>>
>>>>>> In my TAP implementation I assume that any request to create a  
>>>>>> new
>>>>>> job needs to have a REQUEST= parameter and if I don't see this I
>>>>>> return the job list.  If I do, I create the new job.  However  
>>>>>> this
>>>>>> doesn't seem to be literally correct.  I suppose you could say  
>>>>>> that
>>>>>> I'm 'rejecting' the request and since that behavior is  
>>>>>> undefined I
>>>>>> can do anything I want. Relying on undefined behavior doesn't  
>>>>>> seem
>>>>>> satisfactory.
>>>>>>
>>>>>> Tom
>>>>>>
>>>>>>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.ivoa.net/pipermail/grid/attachments/20091103/7a88b1ad/attachment-0007.html>