table_name syntax

Mark Taylor M.B.Taylor at bristol.ac.uk
Thu Apr 30 00:48:20 CEST 2015


Dave,

On Wed, 29 Apr 2015, Dave Morris wrote:

> Hi Mark, Laurent,
> 
> On 2015-04-29 10:18, Mark Taylor wrote:
> 
> > Most clients do not need to do that kind of parsing.  If all you want
> > to do is display table metadata or assemble ADQL text (these are the
> > things topcat needs to do, and I guess TapHandle too), or ensure
> > uniqueness of the table+column content as mentioned by Marco,
> > you can just use the table_name value.  This is a guaranteed unique
> > identifier for the table; it may or may not include a schema
> > (or catalog.schema) prefix according to whether it's required for
> > uniqueness, which is something the service knows and the client
> > doesn't need to worry about.
> > 
> 
> Yep, agreed.
> 
> This is established functionality, so I'm not suggesting we change this.

Good!
 
> However ..
> 
> > Admittedly, if you need to recover the unqualified and unquoted
> > table name for some reason (I get the impression this is one of
> > Dave's requirements, though I may be wrong) it's a bit more fiddly
> > under the current scheme, but it's not impossible, and I would
> > say that's a niche requirement.
> 
> Yes, for our use case we need to extract the original, unquoted, catalog,
> schema, table and columns names as separate items.
> 
> You are right, it is not impossible to parse the concatenated string.
> 
> It will however, become that bit more tricky once we start to have nested
> double quotes and full stops inside a variable number of double quote
> delimited identifiers which are themselves separated by full stops.

To be concrete about this, I attach a java class that I believe does
this parsing comprehensively for all legal table_name values,
including as many double quotes and full stops as you like.
The whole thing is a few tens of lines of code.

> If it is at all possible I'd prefer to avoid the process of encoding the
> separate names into a concatenated string on the server, only to have to
> decode and parse the string to produce the separate unquoted names again on
> the client.
> 
> The server must have the original unquoted names in order to be able to create
> the concatenated string, the request was to make these available to the client
> as two additional columns in the TAP_SCHEMA table.
> 
> The cost would be adding two rows to the table description in the
> specification, and two cells per row in TAP_SCHEMA.tables on each server,
> populated with data we know the server must already have.
> 
> No cost at all to existing clients that use the concatenated table_name, the
> existing functionality would stay the same and you wouldn't even notice the
> extra columns were there.
> 
> You may be right, this may be a niche requirement, but even so, given the cost
> if fairly low - it doesn't seem that big an ask.

I'm not worried about the cost to clients.
In my view the thing that's high cost is the standardisation work:
making modifications or adding content to standards.
The text has to get argued over and written correctly and reviewed
by all TCG members, validation suites have to be updated (e.g. in this
case to check that the raw_*_name columns are consistent with the
existing table_name column) and mistakes can be made, in the drafting
and implementation of additional features.
Client-side implementation against existing standard features
is cheap by comparison (Case in point: I've already spent more
time arguing about this possible new feature than I did to write
the code that proves you can do without it.  Making the relevant
changes to taplint would also take longer).

In cases where a standard makes it impossible or inefficient
to do something for which there is a pressing need,
it is clearly in order to consider making changes.  But in this
case the requirement affects only a few TAP_SCHEMA consumers
and the client-side effort required to do what they need to do
with the existing standards amounts to a few lines of regular
expressions.  So, I still don't support this suggestion.

> It also means that searching for "table name starts with" becomes much much
> easier.

I will grant you that; is there much call for that kind of query?

Mark

--
Mark Taylor   Astronomical Programmer   Physics, Bristol University, UK
m.b.taylor at bris.ac.uk +44-117-9288776  http://www.star.bris.ac.uk/~mbt/
-------------- next part --------------

import java.util.Arrays;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * Utility class to parse TAP_SCHEMA table_name columns.
 *
 * @author   Mark Taylor
 * @since    29 Apr 2015
 */
public class TName {

    private static final Pattern CST_REGEX = createCatalogSchemaTablePattern();

    /**
     * Returns a regex pattern with three capturing groups, for Catalog,
     * Schema and Table regular-or-delimited-identifiers.
     */
    private static Pattern createCatalogSchemaTablePattern() {
        String regularIdentifier = "[a-zA-Z][a-zA-Z0-9_]*";
        String delimitedIdentifier =  "\"(?:[^\"]|\"\")+\"";
        String identifierCapture =
            "(" + regularIdentifier + "|" + delimitedIdentifier + ")";
        String dot = "[.]";
        String catSchTab =
             "(?:" + identifierCapture + dot + ")??"
           + "(?:" + identifierCapture + dot + ")??"
           + identifierCapture;
        return Pattern.compile( catSchTab );
    }

    /**
     * Takes a regular-or-delimited-identifier and returns its raw form.
     * 
     * @param  identifier  regular or delimited identifier, or null
     * @return  identifier with no surrounding quotes or other escaping,
     *          or null for null input
     */
    private static String unescape( String identifier ) {
        if ( identifier == null ) {
            return null;
        }
        int leng = identifier.length();
        if ( identifier.charAt( 0 ) == '"' &&
             identifier.charAt( leng - 1 ) == '"' ) {
            return identifier.substring( 1, leng - 1 )
                             .replaceAll( "\"\"", "\"" );
        }
        else {
            assert identifier.matches( "[a-zA-Z][a-zA-Z0-9_]*" );
            return identifier;
        }
    }
   
    /**
     * Takes a string of the form that appears in TAP_SCHEMA table_name
     * columns ([catalog.[schema.[table]]]) and returns a 3-element array
     * giving the unescaped forms of the three elements.
     * Either catalog alone, or both catalog and schema, may be null.
     * The return value is null in the case that the input string is
     * not a valid table_name.
     *
     * @param   tname   table_name string
     * @return   array giving raw (catalog, schema, table) values,
     *           or null for parse failure
     */
    public static String[] getRawAtoms( String tname ) {
        Matcher matcher = CST_REGEX.matcher( tname );
        if ( matcher.matches() ) {
            return new String[] {
                unescape( matcher.group( 1 ) ),
                unescape( matcher.group( 2 ) ),
                unescape( matcher.group( 3 ) ),
            };
        }
        else {
            return null;
        }
    }

    /**
     * Supply the table_name value on the command line, and the 3-element
     * raw catalog, schema, table array will be printed out.
     */
    public static void main( String[] args ) {
        System.out.println( Arrays.toString( getRawAtoms( args[ 0 ] ) ) );
    }
}


More information about the dal mailing list