SQL Anywhere: April 2008

Friday, April 25, 2008

This article should be required reading for every single developer, analyst, architect and manager involved in business software projects: The Mythical Business Layer...

"Programmers... often take refuge in an understandable, but disastrous, inclination towards complexity and ingenuity in their work. Forbidden to design anything larger than a program, they respond by making that program intricate enough to challenge their professional skill."

"It was as if its architects were given a perfectly good hammer and gleefully replied, neat! With this hammer, we can build a tool that can pound in nails."

"With simple, consistent code, change analysis is a breeze. A simple trace through the layers of code with a text-search tool will generally reveal exactly what needs to be changed, and where. A solid test plan will identify what changes broke the system and how, and allow developers to respond appropriately.

No, it’s not fun and not challenging. It’s just good software."

Drinks are on me!

At a dinner just outside San Francisco last weekend, the two annual "TeamSybase MVP" awards were handed out:

The "Iron Man Award" is given to the TeamSybase member who answered the most questions in the Sybase newsgroups, and this year it went to Bruce Armstrong with over 1,000 posts.

Awesome! Bruce, you rock!

The "People's Choice" award is based on email votes cast by folks who use the newsgroups:

"...cast your vote for the member of TeamSybase who has provided the most assistance to you in the Sybase Newsgroups over the past year (2007)."

The runners-up for "People's Choice" were...

Bruce Armstrong
Terry Dykstra
Dave Fish
Dean Jones
Mark Parsons
Michael Peppler
Kevin Sherlock
Jerry Siegel
Terry Voth

...and the winner is... me! Woohoo!

Come to Techwave, drinks are on me!

Wednesday, April 23, 2008

OpenXML() Rocks!

But XML, maybe not so much...

XML files are sometimes used to transmit data between relational databases using a hierarchical structure that is non-relational in nature. SQL Anywhere provides very fast and fairly-easy-to-use facilities to parse and load this XML data into relational tables: xp_read_file() and INSERT t SELECT ... FROM OPENXML().

Here's a story some of you will be painfully familiar with: Company A is sending data to Company B, from one computer system to another. Company A has the power to define what the data transmission looks like (no, it's not Walmart, but it could be), and it chooses to switch to XML.

Company A used to let Company B get the data in a synchronous fashion by running SQL queries against Company A's database. There's no real reason to stop doing that, especially since Company B is a wholly-owned subsidiary of Company A. Nevertheless, it's a new world, everyone's learning about XML in school, and everyone wants to use it as soon as they graduate.

So now, instead of nine queries to copy about 31M of database data from nine SQL Server tables, a single 84M XML file is created by Company A, transmitted across the internet and then processed by Company B. This file contains all the data in a deeply nested hierarchical structure under one single top-level XML tag.

... and that's reminiscent of IBM's mainframe "Information Management System" hierarchical DBMS from the 1970s.

Once upon a time, about 30 or 40 years ago, there were three kinds of database managment systems competing for popularity: Mind-numbingly complex "network" DBMS products, not-quite-so-bad "hierarchical" products, and the brand-new, wonderfully-simple (and eventually victorious) table or "relational" databases. With network and hierarchical databases, programmers wrote lots of code that simply navigated from node to node, with instructions like "Get Unique", "Get Next", "Get Next Within Parent" to traverse the tree, and "Get Hold Next Within Parent" to signal update intent. Similarly, with hierarchical representations of data in XML, the notion of navigation becomes important once again... except today it's given a patina of modernity with terms like "parsing" and "shredding".

But enough ranting... The engineers behind SQL Anywhere have given us very fast and fairly-easy-to-use facilities to parse and load XML data into relational tables, which you can code inside a stored procedure:

Step 1: Call xp_read_file() to load the XML data into a local LONG VARCHAR string variable. That's fast and easy.
Step 2: Execute INSERT t SELECT ... FROM OPENXML() to parse the XML and load the table. Also fast, but easy? We'll see...

Step 1 is currently limited to 2G for the string value. For larger files, say, the entire contents of Wikipedia, you're going to have to do something different (don't laugh, you *can* download all of Wikipedia in a single XML file, Jason Hinsperger's done it, and loaded it into a SQL Anywhere database, and I hope he'll talk about it here.)

Step 2 is executed once for each table. For this particular story, that means parsing the XML string nine times, with nine different INSERT t SELECT ... FROM OPENXML() statements. Sounds wasteful, but it's fast, so it falls into the "who cares?" performance category.

The Input

Here's a snippet from the giant XML file:

For this discussion, we're looking for the XML <Line...> element at the bottom of the picture.

The Output

The data from each <Line...> element (and there are thousands of them in the full file) is going to be inserted into a row in the RDN_Line table:

CREATE TABLE RDN_Line (   
   artificial_pkey BIGINT NOT NULL DEFAULT AUTOINCREMENT  
                      PRIMARY KEY,
   InternalId      VARCHAR ( 15 ) NOT NULL, 
   Line            VARCHAR ( 50 ) NOT NULL, 
   RDNFlag         VARCHAR ( 5 ) NOT NULL,  
   RDNUpdated      VARCHAR ( 100 ) NOT NULL );

RDNLine.Line = the data value between the <Line...> and </Line> tags; e.g., "Linguistics".
RDNLine.RDNUpdated = the RDNUpdated attribute from the <Line...> tag; e.g., "2008-03-18T22:29:07.0000000-05:00".
RDNLine.RDNFlag = the RDNFlag attribute from the <Line...> tag; e.g., "true".
RDNLine.InternalId = the data value between the <InternalId...> and </InternalId> tags; e.g., "2195445AG".

That last one is really tricky: just where is the <InternalId...> tag? This is what makes the XML hierarchical, like good old IMS, and not relational, like databases in modern times... you have to navigate from the <Line...> tag, up past <Lines> to the parent <Agency...>, and then back down past <InternalIds> to the <InternalId...> tag. That's where you'll find the value to put in RDNLine.InternalId.

And that's what makes this stuff complex, not the SQL syntax, but the way people create XML. It's a license to obfuscate, and the best (worst) part is this: those same people claim XML is simple, straightforward and self-documenting... sure it is, and the emperor is fully clothed.

Oh, and what about the artificial_pkey column? Logically the primary key to RDN_Line is ( InternalId, Line ) but because the XML is stored in an ordinary text file there's no guarantee it's valid. The code discussed here simply loads the XML data into a table for further processing which includes edits for validity that aren't shown here. The key word is "robust": no runtime exceptions, no matter what kind of garbage is received.

The Processing

Here's the INSERT statement that parses each <Line...> element and puts the data into RDN_line:

INSERT RDN_Line (
   InternalId,        
   Line,    
   RDNFlag, 
   RDNUpdated ) 
SELECT TRIM ( COALESCE ( InternalId, '???' ) ), 
       TRIM ( COALESCE ( Line, '???' ) ),  
       TRIM ( COALESCE ( RDNFlag, '???' ) ),  
       TRIM ( COALESCE ( RDNUpdated, '???' ) ) 
  FROM OPENXML ( 
          @xml,
          '/Branches/Agencies/Agency/Lines' )
       WITH ( InternalId VARCHAR ( 15 )  '../InternalIds/InternalId',
              Line       VARCHAR ( 50 )  'Line',
              RDNFlag    VARCHAR ( 5 )   './Line/@RDNFlag',
              RDNUpdated VARCHAR ( 100 ) './Line/@RDNUpdated' );

Yes, it looks ugly, and yes, it *is* ugly... so why did I call it "fairly-easy-to-use"?

Because it's short, for one thing. Ten or twenty lines of complex code are usually easier to deal with than a hundred or a thousand lines of complex code.

And the processing logic is divided into separate components, each of which can be understood once someone explains it to you. Which is what I'm going to try do do: divide and conquer.

First, a high-level view, reading from the bottom up:

The WITH clause at the bottom tells the builtin OPENXML() procedure how to parse the XML data to find the column values you're looking for.
The two arguments passed to OPENXML() do more of the same thing: they tell OPENXML what to do.
The FROM clause makes use of SQL Anywhere's ability to treat a procedure call just like a table. If a procedure returns a result set (and OPENXML does that), you can call it in a FROM clause.
The SELECT list tells SQL Anywhere what to do with each column in the result set returned by OPENXML().
The INSERT list tells SQL Anywhere where to put columns from the SELECT list. And (finally :) it's done!

No, it's still not easy, not even "fairly easy".

The Syntax

Here is the pseudo-SQL for Steps 1 and 2 described earlier, the call to xp_read_file() and the INSERT t SELECT ... FROM OPENXML():

BEGIN
DECLARE @xml  LONG VARCHAR;

SET @xml = xp_read_file ( 'filespec-for-XML-data' );

SET TEMPORARY OPTION STRING_RTRUNCATION = 'OFF';

INSERT target-table ( 
       target-column-name-1,
       target-column-name-2,
       ... )
SELECT column-name-1,
       column-name-2,
       ...
  FROM OPENXML ( @xml,
                 'Xpath-query-to-find-starting-point' )
       WITH ( column-name-1 
                 data-type-1 
                    'Xpath-subquery-to-get-value-1',
              column-name-2  
                 data-type-2  
                    'Xpath-subquery-to-get-value-2',
              ... );
COMMIT;
END;

The SET @xml = xp_read_file loads the entire contents of the XML file into a local LONG VARCHAR variable.
Setting STRING_RTRUNCATION = 'OFF' is done so the OPENXML call will truncate XML data values that are wayyyyy too long, rather than raising an exception; more on this later.
The target-column-names in the INSERT list point to the target-table. They don't have to be the same as the column-names used in the SELECT; the correspondence is positional.
The column-names in the SELECT correspond by name to the column-names in the WITH clause.
The first argument passed to OPENXML is the @xml string filled by the earlier call to xp_read_file().
The column-names in the WITH clause are made up by you, and only have local scope (this one SQL statement). Personally, I like to use exactly the same names here as the names that appear in the corresponding XML tags. For example, if <node>value</node> appears in the XML, "node" is what I use as the corresponding column name in the WITH clause.
The WITH keyword may be the most popular keyword with the designers of SQL Anywhere... try looking it up in the Help and you'll be hard pressed to find what you're looking for, there are so many. In fact, you can have several different styles of WITH clauses one single SELECT: WITH index hints, OPENSTRING WITH, WITH RECURSIVE, WITH local view. The syntax and semantics of any particular WITH clause depends on exactly where it's used, and *this* particular WITH clause applies only to the OPENXML call.
The data-types in the WITH clause are chosen by you, to tell the OPENXML call how to convert the XML string data values when it finds them.
If you are expecting garbage values in the XML the only truly safe data type is LONG VARCHAR which will accept pretty much anything. But that's extreme; personally, I prefer VARCHAR with some reasonable maximum length, and turn STRING_RTRUNCATION = 'OFF' to chop off unreasonably long values. You can still perform edit checks for the "too long" condition by specifying a length that is longer that what you really want, but there is no point storing the entire contents of an infinitely long string (where "infinite" means "way too long").

The Xpath Zone

So what's left? The two kinds of 'Xpath-...' strings, that's what... by far the weirdest parts of this whole INSERT t SELECT ... FROM OPENXML() statement.

The 'Xpath-...' strings take us out of the SQL language and into something called the "Xpath Query" language... you can read about that language here, but file a flight plan first; i.e., let your next-of-kin know where you're going in case you need rescuing later.

Once again, here's the actual code for the INSERT t SELECT ... FROM OPENXML():

INSERT RDN_Line (
   InternalId,        
   Line,    
   RDNFlag, 
   RDNUpdated ) 
SELECT TRIM ( COALESCE ( InternalId, '???' ) ), 
       TRIM ( COALESCE ( Line, '???' ) ),  
       TRIM ( COALESCE ( RDNFlag, '???' ) ),  
       TRIM ( COALESCE ( RDNUpdated, '???' ) ) 
  FROM OPENXML ( 
          @xml,
          '/Branches/Agencies/Agency/Lines' )
       WITH ( InternalId VARCHAR ( 15 )  '../InternalIds/InternalId',
              Line       VARCHAR ( 50 )  'Line',
              RDNFlag    VARCHAR ( 5 )   './Line/@RDNFlag',
              RDNUpdated VARCHAR ( 100 ) './Line/@RDNUpdated' );

The 'Xpath-query-to-find-starting-point' string is '/Branches/Agencies/Agency/Lines', and it means this: Start at the root "/" of the XML document, and navigate down the tree to the <Lines> tag (that's Lines plural, not Line singular), and start the parsing there. Not just the first <Lines> tag, but each and every one of them: find them all, parse them all.

The first 'Xpath-subquery-to-get-value-...' is '../InternalIds/InternalId' which means this: Go up one level "../" from the <Lines> tag to the <Agency...> tag, then down two levels to the <InternalId...> tag, and pick up the data value "2195445AG" from between the <InternalId...> and </InternalId> tags.

If you're wondering why we need this InternalId value, it's because it is the logical primary key for all the data under the <Agency...> tag. In the XML it only appears once, in the Agency element, but on the relational database side of things it must be repeated as a logical foreign key column in each of the child tables (RDN_Line.InternalId in this case). So that's why all the navigating, up-and-down... air-sickness bags optional but recommended.

The second Xpath subquery is simply 'Line' which means this: Pick up the data value "Linguistics" from between the <Line...> and </Line> tags.

The third Xpath subquery './Line/@RDNFlag' means this: Look inside the <Line...> tag itself, and pick up the value for the RDNFlag attribute, which is "true". That's what the "@" means: look for an attribute value, not an element data value.

As far as I know the "./" directive means "start here, where you are, and look down". That's rather redundant, and I'm not sure it's necessary... but it works, and I have less-than-zero interest in exploring alternative Xpath queries to do the same thing :)

The last Xpath subquery './Line/@RDNUpdated' picks up the value for the RDNUpdated attribute, which is "2008-03-18T22:29:07.0000000-05:00", a kind of funky committee-designed timestamp-with-timezone format.

The Results

Here's a look at that one row in RDN_Line...

Here are the numbers for a full test on a laptop...

84M XML file containing data for 9 tables in a deeply nested tag hierarchy
Adaptive Server Anywhere Database Engine Version 9.0.2.3575
HP Pavilion zd8000 laptop 3.4Ghz Pentium 4 running XP SP2
Method: xp_read_file into a LONG VARCHAR variable then parse 9 times via INSERT t SELECT ... FROM OPENXML()
9 tables, 71 columns, 338,917 rows
36.5M total database space = 30.7M table + 5.8M index
2.5 minutes on a cold cache

Sunday, April 20, 2008

Inside Techwave

PowerPoint junkies may be not be fully satisfied by Techwave 2008: Only two keynotes, by Sybase CEO John Chen and chief marketing officer Raj Nathan.

For the rest of us, that's good news. There's nothing wrong with a good show, or two good shows, but that's enough.

Or do you disagree? Did you stay to the end of the third keynote last year? ...didn't think so. I stayed, I had to stay, Raj made me stay, and I was lonely... but you didn't know that, did you, you'd gone to lunch long before it was over :)

In other Techwave news, the agenda, or at least part of the agenda, will be published earlier this year so folks can meet deadlines for getting approval to attend.

Plus the SQL Anywhere folks have been given a free hand in picking presentations for their conference tracks.

All good news, doncha think?

Saturday, April 19, 2008

Panorama Goes Public

No, it's not Generally Available, it's not that kind of "Goes Public" yet. What just happened is that Sybase has kicked off the official promotion of Panorama, the beta for SQL Anywhere 11.

Here are the Top 4 Release Defining Features, according to the promotion:

Enhanced parallel query capabilities – includes parallel sort, parallel index scans and index-only retrieval to make the most of today’s laptop, desktop, and server machines
Embedded full-text search - provides a powerful yet compact facility to effectively search the ever increasing amounts of unstructured data in the front lines
New data synchronization options - builds on our rich, heterogeneous synchronization platform with the addition of new failover and load balancing options, background synchronization, and support for MySQL
.NET and Perl stored procedures – ensures developers can leverage their programming language experience to implement database business logic

Here's the official Sybase press release: Sybase iAnywhere Unveils SQL Anywhere 11 Beta

Here's an article with more detail: Sybase Unveils Feature-Packed Beta of Embedded Database

...and if you don't have it yet, here's how to get the Panorama beta for yourself.

Wednesday, April 16, 2008

SQL Anywhere Blog Center

The brand-new SQL Anywhere Blog Center is pretty slick, and not just because it links back here :)... everyone there actually has something to say!

Tuesday, April 15, 2008

PHP IDE Supports SQL Anywhere

Among many other new features, Codegear/Borland's Delphi for PHP version 2 supports SQL Anywhere databases.

Tuesday, April 8, 2008

Panorama Beta Refresh

Dullards might call it "Beta 2" but the cool doodz at iAnywhere Solutions prefer "Beta Refresh".

If you've already signed up for the SQL Anywhere Panorama Beta, you've probably also received the email that says to go here.

If you want to sign up, go here.

Monday, April 7, 2008

On Display in Japan

I knew the book was popular in Japan, I didn't know it was on display in the iAnywhere office...

Putting food on the cover is another Japanese trend the rest of the world should adopt; that ain't no McDonald's burger...

BTW, that's a bookend in front of the book, not ~~a dragon~~ Shisa on the cover protecting the burger.

Thursday, April 3, 2008

Attention Techwave Submitters!

The voting process begins next week. The good news is, here's how the voters were chosen:

1. Broad/specific knowledge of product areas.

2. Understanding of what customers are requesting & would like to be educated on.

3. Be willing and able to provide feedback on individual sessions where necessary.

Bad news, they've invited me to be a voter.

What was it Groucho Marx said?

"PLEASE ACCEPT MY RESIGNATION. I DON'T WANT TO BELONG TO ANY CLUB THAT WILL ACCEPT PEOPLE LIKE ME AS A MEMBER".

Worse news, I don't have the same high standards as Groucho Marx, so I will be voting on your submissions.

But at least if you're rejected you'll have someone to blame :)

Wednesday, April 2, 2008

How To Make SQL Anywhere Slow

Oh, boy, am I gonna get in trouble with this one... talking about Bad Stuff.

But, sometimes it's good to look at a problem (say, performance) from a different angle. This list actually started as a conversation in the bar, way back in 2001, and I just stumbled on a follow-up email today.

Most of the original points still applied, and with a little cleanup and a few points from the new Capacity Planning whitepaper, here it is... in no particular order...

How To Make SQL Anywhere Slow

Turn AutoCommit on. Or forget to turn it off when using ODBC, ADO.NET, JDBC, PHP, etc.
Use 1K or 16K or 32K pages.
Set ROW_COUNTS to ON so every query gets executed twice.
Set ISOLATION_LEVEL to 2 or 3.
Set MAX_PLANS_CACHED to 0 so there's no access plan caching.
Use SET OPTION instead of SET TEMPORARY OPTION to get all those SYSOPTION changes recorded in the log.
Set OPTIMIZATION_GOAL to First-row for complex queries, especially when your application really wants all the rows (which it probably does, all of the time).
Set PREFETCH to OFF when scrolling through a result set in sequence.
Set PREFETCH to ON or Conditional when doing ABSOLUTE FETCHes randomly in the result set.
Set QUERY_PLAN_ON_OPEN to ON.
Set RECOVERY_TIME to 1 minute to force lots of unnecessary checkpoints.
Set a small cache size.
Put lots of logic in user-defined functions, like SQL statements using many joins and temporary tables.
Call user-defined functions in the WHERE clause.
DROP STATISTICS to make the optimizer stupid; it's DROP OPTIMIZER STATISTICS in Version 5 through 7.
Code DDL in stored procedures to increase the chances of single-threading.
Use base tables as temporary tables to get lots of unnecessary transaction log I/O.
In Version 7 and earlier, create indexes where the first 9 bytes are almost universally the same value.
Run without a transaction log, thus forcing a checkpoint on every single commit.
Put the transaction log mirror on the same physical device as the transaction log, to get twice the I/O without any benefit.
Do joins in the application.
Move WHERE clause logic into the application.
Choose inappropriate indexes.
In Version 6 and earlier, don't put indexes on AUTOINCREMENT columns, thus forcing SELECT MAX() after startup.
Leave off primary keys so recovery will be slow.
Try to emulate the query engine by using intermediate tables step-by-step, in procedures and application code.
INSERT rows with missing values, then UPDATE to fill them in, so you get lots and lots of row splits.
Don't use wide fetches when fetching large result sets.
Don't use wide inserts when inserting lots of data.
Use INSERT or INPUT FROM instead of LOAD TABLE when initially populating a database.
Use Java in the database.
Specify redundant CHECK constraints.
PREPARE the same statement over and over again, every single time it's used.
Use jConnect instead of the iAnywhere JDBC driver.
Use Open Client so you can get TDS.
Create convoys on hot rows and pages, say by making thousands of connections use the same rows in the same tables.
Send lots of small requests across a high latency network.
Use a custom key table instead of DEFAULT AUTOINCREMENT so all your inserts are serialized.

Caveat: Almost none of these points describe default behavior. Generally, you have to work really hard to make SQL Anywhere slow... alas, people have been known to work hard.

Plus: Google says "No results found" for "How To Make Oracle Slow", possibly because Oracle runs slow out of the box (zing! :)

Tuesday, April 1, 2008

Are you a "Develooper"?

Personally, I've been known to code a loop or two, but I'm not entirely one-dimensional... my toolbox also includes branching logic and subroutine calls :)

SQL Anywhere

Pages