Martin Probst's weblog

Reducing XPath

Tuesday, January 5, 2010, 14:43 — 1 comment Edit

Michael Kay writes on his blog: “Could XPath have been better”, suggesting XPath would have been a nicer language without all the little inconsistencies. Instead, he’d rather map more or less everything to built in functions and their application to sequences, including the axes, predicates, and so on.

This very much sounds like an implementors pipe dream: remove all the annoying inconsistencies and make it easier to create fast implementations.

If you are done replacing all implicit syntax with function calls, I think you might find that you have written a LISP interpreter with built in functions (some with funny or punctuation names) for DOM navigation. Not that that would be a bad thing.

Though this makes one wonder what the actual proposed value of XPath is, once you reduced it to a LISP dialect. Probably the restricted expressiveness and from that the ability of analyzing the function applications to produce a clever execution strategy.

This always reminds me of Erik Meijer and his presentation on LINQ at VLDB 2005 (?), where he demonstrated how LINQ effectively maps certain function applications (selection, projection) to different repositories. I still like the approach: provide a somewhat unified syntax, hand over an Abstract Syntax Tree at run time to the data source/repository, and let that find a good way of executing the query. Integrating the query language into the programming language very much reduces the pain for users, and creates a uniform interface for many different data sources.

This is of course limited to the .NET platform and effectively SQL only, afaik, and I have never actually used LINQ, so I have no idea how good it works out in practice. I might imagine that tool support (profiling! indexes!) can be difficult.

Generating Eclipse build files with XQuery

Monday, November 16, 2009, 20:27 — 0 comments Edit

A friend of mine had a problem today. He was trying to make a huge Ant-based project usable from within Eclipse. The build file would manage dependencies through XML property files within about 30 subdirectories, each declaring which sub-projects need to be compiled before itself.

Writing all those .project and .classpath files is a very tedious task, and clicking them together in Eclipse is even more tedious. XQuery to the rescue!

I simply imported all the property files into xDB and ran this query:

import module namespace fw = '';
import module namespace xn = 'java:com.xhive.dom.interfaces.XhiveNodeIf';

for $project in /project
let $name := substring-before( substring-after(document-uri($project/root()), '/build/'), '/')
where $name != 'swami'
(:let $name := if ($name = 'daogen') then 'DaoGenerator' else $name:)
let $dep-internal := tokenize($project/property[@name='${}.depend.internal']/@value, ',')
let $dep-common := tokenize($project/property[@name='${}.depend.internal']/@value, ',')[not(ends-with(., '.jar'))]
let $deps := ('libs', distinct-values(($dep-internal, $dep-common)))
let $fw := fw:new(concat('/Users/martin/tmp/build/', $name, '/.classpath'))
let $pw := fw:new(concat('/Users/martin/tmp/build/', $name, '/.project'))
let $classpath :=
{ comment { $name } }
    <classpathentry kind="src" path="src"/>
    <classpathentry kind="con" path="org.eclipse.jdt.launching.JRE_CONTAINER"/>
{ for $dep in $deps return 
    <classpathentry combineaccessrules="false" kind="src" path="/SW {$dep}"/>
    <classpathentry kind="output" path="bin"/>
let $project := 
    <name>SW {$name}</name>

return (
  fw:write($fw, xn:to-xml($classpath)), 
  fw:write($pw, xn:to-xml($project)),

What does this do? It iterates through all project descriptions, takes the project name from the document URI, takes out the dependency information (those are the tokenize calls), and create a .project and .classpath XML snippet for each sub-project.

xDB does not include functions to write to the file system out of the box. We could create our own custom extension functions in Java and put them on the classpath, but there is a much easier way through the Java Module Import functionality.

We simply import, create a FileWriter through the fw:new(…) calls for the right file location, serialize the XML snippets using xn:to-xml(…), and then make sure to close the file writers.

We still needed to do some fixups in Eclipse (mainly adding .jar files to the build path and fixing some wrong circular dependencies), but this certainly saved us hours. The world is a much nicer place when you have effective XML tools at your hands :-)

Mac Mini Media Center (M³C)

Saturday, September 12, 2009, 19:23 — 0 comments Edit

So, after much procrastination I bought myself a new Mac Mini and have set up my Mac Mini Media Center (M³C).

Some observations:

Server move

Sunday, September 6, 2009, 19:16 — 0 comments Edit

As you might know, I used to run this weblog on a virtual server hosted by Hosteurope, the smallest possible configuration. However a small virtual server doesn’t seem to be enough for even the smallest weblog possible (at least when it’s written in Rails…), so I moved this weblog to my own server today.

The trick is that I got a DynDNS domain and point my real domain ( to that one through a CNAME record, and the media server / TV in my living room happily serves the files.

$ dig
[... snip ...]
;; ANSWER SECTION:  81215   IN  CNAME    60  IN  A

The server is a new Mac mini, so it will certainly not have the dreaded out of memory problems. I think I’m even saving money - Apple claims an idle energy consumption of about 14 Watt, which should be slightly cheaper than my server hosting in total. Of course this calculation doesn’t include the hardware, but I wanted that media server anyway ;-)

In the process I also upgraded Rails to 2.3.4, which was a bit painful. But I came from 1.2.something, so some friction probably has to be expected.

Hardware on Ubuntu, once again

Wednesday, March 11, 2009, 17:54 — 0 comments Edit

This is really getting ridiculous. Today I wanted to scan some document, and after some googling and searching I found out that the old & crappy USB Scanner I have here (Mustek Bearpaw 1200 CU) doesn’t work on Mac OS X, does theoretically work on Windows XP, but the driver is so bad it crashes the OS all the time, but getting it to work on Ubuntu is trivial.

I still remember the times when hardware support on Linux was really bad, and getting your Wifi to work was a matter of luck. For Wifi is hear it’s still not totally easy, but my experience with Windows is that it’s no better on the Wifi front…

Mobile phone contracts

Tuesday, February 17, 2009, 20:55 — 1 comment Edit

Recently, I changed my mobile phone provider from O2 to Simyo. It’s quite funny - the regular, contract based mobile phone providers should be delivering a premium for the fact that you pay them a monthly fee and bind yourself to a commonly two year contract. And it’s quite the opposite. With Simyo, I can now actually understand my bills, they have web tools that are actually useful, and I’m paying a lot less. O2 and the other providers appear to be investing the premium money mostly into commercials and sales (all these mobile phone shops in the towns must be really expensive…).

To me, usable web tools and understandable bills are a majore feature in providers of anything, even at a potential slight premium. The complete failure of most phone-related companies at this is really a shame. I would actually happily switch my fixed line provider (Alice) for another one, if I just knew a German telephone company that was actually any better.

Shell meta programming

Monday, December 8, 2008, 11:08 — 2 comments Edit

I’m currently reworking X-Hive/DBs command line startup scripts for various utilities, and I’m facing an interesting challenge with shell programming.

The issue is that I want to have a “.xhiverc” file that contains various settings in a Java property file style. Normally, I would simply read those settings from within Java, and everything would be nice and fine. But this file is supposed to contain, amongst others, the memory settings for the virtual machine - and once the JVM is running, it’s of course too late to read those.

So I need to somehow read the file from the shell. That should be easy, right? “. ~/.xhiverc” and everything is fine - or maybe not. What if the user wants to override those settings from the environment? E.g., we have XHIVE_MAX_MEMORY defined in the .xhiverc, but the user has exported XHIVE_MAX_MEMORY=“2G”. This is where the meta programming comes in: we have operate on variables of which we don’t know the name statically.

Current solution: iterate through all legal variable names, save their state in ${VARNAME}_BACKUP, source the .xhiverc, and then re-set them to the previous value if they were non-empty. As the scripts need to be POSIX compliant (i.e., no bashisms), we don’t have ${!VARNAME}, so this already involves some interesting eval scripts (eval export ${var}_BACKUP=\“\$${var}\” - the backslashes are not a Wordpressian/PHP escaping problem).

Now the next interesting thing: how to test if a variable is set? Testing if it’s empty is [ -n “${VARNAME}” ], but what if someone wants to override a default setting to be undefined? If you know the name, it’s “${XHIVE_MAX_MEMORY+x}” = “x”. If you don’t, it’s again some horrible eval combination - maybe I’m missing it, but there doesn’t seem to be a standard “defined” command/test.

I have the feeling I’m doing something wrong - this should be easier ™. Maybe I should just forget about the whole thing, and have a XHIVE_DEFAULT_MAX_MEMORY and a second XHIVE_MAX_MEMORY, same for the other variables…

What surprised me along this, this of course also has to work in Windows batch. And everyone knows that Windows batch is probably one of the most horrible programming environments ever “invented”. But this particular problem is actually not too difficult. Once someone on enlightened me over the byzantine details of the Windows batch FOR loop, it’s a relatively simple loop containing an IF DEFINED %%i:

  FOR /F "eol=# tokens=1,2* delims==" %%i in ('type "!XHIVERC!') do (
    REM only set variables if not already defined as environment variables (they take precedence)
    IF NOT DEFINED %%i (
      SET %%i=%%~j

SSD is the new disk, disk is the new tape

Friday, November 21, 2008, 09:34 — 0 comments Edit

Tim Bray has some very interesting performance numbers for storage systems.

There is this saying that memory is the new disk, disk is the new tape. I think we have to insert something there - SSD is the new disk, disk is the new tape, and memory is somewhere between the CPU cache and the SSD.

The problem is then, how to benefit from these enhancements. If you have ye olde database system, you could simply put all of the data on SSD. This will be fast, but quite a bit of a waste. DBMSes currently manage the cache hierarchy on their own, having a memory cache for the really hot data, a disk storage for the not-so-hot, and tapes for backups.

It would be really nice if the DBMS was aware of the wildly different seek times of SSDs and disks, and if it thus could manage this aspect of the storage hierarchy, too. Ideally, it would lazily remember which data was accessed recently, and move the old stuff to disk. For example, in everyones favorite running performance example - called “Twitter” - presumably next to no one cares about tweets that are older than a month or so, so you could move them to tape disk.

This is again a good example of a change in requirements for databases which as it is now requires developers to implement the smarts themselves. Let’s hope databases will learn this…

Java & Ruby complexity

Wednesday, November 19, 2008, 07:33 — 0 comments Edit

Patrick Mueller writes:

Same sort of nutso thinking with Java. A potentially decent systems-level programming language, it could have been a successor to C and C++ had things worked out a bit differently. But as an application programming language? Sure, some people can do it. But there's a level of complexity there, over and above application programming languages we've used in the past - COBOL and BASIC, for instance - that really renders it unsuitable for a large potential segment of the programmer market. [...] We're seeing an upswing in alternative languages where Java used to be king: Ruby, Python, Groovy, etc

I really don’t agree with the notion of complexity in Java. Complexity as a term is IMHO highly unprecise, so maybe we’re just thinking differently about it here.

Much of the stuff people people don’t like about Java is actually it’s verboseness (compared to e.g. Ruby), but that’s nearly the opposite of complexity. The inventors of Java explicitly left a lot of features out - like closures - because they feared they would create a too complex programming language.

Ruby & Co have all these features, plus a lot of nice meta programming, and a somewhat weird module/inclusion/inheritance system. I personally think that Ruby is much more complex than Java in the long term. The interesting question is whether people will be happy with the added complexity in the long term.

I see this as a trade off in programming languages: language features like cool meta programming, closures, or a really worked out type system (a la OCaml & Haskell) can remove a lot of accidental complexity: with them. you’re able to write programs much more succinct, or have proofs of global properties of you’re program that weren’t possible before.

On the other hand, language features can create a lot of complexity, if not done really well. I’m reading the Scala mailing list, and I remember discussions of the sort “is this code legal Scala? and if it is, what does it mean?” (usually from a type system point of view), and if I remember correctly the language designers weren’t quite sure about it either. This is exactly what you don’t want in a language: unclarity or ambiguity of expressions, unexpected “side effects” of expressions.

Quite a lot of Ruby/Rails code one happens to see is clever in very interesting ways. But I really see that cleverness as a problem: who will understand the tricks that made the code a bit shorter in five years? Probably someone, but it might take him a long time to do so. Already now it’s sometimes quite difficult to find documentation on a particular library method/class in Ruby, as the documentation system is apparently not up to handle the language’s module inclusion features.

At what point do all these clever tricks sum up to something that is no longer understandable? Are we really sure that the modularization works out good enough that we don’t have to be afraid of all ending up as a large meta-closure-soup? ;-)

Don’t get me wrong: I like dynamic languages for a lot of features. I’m just weary of some of the effects. Pushing accidental complexity out of the application and into the programming language (now as feature complexity) should normally be a good thing: it sounds reasonable that this should reduce overall complexity, and give programmers a broader understanding of what’s happening. We need a good modularity system and proper abstractions to have a real positive effect from this - and I’m not sure I see this in e.g. Ruby.

Databases and Caching

Tuesday, November 11, 2008, 18:24 — 1 comment Edit

Dare Obasanjo compares database caching with how compilers manage the various CPU caches (e.g., L1, L2). Surprisingly he comes to the conclusion that you need to implement your own caching scheme through memcached and friends because in the database situation the amount of data is so large:

The biggest problem is hardware limitations. A database server will typically have twenty to fifty times more hard drive storage capacity than it has memory. Optimistically this means a database server can cache about 5% to 10% of its entire data in memory before having to go to disk. [...] So the problem isn't a lack of transparent caching functionality in relational databases today. The problem is the significant differences in the storage and I/O capacity of memory versus disk in situations where a large percentage of the data set needs to be retrieved regularly.

I wonder how this is different from the sizes of L2 cache compared to main memory?

It’s even worse: on a regular PC you might have 4MB L2 cache, but 4 GB of main memory. That’s about 0.1 % compared to main memory - so databases actually have a relatively luxurious position, compared only by relative data size.

Application knowledge

Quite the contrary, I believe the problem is not in the data sizes, but in the optimization hints available to databases (and potentially the smartness of the database caching methods). With a good compiler, in particular a JIT, you can easily judge what data will be used in the near future in the code execution, and through data flow analysis and fancy register allocation tricks a compiler has a pretty complete knowledge of what the code tries to do, so it can optimize cache usage (and a whole lot of other things) very efficiently.

Compared to this, databases have little knowledge of the data access patterns of the application. They can only rebuild this knowledge from observations on the queries hitting them, but they don’t seem to be very successful there judging from the frequency you hear people talking about memcached. I’m not sure why that is exactly, maybe because it’s always more difficult to implement optimizations based on observations of dynamic data than on static knowledge?

One possible problem is probably that the database doesn’t necessarily know - and in many situations probably cannot even guess - which data structures are commonly displayed together for a certain webpage, so caches can be invalidated together, or directly stored together.

It might be an interesting thought experiment to think what would be possible if the database was integrated with the application logic in a way that would make the application knowledge available to the database. This could probably lead to interesting changes regarding invalidation and cache organization. I know there were (and probably still are) some things going into this direction in the Smalltalk environment, but I have no idea if they really take advantage of the application knowledge. Probably not, as most Smalltalk is highly dynamic and I don’t think they’ve put the emphasis into declarative programming that would be needed for this.

Cache Granularity

Also interesting might be the fact that the granularity of objects in a relational database is quite different from the application perspective. Relational databases store entities in rows in tables, but (web-)applications have a hugely different data model. A single application entity, e.g., a user, will span over several tables. But the pages the database uses as the unit of caching usually only contain data from one table. If you have 16k of cache, that might be enough memory for several hot data entities, but because the database caches tables, not application entities, much of those 16k will be filled with rarely used rows, e.g., 4k from the users table, 4k from the mood messages table, 4k from the friends table, and so on. Application developers fight this (and the processing cost of joins operations) with denormalization, which is basically a hack to reduce the number of tables an entity spans.

This all boils down to the fact that relational databases were designed for mass-data processing, like in financial institutions, where large calculations over huge tables of uniform data with little nested structures are the common operation.

I think this is one of the area where non-relational databases, like XML databases, are going to have a bright future. The data model, and thus the unit of caching, is much closer to what today’s content-centric application’s data actually looks like. It’s not only much easier to program without that impedance mismatch, it can also have significant performance advantages over RDBMSes.

New Post