Martin Probst's weblog

Evolution & Spam filtering

Wednesday, June 29, 2005, 10:50 — 0 comments Edit

After quite a long and annoying hunt I think I have found out why Evoltion refuses to filter spam for me. Evolution uses SpamAssassin as it’s backend and SpamAssassin has a certain feature called bayes_auto_learn.

It basically means that everything that gets classified as definetly spam (>15) or definetly not spam (<=0.1) is also automatically sent to train the bayesian filter.

I really wonder of what use this is. The bayesian filter will learn the same rules that are already implemented in SpamAssassin by that, if I’m not mistaken.

Apart from that, for me this was a nice bug. When you mark a message as spam in Evolution, it’s supposed to train the filter. But the spam I’m getting (advertisement on stock options and such) always gets rated as 0.1 by SpamAssassin and is then automatically trained as not spam. Evolution would have to call sa-learn with the –forget option to force training the message as spam as SpamAssassin tries to avoid training messages multiple times.

So basically the spam filtering worked, but all the spam I got was automatically trained to be ham, no matter what I did with clicking etc. I whish spam filtering in Evolution was as easy and helpful as in Thunderbird…

Off to XIME-P

Wednesday, June 15, 2005, 04:23 — 0 comments Edit

I’ll be off to XIME-P, the International Workshop on XQuery Implementation, Experience and Perspectives. There will be a number of talks about directions and future development of XQuery. I’m especially interested in the upcoming Update language.

Also, I’ll be spending 4 days in Baltimore, so I have two free days. Everybody told me Baltimore is not that interesting so I will try to get to Washington and do some tourism.

Beagle

Friday, June 10, 2005, 18:39 — 2 comments Edit

I’ve got a new hobby - watching beagle index all the data that has accumulated in my homedir.

The installation on Ubuntu is pretty straightforward, except that some libraries/symlinks don’t seem to be created correctly.

$ sudo apt-get install libgsf-cil libgmime-cil libebook1.2-3 
$ sudo ln -s /usr/lib/libebook1.2.so /usr/lib/libebook1.2.so.0

I think there were some more libraries missing, but executing “beagled –fg –debug” will probably tell you about that. Everytime it spits out a DllNotFoundException with some .so or complains about a missing .dll, just install those and everything works fine.

Pretty amazing that it runs so smooth, at least up til now. Kudos to the developers.

PS: Yes, this is about beagle in the version 0.0.11.1 for Hoary. 0.0.12 has not been backported yet so I wont install, even though 0.0.11.1 has serious issues for me (memory consumption with blam! is insane).

External functions in XQuery

Saturday, June 4, 2005, 12:20 — 0 comments Edit

I recently implemented a (IMHO) much handier way to provide external functions to XQueries in X-Hive/DB.

External functions can be declared in XQuery like this:

declare function myfunc($a, $b, $c) external;

In X-Hive, you can now create a statement on an arbitrary XML node, register functions, and execute the query (this is from memory and will probably not compile like that):

  XhiveNodeIf node = …;
  XhiveXQueryQuery statement = node.createXQuery(
    “declare function extract-post($author, $title, $content, $time) external;” +
    “declare namespace dc = ‘http://purl.org/dc/elements/1.1/';" +
    “declare namespace content= ‘http://purl.org/rss/1.0/modules/content/';" +
    “for $item in /rss/channel/item” +
    “return extract-post($item/dc:creator, $item/title, $item/content:encoded, $item/pubDate)“)
  ArrayList posts = new ArrayList();
  statement.setExternalFunction(null, “extract-post”, new XhiveExtensionFunctionIf() {
    Object[] call(Iterator< ? extends XhiveXQueryValueIf>[] params) {
       String author = params[0].next().toString();
       …
       posts.add(new RSSPost(author, title, content, date));
       return null;
    }
  });
  statement.execute();

While in general you have to be very careful with functions having side effects, this is a pretty handy way to extract Java objects from a given XML source. As long as you do not make any assumptions about the order, in which the function calls happen, it should also not break.

There are quite a lot of other projects about converting your XML into Java objects (e.g. Apache XMLBeans or DAX). Using XQuery has the advantage of giving you a real XML query language at hand for value extraction, and in combination with an XML database you can also handle really large documents very efficiently.

Microsoft good at competing

Friday, May 20, 2005, 06:54 — 0 comments Edit

Dare Obasanjo writes:

The main problem is that Microsoft is good at competing but not good at caring for customers. The focus of the developer division at Microsoft is the .NET Framework and related technologies which is primarily a competitor to Java/JVM and related technologies. However when it comes to areas where there isn’t a strong, single competitor that can be focused on (e.g. RAD development, scripting languages, web application development) we tend to flounder and stagnate. Eventually I’m sure customer pressure will get us of our butts, it’s just unfortunate that we have to be forced to do these things instead of doing them right the first time around.

That is probably a very insightful comment. Also, I can’t remember Microsoft creating a whole new market sector to compete in at any time. Microsoft seems to always enter markets very late, then take over the whole market by producing arguably quite good products after some time, and then not much happens anymore. The stagnation is probably because of the complete lack of any serious competition. Does anyone remember a really innovative feature in MS Office ever since it evaded it’s competition?

Visited Countries

Monday, May 16, 2005, 13:31 — 0 comments Edit

This is cool:


create your own visited country map

[via Daniel Holbach].

Firefox Extensions

Monday, May 16, 2005, 13:02 — 0 comments Edit

So it seems the Firefox extensions webpage is very smart and checks if you’re using the latest firefox version. Great.

And if you or your linux distribution somehow suck and do not install the latest firefox update 10secs after it has been released you suck and are thereby sentenced to “no extensions” penalty.

Hello? It’s nice to add something like that, but a “no I don’t want to upgrade, take me to the extensions” button would be quite nice. This somewhat reminds me of the old windows installers that insisted you would reboot you system after installation, no matter what. I can remember using my computer with unclosed but finished installers for longer periods because I didn’t want to reboot …

btw I only ran into this because somehow some extensions must have interefered with each other, and as the net result middle-click-open-tab-in-background stopped working, which is - at least for me - one of the most important features of a tabbed browser there is…

Update: killing all extensions and reinstalling didn’t help. Instead, if I uncheck the “Open middle clicked links in background” option in tabbrowser-preferences it works. A magic inverted checkbox. Gnarf.

Continuations explained

Wednesday, April 13, 2005, 22:01 — 0 comments Edit

Sam Ruby writes about Continuations. I stumbled over the name once but never cared to read more as it seemed somewhat obscure to me. Sam starts his article with this:

This essay is for people who, in web years, are older than dirt. More specifically, if there was a period of time in which you programmed in a language which did not have garbage collection, then I mean you. For most people these days, that means that you had some experience with a language named C.

While this definetly means this article is not intended for me, it’s actually a good explanation and easy to understand for people who have some basic understanding of how computers actually work - at least at the level of call stacks etc.

So while I’m definetly not older than dirt, my mind seems to have aged a lot over studying and programming in C++ the last two years. I knew it was unhealthy ;-)

Subversion and Eclipse again

Wednesday, April 6, 2005, 18:16 — 0 comments Edit

Some time ago Lars Trieloff helped me out with my Eclipse/Subclipse/Subversion problems by pointing me to a pure Java implementation of SVN client, which can be used to overcome Subclipses deficiencies.

This really made my day and solved everything - for about two days. Then I was back to normal, Subclipse didn’t work and everything just annoyed me.

Today I took another try at it. The problem with JavaSVN was it couldn’t find some jar (SVNClient.jar) in the (…).subclipse.core_0.9.28.1 plugin directory. Turns out there are two directories, one with version 0.9.28 and one with 0.9.28.1 - but only the first one contains the jar and is actually used according to plugin.xml and feature.xml. After some fiddling around with texteditors, the xml files and some file-copy tries I gave up.

Subclipse itself is intended to work with the JNI javahl bindings, which it couldn’t find on my system. The Subclipse page states that Linux distributions should actually provide these bindings, they only ship them to Windows systems. And by following that hint I found out that Gentoo includes these bindings in the subversion ebuild (when compiled with USE flag java and berkdb). The only problem is that by default $LD\_LIBRARY\_PATH is apparently not set on Gentoo Linux systems.

Long story, short fix:

martin@perseus ~ $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib
martin@perseus ~ $ eclipse-3.1

And everything (?) works again. On Gentoo you can set this environment variable permanently by editing e.g. /etc/env.d/00basic, add the line “LD\_LIBRARY\_PATH=$LD\_LIBRARY\_PATH:/usr/lib” and do an /sbin/env-update && source /etc/profile (as root). On the next login you should have your environment set up.

Applied XML in RSS, SOA(P) and REST

Tuesday, March 29, 2005, 15:46 — 3 comments Edit

Lately there have been lots of debates about SOAP vs REST. To quickly summarize it: both SOAP and REST are about making ressources accessible via the web using XML. They have two different solutions for this. SOAP provides a toolkit that (in the best case) seamlessly integrates with your programming language of choice and provides the access to ressources by e.g. method calls on an object. This is done by providing a machine-readbale definition of the interface (WSDL file).

REST is about not providing a toolkit but rather defining simple APIs using HTTP GET, POST, PUT and DELETE as the commands, passing around chunks of XML. People argue that this is simpler than SOAP with all it’s toolkits (which are partially incompatible to each other etc.) and much more similar to the current architecture of the web, which is a great success after all.

Now all this stuff about Service Oriented Architectures seems to be marketing speak. But will REST really be better then SOAP? RSS feeds are an XML application which is already deployed very widely. It’s supported by many different tools in a read-only mode and by some even with a posting method.

The pro for REST is that RSS somehow works. People started providing content via HTTP GET in some rather loosley defined format, other people wrote aggregators for these ressources, and today we can read news from many different sites on our desktops easily.

The con is the effort needed to provide RSS support. Some of this comes from the “loosely defined format”. But the biggest part is from malformed XML, at least as far as I can see (blog posts by prominent RSS tool authors seem to support this). There is hardly any feed out there that really adheres the format. Additionally, there are lots of feeds out there which aren’t even XML. There is a huge amount of issues, from simple bad nesting to unescaped content, double-escaped content, bad namespaces, bad content encodings, etc. pp. And this is not even about adhering to a special schema. XML looks as if it was very easy to write and read, but in reality it’s a lot more complicated than you might think.

Now what does this mean for REST? I think it is very likely that all custom XML applications where people don’t use toolkits, but rather write angle bracktes themselves, will suffer from the mentioned problems. You might argue that this is not such a big problem as RSS reader authors seem to get along with it too. But in this case everyone who wanted to use a specific REST application would need to write some magic, ultra-liberal parsing application. She wouldn’t be able to use XML technologies like XPath, DOM, XSLT out of the box, she wouldn’t even be able to use SAX.

Apart from that, assuming people were able to at least provide valid XML. What about the API. With RSS it’s relatively easy, there is only one big GET. But what if you wanted to provide more, like GET last 5 posts? You would start inventing an API, e.g. via query strings. Nothing bad about that, but how do you document that API? As far as I have seen it, most documentation in companies is done using Word documents. Most of this documentation is either too old, barely understandable, much too short, simply wrong or a combination of these.

A decent SOAP toolkit would provide the XML serialization of objects. It would also provide a basic minimum of documentation of the API (people would at least know which functions exist). Noone really has to care about the exotic WS-* stuff, but it’s there if you need it.

REST might be good for really small, really simple applications. But if you want to start something bigger, something that might involve lots of developers, might change over time, has an API that provides more than two methods, you should really use a toolkit or it will be a mess.


New Post