Martin Probst's weblog

InputManagers in Leopard

Tuesday, November 20, 2007, 21:12 — 1 comment Edit

In Leopard, InputManagers need to be installed in /Library and owned by root, for security reasons. Tutorials how to re-enable them can be found e.g. on Mac OS X hints or in the TextMate blog.

Something not mentioned in those tips is that not only the input managers themselves but also the InputManagers directory must be owned by root and only writable by the owner (g-w).

These two commands did the trick for me:

$ sudo chown -R root:wheel /Library/InputManagers
$ sudo chmod -R go-w /Library/InputManagers

By the way, does anyone know what the ‘@’-sign after the rights in a directory listing means? As in drwxr-xr-x@ 11 root wheel 374B 11 Sep 04:25 SafariBlock?

MySQL backup/restore task for Capistrano

Tuesday, November 20, 2007, 10:59 — 0 comments Edit

This is a simple backup task for Capistrano (which could really use a lot more documentation…).

The tool reads database configuration from the local database.yml. This Works For Me ™ as I keep the local and remote database configuration identical - YMMV.

While the script doesn’t require you to type the database user’s password, it will echo it to the console for the restore task. Avoiding that seems to be quite tricky - I tried sending the backup directly over the stream and piping in the password before, but that gives an obscure error.

So the following will have to do for now, but I’m quite pleased with it. I should probably include a warning/confirmation before restoring, but hey, command lines are for experts ;-)

$config = YAML.load_file(File.join('config', 'database.yml'))

desc "Backup the database to db/" + Time.now.strftime("backup_#{$config['production']['database']}_%Y-%m-%e.sql")
task :backup, :roles => :db, :only => { :primary => true } do 
  backup_path = File.join('db', Time.now.strftime("backup_#{$config['production']['database']}_%Y-%m-%e.sql"))
  on_rollback { delete backup_path, :recursive => false }
  backup_file = File.new(backup_path, 'w+')
  run "mysqldump --default-character-set=utf8 " +
    "--user=#{$config['production']['username']} " +
    "--password " +
    "-B #{$config['production']['database']}" do |channel,stream,data|
    if stream == :out
      backup_file.write(data)
    else
      if data =~ /^Enter password:/
        channel.send_data($config['production']['password'])
        channel.send_data("\\n")
      else
        raise Capistrano::Error, "unexpected output from mysqldump: " + data
      end
    end
  end
  logger.info "Database dumped to #{backup_path} successfully."
end

desc "Restore the database from backup"
task :restore, :roles => :db do
  backups = Dir[File.join('db', "backup_#{$config['production']['database']}_*.sql")]
  raise Capistrano::Error, "no backup found!" if backups.size == 0
  last_backup = backups.sort[-1]
  put(File.read(last_backup), "#{current_path}/db/restore.sql")
  logger.info "Restoring from #{last_backup}"
  run "mysql --default-character-set=utf8 " +
    "--user=#{$config['production']['username']} " +
    "--password=#{$config['production']['password']} " do |channel, stream, data|
    raise Capistrano::Error, "unexpected output from mysql: " + data
  end
  logger.info "Restored successfully."
end

New blog engine

Tuesday, November 20, 2007, 10:59 — 1 comment Edit

I ported my old WordPress blog over to a hand-written Ruby solution. You probably already noticed that my permalinks were not that perma, so apologies for re-appearing entries in your feed readers.

I decided to move away from WordPress after taking a look in my archives. Through various import/export operations and the liberal re-formatting of entries - done by WordPress itself or various plugins - the data in the database was a complete mess. Corrupt UTF-8, double, triple and quad escaped anything, mixed encoded and non-encoded HTML… took me quite some time to clean it up (thank God for RegExps).

Writing a simple blog in Ruby on Rails is an easy exercise, at first. It gets a lot more complicated once you consider trackbacks/pingbacks, proper permalinks, comment spam, etc., but more on that in separate entries.

Migrating to Google Apps (copying IMAP mails)

Monday, October 29, 2007, 12:35 — 0 comments Edit

Now that Google has announced IMAP support for Gmail I’m migrating my email to Google Apps.

I’ve always had a HostEurope WebPack that provides some webspace, PHP, MySQL and IMAP. Some time ago I also ordered a virtual root server, to have some fun with rails, and a general space for experimentation. Then I wanted to take the webpack down as I didn’t need it anymore. But to be honest, I soon figured out that configuring and properly maintaining a whole email setup (MTA, IMAP, various spam filters, …) is indeed a lot of work.

So I moved all my email related stuff to Google Apps. So far it looks quite nice. It’s a bit strange that my regular Google user account didn’t integrate with the new one, but I simply dropped the old account.

Now I’m copying all my IMAP emails over to Google Mail. Surprisingly, I couldn’t find an easy to use, readily working script to copy IMAP messages from one host to another. There are several, but they seem to be either unmaintained, requiring obscure dependencies, or require bizarrely complicated setup.

So in a first class wheel reinvention act I wrote my own IMAP copy tool, in ruby; imapcopy.rb. Only dependency is highline for the password prompt, but if you don’t want that, you can easily adapt the code.

I really like it: it does everything I needed, doesn’t require any configuration, it only copies messages that are not present on the new host, and even prints a nice spinner ;-). Sample usage:

ruby imapcopy.rb user1@somehost.com user2@gmail.com@gmail.com
Password for user1@somehost.com:
Password for user2@gmail.com@gmail.com:
...

RFC (2)822 dates in IMAP and Courier

Wednesday, October 10, 2007, 07:44 — 0 comments Edit

I’m writing a little ruby script to download emails from my IMAP server and put them in a Maildir structure. It’s more of a learning exercise - I’m aware that there are working tools for this task, but they all seem a bit complicated in use.

Something strange I noticed is that Courier-IMAP seems to return INTERNALDATE in non-RFC822 (or RFC2822) compliant format:

irb(main):008:0> imap.uid_fetch(16860, ‘INTERNALDATE’)[0].attr[‘INTERNALDATE’]
=> “01-Jun-2007 09:04:04 +0200”
That should have been “01 Jun 2007 09:04:04 +0200”, with an optional “Fri, ” in front (no dashes!).

This is probably a problem with the standard. While RFC 3501 (IMAP) does not say anything specific about the correct date format to use, it seems to implicitly reference RFC 2822 for that. It also contains examples in RFC2822 format. Another hint that writing good standards and specs is really hard.

Interesting thing: this does not seem to be a problem in reality. I.e. except for my little script, all Mail clients don’t seem to bother and probably do some fuzzy parsing.

Wide Finder in Scala

Monday, September 24, 2007, 22:42 — 4 comments Edit

Tim Bray:

In my Finding Things chapter of Beautiful Code, the first complete program is a little Ruby script that reads the ongoing Apache logfile and figures out which articles have been fetched the most. It's a classic example of the culture, born in Awk, perfected in Perl, of getting useful work done by combining regular expressions and hash tables. I want to figure out how to write an equivalent program that runs fast on modern CPUs with low clock rates but many cores; this is the Wide Finder project.

So while it’s probably most sensible to do this with some map/reduce library, I tried implementing it using Scala actors. I’m not a Scala programmer, and have no clue about the Actors library, so this code is probably totally wrong, inefficient etc. But at least I can learn something this way :-)

First the original Ruby script:

counts = {}
counts.default = 0

ARGF.each_line do |line| if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) } counts[$1] += 1 end end

keys_by_count = counts.keys.sort { |a, b| counts[b] <=> counts[a] } keys_by_count[0 .. 9].each do |key| puts “#{counts[key]}: #{key}” end

Converted to Scala, that gives for the serial case:

object SerialAnalyzer extends Application {
  val pattern = Pattern.compile(“GET /root/(\\d\\d\\d\\d/\\d\\d/\\d\\d/[^ .]+)“)
  val reader = new BufferedReader(new FileReader(“/Users/martin/tmp/log”))

val counts = new HashMap[String, Int] var line = reader.readLine while (line != null) { val matcher = LogMatcher.pattern.matcher(line) if (matcher.find()) { val uri = matcher.group(1) val count = counts.getOrElse(uri, 0) counts(uri) = count + 1 } line = reader.readLine } }

This takes about 1.5 seconds to go through 250 M of log files on a dual core MacBook Pro 2GHz.

object Analyzer {
  def main(args: Array[String]): Unit = {
    val numAnalyzers = if (args.length > 0) Integer.parseInt(args(0)) else 4
    val logreader = new LogReader(numAnalyzers)
    logreader.start
  }
}

class LogReader(numAnalyzers: int) extends Actor {
  val reader = new BufferedReader(new FileReader("/Users/martin/tmp/log"))
  def hundredLines = (for (val i <- 0 to 10000) yield reader.readLine).toList
  
  val analyzers = (for (val i <- 1 to numAnalyzers) yield new LogMatcher).toList
  analyzers.foreach(_.start)
  
  def act = {
    while (reader.ready) analyzers.foreach(_ ! hundredLines)
    analyzers.foreach(_ ! Stop)
    for (analyzer <- analyzers) {
      receive {
        case result: HashMap[String, Int] => print("Done.\\n")
      }
    }
    val resultMap = new HashMap[String,Int]
    for (map <- analyzers.map(_.counts); (uri, count) <- map) {
      resultMap(uri) = resultMap.getOrElse(uri, 0) + count
    }
    for (entry <- resultMap) print(entry._1 + ": " + entry._2 + "\\n")
  }
}

object LogMatcher {
  val pattern = Pattern.compile("GET /root/(\\\\d\\\\d\\\\d\\\\d/\\\\d\\\\d/\\\\d\\\\d/[^ .]+)")
}
class LogMatcher extends Actor {
  val counts = new HashMap[String, Int]
  
  def act = {
    loop {
      react {
        case lines: List[String] =>
          for (line <- lines if line != null) { 
            val matcher = LogMatcher.pattern.matcher(line)
            if (matcher.find()) {
              val uri = matcher.group(1)
              val count = counts.getOrElse(uri, 0) 
              counts(uri) = count + 1
            }
          }
        case Stop => 
          sender ! counts
          exit()
      }
    }
  }
}

The code does work, but sadly the Actors version is not faster than the single threaded version on my dual core MacBook Pro. No idea why… also the program exhibits some sort of a memory leak - it seems to keep the whole file in memory, thus giving OutOfMemoryErrors if you don’t run it with a Java heap big enough for the whole log file. Again, no idea why, I don’t seem to keep any nasty pointers to anyone.

So what does this give? Ruby is an elegant language with a nice collections API. Scala is much nicer than Java, but still quite talkative. And I obviously didn’t really get something about the Scala actors…

PS: The Ruby version takes about 20 seconds to go through 270 MB of logs. The serial, no concurrency Scala version takes 18.5 seconds. Simply reading the data line-by-line using Scala takes over 12 seconds.

Type inference for Java

Tuesday, April 17, 2007, 22:23 — 0 comments Edit

InfoQ has an article on Type inference for Java.

Commenter Steve Jones states the following:

Type inference is just a case of complete laziness and is brought to us by the same sort of people who think that typing is the most time consuming part of the exercise. Are there people who really think that the problem with Java is that there are too many characters in a .java file? It would be great to see some efforts focused around making Java a better language for support and professional development. Things like contracts on methods and classes (ala Eiffel) would be nice. Saving 5 characters just because you think that will be quicker? Complete and utter muppetry.

Actually, yes I do. One problem with Java is indeed the amount of code that needs to be written. And generating code using some IDE is not a solution - code gets read a lot more often than written, so helping with writing doesn’t help at all. Code needs to be easier to understand, not easier to write.

Java files tend to get enormous in size, even for really mundane tasks. This is partially due to bad APIs (I blogged about my experience of putting an XSLT transformation in a self-contained JAR). The other part is just the language itself. I’d really hope that good type inference could solve a lot of the ugly to read code. It will require quite a change in how to write APIs, i.e. be more explicit on the return type of things in the method name.

But still, reducing the SLOCs must be the major goal of any language. Reduce complexity, and the best measure for complexity is still to this day the number of lines of code. Given any certain task, the solution that requires less code is almost always easier to understand.

The holy grail CSS

Wednesday, April 4, 2007, 08:55 — 0 comments Edit

Eliotte Rusty Harold writes about a slightly modified version of the “The holy grail” CSS layout on his weblog.

Something that really annoys me with this sort of CSS layout is the need to specify actual widths; be it in percentage or in ems or whatever. What I really want is a CSS layout that has a left and/or right column, and the columns extend to the size they actually need, and the center div shrinks accordingly, maybe in some min/max boundary. This very useful behaviour that HTML tables had seems to be simply impossible with current CSS standards.

It’s surprising how much unintuitive, hackish CSS is needed for such stuff, even without the IE hacks, just regular CSS. If one needs to go to these lengths just to get some really basic stuff everyone needs, maybe the spec is simply not that good? I always struggle with CSS, and the number of web pages with CSS layout templates suggests other people have problems, too.

Packages, Classes, Methods - Scopes?

Sunday, April 1, 2007, 16:57 — 1 comment Edit

I wonder why there is a distinction between the concepts of packages, classes, and methods. If you think about it a bit, it’s all just scopes. You could simply drop the distinctions.

A scope is an basic building block in a programming language that has

Scopes can be instantiated - in the case of packages, it’s loading the package. In the case of classes, it’s actually instantiating the class. In the case of methods, it’s calling the method. Instantiating a scope will initialize the non-static fields to something and yield a a scope instance - the activated package, an object instance or the functions stack frame/closure. The scope instance can then be passed around and members of it can be called.

To make it look nicer, you would probably end up providing some syntactic sugar like how function calls return their ‘return value’ member by default or something. But there doesn’t really seem to be a big conceptual difference here. Packages are currently not used or implemented in this instantiation way (or are they? OSGi?), but I think this might be really beneficial…

The idea probably doesn’t hold up, but whatever, it sounded nice at first :-)

DocBook to Word (with free bibliography converter!)

Friday, February 16, 2007, 14:02 — 1 comment Edit

My girlfriend is writing her PhD thesis at the moment. This gives the big problem of what format to use for the document. The options are (as far as I know) LaTeX, DocBook, OpenOffice.org and Word.

LaTeX

I’ve tried LaTeX on her in the past, and she was not to happy. First I tried LyX, which is simply not working for the task we had. It’s actually really confusing, instable, ugly etc. Not a choice. Editing LaTeX source by hand is nothing she wants to do. Also, she needs to be able to manage the document structure etc. by herself, so this simply doesn’t work. Pro for LaTeX: really nice documents as a result, citation support is ok, though customising it is a pain.

DocBook

So I went on and looked for a more modern publishing format with decent editors. As Lars Trieloff is a friend of mine I know DocBook and have used it several times in the past. There are editors for it that do not totally suck and it’s fairly easy to customise using XSLT. So we tried that for several months.

Net result: citation support totally sucks in DocBook. There are bibliography elements available, but no-one really seems to no which elements to use for what (e.g. how to properly document an article, inproceedings, etc.). This was a major obstacle, and tool support for it is also really bad in the editors we tried (XXE and Serna). XXE is a nice editor, but we settled with Serna because it’s output is closer to the real end result and my girlfriend somehow preferred it.

Serna itself is a good XML editor, and they have a very strong technology behind it. The problem is that it doesn’t really support DocBook out of the box very well. They have some stylesheets for a very old docbook version shipping with it, put apart from that there isn’t much.

Another major problem along with citations: tables. There are the CALTECH tables, which are ridiculously complex to edit, and there are HTML tables, which are complicated to edit. Tool support in Serna is really bad for tables, I’m sorry to say.

So basically, after a long time trying, we gave up on that.

OpenOffice.org

We didn’t try that. I’ve used OpenOffice in the past and I remember it as a horribly slow and ugly copy of MS Word, with a totally weird user interface. I’ve tried to get citation support to work in OpenOffice but didn’t manage to. Also, I’ve had issues with OpenOffice and corrupted files… so if I want an unstable office suite, I can also use MS Office which is at least a bit easier to use.

Microsoft Word 2007

So we ended up here again. Which is really, really sad. Word 2007 sure is better than previous versions, especially the new user interface is very good. Also, they’ve added a lot of tools for academia, especially the citation support looks really good. But there are still the same stability problems. I’ve just tried saving a large document, and Word simply crashed. Plus, it somehow killed the backup copy - after restarting, 30 mins of work were gone. I’m quite scared at the perspective of creating a 300+ pages document with graphics, foot notes, citations etc. in this tool.

But there is no apparent alternative. OpenOffice is no better, plus it sucks more, and the other tools are simply not usable for someone without a strong technical or computer science background. This means frequent backups and crossing the fingers for the next two years…

DocBook exodus and the bibliography

So how did I get the existing document from DocBook to Word? I didn’t get those transformation stylesheets available to run, but in the end I’ve more or less simply imported the full text from the XML document.

I’ve written a small stylesheet to convert the docbook bibliography to Word 2007 format, available here. It’s probably incomplete, erroneous etc. and will kill all your data, but hey, it’s at least something ;-)

You can simply run your existing bibliography through it and then select that file as the bibliography source in Word, or alternatively merge it with the existing file using your favorite VIM, uh, text editor I mean ;-)


New Post