Gathan Beaga

migrating from Textpattern to Octopress

Textpattern is a great PHP/MySQL based CMS for blogs and small websites. It’s small, elegant, fast, and well featured… and also sufficiently obscure that it does not attract the kind of black hat attention that Wordpress does. I’ve been using it on my website for over six years now.

Recently though I decided to switch my blog to possibly the most popular of the emerging “baked” blog solutions: Octopress. At the same time, I decided to switch domains, just to make things more interesting. (My other two Textpattern sites will remain as they are: it’s still a fantastic light-weight CMS solution for me.)

The following describes how I did it.

Octopress

Octopress is a blogging framework that works the way programmers do, and uses programming tools to do so. Everyone else can use it too, though it does take a bit of mucking about.

Blog posts in Octopress are written in a simplified markup - usually Markdown, but Textile can be used too - and saved to your hard drive with your favourite text editor. There is a metadata “header” at the top of these text files which is in YAML, a very readable, by machine and human, markup format. These files constitute what may be thought of as the “source code” for your blog; and this source code is looked after and versioned by a Source Code Manager: Git.

New posts you write and save are then “compiled” into valid HTML and uploaded to your website.

The advantages of this approach are:

  • Your website is made of static files, and is thus fast and easy for the webserver to serve. If you ever get Fireballed (unlikely in my case) your site is much more likely to survive it.
  • Your website is much more secure. Blog software installs are notorious for introducing vulnerabilities that a hacker or skiddy can exploit.
  • Everything is kept locally in a readable format, not far away in a database you can’t easily see into.

Some of the disadvantages of this approach are:

  • It’s potentially complex: you’ll need to learn some new stuff (like you did when you installed your old blog software right?).
  • Doing everything locally means you need to look after backup yourself.
  • There’s no fancy user interface.

The problem: my blog

My blog consists of about 850 blog postings spread across 10 years.

In addition to this are around 3,000 comments attached to those postings. A few postings still attract many visits and comments to this day and some might even be thought to have a community around them. So, pretentious though it sounds, I felt I had a duty to try and migrate them across.

What you’ll need

I’m assuming you already have

Migrating the blog postings

Octopress is a sophisticated layer of scripts, templates, and conveniences on top of the Jekyll blogging framework. It is this lower level of software that provides the import capability from many other blogging systems.

There’s an import command in Jekyll that will connect to a local database, pull out your old blog postings, and create a text file for each one. You’ll need to have a look at the detailed instructions on the Jekyll site - there are a few other bits and pieces you’ll need to install first.

Unfortunately, having followed all those instructions, it turned out the migration script provided by Jekyll didn’t quite work the way I wanted.

What URLs to use?

I realised that I wanted a bit more control over the URLs of my new blog. In particular:

  • the new URLs needed to be relatively “cruft-free”
  • I needed predictability of the new URLs so I could load a whole lot of Redirect 301 lines into the .htaccess file of the old blog - this way anyone looking for a particular old blog posting would be painlessly redirected to the appropriate page on the new site.

By default the Jekyll/Octopress import process from Textpattern creates blog postings with URLs using a /year/month/day/title-of-blog; whereas I just wanted /title-of-blog1. It is possible to effect this for new postings through a simple settings change in the _config.yml file in Octopress… but this wouldn’t work with imported ones which would be stuck to the default.

So some hacking of Jekyll’s Textpattern importer was required.

Hacking Jekyll

I don’t know enough about Ruby to really understand what I was doing here (otherwise I would be able to bundle my changes up and send them back to the Jekyll owner on Github). But after a lot of trial and error, I got it to work how I wanted.

The main changes I made to the importer - which on my machine I found at ~/.rvm/gems/ruby-1.9.2-p290/gems/jekyll-0.11.2/lib/jekyll/migrators/textpattern.rb - were the following:

  1. Adding require YAML into the top of the file - this is an oversight that has been fixed by Jekyll’s owner in the source, but not released yet.
  2. Adding YAML::ENGINE.yamler='psych' to the top of the file - this I found helped with the import of some UTF8 characters that apparently the default YAML parser cannot handle
  3. Changed the SQL query that extracts the old blog post data from MySQL to bring back a string that contained a list of categories; and also the comment count
  4. Added the lines for the date posted and the category list to that part of the code that generated the YAML metadata for the blog posting file
  5. Constructed a permalink format that I wanted (it turns out that Jekyll allows a chosen permalink to be baked into the YAML metadata)
  6. Added a Comments: True / Comments: False YAML header based on the number of comments so that only blog posts with existing comments would ask Disqus to display them (see Comments below).

The resulting code looked like this:

hacked textpattern.rb (textpattern.rb) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
require 'rubygems'
require 'sequel'
require 'fileutils'
require 'yaml'
# below fixes problems with import of UTF8 characters
YAML::ENGINE.yamler='psych'

# NOTE: This converter requires Sequel and the MySQL gems.
# The MySQL gem can be difficult to install on OS X. Once you have MySQL
# installed, running the following commands should work:
# $ sudo gem install sequel
# $ sudo gem install mysql -- --with-mysql-config=/usr/local/mysql/bin/mysql_config

module Jekyll
  module TextPattern
    # Reads a MySQL database via Sequel and creates a post file for each post.
    # The only posts selected are those with a status of 4 or 5, which means
    # "live" and "sticky" respectively.
    # Other statuses are 1 => draft, 2 => hidden and 3 => pending.
    QUERY = "SELECT Title, \
                    id, \
                    url_title, \
                    Posted, \
                    Body, \
                    concat(Category1, case when Category2 <> '' then concat(' ', Category2) else '' end) as Categories, \
                    comments_count \
             FROM textpattern \
             WHERE Status = '4' OR \
                   Status = '5'"

    def self.process(dbname, user, pass, host = 'localhost')
      db = Sequel.mysql(dbname, :user => user, :password => pass, :host => host, :encoding => 'utf8')

      FileUtils.mkdir_p "_posts"

      db[QUERY].each do |post|
        # Get required fields and construct Jekyll compatible name.
        title = post[:Title]
        slug = post[:url_title]
        date = post[:Posted]
        content = post[:Body]
        post_id = post[:id]
        categories = post[:Categories]
        #category2 = post[:Category2]
        comments_count = post[:comments_count]

        name = [date.strftime("%Y-%m-%d"), slug].join('-') + ".textile"

        # Get the relevant fields as a hash, delete empty fields and convert
        # to YAML for the header.
        data = {
           'layout' => 'post',
           'title' => title,
           'date' => date.strftime("%Y-%m-%d %H:%M"),
           # 'permalink' => '/article/' + post_id.to_s + '/' + slug,
           'permalink' => '/bhalg/' + slug,
           'comments' =>
           if
             comments_count == 0
            then false
            else true
            end,
            'categories' => categories
         #  'tags' => post[:Keywords].split(',')
         }.delete_if { |k,v| v.nil? || v == ''}.to_yaml

        # Write out the data and content to file.
        File.open("_posts/#{name}", "w") do |f|
          f.puts data
          f.puts "---"
          f.puts content
        end
      end
    end
  end
end

I renamed my new file as textpattern2.rb and ran it from the same location as the old file.

Liquid error: undefined method `join’ for #<String:0x007f823ed1d658>

This created all 850 or so blog postings in less than a second. (I had to add that “export” line in response to problems loading MySQL as described here.)

Tidying up and getting online

There were of course lots of cross links between old postings that needed corrected: this called for a regular expression find and replace to be run over the entire directory of blog postings.

Unfortunately this locked up Textmate 2, so I found a specialised app on the Mac app store (Find and Replace It!) which fixed all these files in a couple seconds2. A Regular Expression that works as intended is a beautiful thing.

Now, I could:

Liquid error: undefined method `join’ for “\n$ rake generate\n$ rake deploy\n”:String

…to get my new site online. Stage 1: complete.

Comments

Octopress blogs are made up of static HTML files, and unlike database+scripting language blog tools like Movable Type and Wordpress cannot support commenting natively. However, Octopress does integrate very closely with a third-party commenting service called Disqus. When the user loads an Octopress blog page, a little piece of Javascript is fired which fetches the comments associated with that page’s URL from Disqus.

Getting the data

Disqus allows import of comments in Wordpress’s WXR format. Instructions for coercing your Textpattern blog into generating a WXR file for upload to Disqus may be found on the Textpattern forum3. Basically, you create new Textpattern Section, Page, and Forms which will create one giant “page” whose source code is the WXR XML file you want. Genius.

Massaging the data

Disqus needs to be told which URLs go with which comments. This means more Regular Expressions to convert the old blog URLs in the WXR file into the new ones. In my case this meant not only changing the domain name, but also the blog subdirectory and removing the Textpattern post IDs.

In other words…

  • Look for this: http://oldblog.net/article/[0-9][0-9]?[0-9]?/; and
  • Replace with this: http://gath.co.nz/bhalg/

Another problem I ran into during testing was that all the HTML entities for things like smart quotes, angle brackets, and apostrophes were not being converted into their actual characters. This meant that every instance of “I’m” was coming through in the final Disqus comments as “I&#38m”. More regular expressions were required here to hunt down and replace all of these before I could upload the resulting file to Disqus.

After upload, it took about 12 hours for the comments to show up in Disqus. Stage 2: Complete.

Next Steps

While Octopress’s default template is rather cool, it’s a tad over-used. At the very least I thought I should make some colour and typeface changes - Octopress makes this very easy - to distinguish my site from everyone else’s. Hence the purple.

Another thing worth thinking about is backup. If the local machine you are using to build your blog fails or disappears, then you can’t update your blog. You need regular backups of the entire Octopress directory. Luckily this is already a Git repository, so anyone used to working with Git won’t have a problem here. Unfortunately I don’t know enough about Git yet to do this easily, so I have set up an rsync job to mirror the files to an online storage space I have.

Everyday usage

Octopress takes about one and a half minutes to generate my website, occupying fully one of the four CPUs of my 2011 MacBook Air during this time. I suspect my site is bigger than most, so I would expect your figures to be a bit less. Even so, I’m quite happy with the time this takes.

There is a really nice “rake preview” command that watches your directory of blog postings for changes and auto-generates these for viewing locally at http://127.0.0.1:4000/. This works really well for small sites… but on mine tends to use enough CPU to get the fan running and therefore isn’t something to do if I want to make the laptop’s battery last. I have a second install of Octopress which I use for preview purposes - it has very few blog postings in it and consequently is very fast and undemanding to regenerate continuously.

Conclusion

This has been a fun process, despite some moments of high-velocity head-desk contact. Octopress really is nice software, and will be worth your effort to investigate. Thanks to Brandon and all the other contributors!


1 I also had to check all the titles of my old blog postings to make sure there were no collisions (as this would have unpredictable results when Octopress generated my new blog) or interesting non-Latin characters in them that might not work as URLs (e.g., macron characters like the ū in “kererū”).

2 I couldn’t get the commandline tool $ sed to work for me and gave in to the temptations of the Graphical User Interface…!

3 Incidentally, Disqus does not actually require you to include the body of your blog posting along with the comments, so you can actually leave the <txp:body /> out of the disqus-article Textpattern form if you are finding, as I did, that the resulting giant XML file is too big for your server to reliably generate.

Comments