Victor Costan: March 2009

Monday, March 30, 2009

Managing Software Dependencies

Some software development decisions are more important than others. This post argues that decisions involving dependencies are among the very important ones, and describes my approach to managing dependencies.

What Are Dependencies
For the purpose of this post, dependencies are pieces of software outside the project or component that you are considering. Software development does entail other dependencies, like the value of a local currency, but those are outside the scope of my write-up.

Why Worry About Dependencies
Decisions where we take dependencies are among the most important software development decisions we take, because dependencies come with costs and constraints.

Maintenance costs are the ongoing cost associated with keeping the dependency. This cost does include traditional maintenance, such as staying informed about new versions, and applying security updates, but it can go much further. For example, taking a dependency on a Windows-only API in a Web server imposes the cost of a Windows license on every machine running the server. Furthermore, maintenance costs aren't always easy to estimate. For example, the biggest cost in using a library developed by a small group of people is not licensing or integration, but rather the potential cost of having to take on the development of that library, if the initial developers cease working on the library.

Replacement costs are more straightforward -- they are the price paid to completely remove the dependency on a piece of software. Their importance lies in the implication that replacement costs are the maximum "premium" that you will pay in maintainance cost for a dependency, over the optimum cost. The explanation for this is: if the maintainance cost for using Windows becomes so large that it's cheaper to pay the replacement cost for Linux, and the maintenance cost for Linux, then you will switch to Linux. So the biggest premium that you will pay to stick with Windows is how much it would take to replace it.

Incompatibility constraints come with every dependency taken. Technical incompatibilities tend to be obvious, for example DirectX requires Windows, Cocoa requires MacOS, so there is no straightforward way to write a Cocoa application using DirectX. Other incompatibilties are more subtle, like licensing. The GPL license is the most well-known pain, because GPL code cannot be linked together with code released under some other free licenses. Last but not least, there are "versioning hell" incompatibilities, where library A requires library B, at most version 1.0, and library C requires library C, version 1.1 or above, and for this reason, A and C cannot be used together.

These costs and constraints are the factors I consider first when considering taking new dependencies, which I describe below.

Managing Dependencies
In a nutshell, my strategy around dependencies is as follows. Avoid unnecessary dependencies, and take cheap dependencies. Failing that, make the expensive dependencies easy to replace.

Unnecessary Dependencies
To me, the most important aspect of managing dependencies is being aware when I'm taking them. For example, Linux or OSX developers can habitually use fork or POSIX filesystem permissions. This habit becomes a problem when developing multi-platform code, because the features are not present on Windows. Higher-level languages are not immune to platform dependencies either. In SQL, it's all too easy to use a database-specific extension, and popular scripting languages (ruby, python) have extensions which may not be available on Windows, or may crash on OSX. Versioning hell dependencies are also a pain, and keeping track of them requires a perspective that is more commonly posessed by accountants than by coders.

Fortunately, continuous builds can be used to delegate the tedious bookkeeping to computers. Continuous builds set up to run on Windows and Mac OSX protect from taking an unwanted dependency on Linux. A continuous build running tests against SQLlite and PostgreSQL database backends protects from dependencies on MySQL. Continous builds warn about troublesome code early on, when programmers will still be inclined to fix it. For example, it's easier to replace the fork / exec pair with a system call before it becomes a pattern sprinkled around the entire codebase.

Awareness is only the first step. Most of the time, a dependency has to be taken in return for extra functionality, and I have to decide what dependency I'm taking, and write the integration code. In this case, I consider the issues I presented in the previous section.

Cheap Dependencies
If the maintainance cost will clearly be low, I don't worry too much about the dependency. For example, if I'm using ruby, I assume the Rubygems library is installed or easily available, so I don't think twice before using its functionality. When figuring out maintainance cost, I pay most attention to incompatibility constraints. The following findings ring alarm bells in my head:

platform dependencies; Example: if it doesn't work on Windows, I can't use it in a deskop application.
restrictive licenses; Examples: GPL, licenses forbidding using code in a commercial setting
patents; A subtle example is that Adobe's supposedly open Flex platform uses the Flash file format, which is patented by Adobe. Though Adobe published a specification of the Flash format, it prohibits the use of the specification to build competing Flash players
niche open-source; Ohloh tracks some statistics that can indicate a potentially troublesome open-source project, like a short revision history, a single committer, and uncommented code

Expensive Dependencies
When the maintainance cost of a dependency will be high, I take extra precautions to lower the replacement cost. I try to learn about at least one alternative, and write the integration code in such a way that it would be easy to swap that alternative in. The goal behind this is to develop a good abstraction layer that insulates the rest of my application from the dependency, and keeps the replacement cost low. Two common examples of this practice are JavaScript frameworks, which insulate application code from browser quirks, and ORM layers such as ActiveRecord that put a lot of work into database independence.

Having good automated tests provides many advantages that prolong the life of a codebase. One of them is reducing the replacement costs for the all the dependencies. Uprooting a dependency is a nightmare when developers have to sift through piles of code by hand. The same task becomes routine when the computer can point at the code that needs to be changed. Without a good automated test suite, dependencies can become really rigid ("this application only works with Rails 2.2, it'd take forever to port to Rails 2.3" versus "we spend a few hours to update the application when a new version of Rails comes out").

The effort that goes into keeping replacement costs low is typically repaid many times over by the benefits of being able to replace old or troublesome dependencies. Of course, this only holds for long-lived projects, and I wouldn't pay as much attention to how I integrate my dependencies when I'm exploring or building a throw-away prototype.

Conclusion
Many good software projects don't shine because of their dependencies (example: Cocoa, because it only runs on Mac OS X). The total cost of long-lived projects is largely influenced by the cost of living with their dependencies. Therefore, it makes sense to invest effort into steering away from dependencies that may bring trouble or even doom the project down the line. Hopefully, this post has presented a few considerations that will help you spot these troublesome dependencies, and either avoid them or at least insulate your codebase from them.

One More Thing
I promise I won't make this a habit, but I want to end this post with something for you to think about. As a programmer, choosing which skill to learn next is closely related to the dependencies problem explored above. We learn new technologies to use them in our projects, which means the projects will take dependencies on those technologies. So, we might not want to learn technologies which translate into troublesome dependencies.

I will write more about looking at dependencies from this different angle, next week.

Wednesday, March 25, 2009

Removing Default Ruby Gems on OSX Leopard

This post describes a quick way to remove the gems that come pre-installed on OSX Leopard.

Method
First, you should update your gems, so you have newer versions for all the gems you're about to remove. While you're at it, update rubygems as well.

sudo gem update --system

sudo gem update

Now blast the directory containing the gems that came with OSX.
sudo rm -r /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/gems/1.8

If, for some reason, that directory does not exist on your system, you can see rubygems stores its gems by running gem env paths. Most likely, the old gems have already been cleaned.

Enjoy being able to clean up all the old gems on your system.
sudo gem clean

Warning
Removing the gems this way is permanent. If you don't like that thought, rename the 1.8 directory to 1.8.dead, and create an empty 1.8. This way, rubygems doesn't see the old gems, but they are still around, if you need them for some reason. So, instead of rming,
sudo mv /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/gems/1.8 /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/gems/1.8.dead
sudo mkdir /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/gems/1.8

Motivation
The pre-installed gems have been released 2 years ago, so they're really old by now. They need to go away. Doing a gem clean will fail to remove them. (tested with Rubygems 1.3.1 and below) What's worse, gem clean will fail to remove other old gems that you have installed so, after a while, you'll have a lot of cruft on your system.

I wrote this post because, up until now, I've been too lazy to figure out the gem cleanup situation. Now that I finally did, I want to make it easy for others to get their systems clean.

Conclusion
I've described a quick way to remove the old ruby gems that come preinstalled with OSX Leopard. This is useful because gem clean is non-functional in the presence of those gems. I hope you have found the post useful. Please comment if you have better or quicker solutions to this problem.

Sunday, March 22, 2009

Your Web Server and Dynamic IPs

This post describes the techniques I'm using to host my application from a server whose IP changes over time. The post assumes the server's IP only changes when the server is not in use, and therefore I do not address servicing requests during the IP change. Instead, I am concerned with restoring the mapping between the server's DNS entries and its IP in an automated and reasonably quick manner.

Overview
I signed up for dynamic DNS service. This gives me a DNS name that points to any IP I want, and some software that I install on my server to automatically change the DNS name. Then I set the user-visible DNS hostname (www.something.com) as a CNAME pointing to the dynamic DNS hostname.

The technique generalizes to serving multiple applications (with separate domains) from a single server. The DNS entries for all the applications are set as CNAMEs pointing to the server's dynamic DNS entry. The HTTP port on the server is owned by a reverse proxy and load balancer dispatching requests to each application's backends based on the Host: header in the HTTP request.

Dynamic DNS Service
You can get dynamic DNS for free. I use dyndns.com's service, and it worked for me. If you want to shop around, here's a list of providers, courtesy of Google Search.

Once you sign up for service, you should get a hostname (like victor.dyndns.com) that you can point to any IP. This host name will be transparent to your users, so you don't need to worry about branding when choosing it. Your only worry is having to remember it.

The important decision you have to make here is the TTL (time-to-live) of your entry. This is the time it takes to propagate an IP change. Shorter values have the advantage that your server can be accessed quickly after it is moved. Longer values mean the IP address stays longer in the users' browser cache, so they have to do DNS queries less often. This matters because the dynamic DNS adds an extra DNS query that users' browsers must perform before accessing your site, which in turn adds up in the perceived latency of your site. Your TTL choice will be a compromise between availability after a move and the average latency increase caused by the extra DNS lookup.

Dynamic DNS Updater
To make the most out of your dynamic DNS service, you need software that updates the IP associated with the DNS hostname.

My Rails deployment script automatically configures the updater for me (source code here). I use ddclient, because it's recommended by my dynamic DNS service provider.

In order to use DynDNS on Ubuntu:

sudo apt-get install ddclient
Edit /etc/init.d/ddclient and replace run_daemon=false with run_daemon=true
Use the following configuration in your /etc/ddclient.conf

pid=/var/run/ddclient.pid
use=web, web=checkip.dyndns.com/, web-skip='IP Address'
protocol=dyndns2server=members.dyndns.org
login=dyndns_username
password='dyndns_password'
dyndns_hostname

The updater will start on reboot. If you want to start it right away,
sudo /etc/init.d/ddclient start

Other Options
If you use DynDNS, but don't run Linux, they have clients for Windows and OSX. If you don't use DynDNS, this Google search might be a good start.

My home router (running dd-wrt) uses inadyn. I don't like that on my server, because it takes my password on the command-line, so anyone that can run ps will see my password.

Application DNS Setup
Having done all the hard work, you close the loop by setting up a CNAME mapping your application's pretty DNS name to the dynamic DNS hostname. If you don't want to pay for a domain, you can give out the dynamic DNS hostname to your users... but it's probably not as pretty.

The process for setting up the CNAME mapping depends on your domain name provider (who sold you www.something.com). The best source of instructions I know is the Google Apps Help. If you use that, remember to replace ghs.google.com with your dynamic DNS hostname.

Debugging
Chances are, your setup will not come out the first time. Even if that doesn't happen, your setup might break at some point. Your best aid in debugging the DNS setup is dig, which comes pre-installed on Mac OSX and most Linux distributions.

Run dig www.something.com, and you'll get an output that looks like this:

moonstone:~ victor$ dig www.mymovienights.com
(irrelevant header, removed)
;; QUESTION SECTION:
;www.mymovienights.com.        IN    A

;; ANSWER SECTION:
www.mymovienights.com.    1742    IN    CNAME    chubby.kicks-ass.net.
chubby.kicks-ass.net.    2    IN    A    18.242.5.133

;; Query time: 211 msec
;; SERVER: 192.168.1.1#53(192.168.1.1)

(irrelevant footer, removed)

I removed the part that is completely uninteresting, and made interesting parts bold. The answer section shows a DNS chain built following this post. If your chain doesn't look like this, you know where to fix the error. If everything looks good here, but you still can't reach your server, the problem is either at the networking layer (can you ping the server?) or at the application layer (your load balancer or application server is misconfigured).

Another interesting result you get from dig is the query time, which shows the latency introduced by DNS to the users who visit your site for the first time. Unfortunately, this doesn't give accurate numbers if dig's answer is in some DNS cache, so be sure to account for that in some way when measuring latency.

Monitoring
I use Google's Webmaster Tools because they provide free monitoring. The overview is sufficient to see if the site is up or down. If you have a Gmail account and use it frequently, you can embed a gadget showing your site's status into your Gmail view.

Multiple Applications
I use the same server for multiple Web applications. I have a separate DNS hostname for each application, and they all point to the same dynamic DNS hostname via CNAMEs.

On the server, I use nginx as my reverse proxy because it is fast and it can be reconfigured with no downtime, as it's serving user requests. You can use apache if you prefer, using these instructions.

My reverse proxy setup is done automatically by my Rails deployment script (source code here). Here's how you can get a similar configuration:

sudo apt-get install nginx
For each application, create a file in /etc/nginx/sites-enabled/ with the following configuration

upstream application_name {
    server 127.0.0.1:8080;
  }

  server {
    listen 80;
    server_name www.something.com;
    root /path/to/your/application/html/files;
    client_max_body_size 48M;
    location / {
      proxy_set_header X-Real-IP $remote_addr;
      proxy_set_header Host $host;
      proxy_redirect false;
      proxy_connect_timeout 2;
      proxy_read_timeout 86400;

      if (-f $request_filename) {
        break;
      }

      if (-f $request_filename/index.html) {
        rewrite (.*) $1/index.html break;
      }
      if (-f $request_filename.html) {
        rewrite (.*) $1.html break;
      }
      if (!-f $request_filename) {
        proxy_pass http://application_name;
        break;
      }
    }
  }

This configuration handles requests for www.something.com by serving static files directly through nginx when they are available, and by forwarding the HTTP requests to your application server at port 8080 otherwise. If you do not want to serve static files from nginx, remove the root clause, and all the if clauses. Tweak any other numbers as needed.

Of course, you cannot use port 80 for any of your application servers.

The server will start on reboot. If you want to start it right away,
sudo /etc/init.d/ddclient start

DNS Prefetching
If you're worried about the latency added by the extra layer of DNS, you can use prefetching to go around this limitation. DNS prefetching is a fancy name for tricking the user to do a DNS lookup for your hostname, before he/she interacts with your application.

If you're wondering whether this prefetching thing actually matters, know that Google uses DNS prefetching in Chrome. Sadly, most Web developers don't have enough leverage over their users to convince them to install custom software.

Firefox supports link prefetching, and you can find it useful if your users install a widget / gadget that's served from a CDN (e.g. Google Gadgets).

You can also be more creative by looking at the bigger picture. For instance, if your users install an application of yours on their mobile phones, those phones will likely do DNS queries using your users' home routers. So, if your mobile application synchronizes with the server using a sync interval that's smaller than the TTL on your DNS entries... you've killed most of the latency.

Motivation
My servers have been hosted in random places. I've had my application server in my dorm room, in my friends' dorm rooms, and in random labs around MIT.

Given that my servers travel so much, I like to keep them light (Mac Mini or Dell Studio Hybrid) and I want to be able to move them without any manual configuration change. This means the servers can be headless, and that my friends can move the servers for me, without the need any training.

Conclusion
Thanks for reading, and I hope you found this post useful. Please leave a comment if you have any suggestion for an easier or better setup.

Wednesday, March 11, 2009

Great Time To Be a Web Programmer

If you don't know client-side Web programming (HTML, CSS, and Javascript) already, it should be the next technology you learn! I'm pretty sure that 2009 starts the golden era of these technologies, and this post explains why. Asides from making my point, I highlight some cool and very useful pieces of technology along the way.

Overview
My argument goes along the following lines: Javascript has evolved into a mature language, with good frameworks. Browsers got faster at Javascript, and better at standard compliance. Major Web sites offer easy access to their data through APIs. Applications and widgets based on Web technologies can be easily integrated into the desktop, or other Web applications. Last, but definitely not least, generous free hosting is available, and can be set up quickly.

Read on, and find out what technologies I'm referring to.

The Platform Is Ready
Javascript got off to a really bad start. Starting from the language's name itself, and continuing with the horribly slow and buggy browser implementations, Javascript got a bad reputation.

However, today's Javascript is a well-understood and pretty productive language. Libraries like Dojo, Prototype/scriptaculous, and jQuery abstract browser incompatibilities away, and insulate programmers from the less inspired DOM APIs. The HTML5 draft, which is adopted pretty quickly (compared to the time it took to get CSS2 in) by the leading quality browsers, specs out many goodies, such as offline Web applications, push notifications, and better graphics.

Equally important, browsers are in a Javascript speed race, and the winners are us. Between Safari 4, Google Chrome, and Firefox 3.1, we should have fast Javascript execution on all major operating systems before the end of 2009.

Integration Opportunities Abound
Integration comes in two flavors. First, you might want to use data from other sources, like Google, Facebook, and Twitter. Second, your idea may not be suitable for an entire Web application, and might fare better on the desktop, or as a widget. There are great news coming from both fronts.

JSONP is an easy way to get data, despite the cross-domain restriction policy, and major companies have been taking it seriously. Google's search API and Twitter's API have JSONP support. Yahoo's Query Language goes one step further and lets you get other sites' content wrapped up in nice JSONP. Did I mention Dojo's seamless support for Google's search API?
If you want to integrate your application with your user's desktop, you have Google Gears and Mozilla Prism today, and HTML5 in the future.

Applications that don't need a lot of screen space can be packged effectively as widgets. Widgets are supported natively in Mac OS by Dashboard, and in Vista's Sidebar. For a more cross-platform solution, you should check out Google Gadgets, which work the same on the Web, in the Mac OS dashboard , in Linux, and in Windows.

Oh, and one more thing. Google's gadgets also work in their productivity suite - in Gmail, in Spreadsheets, and in Google Sites. So you could impress your boss by building a dashboard with important numbers straight into their Gmail.

REST Decouples Client From Server
Remember ugly long URLs? REST (Representational State Transfer) is a collection of design principles which yields the opposite of those long URLs. It matters because, once your client-server API obeys REST, your client is not dependent on your server implementation.

Using REST works out very well with the approach of pushing most of the application logic to the client-side Javascript code. An argument for why most of your code should be client-side follows.

If you're looking for free hosting (covered in the next section), the server code will not be in Javascript, but rather a server-side language, like Ruby, Python, or Java. Choosing a server language narrows down your platform choice (for example, at the moment, Google's App Engine only works with Python). If you're looking for free hosting, you want to be able to port your code quickly to whichever platform offers better free quotas at the moment.

Using REST designs with JSON input and output gives you "standardized" servers that are easy to interact with, and easy to code. On the client side, for example, Dojo automates the data exchange for you. On the server side, Rails scaffolds have built-in REST/JSON support, or you can pick up ready-made App Engine code.

Hosting Can Be Free
Web applications are very easy to access, but the servers are a pain to setup. Furthermore, hosting costs money - even if you're on Amazon EC2, there's a non-zero amount of money that you have to pay. Most of us, programmers, don't like to pay for stuff.

Fortunately, there's the Google App Engine, and it has a free tier which is pretty much equivalent to running your own server. "Pretty much" covers everything except storage, which is currently capped at 1Gb.

If you prefer gems to snakes, like me, check out Heroku for Rails hosting. Heroku's beta platform is free, and they promised a free tier on their production version. Their free tier may not end up to be as generous as Google's, but you can always downgrade to Python if your application becomes successful. Update: Google's App Engine can run Java now, which leads to support for Ruby and other languages. This post has more details.

Conclusion
I hope I have convinced you to make a priority from learning HTML, CSS, and Javascript this year. If not, here's "one more thing" - you can build hosted solutions for small companies (50 people or less) with zero infrastructure cost. Google Apps, together with the App Engine gives you SSO (single sign-on), and Gadgets can be used to integrate into Gmail.

Thanks for reading this far, and I hope you have found this post to be helpful!