Archive for June 2008

Weekly Summary

It was a good week. I finally got all of the automated TC Spring tests to pass for Spring 2.5.4, so I was able to mark that issue done. Terracotta now clusters Spring 2.0.x through 2.5.x. That code base is due for a refactoring, though. Our code for clustering Spring uses AspectWerkz to define join points all over the Spring source code, not just the public API. What this means, as I’ve ranted about before, is that even minor changes to Spring’s source code (as occur even between minor releases such as 2.0.5 and 2.0.8) have broken our clustering code. What I’d like to do, when time permits, is see if we can rewrite our aspects to only use methods of the public Spring API as join points. That should give us a whole lot more stability.

My boss Alex is prepping me to help him do some more performance testing. He recently wrote some great blog entries about that here and here. We met with the product management team this week to brainstorm what sort of testing we want to do, what sort of data they might want to have from a marketing/sales perspective, etc. As Alex pointed out, it’s a tricky thing - this sort of testing always leads to finding bugs, which leads to bug fixes, which invalidates any prior testing and so you have to start over. Luckily, we already have a very capable distributed testing framework, developed in-house by Alex, in which we can pretty easily script tests with Groovy. We can have agents on multiple machines (i.e. L1 nodes, talking to a TC L2 server) and have the agents start workers to run tests. The agents can do things like kill and restart workers, to test having to repartition a distributed cache. Sounds like the first thing we’re going to measure is the load time and then the TPS (transactions per second) for a couple different kinds of distributed caches: ConcurrentHashMap and Ehcache.

We found out this week our next big company-wide gathering in San Francisco will be the week of Oct. 13-18. I’ve already book my flight and hotel room. I’m excited - these trips have so far been a lot of fun.

I did a phone interview for a candidate to join my team. Probably shouldn’t elaborate on that yet, but I will say that Terracotta is very thorough with candidates. When I interviewed back in January, I did five phone interviews, four of them with other engineers, before being invited to come out in person. When I did fly out, I was interviewed by another five people, including the CEO and CTO! Honestly, although it was exhausting, I had a great time! I loved being challenged by, and having conversations with, some very smart and talented people who have produced some amazing software.

New software this week: OmniGraffle, which I’ve heard from everyone is the only graphics editing software you need on a Mac. I’ve got a copy now which I will hopefully be using in the not-too-distant-future to write some more technical blog entries about Terracotta. Also, Alex encouraged us to try out FindBugs, including it’s Eclipse plugin here (update site). I’ve added both of these to my list of essential Mac software for the Terracotta developer.

Weekly Summary - Slow, Painful TC Spring Progress

After two weeks of debugging, on Friday I added five magic lines of code to make the last three automated Spring tests pass. With this change we can say we support Spring up to 2.0.8 in our upcoming 2.6.2 release. We still have targeted Spring 2.5.x support for an upcoming summer release.

Two weeks of debugging, five lines of code, one bug fix. This has to be improved on.

Just for my own amusement, I’m going to try to list and describe all of the things that accounted for all the time spent on this.

By midweek last week, I had finally gotten into a groove with the Eclipse debugger, stepping through one of the failing Spring container tests. By “container” test, we mean parts of the test were actually running in a web app container, Tomcat in this case. Even so, this was not a very good, tight feedback loop. The basic pattern was for me to start the test, then attach to each of the two processes running in tomcat with the debugger. (Luckily, our code was already set up to allow for debugging, although it took me awhile to find and enable the magic property.) Then I just had to step through code in the debugger, not really sure what I was looking for. I had to alternate doing this, first with Spring 2.0.1 in the classpath, since our test was passing for that version, then with Spring 2.0.8 in the classpath, and look for what was going wrong. (We have a way of easily varying the version of a library such as Spring. However, I spent basically all of Tuesday helping Hung debug a problem in our build process concerning those variants.) Each time I alternated Spring variants, I also had to tweak the source lookup in my Eclipse remote debug configuration, to pull up source from the right Spring version.

The test uses our DSO functionality, it’s basically a semi-complete running instance of the DSO client. At one point I found I needed to hit breakpoints that were occuring during DSO bootstrapping, which meant I had to set yet another magic property (in ClassProcessorHelper) to enable debugging which is normally not available prior to TC instrumentation. I had to dig through my IM chat transcripts to find the name and location of that property, which my boss had mentioned once weeks ago, and then enable it, and find through painful trial and error that it didn’t work unless I did a full clean recompile (but it worked like a charm after that).

The code I was stepping through was instrumented using AspectWerkz, although for the most part the breakpoints seemed to work fine. But the amount of code was just vast. All I can say is, I spent probably twelve to fourteen hours of straight debugging on Thursday and Friday, just hitting breakpoints, comparing state, following hunches and wild goose chases and red herrings. In the end I found the Spring class whose source had changed between 2.0.1 and 2.0.5, and again in 2.0.8. (It was org.springframework.aop.config.ScopedProxyBeanDefinitionDecorator, deep in the guts of Spring’s aop framework.)

So in the final tally, I spent unexpected amounts of time

  • helping debug a build problem which prevented us from using the “variants” feature, to vary the Spring library
  • setting up Eclipse projects for different versions of Spring, so I could browse 2.0.1 and 2.0.8 source code side by side
  • tweaking the Spring source lookup for the Eclipse remote debugging configuration, to alternate between 2.0.1 and 2.0.8
  • figuring out how to enable debugging of our container tests
  • figuring out how to enable debugging during DSO client startup
  • trying (in vain) to write a more lightweight unit test to tackle the problem with
  • debugging for many, many hours once it was all working

And that’s not even counting actually learning AspectWerkz and writing a new pointcut and advice, which I would have had to do anyway, but which itself involved some painful trial and error.

And at the end of all of this, I sort of feel like I’ve just patched some big honking beast that I don’t fully understand. I feel like our code as it stands now (the AspectWerkz pointcuts and advices for clustering Spring) are still very tightly coupled to Spring source code in a very fragile way, and it’s only a matter of time before a new minor Spring version comes along and breaks it again. All week I was thinking, there has got to be a better way.

Our code for instrumenting and clustering Spring is really cool and mind-blowing, don’t get me wrong. I never would have thought of it in a million years. But it is a classic case of code that is tightly coupled, not modular at all. There is an impressive amount of test coverage through automated integration and system tests, which we need. But the Spring test suite takes about an hour to run, and that’s not even every test. There is a lot of AspectWerkz advices and pointcuts being used to cluster Spring, and none of them can be run and tested in isolation. You just have to start up the whole shebang and debug.

I consider myself a TDD and refactoring proponent. All week long, the pain of trying to debug this was screaming out to me that this part of the code needed refactoring and unit tests. Honestly, I’m not sure I could have done any worthwhile refactoring and still figured out the problem in the same amount of time. That’s the classic TDD/refactoring chicken and egg problem - when you’re in a time crunch, the prospect of trying to clean up some code in order to make things easier at some undermined future point seems insurmountable, and meanwhile you always tend to mentally downplay the amount of time it will take to just debug and fix the code as is. That’s the fear that comes with TDD.

In an interesting twist, I was talking to my boss Steve, and he asked me why I wasn’t refactoring. My boss wants me to refactor! He didn’t order me to or anything, but he sounds very much in favor of it. He said he almost always regrets not refactoring as he goes.

I think keeping the code clean and testable is so important to the long term health of a large, constantly evolving code base. I mean, we can’t just keep having software developers spend two weeks debugging one bug.

There’s a lot more I want to think about along these lines, but I’ve written enough for now.

Bash and TC Build Hacks I Learned in the Last Two Hours

There’s very good documentation about Terracotta’s in-house TC Build system already. But I’ve been doing some intense debugging with Hung, and have learned some things that I want to write down before I forget.

run without ivy: tcbuild blah blah --no-ivy - I’m assuming this runs faster because it skips using Ivy to check that all dependencies are in place.

run without compiling tcbuild --no-compile blah... when just shuffling some runtime dependency or something.

put environment stuff in .bashrc

check trunk/buildsystem to find things like jruby

For our automated container tests, individual jar files are placed in one huge WAR file. This is not true for ordinary unit tests.

Doing something like ./tcbuild check_one CustomScopedBeanTest --no-ivy > log.txt 2>&1 puts output in a file, and the last part redirects err stream to output stream.

Important shared stuff at /shares/terra/jdk/ such as Java, ant, etc

Grep trick 1: ps -ef | grep java to see details about Java processes running

Grep trick 2: env | grep JAVA to see environment variables I should have set up to run tcbuild

Grep trick 3: find <path> -name <filenamepattern> | xargs grep <searchstring> find all files matching filenamepattern that also have search string within them

find trick: rm -rf `find . -type d -name .svn` remove all .svn directories recursively

~/.tc/appserver is where tomcat is stored during automated tests - may want to remove as sanity check sometimes.

~/.ivy* is where ivy stuff is stored - may want to remove prior to doing total clean rebuild.

Weekly Summary - TC Spring again

This weekly summary actually encompasses the last three weeks. Sigh.

Lots of activity throughout dev is centered around the Terracotta 2.6 and 2.6.1 releases, as well as the upcoming 2.6.2 release.

Primarily I’ve been working on updating Terracotta’s Spring support to 2.5.x. Currently we only support up to 2.0.5. I had thought I had gotten it working up through Spring 2.0.8, but late last week we fixed a bug in our build process which then revealed three failing automated TC Spring tests which were previously (incorrectly) passing. So Spring 2.0.8 is not quite there…but close. Meanwhile, my compadre Nitin had made some changes that got TC working with Spring 2.5, but those changes are not backwards compatible to Spring 2.0.x, so I’m investigating whether they can be merged together somehow. Since we are dependent on the Spring source code in order to instrument their code (by using Aspectwerkz), we are subject to the whims of whatever source code changes occur between even minor releases (such as differences between Spring 2.0.5 and 2.0.8).

The other thing of note that I got accomplished was to respond to this post on our forums about a deadlock occurring in Terracotta L1. The poster had nicely laid it all out for us, with a stack trace excerpt clearly showing the deadlock. My teammates and I reviewed the pertinent class, and I cleaned up a number of synchronization bugs or missing synchronization. The deadlock itself was cleaned up by moving to a CopyOnWriteArrayList for a collection, which previously was being locked while iterating through it (read-only) and doing expensive stuff. The fix will be in 2.6.2 release.

I was without internet connection at my house a couple weeks ago for a few days. I had to do bloody battle with Charter to get that fixed. Ultimately a technician came and found that the line to my house had been put on a splitter at some undetermined point in the past, and so my signal strength was no longer strong enough. Meanwhile, luckily, I was able to go to my parents’ house and get some work done there. Have I mentioned that I love my MacBook Pro, and wireless internet?