J. Brisbin

Just another Wordpress.com weblog

Archive for April 2010

Even more reason why Oracle/Sun isn’t really competetive in the cloud

leave a comment »

I blogged recently about our terrible experience with Oracle/Sun support. I’d like to say it’s just because they’re getting out of the hardware business and their heart isn’t in it any more. I’d like to say it’s not a systemic problem with the company as a whole. I’d like to say that (honestly) only because I recommended we buy Sun hardware so we could run Solaris 10 with the virtualization offered by zones. We continue to be frustrated with Oracle/Sun as a company and we see systemic problems in the organization that make it unreliable as a vendor for our mission-critical applications.

There are too many problems to go into much detail on each one, but suffice it to say that our experience with Oracle/Sun support has been so bad, we have a demo next week to check out HP’s newest virtualization offering because the company we’re working with on that:

a) Answered a question I had about hooking up Solaris 10 to our SAN. The guy who helped me wasn’t a Solaris guy, just a UNIX guy. But we got it done. The support tech at Sun who was supposed to be helping me had a list of about 6 or 8 questions about our environment that he wanted me to answer first. Things it would have taken me more than an hour to answer. This HP guy helped me with a Sun problem on the fly with no questionaires.

b) Will take our Sun servers on trade-in.

If the price is right and we like what we see, we’re going to dump Oracle/Sun entirely and trade in 5 Sun servers for something else–anything else.

To prove I’m not just exaggerating the atriciousness of the situation, our primary warehouse server running Solaris 10 on Sun hardware crashed on Sunday morning. The tech finally showed up Wednesday after lunch and proceeded to replace the main system board. Fired the server back up (at least tried to) and the same result as before: nothing. Completely dead. No BIOS, no boot screen, no nothing. He tried taking the processors out and swapping sockets, updating ILOM firware, everything. It’s just as dead as a doornail. Then he tried to get someone at Oracle/Sun on the phone to discuss the next steps. After wading through automated menus and never getting to talk to a real person (this is the Oracle/Sun tech, remember, not Joe the Customer) he smiled and told me: “I feel your pain.” So the tech left last night with our server riding shotgun with him back to The Big City to operate on it.

If the field tech can’t even contact the engineer who’s supposed to be helping us with our problem, how can that company hope to support the Vast Throng of cloud-computing sychophants who are getting tech envy and want a piece of the action? How can a company that does business like this be serious competition for those (claim jumpers though they may be) already in this space?

But that’s just the hardware side of things, I hear. I use product X and I get great support, I also hear. Consider yourself lucky, then, to have avoided the vast beauracracies within the company that make it an inefficient behemoth. I understand now why the company was failing before Oracle bailed them out and why they off-loaded Java to the community (which I’m glad they did, of course). But I can see what this merger with Oracle has done to the company and I haven’t seen any net gain yet.

Written by J. Brisbin

April 28, 2010 at 9:34 pm

Oracle/Sun can’t answer anyone, let alone SpringSource

leave a comment »

I had to chuckle when I read the headline that Oracle “answers” VMware’s purchase of SpringSource with their new WebLogic tools. I’ve just spent two days knee-deep in problems caused by a catastrophic hardware failure in one of our Sun servers and I’m convinced that a company that does business the way Oracle/Sun does cannot survive in this new cloud ecology. Actually, I’m surprised they’ve made it this far.

When we found out (on a Sunday afternoon, no less) that the particulars of the support contract we had were Monday-Friday, 8:00-5:00 and not 24/7 we dutifully kicked ourselves in the rears and called them back. What can we do to get an upgraded support contract? Where can we send the check? We weren’t asking them to do something for free. We were willing to upgrade our support contract, pay for a support incident, or go to the friggin bank and bring back a roll of non-sequential Benjamins for them for crying out loud. Monday rolls around and we start calling our vendor, who supposedly calls the Oracle/Sun account rep, who never gets back to us. We spend most of the day getting bounced from one department within Oracle/Sun to another. Level one has me upgrade some firmware to the latest version so I can tell them the mainboard is fried much more elegantly than I can with the old version that’s currently on the machine. I talk to at least three different people within support before they begin the gradual process of bouncing my issue to the field techs, who will have to come onsite with a part. All the while, we still couldn’t get anyone on the phone to take our money. We were begging them to give us a chance to pay them whatever they wanted to get our support contract bumped up to the level where they actually start taking you seriously and don’t promise to call you back in 30 minutes when they mean two hours.

Oh, and that tech who’s supposed to come out? His pager number we were given isn’t valid. We tried paging him and the familiar ascending tones and the pleasant voice: “that number is no longer in service.”

As of now, the Sun hardware is still kaput. Maybe a tech will show up tomorrow, maybe they won’t. I’m not going to hold my breath. Oracle as a company is a vast, inhuman labyrinth of beauracracies which know nothing about what the other is doing. You can’t talk to a tech and then ask them to transfer you to someone who can take your money. You also can’t upgrade support plans on the fly. Buy a time machine instead and get the support plan you should have the first time.

All day I kept thinking about how Oracle was trying to “answer” what SpringSource was doing in the cloud computing space. It’s great that they understand where the industry is headed. But I can’t see a company that does business like this to succeed in anything it does. The last Sun server we bought (because we were running everything in Solaris 10, but we’ve moved completely away from that now to VMware and Ubuntu Linux) came, literally, in pieces. It’s like they just put all the parts required to construct a server into a box and shipped that to us. I couldn’t believe it. The actual CPUs weren’t even in the chassis. I had to install the heat sinks myself with that gray putty from the hardware guys. A $12,000 server and I’m putting in the parts myself. The developer. The system admin.

I will not willingly do business with Oracle or Sun ever again. Their support has been of no use, their account reps are unreliable, and their offerings pale in comparison to the robust and flexible solutions being offered by SpringSource and the community of cloud implementors. I don’t usually try to disuade anyone from the product or vendor of their choice. Do what you want. Whatever works, right? But know that my experience with this new Oracle/Sun behemoth has been nothing but frustratingly schziophrenic. I’m embarassed now to have ever suggested we use Sun for anything. Operating system, hardware, or what have you. I should have stuck with Linux.

I’ve learned my lesson. Never again, Oracle.

Written by J. Brisbin

April 27, 2010 at 1:10 am

Distributed Atomicity in the cloud with RabbitMQ

leave a comment »

The private cloud Tomcat/tcServer session manager I’m working on has a huge job cut out for it. Maintaining the state of an object that exists in possibly more than one location at any given point in time is not an easy task, I know. To be honest, if it weren’t for my Midwestern stubbornness, I might not take the time to work through these hefty issues. I might follow the path of least resistance, like most of the industry has done so far.

I just don’t like the idea of sticky sessions. I look at my pool of tcServer instances as one big homongenous group of available resources. In my mind, there should be no distinction made between machines running in different VMs–or even on different hardware. They should exist and cooperate together as a single unit.

But in “replicated” mode, each server has a copy of the object. This is great for failover and it makes the session manager extremely performant. But yet another sticky wicket rears its ugly head. How do I protect this object and make sure it gets updated properly before someone else has a chance to operate on it?

Call it distributed atomicity if you want–the idea being that an object exists within the context of a cloud of compute resources (in this case, a Tomcat/tcServer user session object) and needs to be updated with all the right attributes when code in a different physical process operates on that object. I’m attacking this problem by implementing a form of distributed atomicity that uses RabbitMQ to send the contents of newly-added attributes to any interested parties throughout the cloud. I already replicate the session object by grabbing it with a Valve, just before the request is completed. This session object gets serialized to the cloud before the response is sent, the idea being that this particular object will be updated in all the places it is needed before another server has a chance to operate on that object.

By using the messaging infrastructure of RabbitMQ, I can at least make updates to this object reasonably atomic. Now the question becomes: where does this object live? For performance reasons, it’s probably not realistic to have just one object to share among web application servers. In the case of Tomcat/tcServer, the internal code is requesting the session object so often (multiple times during a single request) that each server simply has to cache a session object for the length of the user’s request.

A tool like ZooKeeper might be helpful in this case. If code has to set an attribute on a session object, the session would set a barrier in ZooKeeper that lets other code know it is in the process of being altered. Once setAttribute() is finished, a message is then sent with the serialized attribute. The other interested parties could alter its local copy of the object with the updated attribute until it receives a full replication of the object. Would the second, full replication be superfluous? At this point I can’t say. In the interest of completeness, I feel compelled to issue a second replication event, but in the interest of performance and bandwidth conservation, I wonder if its really necessary.

I’m far from finished with the cloud-based session manager. I’m trying to get it to a stable point so that I can migrate my cloud away from sticky sessions. The “replicated” mode seems to work fine; and I’m okay with sending too many messages–I’d rather have that than have too few and end up with page loads blocking because the session can’t be tracked down.

Distributed, asynchronous programming isn’t easy. It isn’t for the faint of heart or those with pesky bosses breathing down their necks to meet arbitrary and usually unhelpful deadlines. It also doesn’t help if you’re not a bona-fide genius. I often feel a little out of my league given the number of CompSci grads that are doing fantastic work in this interesting and growing segment of the industry. But I’m stubborn enough to keep plugging away when I should probably give up.

Written by J. Brisbin

April 22, 2010 at 6:17 pm

Tomcat/tcServer cloud session manager now has “replicated” mode

leave a comment »

I’ve updated my virtual/hybrid cloud Tomcat/tcServer session manager to use two different modes of operation. The default mode is what I’ve described previously. The new mode of operation is called “replicated” and it, as the name implies, replicates the user’s session object to every node consuming events on that exchange. This might be the whole cloud, it might not, depending on how you have your exchanges configured.

I’m working on code to only replicate the session if it sees changes in the MD5 signature of the serialized session. Otherwise, it’ll conserve your bandwidth and not replicate the session until it has to. Until then, though, the entire session gets replicated after every request. Excessive? Maybe. 🙂

I’m also trying a different approach to loading user sessions. Rather than contacting the direct queue of the node that advertises itself as the owner of that session, I’m sending a load message to the fanout exchange. This way, dedicated replicator/failover consumers can also respond to load requests in case a node goes down unexpectedly.

At the moment, there’s still no persisting of sessions to disk since a server replicates all its sessions off that node when it’s going down. I’m not sure I really need to dump a node’s sessions to disk when it goes down. I think I want to have dedicated consumers for that purpose.

With dedicated failover consumers, when new servers come up, they get the list of current sessions from the failover node. I don’t see that restoring things from disk would add significant functionality to this store. If you feel differently, be sure and let me know. It wouldn’t be difficult to implement a disk-based persistence mechanism for restarts.

You can checkout the source code from github:

git clone git://github.com/jbrisbin/vcloud.git

The only other change is that you now need to add a special replication valve that calls the replicateSession() method after each request invocation.

<Valve className="com.jbrisbin.vcloud.session.CloudSessionReplicationValve"/>

Written by J. Brisbin

April 20, 2010 at 4:08 pm

Posted in The Virtual Cloud

Tagged with , ,

Change logging package to SLF4J/Log4J in tcServer/Tomcat

with one comment

I really dislike the JULI logging package which is Tomcat’s (and thusly tcServer’s) default. Its configuration seems uncomfortable and the log files are almost unreadable without grepping out what you’re looking for. In all my other applications I use SLF4J, powered by Log4J. This combination is powerful, easy to configure, and I like that it doesn’t put the date of the filename on the log file until after its rotatated. There’s been discussion on the Tomcat list recently about maybe changing this in the future, but I’m not very patient and I’d rather not spend the precious little time I do have mucking about with things that are difficult.

The documentation describing the switch from JULI to Log4J isn’t very long or informative, though the process itself–to be fair–isn’t very complicated. But I get the sense that not many Tomcat developers want to discuss switching from JULI to Log4J, hence the lack of documentation.

Making the switch for tcServer is really only one additional step, though the way tcServer structures its instance directories makes it slightly more complex to configure for use with Log4J.

Due Diligence

Please read the official documenatation on switching from Tomcat JULI to Log4J first. We’ll be doing things a little bit differently, but you should understand where we’re coming from before simply jumping into this.

Building Tomcat

In order to switch from the default Tomcat JULI package, you’ll need to build Tomcat from source, then build the “extras” module. The official documentation leaves out that you have to build the whole server first, then build the extras. If you build only the extras, without building the whole server, you’ll end up with ClassNotFound errors when you try to start Tomcat/tcServer.

UPDATE: You can build the extras module from source, but, come to find out, SpringSource has helpfully included the two jar files mentioned in “tomcat-6.0.20.C/bin/extras”. You can simply copy those jar files to the locations discussed here rather than building the whole server from source.

Building Tomcat

  1. I’m using tcServer 6.0, so download the source tarball for Tomcat 6.0.20 and unzip it somewhere.
  2. “cd” into that directory.
  3. Copy the build.properties.default file to build.properties.
  4. “vi” build.properties and uncomment the “jdt.loc” property, which will allow the Ant build to download the JDT compiler, which is a requirement of the build process.
  5. Increase Ant’s heap size: export ANT_OPTS=-Xmx256m
  6. Build the server: ant
  7. Once the Tomcat server has been successfully built, build the “extras” module: ant -f extras.xml

When that’s finisehd:

  1. Copy ($TCSERVER_HOME/tomcat-6.0.20.C/bin | $TOMCAT_SRC/output)/extras/tomcat-juli.jar file to $TCSERVER_HOME/tomcat-6.0.20.C/bin/tomcat-juli.jar.
  2. Copy ($TCSERVER_HOME/tomcat-6.0.20.C/bin | $TOMCAT_SRC/output)/extras/tomcat-juli-adapters.jar to $TCSERVER_HOME/tomcat-6.0.20.C/lib/
  3. Delete $TCSERVER_INSTANCE_DIR/conf/logging.properties.

Now, copy the Log4J and SLF4J jars. I used the ones from my personal Maven repository (from the $TCSERVER_HOME directory):

cp ~/.m2/repository/log4j/log4j/1.2.15/log4j-1.2.15.jar tomcat-6.0.20.C/lib
cp ~/.m2/repository/org/slf4j/slf4j-api/1.5.8/slf4j-api-1.5.8.jar tomcat-6.0.20.C/lib
cp ~/.m2/repository/org/slf4j/slf4j-log4j12/1.5.8/slf4j-log4j12-1.5.8.jar tomcat-6.0.20.C/lib
cp ~/.m2/repository/org/slf4j/jcl-over-slf4j/1.5.8/jcl-over-slf4j-1.5.8.jar tomcat-6.0.20.C/lib

Configuration

When you’ve got all the dependencies copied over, you need to put a configuration file in one of two places, depending on how you want to configure logging for your instances. In my case, I use three identical instances (actually, the names of the instances are different, but other than that, they’re identical) of tcServer, so I could put my log4j.xml file in tomcat-6.0.20C/lib/. In your case, though, assuming your instances are configured differently from one another, you might want to put your log4j.xml file in (assuming an instance name of “dev1”) dev1/lib/.

NOTE: You also need to “vi” the tcServer start script (tcserver-ctl.sh) and comment out the lines that deal with a logging manager and a logging config file (lines 261-262 and 268-269). UPDATE: I actually don’t think this is necessary now. I think my errors were caused by something else. I think it’s safe to leave these be.

If you’re already using Log4J and SLF4J, you’ve likely already got an example XML file lying around that you could use. Copy that file to one of the locations mentioned previously. Mine looks something like this:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
<log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/">

  <appender name="console" class="org.apache.log4j.ConsoleAppender">
    <layout class="org.apache.log4j.PatternLayout">
      <param name="ConversionPattern" value="%d %-5p %c{1} - %m%n"/>
    </layout>
  </appender>

  <appender name="catalina" class="org.apache.log4j.DailyRollingFileAppender">
    <param name="File" value="${catalina.base}/logs/catalina.log"/>
    <layout class="org.apache.log4j.PatternLayout">
      <param name="ConversionPattern" value="%d %-5p %c{1} - %m%n"/>
    </layout>
  </appender>

  <appender name="vcloud" class="org.apache.log4j.DailyRollingFileAppender">
    <param name="File" value="${catalina.base}/logs/vcloud.log"/>
    <layout class="org.apache.log4j.PatternLayout">
      <param name="ConversionPattern" value="%d %-5p %c{1} - %m%n"/>
    </layout>
  </appender>

  <category name="org.springframework">
    <level value="INFO"/>
  </category>
  <category name="org.quartz">
    <level value="INFO"/>
  </category>
  <category name="org.apache.catalina">
    <level value="INFO"/>
    <appender-ref ref="catalina"/>
  </category>
  <category name="com.jbrisbin.vcloud">
    <level value="DEBUG"/>
    <appender-ref ref="vcloud"/>
  </category>

  <root>
    <level value="INFO"/>
    <appender-ref ref="console"/>
  </root>

</log4j:configuration>

You can now add categories and appenders to suit your particular needs. You can also change the pattern to suit your tastes.

Written by J. Brisbin

April 20, 2010 at 2:54 pm

Securing services within a hybrid, private cloud

leave a comment »

It seems to me there is an inverse relationship between security and functionality. To gain functionality in my hybrid/virtual cloud environment, I have to sacrifice security. Or do I?

To be honest, I’m not entirely certain yet. I keep running into problems I need to solve within my cloud that I can’t easily address. Case in point: keeping configuration files in sync.

In order to keep a configuration file in sync, I have to provide some mechanism to overwrite a local file with the contents of a remote file (either one that’s downloaded for that purpose, or comes as the body of a RabbitMQ message). Even though our user pool is limited to only employees of the company, that still exposes (at least theoretically) a vulnerability in that a user with access to our LAN could manipulate the file over-writing process to inject their own script onto a cloud-based machine. To combat this, I’m only exposing limited files for updating based on a key value (rather than the full, real path). Even if a malicious user was able to force a config file to be updated with the contents of their script, it would only be of a limited number of files, and those files would not have their direct paths exposed because the updater takes a key value which maps to the real local path inside the update consumer and this map is never exposed to users.

Assuming this malicious user was able to get their script on a server, they’d still have to invoke it somehow. Ensuring that my config file updater, which is meant to only update configuration files, writes their updates to the filesystem with the executable bit turned off, the script would not be executable.

After a configuration file has been updated, it’s very likely a service will need to be restarted. Exposing some functionality to restart services also gives me pause because I’ll be running commands in the shell at the behest of a message which I can’t really be certain isn’t coming from a malicious user. By using the key-based method previously described, though, I can severly limit the number of commands that might potentially be run. Exposing only the key value means the full path to the script is not derived from the data coming from the message but from data already living on the machine which is executing the command. If a malicious user was able to manipulate this side of the system, they’d still only be able to invoke restarts on specific services.

To fully exploit the system, a malicious user would have to inject their own configuration into a service (maybe by putting in Proxy directives to an Apache configuration file or something) and invoke a restart of the service. To do this, they’d need to:

  1. Intercept a message containing the key of the configuration file to inject their own configuration into.
  2. Intercept a message containing the key of the service that needs restarting after injecting their own configuration.
  3. Send the appropriately-crafted message through the message broker which is secured with a username and password.

We use an internal firewall here to segregate our AS/400 from the rest of the LAN for PCI compliance reasons. I suspect we could do the same thing for our cloud so that a malicious user would first have to gain access to a machine behind the internal firewall before it could contact the RabbitMQ server to even exploit these services.

I honestly don’t know what security best-practices are going to look like for hybrid/virtual private clouds. Thankfully, I’m not the only one asking these questions. As the cloud environment commands a greater share in the market, I’m sure people who are much smarter than I am can offer more practical solutions than a simple firewall and username/password security plan.

Written by J. Brisbin

April 19, 2010 at 3:16 pm

Posted in The Virtual Cloud

Tagged with , ,

Publish Tomcat/tcServer lifecycle events into the cloud with RabbitMQ

with one comment

Once of the common tasks in any cloud environment is to manage membership lists. In the case of a cloud of Tomcat or SpringSource tcServer instances, I wrote a simple JMX MBean class that exposes my tcServer instances to RabbitMQ and serves two functions:

  1. Expose the calling of internal JMX methods to management tools that send messages using RabbitMQ.
  2. Expose the Catalina lifecyle events to the entire cloud.

To maintain a membership list of tcServer instances, I now just have to listen to the events exchange and respond to the lifecycle events I’m interested in:

def members = []
mq.exchange(name: "vcloud.events", type: "topic") {
  queue(name: null, routingKey: "#") {
    consume onmessage: {msg ->
      def key = msg.envelope.routingKey
      def msgBody = msg.bodyAsString
      def source = msg.envelope.routingKey[msgBody.length() + 1..key.length() - 1]
      println "Received ${msgBody} event from ${source}"
      if ( msgBody == "start" ) {
        members << source
      } else if ( msgBody == "stop" ) {
        members.remove(source)
      }

      return true
    }
  }
}

Starting and stopping the tcServer instance yields this in the console:

Received init event from instance.id
members=[]
Received before_start event from instance.id
members=[]
Received start event from instance.id
members=[instance.id]
Received after_start event from instance.id
members=[instance.id]
Received before_stop event from instance.id
members=[instance.id]
Received stop event from instance.id
members=[]
Received after_stop event from instance.id
members=[]

It seems to me one of the defining characteristics of cloud computing versus traditional clusters is the transparency between runtimes and what used to be separate servers. To that end, I’ve exposed the inner workings of my tcServers both to other servers of their kind in the cloud, and to sundry management and monitoring tools I may choose to write in the future.

If you’re concerned with security, opening up the JMX MBeans of your server may give you pause. Fair enough. In my case, that’s not as big of a concern because these servers are protected from the outside world. Only LAN and WAN users can access these servers, so I don’t mind exposing JMX methods to trivially-secured message brokers, particularly if it gives me this kind of inexpensive and direct control over the services I’m exposing to the cloud.

Written by J. Brisbin

April 15, 2010 at 9:32 pm