Avoiding Repetition & Torture

Apparently Gacela does stuff people want. Which is cool.

So within 48 hours of launching Gacela’s love child Kacela officially to the Kohana community, I got the following tweet:

Do you think gacela could be a solution for #civicrm?

So I check the link I was sent and realized that you know, Gacela does in fact do all this:

  • elimination of boilerplate code when accessing the db
  • being able to swap out databases from different vendors, such as MySQL, Oracle, DB2, etc. without changing code.
  • Simplified error/exception handling.
  • Simplified mapping result sets to objects/arrays.
  • Standard data mappings done automatically, such as mapping SQL dates to programming language dates.

And holy crap! I was amazed to think that someone had just boiled down some of the most important features in Gacela in that list. My primary motivators for building Gacela were 1) I had a hard time doing common tasks (like inheritance relationship mapping) in the ORM’s I had used. And 2) I hate repetition. In fact I consider repetition to be one of the most miserable, torturous things for a programmer to deal with. If I do something somewhere, at some point in time, I don’t EVER want to have to do it again. Since Gacela definitely reflects my bias, I feel fairly safe stating that point for point, Gacela actually makes it possible to get all of these features:

  • All that Gacela doesn’t do is configure itself automatically and create the Data Mapper and Model classes from your database.
  • This is why there is a Criteria object first of all. And that’s why we have DataSource adapters that can handle vendor specific options. This is also why PDO is the PHP library used by Gacela.
  • The DataSources themselves are pretty bare bones when it comes to running queries. Everything is passed to the PDO instance as quickly as possible.
  • Anytime a findAll() query is issued, it returns a Gacela\Collection object that provides all of the standard Array-type access as well as a few other goodies.
  • This last item deserves its own section in the documentation as well as a full post, but suffice it to say that a DataSource has Resources (tables) which contain Fields. One of the primary purposes of the Field objects is to translate the value of the field between the database and the application. For example, a database bool (0,1) becomes a PHP bool (false, true) and vice versa automatically.

My favorite statement from this post was:

The 20% that needed a full ORM were teams that had the luxury of starting fresh, where they allowed the tool to generate the db tables and all needed SQL based on the objects in their application.

One of the things I love most about Gacela is its awesome ability to support existing databases. You don’t have to completely restructure the database to make it work because Gacela inspects the database structure as it is and infers as much as it can from there.

Posted in Data Mapper, Gacela | Tagged , , , | Leave a comment

Caching Data for Scalability and Performance

Caching is Easy

In the PHP world when people talk about scalability, caching is inevitably mentioned as a means to increasing the performance of a site. Of course setting up and using something like memcache is pretty trivial. If you want to call something the hardest part with it, you do have to figure out how to install memcached as well as the php module for memcache. And using memcache is about this hard:

$memcache = new Memcache;
$memcache->connect('localhost', 11211);
 
$memcache->set('key', 'foo');
 
echo $memcache->get('key');

Which is awesome if you want to just want to store results from a news feed or want to implement a session handler other than using a database.

Caching Query Results is not Easy

Caching database results on the other hand, presents a number of difficulties:

  1. For one thing, many ORM implementations store either the database resource or a non-serializable result set or both in the actual ORM objects. Since caching requires that all objects be serialized, this puts a serious road block in our way.
  2. How do you track one set of data versus another in the cache?
  3. How do you notify the cache that a particular data set has changed?

Which is why, in general, I’ve found that when people talk about using memcache with data sets, you generally see examples like this:

<?php
$memcache = new Memcache;
$memcache->connect('127.0.0.1', 11211) or die ("Could not connect");
 
include('./db.php');
 
$key = md5("SELECT * FROM memc where FirstName='Memory'");
$get_result = array();
$get_result = $memcache->get($key);
 
if ($get_result) {
        echo "<pre>\n";
        echo "FirstName: " . $get_result['FirstName'] . "\n";
        echo "LastName: " . $get_result['LastName'] . "\n";
        echo "Age: " . $get_result['Age'] . "\n";
        echo "Retrieved From Cache\n";
        echo "&lt;/pre&gt;";
 
} else {
 
        // Run the query and get the data from the database then cache it
        $query="SELECT * FROM memc where FirstName='Memory';";
        $result = mysql_query($query);
 
        $row = mysql_fetch_array($result);
        echo "<pre>\n";
        echo "FirstName: " . $row[1] . "\n";
        echo "LastName: " . $row[2] . "\n";
        echo "Age: " . $row[3] . "\n";
        echo "Retrieved from the Database\n";
        echo "&lt;/pre&gt;";
 
        $memcache->set($key, $row, MEMCACHE_COMPRESSED, 20); // Store the result of the query for 20 seconds
 
        mysql_free_result($result);
}

The query is setup, the cached checked and if its empty for that query then the query is run on the database and the result stored. Of course, almost all examples show this being done in a context that would typically mean putting this process in your domain layer everywhere that a query would be run. Not only is this cumbersome, but most examples ignore everything aside from #1 in the list above.

Actually, Caching Query Results isn’t Hard

Where I currently work our director has had a habit of mentioning the need for memcache and how we’d eventually have to ditch using an ORM altogether in order to implement memcache, and something about that sentiment never has sat quite right with me. I mean, it seems like if you can implement an external check on queries and cache them, then couldn’t an ORM library implement an internal caching mechanism to reduce database load? Since I was in the middle of developing the initial beta version of Gacela at the time, I decided to step and see how hard it would be to implement caching into its foundation. And the truth of the matter was it wasn’t hard. I just had to rethink the general architecture of the library so that caching could be supported from the ground up.

Dealing with #1 wasn’t that hard since Gacela already contained DataSources that were separate from the remainder of the objects. There were a couple of other objects that were also using the database handle, and removing their dependence on it and making the DataSources the only place that uses the database was handle was used is something I see as an additional plus. In my opinion one of the places where ORM libraries that use the database schema to discover information fail, is in the fact that they frequently have to pull the same schema information several times during a single request. So a planned feature of Gacela was storing all loaded mappers and resources in a repository. Implementing cache support means that mappers and resources (which generally only change when there is an update to the application) never HAVE to be loaded more than once.

Dealing with #2 seems straightforward if you are going to cache each record from each resource individually, but what about when you want to pull back a collection of records for a parent resource? Do you perform an initial query to pull back the id’s of the related records and then just hit against the database for individual records that aren’t already in the cache?

I ultimately decided to cache data based first on the resource associated to the Mapper being queried and second on the actual query data itself. So the DataSource::query() came to look something like this:

public function query(\Gacela\DataSource\Resource $resource, $query, $args = null)
{
	if($query instanceof Query)  {
		// Using the _lastQuery variable so that we can see the query when debugging
		list($this->_lastQuery['query'], $this->_lastQuery['args']) = $query->assemble();
	} else {
		$this->_lastQuery = array('query' => $query, 'args' => $args);
	}
 
	$key = hash('whirlpool', serialize(array($this->_lastQuery['query'], $this->_lastQuery['args'])));
 
	$cached = $this->_cache($resource->getName(), $key);
 
	// If the query is cached, return the cached data
	if($cached !== false) {
		return $cached;
	}
 
	$stmt = $this->_conn->prepare($this->_lastQuery['query']);
 
	if($stmt->execute($this->_lastQuery['args']) === true) {
		$return = $stmt->fetchAll(\PDO::FETCH_OBJ);
		$this->_cache($resource->getName(), $key, $return);
		return $return;
	} .....

Once we were caching unique sets of data based on specific queries for specific resources, fixing #3 became easy after I stumbled upon this suggestion from the guys who develop memcached. Since every DataSource supports exactly three methods for changing the state of data, it wasn’t hard to add the following into the insert(), update(), delete() methods:

if($query->execute($binds)) {
 
	// Increment cache version for the resource being updated, inserted, or deleted	
	$this->_incrementCache($name);
 
}

And then to create the _incrementCache() method:

protected function _incrementCache($name)
{
	// Get the Gacela instance which holds the Memcache instance
	$instance = $this->_singleton();
 
	// Bypass if memcache is not enabled
	if(!$instance->cacheEnabled()) {
		return;
	}
 
	$cached = $instance->cache($name.'_version');
 
	// If there isn't a cache version already, then no need to increment it
	if($cached === false) {
		return;
	}
 
	// Increment the cache version
	$instance->incrementCache($name.'_version');
}

When all is said and done, using a caching solution is very important for scaling in a LAMPP environment and given a little time and effort, caching database results doesn’t have to be rocket science and it can be rolled into an ORM library so that you’re code isn’t littered with manual checks against the cache.

Posted in Gacela | Tagged , , , | 1 Comment

Reinventing the Wheel (Part 1)

Some people, when confronted with a project, think “I know, I’ll use a framework.” Now they have two problems.

I love frameworks. I stiil shudder when I remember the long days where I worked in spaghetti code when I first started programming. It was shortly into my first experiences with PHP that I was introduced to Zend Framework and the idea of an MVC design pattern that would keep my business logic, data access and html all in separate files. I seriously thought I’d found the best thing since sliced bread.

It was when I studying the MVC design pattern that I stumbled upon Rasmus Lerdorf’s post about “No Framework MVC” and realized that it was possible to create a well formed architecture without using a third-party framework. This naturally led to several projects I did in a bare-bones style without the use of an existing framework. Let me just say that it was a great learning experience. In the end, we switched those projects to Zend Framework because it brought a lot of needed tools to the table as well as a less buggy implementation of the core MVC framework.

When I came to my current position with Lendio, they were already using Kohana 2 so I learned a new framework, and to be honest I’ve been a pretty hard sell. I follow ZF bloggers, I prefer ZF’s conventions and general style, but I’ve come to appreciate K2 for the lightweight rapid application development tool that it is. K2 has been great, it allowed Lendio to roll out an awesome product on an unheard of timeline and its been stable since our launch so no complaints here.

But, even with as awesome as I’ve found both Zend Framework and Kohana to be, my projects have been far from painless. What I’ve found in each of my projects, whether it was my own framework or a third-party one, is that at some point someone made a choice (or an assumption) in the framework and that choice cost me time development time, headaches and in a few instances never-ending frustration.

One of the first I stumbled upon was using the Zend_Db Row Data Gateway implementation in a way that allowed one class (or table) to inherit from another class (table). One just has to look at the series of questions I’ve posted to stackoverflow about the topic to see that isn’t a really great solution to a lot of these issues. In the end, I had to roll a custom data mapper that could support inheritance between database tables as well as other more complex relationships between data.

When I first started working with K2′s ORM and Database Querybuilder, I kept running into this annoying situation where I’d start building a query for one thing and then I’d need to run another query to get some other piece of data, and pass that piece of data into the first query and then run the first query. But every time I set this up, one or both queries would fail. It wasn’t until I did some digging that we found out that the Querybuilder object actually wrote the query information into the Database driver and was flushed every time a query was run. There was no way to build a query in K2 and store it for later use. At first we just made sure that queries were built and run in proper sequence. Eventually it occurred to me that we could build a class that contain all of the necessary arguments for running a query in K2′s Database Querybuilder but would allow us to build a query and hang on to it, till we were ready to use it.

There are more times than I care to count where I’ve had to give up on using a framework tool because it just didn’t get the job done. Has anyone else tried to use Zend_Date::isValid()? As far as I know, it still takes any string and forces it into a valid looking date and thus always returns true. Or in the case of Zend_Db, there was a time that you could use named parameters when using the PDO database adapter library, but they removed it when they abstracted variable escaping into the framework and away from the native database driver. In K2′s case, have you ever tried to build an ACL or navigation tree based on controllers and actions? Maybe its my ZF background, but I figured that both shouldn’t be that difficult to implement either as ports from ZF or natively. However, since Kohana supports nested controllers, you couldn’t just work with a simple module/controller/action path. Implementing both the ACL and navigation tree in a more or less infinite architecture required tons of development time from our team. We’ve been live with our new site for several months now and just recently implemented a method call on a K2 helper that failed miserably in our live environment so once again we spent hours debugging through framework code only to find that our environment didn’t like the way it was done in the helper so we had to ‘massage’ the framework to do it a different way.

I know, you’re either annoyed because you get my point or because you’re wondering when I’ll make it. Basically, I think Terry Chay is on the money when he says that frameworks suck. And why do they suck? As Terry Chay put it: “It is not so much an assumption as a fact: when you develop software, it is about making choices. It is about tradoffs. You can do “A” but you can’t do “B.” “

He later says:

And why I mention this is because I want to mention that whenever I rail on Rails or rip on your framework, every criticism is coming from this concept of There are consequences. When you use a framework, you make a choice. When you adopt a framework, you adopt all the choices that that framework designer has adopted—so you have to be very careful in which framework you are going to use.

So frameworks suck because when you choose a framework, you’re locking yourself into a lot of choices that have just been made for you. And even worse is that you’re locking yourself into choices that you don’t even know were made, at least not until they bite you in the rear end. And that’s not even the worst of it, again I am going to appeal to Terry Chay:

It’s probably because framework code is the antithesis of the design principles that went behind PHP. Frameworks try to do too much; PHP tries to pass the buck. Frameworks are complete; PHP is a component (the ‘P’ in LAMP). Frameworks are standard; almost nothing in PHP is standard, that’s why you need a website to document its quirks. Frameworks surround; PHP sits inside. Frameworks are complicated; PHP is simple.

Like with Zend_Db and escaping values for queries, most frameworks end up doing stuff that is already done by something that actually did the job really well and all in the name of proper code reuse or keeping abstractions intact or some such nonsense. Maybe that works in environments and languages where the cheapest and fastest solution is code related. PHP, however, is a single component of an entire ecosystem and ultimately PHP programmers MUST be good at identifying the easiest, cheapest, and quickest way to deal with any given problem and frameworks in their completeness have a tendency to reduce our ability to do that.

But, if frameworks suck so bad, why are we here on a blog for another PHP library / framework / thing? And why is everyone, including me, still using frameworks in PHP? And is it possible to have our cake and eat it too? Please stay tuned for our next article in this series about reinventing the wheel.

Posted in Gacela | Tagged , , | Leave a comment

Why another ORM?

I love reusing code as much as the next fellow and I have used a number of existing ORM solutions extensively. There’s the great and absolutely valid Zend_Db component from Zend Framework, and ORM from Kohana or even Doctrine for those who want something stand-alone and robust. So, if there are all these other solutions out there, why did I ever go and build my own?

I guess my ultimate inspiration came from Jeff Atwood. The beginning of this story goes into the history of Gacela’s development (another topic for another day) because Gacela certainly isn’t my first time at this particular rodeo and I’ll be surprised if it’s my last. The one nice thing about doing it yourself, is that you get to learn about all the common mistakes (faulty assumptions, stupid assumptions) that eventually box in the design and then you get to start all over again with the process. Suffice it to say that between using other people’s ORM’s and my own, I found a number of hurdles with the existing solutions that I wanted to get past:

  • Manually building out information such as relationships between entities that my database already knew about. Why couldn’t the ORM just figure it out from there?
  • Table inheritance issues such as how do I perform a find() operation and have it return back two different models based on a role or type
  • Separation of data access from business logic. Simply put, most ORM’s all use the Active Record pattern to combine data access with business logic in the same object. I want to use the more robust Data Mapper pattern in my applications.
  • Personally, I don’t believe that doc blocks should be used for programmatic purposes.
  • My boss kept suggesting that at some point we’d have to ditch the ORM and go to straight queries with hard-coded checks against a cache (memcache for instance). I kept thinking, there has to be a way to combine the versatility of a object-oriented framework for data mapping with the scalability of caching.
  • In my opinion, domain level code (controllers, models) should not contain database queries rather they should all exist at the data mapper level.
  • I think that in the long-run, the best thing for the PHP community as a whole will not be “The one framework to rule them all” but rather if we have a number of small highly specific frameworks that integrate well with each other and we allow framework developer’s to focus on the features within their highly specific framework rather than trying to make just another generic, do everything framework.
  • The default functionality (by convention) should be easy to use with relatively little setup. But I should be able to completely bypass the defaults with ease and go bare metal if need be.
  • In Patterns of Enterprise Application Architecture, Fowler suggested that Data Mappers should be able to map data from any data source. This includes Xml, RESTful web services, a database, or anything else that might come our way. I have yet to have found a solution that stepped into the dubious territory of supporting a non-database backed ORM.

To this end, I started Gacela from the ground up to meet these lofty requirements:

  • Whenever possible, the DataSource pulls in relationship and field metadata so that you don’t have to hand-code that information
  • Data Mappers support Concrete Inheritance, Association relationships, and Dependent Relationships out of the box.
  • Mappers contain all data access logic. Models contain only business logic.
  • Uh, yeah. Doc blocks just explain what’s going on.
  • Mappers, Models, and Resources are all fully serializable because only the DataSources and the core Gacela class contain non-serializable resources. As a result it was a cinch to implement storing these items in a file system cache like memcache.* Models and Collections of Models only have to reloaded from the DataSource as a result of insert, update, or delete operations.
  • If you want to find all entities by a shared piece of data, you can build a Criteria object. But if you want to build a custom query to pull information from the DataSource, you’ll have to implement a custom find method in the Mapper. This means that you could theoretically switch out your DataSource with minimal impact on the domain layer (Models, Controllers) because they don’t contain any logic that ties to a particular data structure or DataSource.
  • Gacela is completely stand-alone. And its required components come with a stock PHP 5.3 LAMPP server (except memcache of course).
  • If you look through the documentation, you should see how simple it is to setup a Mapper and Model based on a database table just like you would with standard ORM’s. You should also see how the Mapper can pick up standard relationships and plug them in for you. Lastly, you should notice how quickly you can override the defaults to work things however you need.
  • Even though only MySQL is currently implemented, the entire framework is designed to separate the Models and Mappers from a specific type of DataSource. I fully anticipate implementing MSSQL and REST support very soon.

  • Taking this idea even further, Mappers and Resources don’t change from request to request because they represent objects that are more or less static. Its not like you’re going to change the structure of your DataSources mid-request. Therefore, the only time your application has to reload these objects into memory is when you do an application update that changes a Mapper or Resource.

So, why did I decide to build just another ORM? I guess its because I agree with Jeff:

Indeed. If anything, “Don’t Reinvent The Wheel” should be used as a call to arms for deeply educating yourself about all the existing solutions — not as a bludgeoning tool to undermine those who legitimately want to build something better or improve on what’s already out there. In my experience, sadly, it’s much more the latter than the former. So, no, you shouldn’t reinvent the wheel. Unless you plan on learning more about wheels, that is.
Posted in Gacela | Tagged , | Leave a comment