'joins' vs. 'includes' in Arel

This week at Flatiron, we began a short unit on Arel (“A Relational Algebra”). Arel is an SQL AST (“Abstract Syntax Tree”) manager, meaning that it provides us a set of tools and methods to more easily generate complex SQL queries, to more easily sort and access data from our databases.

Arel provides many helper methods, with “where”, “order”, “joins”, “group”, “having”, “pluck”, and “includes” being the ones we found ourselves using. These methods all accept as arguments either a hash, in which the name of the attribute in question is keyed to the desired value, such as:

where(:name => "Catamaran")

or as a string, which must be formatted as if it were pure SQL , such as:

where("length < 20")

This post will focus on and explicate two similar, but subtly distinct, methods – “joins” and “includes” – used for joining multiple tables in order to run cross-table queries.

In brief, “joins” will join the two tables, but only preserve the data specified by the query. “includes”, on the other hand, will generate a more complex SQL query and pull every column from the tables being joined, and preserve them all in the resulting object. The advantage to “includes”, then, is that if you find yourself wanting to do additional queries on the result of your initial query, there is no need to re-open the database to pull more data – the information you want is already there.

But what do I mean, when I say that “the information is already there”? When you run an Arel query, in most cases, the result will not be an ActiveRecord object, but something called an “ActiveRecord::Relation”. This object represents the result of the query to the database, and can be filtered further as-is. So while “joins” and “includes” both return instances of “ActiveRecord::Relation”, they differ in the amount of data in the objects they return. Let’s have an example.

Below is some code from a pry session I ran on a recent nautically-themed Arel lab, which will illustrate this distinction. I entered this pry session within a “Boat” class method, and so the following methods are being called on the Boat class. Regarding the schema, the “boats” table is associated with the “classifications” table by way of a mutual “boat_classifications” table. Also, I turned on the logger. Onwards.

With “joins”, saving the value as “j”:

Now the same query, made with “includes”, saving the value as “i”:

Note first how both queries return objects of the same class: “ActiveRecord::Relation::ActiveRecord_Relation_Boat”, due to being the results of Arel queries run on the Boat class. Note also how the return values of both queries appear to be the same. This is deceptive. Look at the actual SQL queries generated by the Arel. See how they differ? “joins” created a query which selected “boats.*” after performing an inner join on the three tables involved. “includes”, on the other hand, selected every column individually from both “end” tables (excluding “boat_classifications”), after performing a left outer join.

What are the implications? Let’s attempt some gentle acrobatics and try printing out the classifications for every boat. First, using the result of the “joins”:

Notice how a SQL query is run for every boat? The “joins” command only kept the information for the boats, so our program has to re-query the database now in order to access the classifications information.

Now, let’s try the same command, except being called on the results of our “includes” query.

Remarkable. This call has no need to re-query the database, since the “includes” command already pulled and stored this information. But wait. There’s a bit of weirdness here, and something’s not right. If you look closely, you’ll note that the call to “j” prints 13 classifications, our call to “i” prints only 6. If you look a bit closer, you’ll realize that the objects being printed by our call to “j” are all unique, while the objects being printed by our call to “i” are all the same. This is definitely not what I expected.

Sorcery? Perhaps. Let’s delve. Calculemus!

Initial investigation suggests that the results, at least on the boat side, are the same for both methods. Let’s keep going.

The plot starts to thicken as the fog starts to clear. It seems here that sending “.first.classifications” to i and j returns different instances of the same class of object, “ActiveRecord::Associations::CollectionProxy::ActiveRecord_Associations_CollectionProxy_Classification”. Note that our initial query (defining j and i) returned an object of the class “ActiveRecord::Relation”. It seems that accessing data across tables through Arel results in a change in class, from “Relation” to “Associations::CollectionProxy”. As an aside, if I’m remembering the Arel lecture correctly, instances of “Associations::CollectionProxy” have a unique property of being able to receive messages defined as class methods for the class which they represent (in this case, “Classification”). This behavior is somewhat beyond the scope of this blog post, however. Let’s return to the mysteries of “joins” vs. “includes”.

A ha! Drilling down to the level of actual Classification instances, we find that the “Sailboat” classification pulled by “joins” and “includes” is in fact the same object. Pop champagne.

So, what’s going on here? I’m not 100% sure, but my guess is that when we made our initial “includes” query, we stored in memory the Sailboat classification and the Sailboat classification only. If you go back and look at the SQL generated by our “includes” request, you’ll find this is true – we performed a left outer join, storing only the information from the “classifications” table that related to Sailboat. When we iterated over the boats to print their classification, we found that every boat had the same, identical classification – that one instance of Sailboat, queried when i was defined.

When we called “joins”, however, something very different happened. We performed an inner join across the three tables, but only to identify which boats satisfied our query requirements. Once we had established which boats were “sailboats”, we tossed the rest of the classification data. Then, when we iterated over the boats, the interpreter re-queried the database for each boat, returning all the classifications that it found. When “each” iterated over the boats, it had no idea that we had assembled the boats based on their sail-ness, and so pulled every classification associated with the boat. This is why all the classifications printed to the screen had different identity numbers – they were all the result of independent queries.

There’s even a bit of an easter egg here. We’ve made much of “includes” ability to store query results to avoid re-accessing the database. It seems, though, that “joins” isn’t quite as amnesiac as it’s eidetic sibling would want us to believe. Calling the “each” iteration over j a second time, we see, surprisingly, no calls to the database.

Q.E.D.

Much thanks to @GeorgeMayer for equipping me for this little adventure.

For more information about includes and Rails’ “Eager Loading”, read on.

Full code below:

https://gist.github.com/kronosapiens/9707658

Comments