Skip to content

Modify Django To Support Group By Query Function

by Paul Kenjora on September 9th, 2007

Group by is used extensively in many day to day SQL queries. Although most often it is used for counting sometimes group by comes in handy for creating an artificial unique constraint. For example say you have a list of URLS that are described by a checksum and you wish to query a list that includes only unique pages. Even though the URLS are unique the checksums may not be, you want to throw away duplicate checksums. The group by keyword accomplishes this fairly quickly while allowing the database to maintain duplicates for other purposes.

The Basics

Pages.objects.all() produces (in essence) SELECT * FROM pages;

All Pages

  1. 23, http://blog.awarelabs.com/?p=18
  2. 78, http://blog.awarelabs.com/?p=19
  3. 23, http://blog.awarelabs.com/?p=18&preview=true
  4. 23, http://blog.awarelabs.com/archive/?p=18

The last 3 pages have the same checksum. We consider them duplicates and do not wish to query them.

Pages.objects.all().group_by(‘checksum’) produces (in essence) SELECT * FROM pages GROUP BY checksum;

Unique Pages

  1. 23, http://blog.awarelabs.com/?p=18
  2. 78, http://blog.awarelabs.com/?p=19

The query returned only unique checksums. The 1st occurrence of each unique checksum page was returned.

Modify Django To Perform Group By

To get things to work right I modified about 15 repeating lines of code and made sure the change was transparent to existing logic. The theory is to insert GROUP BY with a list of fields just like the order_by function. I basically copied that and changed it to group_by. All changes are only within "django/db/models/query.py"

First add a new group_by member to QuerySet.

class QuerySet


class QuerySet(object):
"Represents a lazy database lookup for a set of objects"
def __init__(self, model=None):
self.model = model
self._filters = Q()
self._order_by = None # Ordering, e.g. ('date', '-name'). If None, use model's ordering.
self._select_related = False # Whether to fill cache for related objects.
self._distinct = False # Whether the query should use SELECT DISTINCT.
self._select = {} # Dictionary of attname -> SQL.
self._where = [] # List of extra WHERE clauses to use.
self._params = [] # List of params to use for extra WHERE clauses.
self._tables = [] # List of extra tables to use.
self._offset = None # OFFSET clause.
self._limit = None # LIMIT clause.
self._result_cache = None
self._group_by = []

Add the new group_by method.

def group_by


def group_by(self, *field_names):
"Returns a new QuerySet instance with the grouping changed."
assert len(self._group_by) == 0, \
"Cannot regroup a query once a group has been taken."
return self._clone(_group_by=field_names)

Add support for copying queries.

def _clone


def _clone(self, klass=None, **kwargs):
if klass is None:
klass = self.__class__
c = klass()
c.model = self.model
c._filters = self._filters
c._order_by = self._order_by
c._select_related = self._select_related
c._distinct = self._distinct
c._select = self._select.copy()
c._where = self._where[:]
c._params = self._params[:]
c._tables = self._tables[:]
c._offset = self._offset
c._limit = self._limit
c._group_by = self._group_by
c.__dict__.update(kwargs)
return c

Ensure that group_by does not interfere with combining queries. This could be smarter but it seems to cover the error cases.

def _combine


def _combine(self, other):
assert self._limit is None and self._offset is None \
and other._limit is None and other._offset is None, \
"Cannot combine queries once a slice has been taken."
assert self._distinct == other._distinct, \
"Cannot combine a unique query with a non-unique query"
assert len(self._group_by) == 0 and len(other._group_by) == 0, \
"Cannot combine queries once a group has been taken."

Add the code to construct the raw SQL query string.

def _get_sql_clause


def _get_sql_clause(self):
opts = self.model._meta

# Construct the fundamental parts of the query: SELECT X FROM Y WHERE Z.
select = ["%s.%s" % (backend.quote_name(opts.db_table), backend.quote_name(f.column)) for f in opts.fields]
tables = [quote_only_if_word(t) for t in self._tables]
joins = SortedDict()
where = self._where[:]
params = self._params[:]
group_by = self._group_by[:]

# Convert self._filters into SQL.
joins2, where2, params2 = self._filters.get_sql(opts)
joins.update(joins2)
where.extend(where2)
params.extend(params2)

# Add additional tables and WHERE clauses based on select_related.
if self._select_related:
fill_table_cache(opts, select, tables, where, opts.db_table, [opts.db_table])

# Add any additional SELECTs.
if self._select:
select.extend(['(%s) AS %s' % (quote_only_if_word(s[1]), backend.quote_name(s[0])) for s in self._select.items()])

# Start composing the body of the SQL statement.
sql = [" FROM", backend.quote_name(opts.db_table)]

# Compose the join dictionary into SQL describing the joins.
if joins:
sql.append(" ".join(["%s %s AS %s ON %s" % (join_type, table, alias, condition)
for (alias, (table, join_type, condition)) in joins.items()]))

# Compose the tables clause into SQL.
if tables:
sql.append(", " + ", ".join(tables))

# Compose the where clause into SQL.
if where:
sql.append(where and "WHERE " + " AND ".join(where))

# Compose the where clause into SQL.
if group_by:
sql.append("GROUP BY " + ", ".join(group_by))

# ORDER BY clause

Handle date query special case, GROUP BY is already used and we do not want to cause errors, for now just ignore GROUP BY (Yes its bad, will fix when needed!).

class DateQuerySet


class DateQuerySet(QuerySet):
def iterator(self):
from django.db.backends.util import typecast_timestamp
self._order_by = () # Clear this because it'll mess things up otherwise.
self._group_by = [] # Clear this because it'll mess things up otherwise.
if self._field.null:
self._where.append('%s.%s IS NOT NULL' % \
(backend.quote_name(self.model._meta.db_table), backend.quote_name(self._field.column)))

Gets do not have a GROUP BY since they return only one object.

def get


def get(self, *args, **kwargs):
"Performs the SELECT and returns a single object matching the given keyword arguments."
clone = self.filter(*args, **kwargs)
# clean up SQL by removing unneeded ORDER BY
if not clone._order_by:
clone._order_by = ()
# clean up SQL by removing unneeded GROUP BY
if not clone._group_by:
clone._group_by = []

Deletes also do not use GROUP BY.

def delete


def delete(self):
"""
Deletes the records in the current QuerySet.
"""
assert self._limit is None and self._offset is None, \
"Cannot use 'limit' or 'offset' with delete."

del_query = self._clone()

# disable non-supported fields
del_query._select_related = False
del_query._order_by = ()
del_query._group_by = []

Ensure that GROUP BY is not included in count, it messes up.

def count


def count(self):
"Performs a SELECT COUNT() and returns the number of records as an integer."
counter = self._clone()
counter._order_by = ()
counter._offset = None
counter._limit = None
counter._select_related = False
counter._group_by = []
select, sql, params = counter._get_sql_clause()

Conclusion

The above code could use some improvements. Better support for count and DateQuerySet are definitely needed. Other than that I think everything is OK. I have not tested it thoroughly on tables requiring complex joins, so testing is needed. Hope it helps, the code is currently used to make the web a better place on Arkayne.

  • And as of a few months back (with the landing of queryset-refactor), there's some (under-the-hood, NOT exposed through any high-level API methods) support for directly dropping a GROUP BY clause into a query.
  • I know this is an old post, though just wanted to let you know that grouping in postgres needs to use aggregate functions, otherwise it throws an error :(
  • Have you submitted this as a patch?
blog comments powered by Disqus