Mimsy Were the Borogoves

Hacks: Articles about programming in Python, Perl, PHP, and whatever else I happen to feel like hacking at.

Converting an existing Django model to Django-MPTT

Jerry Stratton, February 12, 2016

For a long time I’ve been painfully aware of a glaring inefficiency in my custom blog software built off of Django. A lot of what it does relies on the tree model: folders within folders. Finding all the front-page articles for my main blog means getting all the children of each category, and then any descendants of those children, and so on. Previewing a new blog article can take a minute or more, mainly because of the recursive nature of building the sidebar. There are only seven articles on the front page, but they require looking through a tree with 1,505 articles spread over five levels.

I’ve been aware of various solutions for turning Django into more of a tree-hugger, but they’ve tended either to not work well with an existing installation, or if they did, their documentation hid that. For example, django-mptt until a few years ago hid the very important .rebuild method, which is necessary for building the tree data for an existing dataset.

Having just finished the basic installation, however, django-mptt is very helpful. On my very simple, not very tree-like blog, the Walkerville Weekly Reader, publish time dropped by around 13% to 21%; since it only took about two minutes anyway, that’s not necessarily a big deal. But Mimsy Were the Borogoves, which I’ve been running since the nineties, has a lot more articles and a much more complex structure. Publish time dropped from about 30 minutes to about 10 minutes. That’s worth the installation troubles.

Converting my model to use django-mptt

I installed django-mptt by downloading version 0.8.0, unpacking it, and running setup, but, despite the documentation claiming that it’s okay with Django 1.8.x, it performed an automatic upgrade to Django 1.9.1. I had to downgrade again using:

  • sudo pip uninstall django
  • sudo pip install django==1.8.8

I’m using django-ajax-selects version 1.3.6, which doesn’t work with Django 1.9.x. While the 1.4 version is compatible with Django 1.9, the 1.4 version doesn’t work on inlines, and inlines are the main reason I use ajax-selects. I have a list of 9,000 URLs and 14,000 pages that can be attached via inline to any page, not to mention the keywords and authors and so forth. Without ajax-select, it takes forever for browsers to render pages because those become pull-down menus.

Converting to django-mptt is very easy. I added this to the top of my models.py:

  • from mptt.models import MPTTModel, TreeForeignKey, TreeManager

I had to include TreeManager because I have a custom manager. If you use the standard manager, you shouldn’t need it.

[toggle code]

  • class PageManager(models.Manager):
  • class Page(models.Model):
    • parent = models.ForeignKey(‘self’, blank=True, null=True)

became:

[toggle code]

  • class PageManager(TreeManager):
  • class Page(MPTTModel):
    • parent = TreeForeignKey(‘self’, blank=True, null=True)

I also had a “level” method on my Page model, which I needed to rename because it interfered with the MPTTModel’s level column.

I used “./manage.py sqlall pages” to find the new columns and indexes, and then added them by hand in SQLite:

  • ALTER TABLE pages_page ADD COLUMN lft INTEGER UNSIGNED NOT NULL DEFAULT 0;
  • ALTER TABLE pages_page ADD COLUMN rght INTEGER UNSIGNED NOT NULL DEFAULT 0;
  • ALTER TABLE pages_page ADD COLUMN tree_id INTEGER UNSIGNED NOT NULL DEFAULT 0;
  • ALTER TABLE pages_page ADD COLUMN level INTEGER UNSIGNED NOT NULL DEFAULT 0;
  • CREATE INDEX "pages_page_329f6fb3" ON "pages_page" ("lft");
  • CREATE INDEX "pages_page_e763210f" ON "pages_page" ("rght");
  • CREATE INDEX "pages_page_ba470c4a" ON "pages_page" ("tree_id");
  • CREATE INDEX "pages_page_20e079f4" ON "pages_page" ("level");

Your index names are likely to be different.

After verifying that everything still worked, I built the tree data:

  • Page.objects.rebuild()

It took about ten minutes for it to create the tree data.

At that point, I left it alone for a few days to make sure I hadn’t introduced anything that broke the blog.

Code

Your code changes will depend on your code, of course. In my case, I had two methods that needed to go:

[toggle code]

  • #get all levels of responses to a blog article or members of a blog category
  • def getBlogArticles(self):
    • articles = self.children().exclude(provider='nisus')
    • possiblyUpdated = Page.objects.live().filter(parent__in=articles, provider='db').exclude(pagekey__keyword__key='Voices')
    • for article in possiblyUpdated:
      • articles = articles | article.parent.getBlogArticles()
    • if self.blogCategory:
      • articles = articles.order_by('rank', '-added')
    • else:
      • articles = articles.order_by('-added')
    • return articles
  • #uses list to reduce size of query, as I think that without it the query is too long for sqlite
  • def alldescendants(self):
    • idescendants = list(Page.objects.filter(parent=self).values_list('id', flat=True))
    • descendants = []
    • while idescendants:
      • descendants.extend(idescendants)
      • idescendants = list(Page.objects.filter(parent__in=idescendants).values_list('id', flat=True))
    • descendants = Page.objects.filter(parent__in=descendants, live='public')
    • return descendants
  • def modified(self):
    • descendants = self.alldescendants()
    • try:
      • mostrecent = descendants.order_by('-edited')[:1].get().edited
      • lastmodified = max(mostrecent, self.edited)
    • except:
      • lastmodified = self.changed()
    • return lastmodified

As you can see, I had two methods of getting all descendants before installing django-mptt. One was to recurse the method on every child and then every child’s child, as I did with getBlogArticles. This had the advantage of always working, but being very slow. The second, in alldescendants was to get the id of all children, and then get every page that had a parent in that list of ids, and then repeat that for the new set of pages, and so on, until no pages had parents in the furthest generation. This had the advantage of being faster, but the disadvantage of creating gigantic SQL statements.1

[toggle code]

  • #get all levels of responses to a blog article or members of a blog category
  • def getBlogArticles(self):
    • articles = self.get_descendants().filter(provider='db').exclude(pagekey__keyword__key='Voices')
    • articles = articles & Page.objects.live()
    • if self.blogcategory:
      • articles = articles.order_by('rank', '-added')
    • else:
      • articles = articles.order_by('-added')
    • return articles
  • def alldescendants(self, include_self=False):
    • return self.get_descendants(include_self=include_self).filter(live='public')
  • def modified(self):
    • descendants = self.alldescendants(include_self=True)
    • return descendants.order_by('-edited')[:1].get().edited

One immediate benefit, besides the massive speed increase for the more complex blog, is that I no longer have to deal with those incredibly large flat lists of ids. When trying to find the most recent modification within a folder, doing so under a parent with too many descendants ran the risk of an error, at which point I just defaulted to the modification time of the parent2. Using MPTT, however, the modified function no longer needs the try/except clause.

If you’re running a large tree, and traversing the tree is becoming time-consuming, look into django-mptt.

  1. Note that there’s actually a bug in alldescendants which I only noticed checking the behavior of the new method against the behavior of the old one: the final parent__in should have included not just the descendants of self, but self as well. Otherwise, it won’t include self’s immediate children, but only its grandchildren and greater grandchildren.

  2. This also had the effect of hiding a bug in modified, which is that if the page has no descendants, .get() will fail, and so use the exception code which is to get the current page’s edited date—which turns out to be the correct date.

  1. <- Stealing focus
  2. Goodreads advanced search ->