Dennis Hackethal’s Blog

My blog about philosophy, coding, and anything else that interests me.

Extracting References from HTML in Rails

Published · 3-minute read

Say you run a blog. Underneath your posts, you wish to list all the other posts that link to it – the 'incoming' references. Maybe you also wish to list outgoing references – all the other posts linked to by a post.

What you need is a way to extract references from a post programmatically, and then store them so you can display them elsewhere. Doing this in Rails is easy. I'm using Rails 6.1.1 and Ruby 2.6.5p114.

First, let's think about how to store references. You need to store a 'referencer' and a 'referenced'. I chose polymorphic associations for my blog since I wanted to accommodate references to and from comments as well, and between posts and comments, but depending on your use case, you may not need polymorphism.

Create your Reference model like so:

$ rails g model reference referencer:references{polymorphic} referenced:references{polymorphic}

Say a post references the same post twice. If you want that to result in only one reference, add a unique index across the entire reference row. This index will ensure that multiple connections to the database can't write the same reference twice:

$ rails g migration add_index_to_references

Change this migration to look as follows:

class AddIndexToReferences < ActiveRecord::Migration[6.1]
  def change
    add_index :references, [:referencer_id, :referencer_type, :referenced_id, :referenced_type], unique: true, name: 'row_index'
  end
end

Without a custom name, the migration will throw an exception because the default name, which is basically a concatenation of all the column names, will be too long.

Next, run your migrations:

$ rails db:migrate

Now, in your Reference model, you need to set up the associations and validate uniqueness on all the model's fields. (If you're wondering if this isn't an unnecessary repetition after setting up the index: the index is meant as a failsafe on a database level. You don't want things to get that far, so you still want your standard model validations to catch duplicates before your model writes to the database.)

class Reference < ApplicationRecord
  belongs_to :referencer, polymorphic: true
  belongs_to :referenced, polymorphic: true

  validates :referencer_id, uniqueness: { scope: [:referencer_type, :referenced_id, :referenced_type] }
end

That's all we need to store references.

To extract references we need to install the Nokogiri gem. In your Gemfile, add:

gem 'nokogiri'

Then install:

$ bundle

Next, we parse a post's and comment's body whenever it changes and see if it contains any references. Since this logic is shared, I decided to put it in a module which I then include in both my post and comment models:

module Referencer
  def self.included(receiver)
    receiver.has_many :out_references, class_name: 'Reference', as: :referencer, dependent: :destroy
    receiver.has_many :in_references, class_name: 'Reference', as: :referenced, dependent: :destroy
    receiver.has_many :post_referencers, through: :in_references, source: :referencer, source_type: 'Post'
    receiver.has_many :comment_referencers, through: :in_references, source: :referencer, source_type: 'Comment'
    receiver.has_many :post_references, through: :out_references, source: :referenced, source_type: 'Post'
    receiver.has_many :comment_references, through: :out_references, source: :referenced, source_type: 'Comment'

    receiver.send :include, InstanceMethods

    receiver.after_save :create_references
  end

  module InstanceMethods
    def create_references
      # Parse the body using Nokogiri. I create the HTML string
      # using markdown, but it's up to you, as long as you end up
      # with an HTML string that you can parse with Nokogiri.
      doc = Nokogiri.HTML(ApplicationController.helpers.markdown(self.body))

      # Get all links
      doc.css('a')
        .map { |link| link['href'] } # 'pluck' the `href` off those links
        .map do |href|
          # This regex matches:
          # - post id, with or without domain, both for localhost
          #   and blog.example.com
          # - comment id, with or without domain, either as a
          #   standalone URL fragment or in conjunction with a
          #   post path and, optionally, the domain
          # Please note: I am a regex noob, don't trust the regex below.
          # It assumes that your links to posts and comments look
          # like https://blog.example.com/posts/123#comment-456,
          # the post id being 123 and the comment id being 456.
          regex = Regexp.new '\A((https?:\/\/(localhost:3000|blog\.example\.com))?\/posts\/(?<post_id>\S+?))?(?<fragment>#(comment-(?<comment_id>\d+)|\S+?)?)?\z'

          result = regex.match href
        end
        .filter(&:present?)
        .map do |result|
          begin
            # At this point, we have at least one of post_id or comment_id,
            # potentially both, but definitely not neither one.
            post_id, comment_id = result.named_captures.values_at('post_id', 'comment_id')

            # If the comment_id is set, we want to reference the comment,
            # not the post, even if the post_id is also set.
            if comment_id
              comment = Comment.find(comment_id)

              # If the post_id was also given...
              if post_id
                # Make sure the post exists
                post = Post.find(post_id)

                # Make sure the comment was made on that post
                comment.post === post ? comment : nil
              else
                comment
              end
            # Only the post_id was given, so we want to reference
            # the post
            else
              Post.find(post_id)
            end
          rescue
            nil
          end
        end # at this point, what's returned could be a post or comment
        .filter(&:present?)
        .each do |referenced|
          begin
            Reference.create(
              referencer: self,
              referenced: referenced
            )
          end
        end
    end

    def referencers
      [*self.post_referencers, *self.comment_referencers].sort_by(&:created_at)
    end

    def references
      [*self.post_references, *self.comment_references].sort_by(&:created_at)
    end
  end
end

I called this file referencer.rb and put it in my models/concerns folder (although I think technically it's not a proper concern). Check out the comments above, it should all be self-explanatory. You will definitely need to change the regular expression to match your needs. I set it up for you in a regex debugger so you can understand how it works and how to change it.

Side note: you may wish to consider removing references when the corresponding links are removed from the body. One rather crude way to do that would be to destroy all of self’s references at the beginning of the #create_references method. That way, any needed references will be recreated while the ones that are gone will stay gone. This is assuming you don't mind changing created_at fields for the same references.

In your Post and/or Comment model, include the Referencer module:

class Post < ApplicationRecord
  include Referencer

  # other stuff
end

Now, whenever a post or comment is created or saved, the after_save callback will extract references from its body and store them.

How you display incoming and outgoing references is up to you. I added a method called #referencers and one called #references to the Referencer module. Those give me the posts and comments directly so I can iterate over them in a view and display titles, links, and what not. At the time of writing, I display references underneath blog posts and comments. For example, scroll to the end of this post. And since I'm linking to that post, you should see a reference to it underneath this here post you're currently reading.


References

This post makes 1 reference to:


What people are saying

What are your thoughts?

You are responding to comment #. Clear
Your real name is preferred.
Markdown supported. cmd + enter to comment. You have free speech here. You’re responsible for what you write. Terms, privacy policy
This small puzzle helps protect the blog against automated spam.

Preview