Dennis Hackethal’s Blog
My blog about philosophy, coding, and anything else that interests me.
Extracting References from HTML in Rails
Say you run a blog. Underneath your posts, you wish to list all the other posts that link to it – the 'incoming' references. Maybe you also wish to list outgoing references – all the other posts linked to by a post.
What you need is a way to extract references from a post programmatically, and then store them so you can display them elsewhere. Doing this in Rails is easy. I'm using Rails 6.1.1 and Ruby 2.6.5p114.
First, let's think about how to store references. You need to store a 'referencer' and a 'referenced'. I chose polymorphic associations for my blog since I wanted to accommodate references to and from comments as well, and between posts and comments, but depending on your use case, you may not need polymorphism.
Create your Reference
model like so:
$ rails g model reference referencer:references{polymorphic} referenced:references{polymorphic}
Say a post references the same post twice. If you want that to result in only one reference, add a unique index across the entire reference row. This index will ensure that multiple connections to the database can't write the same reference twice:
$ rails g migration add_index_to_references
Change this migration to look as follows:
class AddIndexToReferences < ActiveRecord::Migration[6.1]
def change
add_index :references, [:referencer_id, :referencer_type, :referenced_id, :referenced_type], unique: true, name: 'row_index'
end
end
Without a custom name
, the migration will throw an exception because the default name, which is basically a concatenation of all the column names, will be too long.
Next, run your migrations:
$ rails db:migrate
Now, in your Reference
model, you need to set up the associations and validate uniqueness on all the model's fields. (If you're wondering if this isn't an unnecessary repetition after setting up the index: the index is meant as a failsafe on a database level. You don't want things to get that far, so you still want your standard model validations to catch duplicates before your model writes to the database.)
class Reference < ApplicationRecord
belongs_to :referencer, polymorphic: true
belongs_to :referenced, polymorphic: true
validates :referencer_id, uniqueness: { scope: [:referencer_type, :referenced_id, :referenced_type] }
end
That's all we need to store references.
To extract references we need to install the Nokogiri gem. In your Gemfile, add:
gem 'nokogiri'
Then install:
$ bundle
Next, we parse a post's and comment's body
whenever it changes and see if it contains any references. Since this logic is shared, I decided to put it in a module which I then include in both my post and comment models:
module Referencer
def self.included(receiver)
receiver.has_many :out_references, class_name: 'Reference', as: :referencer, dependent: :destroy
receiver.has_many :in_references, class_name: 'Reference', as: :referenced, dependent: :destroy
receiver.has_many :post_referencers, through: :in_references, source: :referencer, source_type: 'Post'
receiver.has_many :comment_referencers, through: :in_references, source: :referencer, source_type: 'Comment'
receiver.has_many :post_references, through: :out_references, source: :referenced, source_type: 'Post'
receiver.has_many :comment_references, through: :out_references, source: :referenced, source_type: 'Comment'
receiver.send :include, InstanceMethods
receiver.after_save :create_references
end
module InstanceMethods
def create_references
# Parse the body using Nokogiri. I create the HTML string
# using markdown, but it's up to you, as long as you end up
# with an HTML string that you can parse with Nokogiri.
doc = Nokogiri.HTML(ApplicationController.helpers.markdown(self.body))
# Get all links
doc.css('a')
.map { |link| link['href'] } # 'pluck' the `href` off those links
.map do |href|
# This regex matches:
# - post id, with or without domain, both for localhost
# and blog.example.com
# - comment id, with or without domain, either as a
# standalone URL fragment or in conjunction with a
# post path and, optionally, the domain
# Please note: I am a regex noob, don't trust the regex below.
# It assumes that your links to posts and comments look
# like https://blog.example.com/posts/123#comment-456,
# the post id being 123 and the comment id being 456.
regex = Regexp.new '\A((https?:\/\/(localhost:3000|blog\.example\.com))?\/posts\/(?<post_id>\S+?))?(?<fragment>#(comment-(?<comment_id>\d+)|\S+?)?)?\z'
result = regex.match href
end
.filter(&:present?)
.map do |result|
begin
# At this point, we have at least one of post_id or comment_id,
# potentially both, but definitely not neither one.
post_id, comment_id = result.named_captures.values_at('post_id', 'comment_id')
# If the comment_id is set, we want to reference the comment,
# not the post, even if the post_id is also set.
if comment_id
comment = Comment.find(comment_id)
# If the post_id was also given...
if post_id
# Make sure the post exists
post = Post.find(post_id)
# Make sure the comment was made on that post
comment.post === post ? comment : nil
else
comment
end
# Only the post_id was given, so we want to reference
# the post
else
Post.find(post_id)
end
rescue
nil
end
end # at this point, what's returned could be a post or comment
.filter(&:present?)
.each do |referenced|
begin
Reference.create(
referencer: self,
referenced: referenced
)
end
end
end
def referencers
[*self.post_referencers, *self.comment_referencers].sort_by(&:created_at)
end
def references
[*self.post_references, *self.comment_references].sort_by(&:created_at)
end
end
end
I called this file referencer.rb
and put it in my models/concerns
folder (although I think technically it's not a proper concern). Check out the comments above, it should all be self-explanatory. You will definitely need to change the regular expression to match your needs. I set it up for you in a regex debugger so you can understand how it works and how to change it.
Side note: you may wish to consider removing references when the corresponding links are removed from the body. One rather crude way to do that would be to destroy all of self
’s references at the beginning of the #create_references
method. That way, any needed references will be recreated while the ones that are gone will stay gone. This is assuming you don't mind changing created_at
fields for the same references.
In your Post
and/or Comment
model, include the Referencer
module:
class Post < ApplicationRecord
include Referencer
# other stuff
end
Now, whenever a post or comment is created or saved, the after_save
callback will extract references from its body and store them.
How you display incoming and outgoing references is up to you. I added a method called #referencers
and one called #references
to the Referencer
module. Those give me the posts and comments directly so I can iterate over them in a view and display titles, links, and what not. At the time of writing, I display references underneath blog posts and comments. For example, scroll to the end of this post. And since I'm linking to that post, you should see a reference to it underneath this here post you're currently reading.
What people are saying