Integrate the Tika REST service with Rails paperclip attachments to extract text from PDF documents and store in Elasticsearch

In this post I'll share some code to integrate the Tika REST service with Rails paperclip file attachments to extract text from PDF documents and store the data in Elasticsearch (Docker-integrated).

I installed my Docker dependencies via Brew on OSX.

$ brew list --versions | grep -i docker
docker 1.12.5
docker-compose 1.10.0
docker-machine 0.9.0
docker-machine-nfs 0.4.1

# [optional] create VirtualBox docker-machine with increased resources
docker-machine create --virtualbox-memory "4096" --virtualbox-disk-size "40000" -d virtualbox docker-machine

# [optional] enable NFS support
docker-machine-nfs docker-machine

Initial Rails setup

# rvm files
echo docker_rails_tika_elasticsearch > .ruby-gemset
echo ruby-2.3.3 > .ruby-version
cd .

gem install rails

rails -v
Rails 5.0.1

# new project with postgresql connection
rails new . --api -d postgresql

# init database
rake db:create && rake db:migrate

Add gems, edit file: Gemfile

gem 'dotenv-rails'
gem 'elasticsearch-model'
gem 'elasticsearch-rails'
gem 'paperclip', '~> 5.0'
gem 'sidekiq'

Execute bundle install to install the new gems.

Update application configuration to set Sidekiq as the the Active Job queue adapter and enable Elasticsearch logging, edit file: config/application.rb

# ...snip...
require 'elasticsearch/rails/instrumentation'

module DockerRailsTikaElasticsearch
  class Application < Rails::Application
    # ...snip...
    config.active_job.queue_adapter = :sidekiq
  end
end

Added a dotenv file for development environment variables, new file: .env.development

ELASTICSEARCH_HOST=localhost
POSTGRES_HOST=localhost
POSTGRES_PASSWORD=postgres
POSTGRES_USER=postgres
RAILS_HOST=localhost
REDIS_HOST=localhost
TIKA_HOST=localhost

Updated database config to use environment variables, edit file: config/database.yml

default: &default
  adapter: postgresql
  encoding: unicode
  host: <%= ENV.fetch('POSTGRES_HOST', 'localhost') %>
  password: <%= ENV.fetch('POSTGRES_PASSWORD', 'postgres') %>
  pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>
  username: <%= ENV.fetch('POSTGRES_USER', 'postgres') %>

Added an Elasticsearch initializer to set the host from an environment variable, new file: config/initializers/elasticsearch.rb

Elasticsearch::Model.client = Elasticsearch::Client.new(host: ENV.fetch('ELASTICSEARCH_HOST', 'localhost'))

Added a Sidekiq initializer to set the host from an environment variable and configure the server and client, new file: config/initializers/sidekiq.rb

redis_url = "redis://#{ENV.fetch('REDIS_HOST', 'localhost')}:6379/0"

Sidekiq.configure_server do |config|
  config.redis = { url: redis_url }
end

Sidekiq.configure_client do |config|
  config.redis = { url: redis_url }
end

Added a new Rails migration to create the table for the FileUpload model.

class CreateFileUploads < ActiveRecord::Migration[5.0]
  def change
    create_table :file_uploads do |t|
      t.timestamps
    end
  end
end

Executed the paperclip generator (rails generate paperclip file_upload document) to create a migration for the attachment, which created this migration:

class AddAttachmentDocumentToFileUploads < ActiveRecord::Migration
  def self.up
    change_table :file_uploads do |t|
      t.attachment :document
    end
  end

  def self.down
    remove_attachment :file_uploads, :document
  end
end

I then create the FileUpload model (see comments for functionality), new file: app/models/file_upload.rb

class FileUpload < ApplicationRecord
  include Elasticsearch::Model

  # used to kick off Sidekiq job after model is saved (committed)
  after_commit :process_document

  # used to send model data to Elasticsearch when indexed
  attr_accessor :document_content

  # paperclip integration
  has_attached_file :document

  # model validation per paperclip attachment
  validates :document, attachment_presence: true
  validates_attachment_content_type :document, content_type: 'application/pdf'

  # Elasticsearch mapping for document content
  mapping do
    indexes :document_content, type: 'multi_field' do
      indexes :document_content
      indexes :raw, index: :no
    end
  end

  # Elasticsearch interface to construct document structure
  def as_indexed_json(options={})
    as_json.merge(document_content: document_content)
  end

  # loads model data from Elasticsearch
  def from_elasticsearch
    search_definition = {
      query: {
        filtered: {
          filter: {
            term: {
              _id: id
            }
          }
        }
      }
    }
    begin
      self.class.__elasticsearch__.search(search_definition).first._source
    rescue => e
      retry
    end
  end

  # called from Sidekiq job
  def set_document_content
    self.document_content = document_content_from_tika
    __elasticsearch__.index_document
  end

  # loads document contents from Elasticearch
  def document_content_from_elasticsearch
    from_elasticsearch[:document_content]
  end

  # loads document contents from Tika REST API
  def document_content_from_tika
    meta_data = JSON.parse(`curl -H "Accept: application/json" -T "#{document.path}" http://#{ENV['TIKA_HOST']}:9998/meta`)
    `curl -X PUT --data-binary "@#{document.path}" --header "Content-type: #{meta_data['Content-Type']}" http://#{ENV['TIKA_HOST']}:9998/tika --header "Accept: text/plain"`.strip
  end

  private

  # after commit hook to create Sidekiq job
  def process_document
    DocumentProcessorJob.perform_later self
  end
end

I created a Sidekiq worker which calls back to an instance method in the model, new file: app/jobs/documentprocessorjob.rb

class DocumentProcessorJob < ApplicationJob
  queue_as :default

  def perform(file_upload)
    file_upload.set_document_content
  end
end

Next I added a migration to force create the Elasticsearch index.

class CreateFileUploadIndex < ActiveRecord::Migration[5.0]
  def change
    FileUpload.__elasticsearch__.create_index! force: true
  end
end

Last I added a simple rake task to create a new FileUpload model instance and output the contents of the document stored in Elasticsearch, new file: lib/tasks/file_upload.rake

namespace :file_upload do
  desc 'File upload test'
  task test: :environment do
    file_upload = FileUpload.new
    file = File.open Rails.root.join('eric-london-blog.pdf')
    file_upload.document = file
    file_upload.save!
    puts file_upload.document_content_from_elasticsearch
  end
end

Docker Integration

I defined a basic Dockerfile to update packages and set the working directory:

FROM ruby:2.3.3

RUN apt-get update -qq && apt-get install -y build-essential netcat imagemagick

# node/npm
RUN apt-get install -y nodejs npm

ENV APP_HOME /rails
WORKDIR $APP_HOME

The containers, environment settings, and persistent volumes are all defined in the Docker compose file (docker-compose.yaml). The Sidekiq container shares a filesystem with Rails so it can access the file uploads and send them to the Tika REST API.

version: '2'
services:
  elasticsearch:
    image: elasticsearch:2.4.4
    ports:
      - '9200:9200'
      - '9300:9300'
    volumes:
      - elasticsearch:/usr/share/elasticsearch/data
  postgres:
    image: postgres:latest
    ports:
      - '5432:5432'
    volumes:
      - postgres:/var/lib/postgresql/data
  rails:
    build: .
    command: bin/docker-start-rails
    depends_on:
      - elasticsearch
      - postgres
      - redis
    environment:
      - ELASTICSEARCH_HOST=elasticsearch
      - POSTGRES_HOST=postgres
      - POSTGRES_PASSWORD=postgres
      - POSTGRES_USER=postgres
      - RAILS_ENV=development
      - REDIS_HOST=redis
    ports:
      - '3000:3000'
    volumes:
      - .:/rails
      - bundle:/usr/local/bundle
  redis:
    image: redis:latest
    ports:
      - '6379:6379'
    volumes:
      - redis:/data
  sidekiq:
    command: bin/docker-start-sidekiq
    environment:
      - ELASTICSEARCH_HOST=elasticsearch
      - POSTGRES_HOST=postgres
      - POSTGRES_PASSWORD=postgres
      - POSTGRES_USER=postgres
      - RAILS_ENV=development
      - RAILS_HOST=rails
      - REDIS_HOST=redis
      - TIKA_HOST=tika
    depends_on:
      - elasticsearch
      - postgres
      - rails
      - redis
      - tika
    image: dockerrailstikaelasticsearch_rails
    volumes_from:
      - rails
  tika:
    image: logicalspark/docker-tikaserver:latest
    ports:
      - '9998:9998'
volumes:
  elasticsearch: {}
  bundle: {}
  postgres: {}
  redis: {}

I created a custom start script for Rails, new file: bin/docker-start-rails

#!/bin/sh

set -x

# wait for postgresql
until nc -vz $POSTGRES_HOST 5432 2>/dev/null; do
  echo "Postgresql is not ready, sleeping."
  sleep 1
done

# wait for elasticsearch
until nc -vz $ELASTICSEARCH_HOST 9200 2>/dev/null; do
  echo "Elasticsearch is not ready, sleeping."
  sleep 1
done

gem install bundler
bundle check || bundle install

rake db:create
rake db:migrate

bundle exec puma -C config/puma.rb

And a similar start script for Sidekiq, new file: bin/docker-start-sidekiq

#!/bin/sh

set -x

# wait for rails
until nc -vz $RAILS_HOST 3000 2>/dev/null; do
  echo "Rails is not ready, sleeping."
  sleep 1
done

sidekiq

I then executed the following Docker commands to build and run the containers:

docker-compose build
docker-compose up

Once the containers were up and running, I executed the Rake task inside the container to generate the FileUpload, extract the text from a PDF, and output the content:

$ docker exec -it dockerrailstikaelasticsearch_rails_1 rake file_upload:test | egrep "^\d{4}-\d{2}-\d{2}"
2017-01-26
2016-12-01  Docker Hadoop Streaming Map Reduce Scala Job (/2016/12/01/docker-hadoop-streaming-map-reduce-scala-job.html)
2016-10-29  Dockerize a Rails development environment integrated with Postgresql, Redis, and Elasticsearch using Docker Compose (/2016/10/29/dockerize-rails-development-
2016-10-16  Track memory utilization of processes and graph the data via Chartkick, Highcharts, and Rails (/2016/10/16/track-memory-utilization-of-processes-and-graph-the-
2016-10-09  Create an iOS Swift app to send your location to a Rails API and display on Google Maps (/2016/10/09/create-an-ios-swift-app-to-send-your-location-to-a-rails-api-
2016-10-06  Elasticsearch autocomplete with a Rails API backend and an Angular frontend (/2016/10/06/elasticsearch-autocomplete-with-a-rails-api-backend-and-an-angular-
2015-12-15  Integrating a Rails API backend with an Angular frontend using token authentication (/2015/12/15/integrating-a-rails-api-backend-with-an-angular-frontend-using-
2015-12-09  Sending messages between a Swift webview and a Rails backend using Javascript (/2015/12/09/sending-messages-between-a-swift-webview-and-a-rails-backend-
2015-07-15  Rails 4: searching for related models with Elasticsearch and tagged content via acts-as-taggable-on (/2015/07/15/rails-4-searching-for-related-models-with-
2015-06-25  Rails 4 Elasticsearch geospatial searching and model integration (/2015/06/25/rails-4-elasticsearch-geospatial-searching-and-model-integration.html)
2015-06-13  Using JRuby native Queue to manage work across threads (/2015/06/13/using-jruby-native-queue-to-manage-work-across-threads.html)
2015-03-08  Using Redis sets (unique lists) to track relationships between users and their friends, and making friend suggestions (/2015/03/08/using-redis-sets-unique-lists-to-
2014-11-16  JRuby: Bulk index Rails model data into Elasticsearch via Sidekiq (Redis queue) (/2014/11/16/jruby-bulk-index-rails-model-data-into-elasticsearch-via-sidekiq-redis-
2014-11-07  Rails4 Javascript-integrated unit tests via PhantomJS, RSpec, and Capybara (/2014/11/07/rails4-javascript-integrated-unit-tests-via-phantomjs-rspec-capybara.html)
2014-09-02  Rails 4 Elasticsearch integration with dynamic facets and filters via model concern (/2014/09/02/rails-4-elasticsearch-integration-with-dynamic-facets-and-filters-via-
2014-08-01  Hadoop, Pig, Ruby, Map/Reduce, on OSX via Homebrew (/2014/08/01/hadoop-pig-ruby-map-reduce-on-osx-via-homebrew.html)

Source code on Github