In this post I’ll share some code to integrate the Tika REST service with Rails paperclip file attachments to extract text from PDF documents and store the data in Elasticsearch (Docker-integrated).
I installed my Docker dependencies via Brew on OSX.
$ brew list --versions | grep -i docker
docker 1.12.5
docker-compose 1.10.0
docker-machine 0.9.0
docker-machine-nfs 0.4.1
# [optional] create VirtualBox docker-machine with increased resources
docker-machine create --virtualbox-memory "4096" --virtualbox-disk-size "40000" -d virtualbox docker-machine
# [optional] enable NFS support
docker-machine-nfs docker-machine
Initial Rails setup
# rvm files
echo docker_rails_tika_elasticsearch > .ruby-gemset
echo ruby-2.3.3 > .ruby-version
cd .
gem install rails
rails -v
Rails 5.0.1
# new project with postgresql connection
rails new . --api -d postgresql
# init database
rake db:create && rake db:migrate
Add gems, edit file: Gemfile
gem 'dotenv-rails'
gem 'elasticsearch-model'
gem 'elasticsearch-rails'
gem 'paperclip' , '~> 5.0'
gem 'sidekiq'
Execute bundle install
to install the new gems.
Update application configuration to set Sidekiq as the the Active Job queue adapter and enable Elasticsearch logging, edit file: config/application.rb
# ...snip...
require 'elasticsearch/rails/instrumentation'
module DockerRailsTikaElasticsearch
class Application < Rails :: Application
# ...snip...
config . active_job . queue_adapter = :sidekiq
Added a dotenv file for development environment variables, new file: .env.development
POSTGRES_HOST = localhost
POSTGRES_USER = postgres
RAILS_HOST = localhost
REDIS_HOST = localhost
TIKA_HOST = localhost
Updated database config to use environment variables, edit file: config/database.yml
default : &default
adapter : postgresql
encoding : unicode
host : <%= ENV.fetch('POSTGRES_HOST', 'localhost') %>
password : <%= ENV.fetch('POSTGRES_PASSWORD', 'postgres') %>
pool : <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>
username : <%= ENV.fetch('POSTGRES_USER', 'postgres') %>
Added an Elasticsearch initializer to set the host from an environment variable, new file: config/initializers/elasticsearch.rb
Elasticsearch :: Model . client = Elasticsearch :: Client . new ( host: ENV . fetch ( 'ELASTICSEARCH_HOST' , 'localhost' ))
Added a Sidekiq initializer to set the host from an environment variable and configure the server and client, new file: config/initializers/sidekiq.rb
redis_url = "redis:// #{ ENV . fetch ( 'REDIS_HOST' , 'localhost' ) } :6379/0"
Sidekiq . configure_server do | config |
config . redis = { url: redis_url }
Sidekiq . configure_client do | config |
config . redis = { url: redis_url }
Added a new Rails migration to create the table for the FileUpload model.
class CreateFileUploads < ActiveRecord :: Migration [ 5.0 ]
def change
create_table :file_uploads do | t |
t . timestamps
Executed the paperclip generator (rails generate paperclip file_upload document
) to create a migration for the attachment, which created this migration:
class AddAttachmentDocumentToFileUploads < ActiveRecord :: Migration
def self . up
change_table :file_uploads do | t |
t . attachment :document
def self . down
remove_attachment :file_uploads , :document
I then create the FileUpload model (see comments for functionality), new file: app/models/file_upload.rb
class FileUpload < ApplicationRecord
include Elasticsearch :: Model
# used to kick off Sidekiq job after model is saved (committed)
after_commit :process_document
# used to send model data to Elasticsearch when indexed
attr_accessor :document_content
# paperclip integration
has_attached_file :document
# model validation per paperclip attachment
validates :document , attachment_presence: true
validates_attachment_content_type :document , content_type: 'application/pdf'
# Elasticsearch mapping for document content
mapping do
indexes :document_content , type: 'multi_field' do
indexes :document_content
indexes :raw , index: :no
# Elasticsearch interface to construct document structure
def as_indexed_json ( options = {})
as_json . merge ( document_content: document_content )
# loads model data from Elasticsearch
def from_elasticsearch
search_definition = {
query: {
filtered: {
filter: {
term: {
_id: id
self . class . __elasticsearch__ . search ( search_definition ). first . _source
rescue => e
# called from Sidekiq job
def set_document_content
self . document_content = document_content_from_tika
__elasticsearch__ . index_document
# loads document contents from Elasticearch
def document_content_from_elasticsearch
from_elasticsearch [ :document_content ]
# loads document contents from Tika REST API
def document_content_from_tika
meta_data = JSON . parse ( `curl -H "Accept: application/json" -T " #{ document . path } " http:// #{ ENV [ 'TIKA_HOST' ] } :9998/meta` )
`curl -X PUT --data-binary "@ #{ document . path } " --header "Content-type: #{ meta_data [ 'Content-Type' ] } " http:// #{ ENV [ 'TIKA_HOST' ] } :9998/tika --header "Accept: text/plain"` . strip
# after commit hook to create Sidekiq job
def process_document
DocumentProcessorJob . perform_later self
I created a Sidekiq worker which calls back to an instance method in the model, new file: app/jobs/document_processor_job.rb
class DocumentProcessorJob < ApplicationJob
queue_as :default
def perform ( file_upload )
file_upload . set_document_content
Next I added a migration to force create the Elasticsearch index.
class CreateFileUploadIndex < ActiveRecord :: Migration [ 5.0 ]
def change
FileUpload . __elasticsearch__ . create_index! force: true
Last I added a simple rake task to create a new FileUpload model instance and output the contents of the document stored in Elasticsearch, new file: lib/tasks/file_upload.rake
namespace :file_upload do
desc 'File upload test'
task test: :environment do
file_upload = FileUpload . new
file = File . open Rails . root . join ( 'eric-london-blog.pdf' )
file_upload . document = file
file_upload . save!
puts file_upload . document_content_from_elasticsearch
Docker Integration
I defined a basic Dockerfile to update packages and set the working directory:
FROM ruby:2.3.3
RUN apt-get update -qq && apt-get install -y build-essential netcat imagemagick
# node/npm
RUN apt-get install -y nodejs npm
The containers, environment settings, and persistent volumes are all defined in the Docker compose file (docker-compose.yaml). The Sidekiq container shares a filesystem with Rails so it can access the file uploads and send them to the Tika REST API.
version : ' 2'
services :
elasticsearch :
image : elasticsearch:2.4.4
ports :
- ' 9200:9200'
- ' 9300:9300'
volumes :
- elasticsearch:/usr/share/elasticsearch/data
postgres :
image : postgres:latest
ports :
- ' 5432:5432'
volumes :
- postgres:/var/lib/postgresql/data
rails :
build : .
command : bin/docker-start-rails
depends_on :
- elasticsearch
- postgres
- redis
environment :
- ELASTICSEARCH_HOST=elasticsearch
- POSTGRES_HOST=postgres
- POSTGRES_USER=postgres
- RAILS_ENV=development
- REDIS_HOST=redis
ports :
- ' 3000:3000'
volumes :
- .:/rails
- bundle:/usr/local/bundle
redis :
image : redis:latest
ports :
- ' 6379:6379'
volumes :
- redis:/data
sidekiq :
command : bin/docker-start-sidekiq
environment :
- ELASTICSEARCH_HOST=elasticsearch
- POSTGRES_HOST=postgres
- POSTGRES_USER=postgres
- RAILS_ENV=development
- RAILS_HOST=rails
- REDIS_HOST=redis
- TIKA_HOST=tika
depends_on :
- elasticsearch
- postgres
- rails
- redis
- tika
image : dockerrailstikaelasticsearch_rails
volumes_from :
- rails
tika :
image : logicalspark/docker-tikaserver:latest
ports :
- ' 9998:9998'
volumes :
elasticsearch : {}
bundle : {}
postgres : {}
redis : {}
I created a custom start script for Rails, new file: bin/docker-start-rails
set -x
# wait for postgresql
until nc -vz $POSTGRES_HOST 5432 2>/dev/null; do
echo "Postgresql is not ready, sleeping."
sleep 1
# wait for elasticsearch
until nc -vz $ELASTICSEARCH_HOST 9200 2>/dev/null; do
echo "Elasticsearch is not ready, sleeping."
sleep 1
gem install bundler
bundle check || bundle install
rake db:create
rake db:migrate
bundle exec puma -C config/puma.rb
And a similar start script for Sidekiq, new file: bin/docker-start-sidekiq
set -x
# wait for rails
until nc -vz $RAILS_HOST 3000 2>/dev/null; do
echo "Rails is not ready, sleeping."
sleep 1
I then executed the following Docker commands to build and run the containers:
docker-compose build
docker-compose up
Once the containers were up and running, I executed the Rake task inside the container to generate the FileUpload, extract the text from a PDF, and output the content:
$ docker exec -it dockerrailstikaelasticsearch_rails_1 rake file_upload:test | egrep "^ \d {4}- \d {2}- \d {2}"
