In this post I’ll share some code to integrate the Tika REST service with Rails paperclip file attachments to extract text from PDF documents and store the data in Elasticsearch (Docker-integrated).
I installed my Docker dependencies via Brew on OSX.
$ brew list --versions | grep -i docker
docker 1.12.5
docker-compose 1.10.0
docker-machine 0.9.0
docker-machine-nfs 0.4.1
# [optional] create VirtualBox docker-machine with increased resources
docker-machine create --virtualbox-memory "4096" --virtualbox-disk-size "40000" -d virtualbox docker-machine
# [optional] enable NFS support
docker-machine-nfs docker-machine
Initial Rails setup
# rvm files
echo docker_rails_tika_elasticsearch > .ruby-gemset
echo ruby-2.3.3 > .ruby-version
cd .
gem install rails
rails -v
Rails 5.0.1
# new project with postgresql connection
rails new . --api -d postgresql
# init database
rake db:create && rake db:migrate
Add gems, edit file: Gemfile
gem 'dotenv-rails'
gem 'elasticsearch-model'
gem 'elasticsearch-rails'
gem 'paperclip' , '~> 5.0'
gem 'sidekiq'
Execute bundle install
to install the new gems.
Update application configuration to set Sidekiq as the the Active Job queue adapter and enable Elasticsearch logging, edit file: config/application.rb
# ...snip...
require 'elasticsearch/rails/instrumentation'
module DockerRailsTikaElasticsearch
class Application < Rails :: Application
# ...snip...
config . active_job . queue_adapter = :sidekiq
end
end
Added a dotenv file for development environment variables, new file: .env.development
ELASTICSEARCH_HOST = localhost
POSTGRES_HOST = localhost
POSTGRES_PASSWORD = postgres
POSTGRES_USER = postgres
RAILS_HOST = localhost
REDIS_HOST = localhost
TIKA_HOST = localhost
Updated database config to use environment variables, edit file: config/database.yml
default : &default
adapter : postgresql
encoding : unicode
host : <%= ENV.fetch('POSTGRES_HOST', 'localhost') %>
password : <%= ENV.fetch('POSTGRES_PASSWORD', 'postgres') %>
pool : <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>
username : <%= ENV.fetch('POSTGRES_USER', 'postgres') %>
Added an Elasticsearch initializer to set the host from an environment variable, new file: config/initializers/elasticsearch.rb
Elasticsearch :: Model . client = Elasticsearch :: Client . new ( host: ENV . fetch ( 'ELASTICSEARCH_HOST' , 'localhost' ))
Added a Sidekiq initializer to set the host from an environment variable and configure the server and client, new file: config/initializers/sidekiq.rb
redis_url = "redis:// #{ ENV . fetch ( 'REDIS_HOST' , 'localhost' ) } :6379/0"
Sidekiq . configure_server do | config |
config . redis = { url: redis_url }
end
Sidekiq . configure_client do | config |
config . redis = { url: redis_url }
end
Added a new Rails migration to create the table for the FileUpload model.
class CreateFileUploads < ActiveRecord :: Migration [ 5.0 ]
def change
create_table :file_uploads do | t |
t . timestamps
end
end
end
Executed the paperclip generator (rails generate paperclip file_upload document
) to create a migration for the attachment, which created this migration:
class AddAttachmentDocumentToFileUploads < ActiveRecord :: Migration
def self . up
change_table :file_uploads do | t |
t . attachment :document
end
end
def self . down
remove_attachment :file_uploads , :document
end
end
I then create the FileUpload model (see comments for functionality), new file: app/models/file_upload.rb
class FileUpload < ApplicationRecord
include Elasticsearch :: Model
# used to kick off Sidekiq job after model is saved (committed)
after_commit :process_document
# used to send model data to Elasticsearch when indexed
attr_accessor :document_content
# paperclip integration
has_attached_file :document
# model validation per paperclip attachment
validates :document , attachment_presence: true
validates_attachment_content_type :document , content_type: 'application/pdf'
# Elasticsearch mapping for document content
mapping do
indexes :document_content , type: 'multi_field' do
indexes :document_content
indexes :raw , index: :no
end
end
# Elasticsearch interface to construct document structure
def as_indexed_json ( options = {})
as_json . merge ( document_content: document_content )
end
# loads model data from Elasticsearch
def from_elasticsearch
search_definition = {
query: {
filtered: {
filter: {
term: {
_id: id
}
}
}
}
}
begin
self . class . __elasticsearch__ . search ( search_definition ). first . _source
rescue => e
retry
end
end
# called from Sidekiq job
def set_document_content
self . document_content = document_content_from_tika
__elasticsearch__ . index_document
end
# loads document contents from Elasticearch
def document_content_from_elasticsearch
from_elasticsearch [ :document_content ]
end
# loads document contents from Tika REST API
def document_content_from_tika
meta_data = JSON . parse ( `curl -H "Accept: application/json" -T " #{ document . path } " http:// #{ ENV [ 'TIKA_HOST' ] } :9998/meta` )
`curl -X PUT --data-binary "@ #{ document . path } " --header "Content-type: #{ meta_data [ 'Content-Type' ] } " http:// #{ ENV [ 'TIKA_HOST' ] } :9998/tika --header "Accept: text/plain"` . strip
end
private
# after commit hook to create Sidekiq job
def process_document
DocumentProcessorJob . perform_later self
end
end
I created a Sidekiq worker which calls back to an instance method in the model, new file: app/jobs/document_processor_job.rb
class DocumentProcessorJob < ApplicationJob
queue_as :default
def perform ( file_upload )
file_upload . set_document_content
end
end
Next I added a migration to force create the Elasticsearch index.
class CreateFileUploadIndex < ActiveRecord :: Migration [ 5.0 ]
def change
FileUpload . __elasticsearch__ . create_index! force: true
end
end
Last I added a simple rake task to create a new FileUpload model instance and output the contents of the document stored in Elasticsearch, new file: lib/tasks/file_upload.rake
namespace :file_upload do
desc 'File upload test'
task test: :environment do
file_upload = FileUpload . new
file = File . open Rails . root . join ( 'eric-london-blog.pdf' )
file_upload . document = file
file_upload . save!
puts file_upload . document_content_from_elasticsearch
end
end
Docker Integration
I defined a basic Dockerfile to update packages and set the working directory:
FROM ruby:2.3.3
RUN apt-get update -qq && apt-get install -y build-essential netcat imagemagick
# node/npm
RUN apt-get install -y nodejs npm
ENV APP_HOME /rails
WORKDIR $APP_HOME
The containers, environment settings, and persistent volumes are all defined in the Docker compose file (docker-compose.yaml). The Sidekiq container shares a filesystem with Rails so it can access the file uploads and send them to the Tika REST API.
version : ' 2'
services :
elasticsearch :
image : elasticsearch:2.4.4
ports :
- ' 9200:9200'
- ' 9300:9300'
volumes :
- elasticsearch:/usr/share/elasticsearch/data
postgres :
image : postgres:latest
ports :
- ' 5432:5432'
volumes :
- postgres:/var/lib/postgresql/data
rails :
build : .
command : bin/docker-start-rails
depends_on :
- elasticsearch
- postgres
- redis
environment :
- ELASTICSEARCH_HOST=elasticsearch
- POSTGRES_HOST=postgres
- POSTGRES_PASSWORD=postgres
- POSTGRES_USER=postgres
- RAILS_ENV=development
- REDIS_HOST=redis
ports :
- ' 3000:3000'
volumes :
- .:/rails
- bundle:/usr/local/bundle
redis :
image : redis:latest
ports :
- ' 6379:6379'
volumes :
- redis:/data
sidekiq :
command : bin/docker-start-sidekiq
environment :
- ELASTICSEARCH_HOST=elasticsearch
- POSTGRES_HOST=postgres
- POSTGRES_PASSWORD=postgres
- POSTGRES_USER=postgres
- RAILS_ENV=development
- RAILS_HOST=rails
- REDIS_HOST=redis
- TIKA_HOST=tika
depends_on :
- elasticsearch
- postgres
- rails
- redis
- tika
image : dockerrailstikaelasticsearch_rails
volumes_from :
- rails
tika :
image : logicalspark/docker-tikaserver:latest
ports :
- ' 9998:9998'
volumes :
elasticsearch : {}
bundle : {}
postgres : {}
redis : {}
I created a custom start script for Rails, new file: bin/docker-start-rails
#!/bin/sh
set -x
# wait for postgresql
until nc -vz $POSTGRES_HOST 5432 2>/dev/null; do
echo "Postgresql is not ready, sleeping."
sleep 1
done
# wait for elasticsearch
until nc -vz $ELASTICSEARCH_HOST 9200 2>/dev/null; do
echo "Elasticsearch is not ready, sleeping."
sleep 1
done
gem install bundler
bundle check || bundle install
rake db:create
rake db:migrate
bundle exec puma -C config/puma.rb
And a similar start script for Sidekiq, new file: bin/docker-start-sidekiq
#!/bin/sh
set -x
# wait for rails
until nc -vz $RAILS_HOST 3000 2>/dev/null; do
echo "Rails is not ready, sleeping."
sleep 1
done
sidekiq
I then executed the following Docker commands to build and run the containers:
docker-compose build
docker-compose up
Once the containers were up and running, I executed the Rake task inside the container to generate the FileUpload, extract the text from a PDF, and output the content:
$ docker exec -it dockerrailstikaelasticsearch_rails_1 rake file_upload:test | egrep "^ \d {4}- \d {2}- \d {2}"
2017-01-26
2016-12-01 Docker Hadoop Streaming Map Reduce Scala Job ( /2016/12/01/docker-hadoop-streaming-map-reduce-scala-job.html)
2016-10-29 Dockerize a Rails development environment integrated with Postgresql, Redis, and Elasticsearch using Docker Compose ( /2016/10/29/dockerize-rails-development-
2016-10-16 Track memory utilization of processes and graph the data via Chartkick, Highcharts, and Rails ( /2016/10/16/track-memory-utilization-of-processes-and-graph-the-
2016-10-09 Create an iOS Swift app to send your location to a Rails API and display on Google Maps ( /2016/10/09/create-an-ios-swift-app-to-send-your-location-to-a-rails-api-
2016-10-06 Elasticsearch autocomplete with a Rails API backend and an Angular frontend ( /2016/10/06/elasticsearch-autocomplete-with-a-rails-api-backend-and-an-angular-
2015-12-15 Integrating a Rails API backend with an Angular frontend using token authentication ( /2015/12/15/integrating-a-rails-api-backend-with-an-angular-frontend-using-
2015-12-09 Sending messages between a Swift webview and a Rails backend using Javascript ( /2015/12/09/sending-messages-between-a-swift-webview-and-a-rails-backend-
2015-07-15 Rails 4: searching for related models with Elasticsearch and tagged content via acts-as-taggable-on ( /2015/07/15/rails-4-searching-for-related-models-with-
2015-06-25 Rails 4 Elasticsearch geospatial searching and model integration ( /2015/06/25/rails-4-elasticsearch-geospatial-searching-and-model-integration.html)
2015-06-13 Using JRuby native Queue to manage work across threads ( /2015/06/13/using-jruby-native-queue-to-manage-work-across-threads.html)
2015-03-08 Using Redis sets ( unique lists) to track relationships between users and their friends, and making friend suggestions ( /2015/03/08/using-redis-sets-unique-lists-to-
2014-11-16 JRuby: Bulk index Rails model data into Elasticsearch via Sidekiq ( Redis queue) ( /2014/11/16/jruby-bulk-index-rails-model-data-into-elasticsearch-via-sidekiq-redis-
2014-11-07 Rails4 Javascript-integrated unit tests via PhantomJS, RSpec, and Capybara ( /2014/11/07/rails4-javascript-integrated-unit-tests-via-phantomjs-rspec-capybara.html)
2014-09-02 Rails 4 Elasticsearch integration with dynamic facets and filters via model concern ( /2014/09/02/rails-4-elasticsearch-integration-with-dynamic-facets-and-filters-via-
2014-08-01 Hadoop, Pig, Ruby, Map/Reduce, on OSX via Homebrew ( /2014/08/01/hadoop-pig-ruby-map-reduce-on-osx-via-homebrew.html)
Source code on Github