Integrate the Tika REST service with Rails paperclip attachments to extract text from PDF documents and store in Elasticsearch
In this post I’ll share some code to integrate the Tika REST service with Rails paperclip file attachments to extract text from PDF documents and store the data in Elasticsearch (Docker-integrated).
I installed my Docker dependencies via Brew on OSX.
$ brew list --versions | grep -i docker
docker 1.12.5
docker-compose 1.10.0
docker-machine 0.9.0
docker-machine-nfs 0.4.1
# [optional] create VirtualBox docker-machine with increased resources
docker-machine create --virtualbox-memory "4096" --virtualbox-disk-size "40000" -d virtualbox docker-machine
# [optional] enable NFS support
docker-machine-nfs docker-machine
Initial Rails setup
# rvm files
echo docker_rails_tika_elasticsearch > .ruby-gemset
echo ruby-2.3.3 > .ruby-version
cd .
gem install rails
rails -v
Rails 5.0.1
# new project with postgresql connection
rails new . --api -d postgresql
# init database
rake db:create && rake db:migrate
Add gems, edit file: Gemfile
gem 'dotenv-rails'
gem 'elasticsearch-model'
gem 'elasticsearch-rails'
gem 'paperclip', '~> 5.0'
gem 'sidekiq'
Execute bundle install
to install the new gems.
Update application configuration to set Sidekiq as the the Active Job queue adapter and enable Elasticsearch logging, edit file: config/application.rb
# ...snip...
require 'elasticsearch/rails/instrumentation'
module DockerRailsTikaElasticsearch
class Application < Rails::Application
# ...snip...
config.active_job.queue_adapter = :sidekiq
end
end
Added a dotenv file for development environment variables, new file: .env.development
ELASTICSEARCH_HOST=localhost
POSTGRES_HOST=localhost
POSTGRES_PASSWORD=postgres
POSTGRES_USER=postgres
RAILS_HOST=localhost
REDIS_HOST=localhost
TIKA_HOST=localhost
Updated database config to use environment variables, edit file: config/database.yml
default: &default
adapter: postgresql
encoding: unicode
host: <%= ENV.fetch('POSTGRES_HOST', 'localhost') %>
password: <%= ENV.fetch('POSTGRES_PASSWORD', 'postgres') %>
pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>
username: <%= ENV.fetch('POSTGRES_USER', 'postgres') %>
Added an Elasticsearch initializer to set the host from an environment variable, new file: config/initializers/elasticsearch.rb
Elasticsearch::Model.client = Elasticsearch::Client.new(host: ENV.fetch('ELASTICSEARCH_HOST', 'localhost'))
Added a Sidekiq initializer to set the host from an environment variable and configure the server and client, new file: config/initializers/sidekiq.rb
redis_url = "redis://#{ENV.fetch('REDIS_HOST', 'localhost')}:6379/0"
Sidekiq.configure_server do |config|
config.redis = { url: redis_url }
end
Sidekiq.configure_client do |config|
config.redis = { url: redis_url }
end
Added a new Rails migration to create the table for the FileUpload model.
class CreateFileUploads < ActiveRecord::Migration[5.0]
def change
create_table :file_uploads do |t|
t.timestamps
end
end
end
Executed the paperclip generator (rails generate paperclip file_upload document
) to create a migration for the attachment, which created this migration:
class AddAttachmentDocumentToFileUploads < ActiveRecord::Migration
def self.up
change_table :file_uploads do |t|
t.attachment :document
end
end
def self.down
remove_attachment :file_uploads, :document
end
end
I then create the FileUpload model (see comments for functionality), new file: app/models/file_upload.rb
class FileUpload < ApplicationRecord
include Elasticsearch::Model
# used to kick off Sidekiq job after model is saved (committed)
after_commit :process_document
# used to send model data to Elasticsearch when indexed
attr_accessor :document_content
# paperclip integration
has_attached_file :document
# model validation per paperclip attachment
validates :document, attachment_presence: true
validates_attachment_content_type :document, content_type: 'application/pdf'
# Elasticsearch mapping for document content
mapping do
indexes :document_content, type: 'multi_field' do
indexes :document_content
indexes :raw, index: :no
end
end
# Elasticsearch interface to construct document structure
def as_indexed_json(options={})
as_json.merge(document_content: document_content)
end
# loads model data from Elasticsearch
def from_elasticsearch
search_definition = {
query: {
filtered: {
filter: {
term: {
_id: id
}
}
}
}
}
begin
self.class.__elasticsearch__.search(search_definition).first._source
rescue => e
retry
end
end
# called from Sidekiq job
def set_document_content
self.document_content = document_content_from_tika
__elasticsearch__.index_document
end
# loads document contents from Elasticearch
def document_content_from_elasticsearch
from_elasticsearch[:document_content]
end
# loads document contents from Tika REST API
def document_content_from_tika
meta_data = JSON.parse(`curl -H "Accept: application/json" -T "#{document.path}" http://#{ENV['TIKA_HOST']}:9998/meta`)
`curl -X PUT --data-binary "@#{document.path}" --header "Content-type: #{meta_data['Content-Type']}" http://#{ENV['TIKA_HOST']}:9998/tika --header "Accept: text/plain"`.strip
end
private
# after commit hook to create Sidekiq job
def process_document
DocumentProcessorJob.perform_later self
end
end
I created a Sidekiq worker which calls back to an instance method in the model, new file: app/jobs/document_processor_job.rb
class DocumentProcessorJob < ApplicationJob
queue_as :default
def perform(file_upload)
file_upload.set_document_content
end
end
Next I added a migration to force create the Elasticsearch index.
class CreateFileUploadIndex < ActiveRecord::Migration[5.0]
def change
FileUpload.__elasticsearch__.create_index! force: true
end
end
Last I added a simple rake task to create a new FileUpload model instance and output the contents of the document stored in Elasticsearch, new file: lib/tasks/file_upload.rake
namespace :file_upload do
desc 'File upload test'
task test: :environment do
file_upload = FileUpload.new
file = File.open Rails.root.join('eric-london-blog.pdf')
file_upload.document = file
file_upload.save!
puts file_upload.document_content_from_elasticsearch
end
end
Docker Integration
I defined a basic Dockerfile to update packages and set the working directory:
FROM ruby:2.3.3
RUN apt-get update -qq && apt-get install -y build-essential netcat imagemagick
# node/npm
RUN apt-get install -y nodejs npm
ENV APP_HOME /rails
WORKDIR $APP_HOME
The containers, environment settings, and persistent volumes are all defined in the Docker compose file (docker-compose.yaml). The Sidekiq container shares a filesystem with Rails so it can access the file uploads and send them to the Tika REST API.
version: '2'
services:
elasticsearch:
image: elasticsearch:2.4.4
ports:
- '9200:9200'
- '9300:9300'
volumes:
- elasticsearch:/usr/share/elasticsearch/data
postgres:
image: postgres:latest
ports:
- '5432:5432'
volumes:
- postgres:/var/lib/postgresql/data
rails:
build: .
command: bin/docker-start-rails
depends_on:
- elasticsearch
- postgres
- redis
environment:
- ELASTICSEARCH_HOST=elasticsearch
- POSTGRES_HOST=postgres
- POSTGRES_PASSWORD=postgres
- POSTGRES_USER=postgres
- RAILS_ENV=development
- REDIS_HOST=redis
ports:
- '3000:3000'
volumes:
- .:/rails
- bundle:/usr/local/bundle
redis:
image: redis:latest
ports:
- '6379:6379'
volumes:
- redis:/data
sidekiq:
command: bin/docker-start-sidekiq
environment:
- ELASTICSEARCH_HOST=elasticsearch
- POSTGRES_HOST=postgres
- POSTGRES_PASSWORD=postgres
- POSTGRES_USER=postgres
- RAILS_ENV=development
- RAILS_HOST=rails
- REDIS_HOST=redis
- TIKA_HOST=tika
depends_on:
- elasticsearch
- postgres
- rails
- redis
- tika
image: dockerrailstikaelasticsearch_rails
volumes_from:
- rails
tika:
image: logicalspark/docker-tikaserver:latest
ports:
- '9998:9998'
volumes:
elasticsearch: {}
bundle: {}
postgres: {}
redis: {}
I created a custom start script for Rails, new file: bin/docker-start-rails
#!/bin/sh
set -x
# wait for postgresql
until nc -vz $POSTGRES_HOST 5432 2>/dev/null; do
echo "Postgresql is not ready, sleeping."
sleep 1
done
# wait for elasticsearch
until nc -vz $ELASTICSEARCH_HOST 9200 2>/dev/null; do
echo "Elasticsearch is not ready, sleeping."
sleep 1
done
gem install bundler
bundle check || bundle install
rake db:create
rake db:migrate
bundle exec puma -C config/puma.rb
And a similar start script for Sidekiq, new file: bin/docker-start-sidekiq
#!/bin/sh
set -x
# wait for rails
until nc -vz $RAILS_HOST 3000 2>/dev/null; do
echo "Rails is not ready, sleeping."
sleep 1
done
sidekiq
I then executed the following Docker commands to build and run the containers:
docker-compose build
docker-compose up
Once the containers were up and running, I executed the Rake task inside the container to generate the FileUpload, extract the text from a PDF, and output the content:
$ docker exec -it dockerrailstikaelasticsearch_rails_1 rake file_upload:test | egrep "^\d{4}-\d{2}-\d{2}"
2017-01-26
2016-12-01 Docker Hadoop Streaming Map Reduce Scala Job (/2016/12/01/docker-hadoop-streaming-map-reduce-scala-job.html)
2016-10-29 Dockerize a Rails development environment integrated with Postgresql, Redis, and Elasticsearch using Docker Compose (/2016/10/29/dockerize-rails-development-
2016-10-16 Track memory utilization of processes and graph the data via Chartkick, Highcharts, and Rails (/2016/10/16/track-memory-utilization-of-processes-and-graph-the-
2016-10-09 Create an iOS Swift app to send your location to a Rails API and display on Google Maps (/2016/10/09/create-an-ios-swift-app-to-send-your-location-to-a-rails-api-
2016-10-06 Elasticsearch autocomplete with a Rails API backend and an Angular frontend (/2016/10/06/elasticsearch-autocomplete-with-a-rails-api-backend-and-an-angular-
2015-12-15 Integrating a Rails API backend with an Angular frontend using token authentication (/2015/12/15/integrating-a-rails-api-backend-with-an-angular-frontend-using-
2015-12-09 Sending messages between a Swift webview and a Rails backend using Javascript (/2015/12/09/sending-messages-between-a-swift-webview-and-a-rails-backend-
2015-07-15 Rails 4: searching for related models with Elasticsearch and tagged content via acts-as-taggable-on (/2015/07/15/rails-4-searching-for-related-models-with-
2015-06-25 Rails 4 Elasticsearch geospatial searching and model integration (/2015/06/25/rails-4-elasticsearch-geospatial-searching-and-model-integration.html)
2015-06-13 Using JRuby native Queue to manage work across threads (/2015/06/13/using-jruby-native-queue-to-manage-work-across-threads.html)
2015-03-08 Using Redis sets (unique lists) to track relationships between users and their friends, and making friend suggestions (/2015/03/08/using-redis-sets-unique-lists-to-
2014-11-16 JRuby: Bulk index Rails model data into Elasticsearch via Sidekiq (Redis queue) (/2014/11/16/jruby-bulk-index-rails-model-data-into-elasticsearch-via-sidekiq-redis-
2014-11-07 Rails4 Javascript-integrated unit tests via PhantomJS, RSpec, and Capybara (/2014/11/07/rails4-javascript-integrated-unit-tests-via-phantomjs-rspec-capybara.html)
2014-09-02 Rails 4 Elasticsearch integration with dynamic facets and filters via model concern (/2014/09/02/rails-4-elasticsearch-integration-with-dynamic-facets-and-filters-via-
2014-08-01 Hadoop, Pig, Ruby, Map/Reduce, on OSX via Homebrew (/2014/08/01/hadoop-pig-ruby-map-reduce-on-osx-via-homebrew.html)