Eric London » Open Source » Software Blog

Rails 6 and React integration with Active Storage image file attachments

2020-02-15T00:00:00-05:00

In this post, I’ll show how to integrate a Rails 6 API with a React component to upload file attachments to Active Storage.

Part 1: Rails API

Initial project setup

rvm current
# ruby-2.5.7@rails6-active-storage-react

# create new rails project with a postgresql database
rails new . --api -d postgresql

# create and migrate the database
rake db:create && rake db:migrate

I added additional gems to the Gemfile

gem 'rack-cors'
gem 'image_processing'

group :development, :test do
  gem 'factory_bot_rails'
  gem 'pry'
  gem 'rspec-rails'
end

group :test do
  gem 'database_cleaner-active_record'
end

Install the gems via: bundle install

I added a basic CORS configuration to the file: config/initializers/cors.rb. This allows the React frontend to make API requests.

Rails.application.config.middleware.insert_before 0, Rack::Cors do
  allow do
    origins %w[localhost:3001]

    resource '*',
      headers: :any,
      methods: %i[get post put patch delete options head]
  end
end

I executed rails active_storage:install and rake db:migrate to create the necessary database migrations for Active Storage.

I added a migration to create a pictures table, and executed rake db:migrate.

class CreatePictures < ActiveRecord::Migration[6.0]
  def change
    create_table :pictures do |t|
      t.timestamps
    end
  end
end

I added the Picture model (file: app/models/picture.rb). The model implements methods for JSON serialization and defines a single Active Storage attachment. The JSON contains an attachment_url with a resized [200, 200] variant.

class Picture < ApplicationRecord
  include ActiveModel::Serializers::JSON

  has_one_attached :attachment

  def attributes
    {
      'id' => nil,
      'updated_at' => nil,
      'created_at' => nil,
      'attachment_url' => nil
    }
  end

  def attachment_url
    Rails.application.routes.url_helpers.rails_representation_url(
      attachment.variant(resize_to_limit: [200, 200]).processed, only_path: true
    )
  end
end

The controller (file: app/controllers/pictures_controller.rb) implements the index and create methods.

class PicturesController < ApplicationController
  def index
    render json: Picture.all.with_attached_attachment.order(id: :desc)
  end

  def create
    picture = Picture.new(picture_params)

    if picture.save
      render json: picture, status: :created
    else
      render json: picture.errors, status: :unprocessable_entity
    end
  end

  private

  def picture_params
    params.require(:picture).permit(:attachment)
  end
end

Last I added the picture controller routes to the file: config/routes.rb

Rails.application.routes.draw do
  resources :pictures, only: %i[create index]
end

Part 2: Testing

From the console I entered a directory containing test images to upload.

# upload all the images using curl
ls -1 | xargs -I{} curl -X POST -F "picture[attachment]=@./{}" http://localhost:3000/pictures

# fetching images from the index route
curl http://localhost:3000/pictures 2>/dev/null | jq '.[0]'
{
  "id": 68,
  "updated_at": "2020-02-15T13:15:24.465Z",
  "created_at": "2020-02-15T13:15:24.449Z",
  "attachment_url": "/rails/active_storage/representations/eyJfcmFpbHMiOnsibWVzc2FnZSI6IkJBaHBTUT09IiwiZXhwIjpudWxsLCJwdXIiOiJibG9iX2lkIn19--6229a61847a498801a17c0f72e5528239fcbc1ec/eyJfcmFpbHMiOnsibWVzc2FnZSI6IkJBaDdCam9VY21WemFYcGxYM1J2WDJ4cGJXbDBXd2RwQWNocEFjZz0iLCJleHAiOm51bGwsInB1ciI6InZhcmlhdGlvbiJ9fQ==--9747cbda9b013ecaed6d2f3f5323a132d671fc88/yM55sxm.jpg"
}

Next I setup RSpec for unit tests. I executed rails generate rspec:install to generate the configuration files.

I added a DatabaseCleaner strategy and included FactoryBot methods in the file: spec/rails_helper.rb

ENV['RAILS_ENV'] = 'test'

RSpec.configure do |config|
  # ...snip...
  config.before(:suite) do
    DatabaseCleaner.strategy = :transaction
    DatabaseCleaner.clean_with(:truncation)
  end
  config.around(:each) do |example|
    DatabaseCleaner.cleaning do
      example.run
    end
  end

  config.include FactoryBot::Syntax::Methods
end

I added a FactoryBot factory for the picture model, file: spec/factories/pictures.rb, and copied the Rails logo into spec/fixtures/files/

FactoryBot.define do
  factory :picture do
    created_at { DateTime.now }
    updated_at { DateTime.now }

    trait :with_attachment do
      after :build do |picture|
        file_name = 'rails-logo.png'
        file_path = Rails.root.join('spec', 'fixtures', 'files', file_name)
        picture.attachment.attach(io: File.open(file_path), filename: file_name, content_type: 'image/png')
      end
    end
  end
end

Here is a sample controller test, file: spec/controllers/pictures_controller_spec.rb

require 'rails_helper'

RSpec.describe PicturesController, type: :controller do
  describe 'GET #index' do
    let!(:picture) { create(:picture, :with_attachment) }

    it 'is successful' do
      get :index
      expect(response).to be_successful
      response_body = JSON.parse(response.body)
      expect(response_body[0]['attachment_url']).to be_present
    end
  end

  describe 'POST #create' do
    let(:file_upload) { fixture_file_upload(file_fixture('rails-logo.png'), 'image/png') }

    it 'is successful' do
      post :create, params: { picture: { attachment: file_upload } }
      expect(response.status).to eq(201)
    end
  end
end

I executed rspec to ensure the tests run successfully.

Part 3: React front end

I created a new React project.

# set nodejs version using NVM
nvm use v12.15.0

# create react app
npx create-react-app .

# define nodejs version
nvm current > .nvmrc

# add bootstrap library for layout
npm install react-bootstrap bootstrap

I included the bootstrap CSS, file: src/index.js

import React from 'react'
import ReactDOM from 'react-dom'
import './index.css'
import App from './App'
import * as serviceWorker from './serviceWorker'

import 'bootstrap/dist/css/bootstrap.min.css'

ReactDOM.render(<App />, document.getElementById('root'))

// If you want your app to work offline and load faster, you can change
// unregister() to register() below. Note this comes with some pitfalls.
// Learn more about service workers: https://bit.ly/CRA-PWA
serviceWorker.unregister()

I added a constants file to define the API host URL, new file: src/constants.js

export const ApiHost = 'http://localhost:3000'

I revised the main App component to include my the Pictures component, file: src/App.js

import React from 'react'
import Pictures from './Pictures'
import './App.css'

function App() {
  return (
    <div className="container">
      <Pictures></Pictures>
    </div>
  )
}

export default App

I created a basic Pictures component (file: src/Pictures.js). On mount, is loads the existing pictures from the API and renders them in a defined number of columns. It also provides a file input which submits (on change) to create a new picture via the API.

import React from 'react'
import { ApiHost } from './constants'

class Pictures extends React.Component {
  constructor(props) {
    super(props)

    this.state = {
      pictures: [],
      number_columns: 4,
      loading: true,
    }

    this.handleFileInputChange = this.handleFileInputChange.bind(this)
    this.loadPictures = this.loadPictures.bind(this)
  }

  render() {
    if (this.state.loading) return null

    return(
      <div className="pictures_container">

        <div className="row">
          <div className="col">
            <form>
              <div className="form-group">
                <label htmlFor="file_upload">Upload Picture</label>
                <input type="file" className="form-control-file" id="file_upload" onChange={this.handleFileInputChange} />
              </div>
            </form>
          </div>
        </div>

        {this.pictureRows().map((pictureRow, rowIndex) =>
          <div key={`picture_row_${rowIndex}`} className="row">
            {pictureRow.map((picture, columnIndex) =>
              <div key={`picture_row_${rowIndex}_col_${columnIndex}`} className="col-sm-3">
                <img data-id={picture.id} src={`${ApiHost}${picture.attachment_url}`} />
              </div>
            )}
          </div>
        )}
      </div>
    )
  }

  componentDidMount() {
    this.loadPictures()
  }

  loadPictures() {
    fetch(`${ApiHost}/pictures.json`)
      .then((response) => response.json())
      .then((pictures) =>
        this.setState({
          pictures: pictures,
          loading: false
        })
      )
  }

  handleFileInputChange(event) {
    let body = new FormData()
    body.append('picture[attachment]', event.target.files[0] )
    fetch(
      `${ApiHost}/pictures.json`,
      {
        method: 'post',
        body: body
      }
    )
    .then((response) => response.json())
    .then((picture) => {
      let pictures = this.state.pictures
      pictures.unshift(picture)
      this.setState({
        pictures: pictures
      })
    })
  }

  pictureRows() {
    let rows = []
    let row = []
    this.state.pictures.forEach((picture) => {
      row.push(picture)
      if (row.length === this.state.number_columns) {
        rows.push(row)
        row = []
      }
    })
    if (row.length > 0) {
      rows.push(row)
    }
    return rows
  }
}

export default Pictures

Last I added a bit of CSS to improve the pictures layout, file: src/App.css

.pictures_container img {
  max-height: 100%;
  max-width: 100%;
}

.pictures_container .row {
  margin-bottom: 30px;
}

The React front end can be started via: npm start.

A screenshot:

Running Apache Spark and Hadoop locally on OSX, prototyping Scala code in Apache Zeppelin, and working with Postgresql/CSV/HDFS data

2020-02-02T00:00:00-05:00

In this post, I’ll demonstrate how to run Apache Spark, Hadoop, and Scala locally (on OS X) and prototype Spark/Scala/SQL code in Apache Zeppelin. The goal of this exercise is to connect to Postgresql from Zeppelin, populate two tables with sample data, join them together, and export the results to separate CSV files (by primary key).

I installed Java JDK from Orcacle; Spark, Hadoop, Postgresql, and Scala via homebrew; and downloaded Apache Zeppelin manually.

# show java version
java -version
java version "1.8.0_241"
Java(TM) SE Runtime Environment (build 1.8.0_241-b07)
Java HotSpot(TM) 64-Bit Server VM (build 25.241-b07, mixed mode)

# show installed brew versions
brew list --versions | egrep -i "(hadoop|spark|scala|zeppelin|postgresql)"
apache-spark 2.4.4
apache-zeppelin 0.8.1
hadoop 3.2.1
postgresql 12.1
scala 2.13.1

# environment variables added to ~/.profile
export JAVA_HOME="$(/usr/libexec/java_home)"
export HADOOP_HOME=/usr/local/Cellar/hadoop/3.2.1/libexec
export HADOOP_MAPRED_HOME=/usr/local/Cellar/hadoop/3.2.1/libexec
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_HOME=/usr/local/Cellar/apache-spark/2.4.4/libexec
export ZEPPELIN_HOME=/Users/eric/Documents/code/spark/zeppelin/zeppelin-0.8.2-bin-all
export PATH="$PATH:/usr/local/opt/scala/bin"

Part 1: Hadoop Setup

Ensure you can ssh to localhost without a password.

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
ssh localhost

Changes I made to Hadoop configuration files, located in $HADOOP_CONF_DIR

core-site.xml


  
    fs.defaultFS
    hdfs://localhost:9000

hdfs-site.xml


  
    dfs.replication
    1

mapred-site.xml


  
    mapreduce.framework.name
    yarn
  
  
    mapreduce.application.classpath
    $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*

yarn-site.xml


  
    yarn.nodemanager.aux-services
    mapreduce_shuffle
  
  
    yarn.nodemanager.env-whitelist
    JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME

Prepare HDFS and start Hadoop services

# format HDFS filesystem
$HADOOP_HOME/bin/hdfs namenode -format

# Start NameNode daemon and DataNode daemon
$HADOOP_HOME/sbin/start-dfs.sh

# Ensure Java processes are running
jps | grep -v Jps
64099 NameNode
64420 SecondaryNameNode
64222 DataNode

# browse to NameNode
http://localhost:9870

# Start ResourceManager daemon and NodeManager daemon
$HADOOP_HOME/sbin/start-yarn.sh

# Ensure Java processes are running
jps | grep -v Jps
64993 ResourceManager
64099 NameNode
64420 SecondaryNameNode
65111 NodeManager
64222 DataNode

# browse to ResourceManager
http://localhost:8088/

Test HDFS, Hadoop, MapReduce:

# make HDFS directories to execute MadReduce jobs
$HADOOP_HOME/bin/hdfs dfs -mkdir /user
$HADOOP_HOME/bin/hdfs dfs -mkdir /user/$(whoami)

# copy some test files
$HADOOP_HOME/bin/hdfs dfs -mkdir input
$HADOOP_HOME/bin/hdfs dfs -put $HADOOP_HOME/etc/hadoop/*.xml input

# run example
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output 'dfs[a-z.]+'

# inspect results
$HADOOP_HOME/bin/hdfs dfs -cat output/part-r-00000
1	dfsadmin
1	dfs.replication

Part 2: Postgresql Setup

Ensure Postgresql is running

brew services list | egrep -i "(^Name|postgresql)"
Name          Status  User Plist
postgresql    started eric /Users/eric/Library/LaunchAgents/homebrew.mxcl.postgresql.plist

Create a Postgresql database and user for Spark development

createdb spark_data
createuser spark_data
psql spark_data
> alter user spark_data with encrypted password 'spark_data';
> grant all privileges on database spark_data to spark_data;

Part 3: Apache Zeppelin Setup

I encountered an issue installing Apache Zeppelin via homebrew, so I manually downloaded the full package.

# browse to https://zeppelin.apache.org/download.html, download zeppelin-0.8.2-bin-all.tgz

# extract
tar -xzf zeppelin-0.8.2-bin-all.tgz
cd zeppelin-0.8.2-bin-all

# ensure ENV var is set to the extracted path, example:
export ZEPPELIN_HOME=/Users/eric/Documents/code/spark/zeppelin/zeppelin-0.8.2-bin-all

# start Apache Zeppelin
./bin/zeppelin-daemon.sh start

# browse to web interface:
http://localhost:8080/#/

Part 4: Scala and Spark development via Zeppelin

I added my postgresql credentials for the jdbc interpreter

Browse to:
http://localhost:8080/#/interpreter

Set:
default.driver: default.driver
default.password: spark_data
default.url: jdbc:postgresql://localhost:5432/
default.user: spark_data

I added a new Zeppelin Notebook (with default interpreter: spark/scala) and began adding paragraphs.

In my first paragraph, I included the postgresql JDBC driver jar, ex:

%spark.conf

spark.jars /Users/eric/Documents/code/spark/zeppelin/zeppelin-0.8.2-bin-all/interpreter/jdbc/postgresql-9.4-1201-jdbc41.jar

Ensure the Postgresql driver is loaded

%spark

Class.forName("org.postgresql.Driver")
res0: Class[_] = class org.postgresql.Driver

Create 2 Postgresql tables which I plan to populate and join for demonstration purposes

%jdbc

CREATE TABLE accounts(
  id serial PRIMARY KEY,
  name VARCHAR (50) UNIQUE NOT NULL
);

CREATE TABLE reports(
  id serial PRIMARY KEY,
  account_id integer NOT NULL,
  name VARCHAR (50) NOT NULL
);

Define credentials in Scala code and create an initial JDBC DataFrame connection variable for reading and writing

%spark

val jdbcUrl = "jdbc:postgresql://localhost:5432/spark_data"
val jdbcDriver = "org.postgresql.Driver"
val jdbcUser = "spark_data"
val jdbcPassword = "spark_data"

import java.util.Properties
val jdbcProps = new Properties()
jdbcProps.setProperty("driver", jdbcDriver)
jdbcProps.setProperty("user", jdbcUser)
jdbcProps.setProperty("password", jdbcPassword)

val dfReader = spark.read.format("jdbc")
  .option("driver", jdbcDriver)
  .option("url", jdbcUrl)
  .option("user", jdbcUser)
  .option("password", jdbcPassword)

Create a list of Account names as a DataFrame

%spark

val accountNamesDf = Range(0, 10)
  .map("account" + _.toString())
  .toDF("name")

// inspect:
accountsDf.show()
+--------+
|    name|
+--------+
|account0|
|account1|
|account2|
|account3|
|account4|
|account5|
|account6|
|account7|
|account8|
|account9|
+--------+

Write Accounts DataFrame to Postgresql table

%spark

accountsDf
  .write
  .mode("append")
  .jdbc(jdbcUrl, "accounts", jdbcProps)

Load Accounts (with IDs) into a new DataFrame

%spark

val accountsDf = dfReader
  .option("query", "select id, name from accounts")
  .load()

// inspect
accountsDf.show()
+---+--------+
| id|    name|
+---+--------+
|  5|account0|
|  8|account5|
|  7|account7|
| 10|account4|
|  9|account1|
|  4|account3|
|  2|account9|
|  6|account2|
|  3|account8|
|  1|account6|
+---+--------+

Collect a list of Account IDs to populate foreign keys in the the other table

%spark

val accountIds = accountsDf
  .select("id")
  .map(_.getInt(0))
  .collect()

// accountIds: Array[Int] = Array(5, 8, 7, 10, 9, 4, 2, 6, 3, 1)

Create a function to randomly select an Account ID

%spark

import scala.util.Random
val random = new Random

def randomAccountId(): Int = accountIds(random.nextInt(accountIds.length))

randomAccountId()
// res0: Int = 10

Create a list of Report names and a function to select a random name

%spark

val reportNames = Range(0, 1000)
  .map("report" + _.toString())
  .toList

def randomReportName(): String = reportNames(random.nextInt(reportNames.length))

randomReportName()
// res0: String = report167

Generate a million row DataFrame containing randomized Account IDs and Report names to populate the Reports table

%spark

val accountIdReportNameDf = sc.parallelize(
  Seq.fill(1000000){(randomAccountId,randomReportName)}
).toDF("account_id", "name")

// inspect
accountIdReportNameDf.show()
+----------+---------+
|account_id|     name|
+----------+---------+
|         8|report415|
|         9|report585|
|         4|report818|
|         8|report938|
|         8|report962|
|         1|report324|
|         8|report169|
|         3|report624|
|         8|report712|
|         1| report92|
|         6| report56|
|         8|report115|
|         6| report67|
|         2|report359|
|         1|report556|
|         7|report829|
|         6| report56|
|         2|report620|
|         6|report623|
|         9|report940|
+----------+---------+
only showing top 20 rows

Write DataFrame rows to the Postgresql Reports table

%spark

accountIdReportNameDf
  .write
  .mode("append")
  .jdbc(jdbcUrl, "reports", jdbcProps)

Load the Reports table into a DataFrame

%spark

val reportsDf = dfReader
  .option("query", "select id, account_id, name from reports")
  .load()

// inspect
reportsDf.show()
+---+----------+---------+
| id|account_id|     name|
+---+----------+---------+
|  4|         7| report47|
|  5|         6|report620|
|  6|         6|report375|
|  8|         2|report724|
| 12|         2|report350|
| 14|         9| report27|
| 15|         3|report965|
| 16|         3|report447|
| 17|         1|report817|
| 18|         1|report670|
| 20|         8|report643|
| 21|         4|report130|
| 22|         1|report832|
| 23|         3|report863|
| 24|         9|  report1|
| 13|         8|report450|
| 26|         6|report716|
| 27|         2|report808|
| 28|         3| report56|
|  9|         8|report415|
+---+----------+---------+
only showing top 20 rows

Using Spark/SQL to join the tables together

%spark

dfReader
  .option("query", "select accounts.*, reports.name as report_name from accounts join reports on reports.account_id = accounts.id")
  .load()
  .show()

+---+--------+-----------+
| id|    name|report_name|
+---+--------+-----------+
|  7|account7|   report47|
|  6|account2|  report620|
|  6|account2|  report375|
|  2|account9|  report724|
|  2|account9|  report350|
|  9|account1|   report27|
|  3|account8|  report965|
|  3|account8|  report447|
|  1|account6|  report817|
|  1|account6|  report670|
|  8|account5|  report643|
|  4|account3|  report130|
|  1|account6|  report832|
|  3|account8|  report863|
|  9|account1|    report1|
|  8|account5|  report450|
|  6|account2|  report716|
|  2|account9|  report808|
|  3|account8|   report56|
|  8|account5|  report415|
+---+--------+-----------+
only showing top 20 rows

Using Spark/Scala to join the tables together

%spark

accountsDf
  .join(reportsDf, accountsDf("id") <=> reportsDf("account_id"))
  .select(accountsDf("id"), accountsDf("name"), reportsDf("name").as("report_name"))
  .show()

+---+--------+-----------+
| id|    name|report_name|
+---+--------+-----------+
|  1|account6|  report817|
|  1|account6|  report670|
|  1|account6|  report832|
|  1|account6|   report52|
|  1|account6|  report832|
|  1|account6|  report291|
|  1|account6|  report866|
|  1|account6|  report191|
|  1|account6|  report715|
|  1|account6|  report174|
|  1|account6|  report220|
|  1|account6|  report488|
|  1|account6|  report133|
|  1|account6|  report534|
|  1|account6|  report166|
|  1|account6|  report625|
|  1|account6|  report475|
|  1|account6|  report249|
|  1|account6|  report252|
|  1|account6|  report708|
+---+--------+-----------+
only showing top 20 rows

Showing an aggregate count of Report records per Account

%spark

accountsDf
  .join(reportsDf, accountsDf("id") <=> reportsDf("account_id"))
  .groupBy(accountsDf("id"))
  .agg(count(accountsDf("id")))
  .show()

+---+---------+
| id|count(id)|
+---+---------+
|  1|   100322|
|  6|    99479|
|  3|    99708|
|  5|    99493|
|  9|   100588|
|  4|    99921|
|  8|   100353|
|  7|    99924|
| 10|   100128|
|  2|   100084|
+---+---------+

Joining and counting the records

%spark

accountsDf
  .join(reportsDf, accountsDf("id") <=> reportsDf("account_id"))
  .count()

// res0: Long = 1000000

For the final Spark operation, write out a CSV file to HDFS containing Report data for each Account.

%spark

val dfWriteOptions: Map[String,String] =
  Map[String,String](
    "header" -> "true",
    "quoteAll" -> "true",
    "escape" -> "\"")

accountsDf
  .join(reportsDf, accountsDf("id") <=> reportsDf("account_id"))
  .select(accountsDf("id").as("account_id"), accountsDf("name").as("account_name"), reportsDf("id").as("report_id"), reportsDf("name").as("report_name"))
  .write
  .mode("Overwrite")
  .options(dfWriteOptions)
  .partitionBy((accountsDf("id"))
  .csv("export")

Inspect CSV files in HDFS

$HADOOP_HOME/bin/hdfs dfs -ls export
Found 12 items
-rw-r--r--   1 eric supergroup          0 2020-02-01 19:56 export/_SUCCESS
-rw-r--r--   1 eric supergroup          0 2020-02-01 19:56 export/part-00000-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv
-rw-r--r--   1 eric supergroup    3589716 2020-02-01 19:56 export/part-00043-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv
-rw-r--r--   1 eric supergroup    3559231 2020-02-01 19:56 export/part-00049-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv
-rw-r--r--   1 eric supergroup    3567532 2020-02-01 19:56 export/part-00051-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv
-rw-r--r--   1 eric supergroup    3559809 2020-02-01 19:56 export/part-00066-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv
-rw-r--r--   1 eric supergroup    3599014 2020-02-01 19:56 export/part-00089-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv
-rw-r--r--   1 eric supergroup    3575171 2020-02-01 19:56 export/part-00102-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv
-rw-r--r--   1 eric supergroup    3590569 2020-02-01 19:56 export/part-00103-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv
-rw-r--r--   1 eric supergroup    3575201 2020-02-01 19:56 export/part-00107-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv
-rw-r--r--   1 eric supergroup    3682807 2020-02-01 19:56 export/part-00122-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv
-rw-r--r--   1 eric supergroup    3580827 2020-02-01 19:56 export/part-00174-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv

Copy HDFS CSV files to local filesystem

$HADOOP_HOME/bin/hdfs dfs -copyToLocal export ./

Count records in each CSV file (the extra ten rows are the CSV headers)

wc -l export/*.csv
       0 export/part-00000-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv
  100323 export/part-00043-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv
   99480 export/part-00049-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv
   99709 export/part-00051-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv
   99494 export/part-00066-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv
  100589 export/part-00089-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv
   99922 export/part-00102-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv
  100354 export/part-00103-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv
   99925 export/part-00107-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv
  100129 export/part-00122-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv
  100085 export/part-00174-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv
 1000010 total

Ensuring each CSV file was partitioned by Account

egrep -io "account\d+" export/*.csv | uniq
export/part-00043-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv:account6
export/part-00049-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv:account2
export/part-00051-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv:account8
export/part-00066-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv:account0
export/part-00089-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv:account1
export/part-00102-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv:account3
export/part-00103-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv:account5
export/part-00107-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv:account7
export/part-00122-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv:account4
export/part-00174-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv:account9

Inspecting the contents of a CSV file

head export/part-00174-02824c8d-bdfb-4129-881a-b0ce402878ad-c000.csv
"account_id","account_name","report_id","report_name"
"2","account9","8","report724"
"2","account9","12","report350"
"2","account9","27","report808"
"2","account9","46","report816"
"2","account9","1","report97"
"2","account9","39","report991"
"2","account9","41","report496"
"2","account9","72","report492"
"2","account9","61","report602"

Running miniDC/OS (Mesosphere, Marathon, DCOS, and Docker containers) locally via Docker

2019-02-24T00:00:00-05:00

In this post I’ll outline steps to run miniDC/OS locally via Docker. miniDC/OS provides a container orchestration platform you can run locally for development/testing purposes, built using Apache Mesos, Marathon, Mesosphere’s Datacenter Operating System (DC/OS), and Docker containers. This post was developed using a Mac running OSX and Docker for Mac.

Part 1: miniDC/OS installation

Initial installation via Homebrew.

# check docker version
docker --version
Docker version 18.09.2, build 6247962

# NOTE: I bumped up docker's available memory to 8GB

# install python
brew install python
brew postinstall python

# install minidcos
brew install https://raw.githubusercontent.com/dcos/dcos-e2e/master/minidcos.rb

# check for issues with doctor command
minidcos docker doctor

Checking network setup.

minidcos docker setup-mac-network

# NOTE: I followed the steps to install Tunnelblick for OpenVPN
# URL: https://tunnelblick.net/
# Latest stable version at the time was 3.7.8

# opened OpenVPN docker-for-mac configuration file
open /Users/eric/Documents/docker-for-mac.ovpn

# Clicked on Tunnelblick in task bar, and choose Connect docker-for-mac

# re-executed:
minidcos docker setup-mac-network

# re-executed to check for issues:
minidcos docker doctor

Create local DCOS cluster.

# Download DCOS installer
minidcos docker download-installer

# create cluster (note: newline)
minidcos docker create ./dcos_generate_config.sh --agents 2
default

# wait for cluster
minidcos docker wait

# list (docker) clusters
minidcos docker list
default

# inspect dcos cluster
minidcos docker inspect
{
  "Cluster ID": "default",
  "DC/OS Variant": "OSS",
  "Nodes": {
    "agents": [
      {
        "docker_container_id": "052111ef91e53f26c8ff0d3f5a1af09f926d0f956b1417d834445034207a3184",
        "docker_container_name": "dcos-e2e-default-3f4e1-agent-1",
        "e2e_reference": "agent_1",
        "ip_address": "172.17.0.5"
      },
      {
        "docker_container_id": "676cef1b6f9decb25045363f1021bdd7dd84213bedcd8ef38e063be389c408c8",
        "docker_container_name": "dcos-e2e-default-3f4e1-agent-0",
        "e2e_reference": "agent_0",
        "ip_address": "172.17.0.4"
      }
    ],
    "masters": [
      {
        "docker_container_id": "5729757089b43507e9b72efd436aa7fd887017bedd0e11a2428b366940f0cf6b",
        "docker_container_name": "dcos-e2e-default-3f4e1-master-0",
        "e2e_reference": "master_0",
        "ip_address": "172.17.0.3"
      }
    ],
    "public_agents": [
      {
        "docker_container_id": "712acc52c1725d607d097b66841e87784c480b5d363c3212ca380f31ab6a0eef",
        "docker_container_name": "dcos-e2e-default-3f4e1-public-agent-0",
        "e2e_reference": "public_agent_0",
        "ip_address": "172.17.0.6"
      }
    ]
  },
  "SSH key": "/var/folders/4j/wfxggd095pv_dlb6d0zhwhdc0000gn/T/4194cb568a514f12837d5cbc3d9b3123/ssh/id_rsa",
  "Web UI": "http://172.17.0.3"
}

# run inspect command formatted for environment variables
minidcos docker inspect --env --cluster-id default
export MASTER_0=5729757089b43507e9b72efd436aa7fd887017bedd0e11a2428b366940f0cf6b
export MASTER_0_IP=172.17.0.3
export AGENT_1=052111ef91e53f26c8ff0d3f5a1af09f926d0f956b1417d834445034207a3184
export AGENT_1_IP=172.17.0.5
export AGENT_0=676cef1b6f9decb25045363f1021bdd7dd84213bedcd8ef38e063be389c408c8
export AGENT_0_IP=172.17.0.4
export PUBLIC_AGENT_0=712acc52c1725d607d097b66841e87784c480b5d363c3212ca380f31ab6a0eef
export PUBLIC_AGENT_0_IP=172.17.0.6
export WEB_UI=http://172.17.0.3
export SSH_KEY=/var/folders/4j/wfxggd095pv_dlb6d0zhwhdc0000gn/T/4194cb568a514f12837d5cbc3d9b3123/ssh/id_rsa

# export/load environment variables into current shell
eval $(minidcos docker inspect --env --cluster-id default)

# ssh to master node
docker exec -it $MASTER_0 bash

# ssh to agent node
docker exec -it $AGENT_0 bash

# show created docker containers
docker ps
CONTAINER ID        IMAGE                    COMMAND                  CREATED             STATUS              PORTS                        NAMES
a69d40cc7e77        dcos-e2e/openvpn         "/local/helpers/run.…"   23 minutes ago      Up 23 minutes                                    vpn-dcos-e2e-openvpn
4babdda6a6f0        dcos-e2e/proxy           "socat TCP-LISTEN:13…"   23 minutes ago      Up 23 minutes       127.0.0.1:13194->13194/tcp   vpn-dcos-e2e-proxy
d40375eb092a        mesosphere/dcos-docker   "/sbin/init"             17 hours ago        Up 17 hours                                      dcos-e2e-default-45e4c-public-agent-0
05750b6a6657        mesosphere/dcos-docker   "/sbin/init"             17 hours ago        Up 17 hours                                      dcos-e2e-default-45e4c-agent-1
390c367a5c71        mesosphere/dcos-docker   "/sbin/init"             17 hours ago        Up 17 hours                                      dcos-e2e-default-45e4c-agent-0
cbaa0b009787        mesosphere/dcos-docker   "/sbin/init"             17 hours ago        Up 17 hours                                      dcos-e2e-default-45e4c-master-0

# run a command (bash) to test
minidcos docker run bash
[root@dcos-e2e-default-45e4c-master-0 /]# exit
exit

# run Mesosphere DC/OS web interface
minidcos docker web

At this point, the web interface should be accessible (ex: http://172.17.0.3), but you will need to authenticate using the dcos cli.

Part 2: DCOS CLI and cluster setup

Install the DCOS CLI tool via Homebrew and setup the DCOS instance.

# install CLI tool
brew install dcos-cli

# setup DCOS cluster
# NOTE: you will need to follow the auth0/oauth login/redirection
dcos cluster setup http://172.17.0.3

# show cluster name
dcos config show cluster.name
DCOS

# list clusters
dcos cluster list
 NAME               CLUSTER ID                 STATUS   VERSION         URL
DCOS*  21538fe8-23fc-4e7c-9749-2c1b77935b4c  AVAILABLE   1.12.2  http://172.17.0.3

# (optional) if you have multiple clusters, you can attach one by its name
dcos cluster attach DCOS

# list dcos nodes
dcos node
   HOSTNAME        IP                         ID                    TYPE             REGION  ZONE
  172.17.0.4   172.17.0.4  c0720123-b041-44b7-bc8f-87ada5a10a69-S2  agent             None   None
  172.17.0.5   172.17.0.5  c0720123-b041-44b7-bc8f-87ada5a10a69-S1  agent             None   None
  172.17.0.6   172.17.0.6  c0720123-b041-44b7-bc8f-87ada5a10a69-S0  agent             None   None
master.mesos.  172.17.0.3    c0720123-b041-44b7-bc8f-87ada5a10a69   master (leader)   None   None

At this point you should be able authenticate to the web interface.

Mesosphere DC/OS Dashboard:

Show nodes:

In addition, telemetry URL and health report: http://172.17.0.3/system/health/v1/report

Part 3: Spark package installation

I used Spark to demonstrate installing a package.

# list repos
dcos package repo list
Universe: https://universe.mesosphere.com/repo

# search for a package (spark)
dcos package search spark | head -2
NAME                             VERSION             SELECTED  FRAMEWORK  DESCRIPTION
spark                            2.6.0-2.3.2         True      False      Spark is a fast and general cluster computing system for Big Data.  Documenta...

# install spark
dcos package install spark --yes
Installing Marathon app for package [spark] version [2.6.0-2.3.2]
Installing CLI subcommand for package [spark] version [2.6.0-2.3.2]

# list packages
dcos package list
NAME   VERSION      APP     COMMAND  DESCRIPTION
spark  2.6.0-2.3.2  /spark  spark    Spark is a fast and general cluster computing system for Big...

# list services
dcos service
NAME                    HOST               ACTIVE  TASKS  CPU   MEM    DISK  ID
marathon             172.17.0.3             True     1    1.0  1024.0  0.0   c0720123-b041-44b7-bc8f-87ada5a10a69-0001
metronome            172.17.0.3             True     0    0.0   0.0    0.0   c0720123-b041-44b7-bc8f-87ada5a10a69-0000
spark      dcos-e2e-default-3f4e1-agent-0   True     0    0.0   0.0    0.0   c0720123-b041-44b7-bc8f-87ada5a10a69-0002

# list tasks
dcos task
NAME   HOST        USER  STATE  ID                                          MESOS ID                                 REGION  ZONE
spark  172.17.0.4  root    R    spark.08077572-387b-11e9-ab43-70b3d5800003  c0720123-b041-44b7-bc8f-87ada5a10a69-S2   ---    ---

# get dcos spark webui
http://172.17.0.3/service/spark/ui

# run spark job
dcos spark run --submit-args="--class org.apache.spark.examples.SparkPi https://downloads.mesosphere.com/spark/assets/spark-examples_2.11-2.0.1.jar 30"

# output spark driver logs
dcos spark log driver-20190224213230-0001
[bind-address]: Resolving container IP to bind to using detection method:
[bind-address]: Resolution method is not specified, using the default 'hostname=ip-address'
[bind-address]: Bind address: 172.17.0.5
spark-env: StatsD metrics require Mesos UCR. For dispatcher metrics, enable the 'UCR_containerizer' option. For driver metrics, include '--conf spark.mesos.containerizer=mesos' in your run
Pi is roughly 3.140829046943016

# review mesos/marathon internal DNS
# SSH to agent node:
docker exec -it $AGENT_0 bash

# dig spark service
dig +noall +answer spark.marathon.mesos
spark.marathon.mesos.	60	IN	A	172.17.0.4

Viewing Spark from the services page

The Marathon UI can be accessed directly or from the services page

Example URLs:

http://172.17.0.3/service/marathon/ui/#/apps

http://172.17.0.3:8080/ui/#/apps

Part 4: Deploying a Marathon Pod

To demonstrate deploying a Marathon Pod I created 3 containers (Rails API, Postgresql, and Nginx). I put the full source code for these containers on GitHub. I also provided a docker-compose file to test the container connectivity outside DCOS/Marathon.

I created a script to build each Docker container, tag, and push to Docker Hub. file: rails-stack/build-images.sh

#!/usr/bin/env bash

DOCKER_USER=$1
if [ -z "$DOCKER_USER" ]; then echo "ERROR: DOCKER_USER arg required."; exit 1; fi

APP_PREFIX="dcos-stack-"
docker_directories=(api nginx postgres)
for docker_directory in "${docker_directories[@]}"
do
  cd $docker_directory
  docker build -t ${DOCKER_USER}/${APP_PREFIX}${docker_directory}:latest .
  docker push ${DOCKER_USER}/${APP_PREFIX}${docker_directory}:latest
  cd ..
done

# executed build script
./build-images.sh ericlondon

# reviewing created docker images
docker image ls | egrep -i "ericlondon.*dcos"
ericlondon/dcos-stack-nginx     latest  6919414b7569  2 minutes ago  141MB
ericlondon/dcos-stack-api       latest  5d6448f3769b  4 minutes ago  351MB
ericlondon/dcos-stack-postgres  latest  316536b3f5c4  9 months ago   235MB

I created an example pod JSON file for the three containers. file: rails-stack/rails-stack-pod.json

{
  "id": "/rails-stack",
  "containers": [
    {
      "name": "api",
      "resources": {
        "cpus": 0.1,
        "mem": 128,
        "disk": 0
      },
      "exec": {
        "command": {
          "shell": "/api/bin/start-rails.sh"
        }
      },
      "image": {
        "kind": "DOCKER",
        "id": "ericlondon/dcos-stack-api:latest",
        "forcePull": true
      },
      "endpoints": [
        {
          "name": "rails-stack-api",
          "containerPort": 3000,
          "hostPort": 0,
          "protocol": [
            "tcp"
          ],
          "labels": {
            "VIP_0": "/rails-stack:3000"
          }
        }
      ],
      "environment": {
        "POSTGRES_PASSWORD": "postgres",
        "POSTGRES_USER": "postgres",
        "POSTGRES_HOST": "localhost",
        "RAILS_ENV": "development",
        "RAILS_PORT": "3000",
        "POSTGRES_PORT": "5432"
      }
    },
    {
      "name": "postgres",
      "resources": {
        "cpus": 0.1,
        "mem": 128,
        "disk": 0
      },
      "image": {
        "kind": "DOCKER",
        "id": "ericlondon/dcos-stack-postgres:latest",
        "forcePull": true
      },
      "endpoints": [
        {
          "name": "rails-stack-postgres",
          "containerPort": 5432,
          "hostPort": 0,
          "protocol": [
            "tcp"
          ],
          "labels": {
            "VIP_0": "/rails-stack:5432"
          }
        }
      ],
      "environment": {
        "POSTGRES_PASSWORD": "postgres",
        "POSTGRES_USER": "postgres"
      }
    },
    {
      "name": "nginx",
      "resources": {
        "cpus": 0.1,
        "mem": 128,
        "disk": 0
      },
      "exec": {
        "command": {
          "shell": "/start-nginx.sh"
        }
      },
      "image": {
        "kind": "DOCKER",
        "id": "ericlondon/dcos-stack-nginx:latest",
        "forcePull": true
      },
      "endpoints": [
        {
          "name": "rails-stack-nginx",
          "containerPort": 80,
          "hostPort": 0,
          "protocol": [
            "tcp"
          ],
          "labels": {
            "VIP_0": "/rails-stack:80"
          }
        }
      ],
      "environment": {
        "API_HOST": "localhost",
        "API_PORT": "3000"
      }
    }
  ]
}

Deploying the Marathon Pod and testing container functionality

# Add marathon pod
dcos marathon pod add rails-stack-pod.json

# get the DCOS task IP address of nginx
service_ip=$(dcos task | egrep -i "nginx.*rails-stack" | awk '{print $2}')

# CURL nginx endpoint which reverse proxies to Rails API
curl http://$service_ip/api/people 2>/dev/null | jq '.[0]'
{
  "id": 1,
  "first_name": "Eric",
  "last_name": "London",
  "created_at": "2019-02-24T22:19:43.934Z",
  "updated_at": "2019-02-24T22:19:43.934Z"
}

Viewing the Rails stack service in DCOS

…Next part coming soon!

Rails 5 API, React, Bootstrap, CRUD example

2019-01-13T00:00:00-05:00

In this post I’ll outline steps to create an example CRUD app using a Rails 5 API with a React frontend. I wrote this on OSX using RMV, NVM, and Homebrew.

Part 1: Rails API

Scaffold Rails project

# create directory
mkdir api
cd api

# setup RMV files for Ruby version and gemset
echo ruby-2.5.3 > .ruby-version
echo rails5-api-react-frontend > .ruby-gemset
rvm use .

# install rails gem
gem install rails

# scaffold project with API and PostgreSQL flags
rails new --api -d postgresql .

# initialize database
rake db:create && rake db:migrate

Create Post model with title and body

# execute generator
rails g model Post title:string body:string

Update migration to not allow null in fields. Edit file: db/migrate/SOMEDATE_create_posts.rb

class CreatePosts < ActiveRecord::Migration[5.2]
  def change
    create_table :posts do |t|
      t.string :title, null: false
      t.string :body, null: false
      t.timestamps
    end
  end
end

Execute rake db:migrate to create posts table.

Add basic model validation, edit file: app/models/post.rb

class Post < ApplicationRecord
  validates :title, :body, presence: true
end

Create controller

# execute generator, namespaced to "api" for Post model
rails g scaffold_controller api/posts --api --model-name=Post

Update generated controller file and set required params and remove location: @post from create method. edit file: app/controllers/api/posts_controller.rb

class Api::PostsController < ApplicationController
  before_action :set_post, only: %i[show update destroy]

  def index
    render json: Post.all
  end

  def show
    render json: @post
  end

  def create
    post = Post.new(post_params)

    if post.save
      render json: post, status: :created
    else
      render json: post.errors, status: :unprocessable_entity
    end
  end

  def update
    if @post.update(post_params)
      render json: @post
    else
      render json: @post.errors, status: :unprocessable_entity
    end
  end

  def destroy
    @post.destroy
  end

  private

  def set_post
    @post = Post.find(params[:id])
  end

  def post_params
    params.require(:post).permit(:title, :body)
  end
end

Add API namespaced controller routes, edit file: config/routes.rb

Rails.application.routes.draw do
  namespace :api do
    resources :posts
  end
end

Enable CORS for frontend acces by adding gem 'rack-cors' to Gemfile and executing bundle install. Update CORS initializer, file: config/initializers/cors.rb

Rails.application.config.middleware.insert_before 0, Rack::Cors do
  allow do
    origins 'localhost:3001'

    resource '*',
      headers: :any,
      methods: [:get, :post, :put, :patch, :delete, :options, :head]
  end
end

Start Rails API via: rails s -p 3000 -b 0.0.0.0

Test API endpoints via CURL

# create new post
curl -XPOST -H "Content-Type: application/json" 'http://localhost:3000/api/posts' -d '{
  "post": {
    "title": "test title",
    "body": "test body"
  }
}' 2>/dev/null | python -m json.tool

# response
{
    "body": "test body",
    "created_at": "2019-01-13T20:50:58.909Z",
    "id": 1,
    "title": "test title",
    "updated_at": "2019-01-13T20:50:58.909Z"
}

# list all posts
curl -XGET -H "Content-Type: application/json" 'http://localhost:3000/api/posts' 2>/dev/null | python -m json.tool

# response
[
    {
        "body": "test body",
        "created_at": "2019-01-13T20:50:58.909Z",
        "id": 1,
        "title": "test title",
        "updated_at": "2019-01-13T20:50:58.909Z"
    }
]

Part 2: React frontend

To scaffold the React frontend I decided to use reactstrap for Bootstrap 4 components and React Router for navigational components.

# set nodejs version
nvm use v10.15.0

# install npm packages (global)
npm install yarn create-react-app -g

# create new react app. NOTE: execute this outside the rails folder
npx create-react-app frontend
cd frontend

# create NVM file
nvm current > .nvmrc

# add npm dependencies
yarn add react-router-dom reactstrap bootstrap

# start frontend
yarn start

To get started I added a JS module to handle all Rails API calls using the Fetch API. Each module export method returns an Array containing error (Boolean) and data (or errors). new file: src/Api.js

const apiHost = 'http://localhost:3000'

const capitalizeFirstLetter = (string) => {
  return string.charAt(0).toUpperCase() + string.slice(1)
}

const collectErrors = (response) => {
  let errors = []

  if (response.status === 404) {
    errors.push(response.error)
    return errors
  }

  const fields = Object.keys(response)
  fields.forEach(field => {
    const prefix = capitalizeFirstLetter(field)
    response[field].forEach(message => {
      errors.push(`${prefix} ${message}`)
    })
  })
  return errors
}

const deletePost = (id) => {
  let response_ok = null
  return fetch(`${apiHost}/api/posts/${id}`, {
    method: 'delete',
    headers: {
      'Content-Type': 'application/json'
    }
  })
  .then(response => {
    response_ok = response.ok
    if (response.status === 204) {
      return ''
    } else {
      return response.json()
    }
  })
  .then(response => {
    if (response_ok) {
      return [false, response]
    } else {
      return [true, collectErrors(response)]
    }
  })
}

const getPosts = () => {
  let response_ok = null
  return fetch(`${apiHost}/api/posts`, {
      method: 'get',
      headers: {
        'Content-Type': 'application/json'
      }
    })
    .then(response => {
      response_ok = response.ok
      return response.json()
    })
    .then(response => {
      if (response_ok) {
        return [false, response]
      } else {
        return [true, collectErrors(response)]
      }
    })
}

const getPost = (id) => {
  let response_ok = null
  return fetch(`${apiHost}/api/posts/${id}`, {
    method: 'get',
    headers: {
      'Content-Type': 'application/json'
    }
  })
  .then(response => {
    response_ok = response.ok
    return response.json()
  })
  .then(response => {
    if (response_ok) {
      return [false, response]
    } else {
      return [true, collectErrors(response)]
    }
  })
}

const savePost = (data, id=null) => {
  let apiUrl = `${apiHost}/api/posts`
  let apiMethod = 'post'
  if (id) {
    apiUrl = `${apiUrl}/${id}`
    apiMethod = 'put'
  }

  const body = JSON.stringify({
    post: data
  })

  let response_ok = null
  return fetch(apiUrl, {
    method: apiMethod,
    headers: {
      'Content-Type': 'application/json'
    },
    body: body
  })
  .then(response => {
    response_ok = response.ok
    return response.json()
  })
  .then(response => {
    if (response_ok) {
      return [false, null]
    } else {
      return [true, collectErrors(response)]
    }
  })
}

module.exports = {
  savePost: savePost,
  getPost: getPost,
  deletePost: deletePost,
  getPosts: getPosts
}

Next I updated the main App file and integrated with React Router. It defines a Router component with a list of Routes mapping to components. edit file: src/App.js

import React, { Component } from 'react'
import {
  BrowserRouter as Router,
  Route,
} from 'react-router-dom'

import Posts from './Posts'
import PostForm from './PostForm'
import PostDelete from './PostDelete'

class App extends Component {
  render() {
    return (
      <Router>
        <div>
          <Route exact path='/' component={Posts} />
          <Route exact path='/posts' component={Posts} />
          <Route exact path='/posts/new' component={PostForm} />
          <Route
            exact path="/posts/:id/edit"
            render={(routeProps) => (
              <PostForm {...routeProps} />
            )}
          />
          <Route
            exact path="/posts/:id/delete"
            render={(routeProps) => (
              <PostDelete {...routeProps} />
            )}
          />
        </div>
      </Router>
    )
  }
}

export default App

Next is the top level Posts component. It fetches existing posts, conditionally renders the PostsTable, and provides a button to add a new post. new file: src/Posts.jsx

import React, { Component } from 'react'
import { Link } from "react-router-dom"
import { Container, Row, Col, Alert } from 'reactstrap'
import PostsTable from './PostsTable'

const Api = require('./Api.js')

class Posts extends Component {
  constructor(props) {
    super(props)
    this.state = {
      posts: [],
      isLoaded: false,
      error: null
    }
  }

  componentDidMount() {
    Api.getPosts()
      .then(response => {
        const [error, data] = response
        if (error) {
          this.setState({
            isLoaded: true,
            posts: [],
            error: data
          })
        } else {
          this.setState({
            isLoaded: true,
            posts: data
          })
        }
      })
  }

  render() {
    const { error, isLoaded, posts } = this.state

    if (error) {

      return (
        <Alert color="danger">
          Error: {error}
        </Alert>
      )

    } else if (!isLoaded) {

      return (
        <Alert color="primary">
          Loading...
        </Alert>
      )

    } else {

      return (
        <Container>
          <Row>
            <Col>
              <PostsTable posts={posts}></PostsTable>
              <Link className="btn btn-primary" to="/posts/new">Add Post</Link>
            </Col>
          </Row>
        </Container>
      )

    }

  }
}

export default Posts

Here is the PostsTable component; it utilizes ReactStrap for Bootstrap form components and provides a link to edit and delete each post. new file: src/PostsTable.jsx

import React, { Component } from 'react'
import { Link } from "react-router-dom"
import { Table } from 'reactstrap'

class PostsTable extends Component {
  constructor(props) {
    super(props)
    this.state = {
      posts: props.posts
    }
  }

  render() {
    const posts = this.state.posts
    if (posts.length === 0) {
      return <div></div>
    } else {
      return (
        <Table>
          <thead>
            <tr>
              <th>ID</th>
              <th>Title</th>
              <th>Body</th>
              <th>Actions</th>
            </tr>
          </thead>
          <tbody>
            {posts.map(post => (
              <tr key={post.id}>
                <td>{post.id}</td>
                <td>{post.title}</td>
                <td>{post.body}</td>
                <td>
                  <Link className="btn btn-success" to={`/posts/${post.id}/edit`}>Edit</Link>{' '}
                  <Link className="btn btn-danger" to={`/posts/${post.id}/delete`}>Delete</Link>
                </td>
              </tr>
            ))}
          </tbody>
        </Table>
      )
    }
  }
}

export default PostsTable

Here is the PostForm component; it is used for editing and creating posts. On mount it conditionally (from passed params) fetches the existing post and sets the intial state. As the user enters field data onChange callbacks set the state of the component, and onSubmit the API is called to save the post. new file: src/PostForm.jsx

import React, { Component } from 'react'
import { Redirect } from 'react-router'
import { Container, Row, Col, Alert, Button, Form, FormGroup, Label, Input } from 'reactstrap'

const Api = require('./Api.js')

class PostForm extends Component {
  constructor(props) {
    super(props)

    this.state = {
      post: {
        id: this.getPostId(props),
        title: '',
        body: '',
      },
      redirect: null,
      errors: []
    }

    this.setTitle = this.setTitle.bind(this)
    this.setBody = this.setBody.bind(this)
    this.handleSubmit = this.handleSubmit.bind(this)
  }

  getPostId(props) {
    try {
      return props.match.params.id
    } catch (error) {
      return null
    }
  }

  setTitle(event) {
    let newVal = event.target.value || ''
    this.setFieldState('title', newVal)
  }

  setBody(event) {
    let newVal = event.target.value || ''
    this.setFieldState('body', newVal)
  }

  setFieldState(field, newVal) {
    this.setState((prevState) => {
      let newState = prevState
      newState.post[field] = newVal
      return newState
    })
  }

  handleSubmit(event) {
    event.preventDefault()

    let post = {
      title: this.state.post.title,
      body: this.state.post.body
    }

    Api.savePost(post, this.state.post.id)
      .then(response => {
        const [error, errors] = response
        if (error) {
          this.setState({
            errors: errors
          })
        } else {
          this.setState({
            redirect: '/posts'
          })
        }
      })
  }

  componentDidMount() {
    if (this.state.post.id) {
      Api.getPost(this.state.post.id)
        .then(response => {
          const [error, data] = response
          if (error) {
            this.setState({
              errors: data
            })
          } else {
            this.setState({
              post: data,
              errors: []
            })
          }
        })
    }
  }

  render() {
    const { redirect, post, errors } = this.state

    if (redirect) {
      return (
        <Redirect to={redirect} />
      )
    } else {

      return (
        <Container>
          <Row>
            <Col>
              <h3>Edit Post</h3>

              {errors.length > 0 &&
                <div>
                  {errors.map((error, index) =>
                    <Alert color="danger" key={index}>
                      {error}
                    </Alert>
                  )}
                </div>
              }

              <Form onSubmit={this.handleSubmit}>
                <FormGroup>
                  <Label for="title">Title</Label>
                  <Input type="text" name="title" id="title" value={post.title} placeholder="Enter title" onChange={this.setTitle} />
                </FormGroup>
                <FormGroup>
                  <Label for="body">Body</Label>
                  <Input type="text" name="body" id="body" value={post.body} placeholder="Enter body" onChange={this.setBody} />
                </FormGroup>
                <Button color="success">Submit</Button>
              </Form>
            </Col>
          </Row>
        </Container>
      )
    }
  }
}

export default PostForm

Below is the PostDelete component. It simply calls the Api delete method and redirects the user back to the Posts component. new file: src/PostDelete.jsx

import React, { Component } from 'react'
import { Redirect } from 'react-router'

const Api = require('./Api.js')

class PostDelete extends Component {
  constructor(props) {
    super(props)

    this.state = {
      id: props.match.params.id,
      redirect: null
    }
  }

  componentDidMount() {
    Api.deletePost(this.state.id)
      .then(response => {
        const [error] = response
        if (error) {
          // TODO: set flash
        }
        this.setState({
          redirect: '/posts'
        })
      })
  }

  render() {
    if (this.state.redirect) {
      return (
        <Redirect to={this.state.redirect} />
      )
    } else {
      return (
        <div></div>
      )
    }
  }

}

export default PostDelete

Last I updated the index.js file to add the Bootstrap CSS include, edit file: src/index.js

import React from 'react'
import ReactDOM from 'react-dom'
import 'bootstrap/dist/css/bootstrap.min.css'
import App from './App'
import * as serviceWorker from './serviceWorker'

ReactDOM.render(<App />, document.getElementById('root'))

// If you want your app to work offline and load faster, you can change
// unregister() to register() below. Note this comes with some pitfalls.
// Learn more about service workers: http://bit.ly/CRA-PWA
serviceWorker.unregister()

Frontend screenshot

Source code on GitHub

Ruby Redis Pub/Sub Worker Queue

2018-12-31T00:00:00-05:00

In this post I’ll share some Ruby code that uses Redis Pub/Sub and Redis lists to implement work queues. One list will contain strings of the task name to complete, and another will contain a JSON string of the completed task and the worker that completed it.

I first defined a RedisBase parent class that the workers and producer with inherit from. It contains all the Redis client methods. On initialize it creates a connection to Redis from environment variables. new file: redis_base.rb

require 'json'
require 'logger'
require 'redis'
require 'securerandom'

class RedisBase
  def initialize
    @uuid = SecureRandom.uuid
    @logger = Logger.new(STDOUT)

    @queue = ENV.fetch('WORK_QUEUE', 'work_queue')
    @processed = ENV.fetch('WORK_PROCESSED', 'work_processed')
    @channel = ENV.fetch('WORK_CHANNEL', 'work_channel')

    @redis_host = ENV.fetch('REDIS_HOST', 'localhost')
    @redis_port = ENV.fetch('REDIS_PORT', '6379').to_i

    @client = Redis.new(host: @redis_host, port: @redis_port)
  end

  protected

  def queue_list
    @client.lrange @queue, 0, -1
  end

  def processed_list
    @client.lrange @processed, 0, -1
  end

  def log(level, message)
    @logger.send(level, "#{@uuid}: #{message}")
  end

  def queue_work
    @client.rpush @queue, rand(1..5)
  end

  def publish_work
    @client.publish @channel, @queue
  end

  def next_task
    @client.lpop @queue
  end

  def complete_task(task)
    @client.rpush @processed, { worker: @uuid, task: task }.to_json
  end
end

I defined the producer class (RedisProducer) below. In a loop, it queues work by pushing a task into the work queue, and then publishes to the pub/sub channel to inform subscribers there is new work to complete. new file: producer.rb

#!/usr/bin/env ruby

require_relative 'redis_base'

class RedisProducer < RedisBase
  def start
    loop do
      queue_work
      publish_work
      sleep 0.25
    end
  end
end

RedisProducer.new.start

Next I defined the worker class (RedisWorker). On initialize, it creates a pub/sub client, checks if there is incomplete work to resume, and then subscribes to the pub/sub channel for new work tasks. new file: worker.rb

#!/usr/bin/env ruby

require_relative 'redis_base'

class RedisWorker < RedisBase
  def initialize
    super
    @pub_sub = Redis.new(host: @redis_host, port: @redis_port)
  end

  def start
    log :info, 'WORKER STARTED'
    resume_work
    subscribe
  end

  protected

  def resume_work
    log :info, 'WORKER RESUME: START'
    has_work = false
    begin
      result = check_work
      has_work = true if result
    end while has_work
    log :info, 'WORKER RESUME: END'
  end

  def check_work
    task = next_task
    return false unless task

    work(task)
    true
  end

  def subscribe
    @pub_sub.subscribe(@channel) do |on|
      on.subscribe do |channel, _subscriptions|
        log :info, "WORKER SUBSCRIBED TO: #{channel}"
      end

      on.message do |channel, message|
        check_work if channel == @channel && message == @queue
      end
    end
  end

  def work(task)
    log :info, "TASK START: #{task}"

    # TODO: do actual work here
    sleep(task.to_i)

    complete_task(task)
    log :info, "TASK END: #{task}"
  rescue StandardError => e
    # TODO: put task back in work queue
    log :error, e
  end
end

RedisWorker.new.start

I created a monitor script to show queued tasks and the completed tasks for each worker. new file: monitor.rb

#!/usr/bin/env ruby

require_relative 'redis_base'

class RedisClient < RedisBase
  def start
    loop do
      puts `clear` + output
      sleep 1
    end
  end

  protected

  def output
    output = ''
    output << output_from(:queue)
    output << output_from(:processed)
  end

  def output_from(kind)
    data = send("#{kind}_grouped")
    output = "#{kind.capitalize}:\n"
    data.each { |k, v| output << "\t#{k}\t#{v}\n" }
    output << "\n"
  end

  def queue_grouped
    grouped = queue_list.each_with_object(Hash.new(0)) { |task, hsh| hsh[task] += 1 }
    Hash[grouped.sort_by { |k, _v| k }]
  end

  def processed_grouped
    grouped = processed_list.each_with_object({}) do |jsn, hsh|
      item = JSON.parse(jsn)
      worker_uuid = item['worker']
      hsh[worker_uuid] = 0 unless hsh.key?(worker_uuid)
      hsh[worker_uuid] += 1
    end
    Hash[grouped.sort_by { |k, _v| k }]
  end
end

RedisClient.new.start

Here is a Dockerfile definition to run the workers and producer Ruby code, new file: Dockerfile

FROM ruby:2.5.3-alpine3.8

RUN apk add --no-cache --update build-base

RUN echo 'gem: --no-document' > ~/.gemrc
RUN gem install bundler

ENV APP_HOME /app/
COPY Gemfile Gemfile.lock $APP_HOME
WORKDIR $APP_HOME
RUN bundle install
COPY *.rb $APP_HOME

I defined a docker compose file to start Redis, a producer, and 10 workers. new file: docker-compose.yml

version: '3'
services:

  redis:
    image: redis:latest
    ports:
      - "6379:6379"

  producer:
    build:
      context: .
      dockerfile: Dockerfile
    command: /app/producer.rb
    environment:
      - REDIS_HOST=redis
    depends_on:
      - redis

  worker:
    build:
      context: .
      dockerfile: Dockerfile
    command: /app/worker.rb
    deploy:
      mode: replicated
      replicas: 10
    environment:
      - REDIS_HOST=redis
    depends_on:
      - redis

I started the docker containers via compose and then executed the monitor script inside the producer container to show the results.

# start containers
$ docker-compose build && docker-compose --compatibility up

# view containers
$ docker ps
CONTAINER ID        IMAGE                   COMMAND                  CREATED             STATUS              PORTS                    NAMES
55bb702657fc        redis-worker_worker     "/app/worker.rb"         3 minutes ago       Up 3 minutes                                 redis-worker_worker_9
8bdbb033e247        redis-worker_worker     "/app/worker.rb"         3 minutes ago       Up 3 minutes                                 redis-worker_worker_6
256a165d1897        redis-worker_worker     "/app/worker.rb"         3 minutes ago       Up 3 minutes                                 redis-worker_worker_7
b5f2256d839b        redis-worker_worker     "/app/worker.rb"         3 minutes ago       Up 3 minutes                                 redis-worker_worker_2
9d6b6646c481        redis-worker_worker     "/app/worker.rb"         3 minutes ago       Up 3 minutes                                 redis-worker_worker_8
68b4fd1ac5e3        redis-worker_worker     "/app/worker.rb"         3 minutes ago       Up 3 minutes                                 redis-worker_worker_4
185da4e137df        redis-worker_worker     "/app/worker.rb"         3 minutes ago       Up 3 minutes                                 redis-worker_worker_10
a0eb6d2f82a7        redis-worker_producer   "/app/producer.rb"       3 minutes ago       Up 3 minutes                                 redis-worker_producer_1
d0b18174daa6        redis-worker_worker     "/app/worker.rb"         3 minutes ago       Up 3 minutes                                 redis-worker_worker_1
9405439235da        redis-worker_worker     "/app/worker.rb"         3 minutes ago       Up 3 minutes                                 redis-worker_worker_3
9904f7ab16af        redis-worker_worker     "/app/worker.rb"         3 minutes ago       Up 3 minutes                                 redis-worker_worker_5
657d2386da40        redis:latest            "docker-entrypoint.s…"   3 minutes ago       Up 3 minutes        0.0.0.0:6379->6379/tcp   redis-worker_redis_1

# start monitor script
$ docker exec -it a0eb6d2f82a7 /app/monitor.rb

# example output
Queue:
	1	19
	2	30
	3	21
	4	24
	5	25

Processed:
	079f5841-e0fa-46da-83a1-3e9ad0ed1816	61
	2795fd22-6ccb-4a4e-9489-3d7fb2d42dca	60
	430f5d00-8f84-46fb-9fbe-0a9d5ebea8a5	67
	4b70afa3-d77a-4f25-94f2-f2d3e281b578	60
	5aae5ae8-4124-4df6-976c-cf9ddafdf240	62
	7466716d-0306-48b4-bd2f-5d93778daf81	64
	795d07fd-c45a-4a0a-8903-a22329ac52ec	65
	9874b347-16cb-4f3e-87ed-6a54ae86e926	57
	adb258a5-ba05-44dd-a1a3-c6e9ebfd3c1b	58
	fe4f78bf-7dff-4302-ba67-ca3fe578d9f1	62

Source code on GitHub

Using RabbitMQ as a Ruby work queue to populate Elasticsearch via Docker Compose

2018-12-09T00:00:00-05:00

In this post I’ll share some Ruby code that uses RabbitMQ as a work queue to populate Elasticsearch documents.

First I created a RabbitMQ base class to contain shared functionality between the producer and workers. On initialize, the base class waits for the RabbitMQ and Elasticsearch services to be available before starting. file: rabbitmq_base.rb

require 'bunny'
require 'elasticsearch'
require 'faker'

module ES
  module_function

  def client
    @client ||= Elasticsearch::Client.new url: es_url, log: true
  end

  def es_url
    "http://#{es_host}:9200"
  end

  def es_host
    ENV.fetch('ELASTICSEARCH_HOST', 'localhost')
  end
end

class RabbitmqBase
  def initialize
    begin
      create_connection
      create_channel
      create_queue
    rescue StandardError
      sleep 1
      retry
    end

    begin
      try_es_connection
    rescue StandardError
      sleep 1
      retry
    end
  end

  protected

  def try_es_connection
    ES.client.cluster.health wait_for_status: 'yellow'
  end

  def create_connection
    @connection = Bunny.new(hostname: rabbitmq_host)
    @connection.start
  end

  def create_channel
    @channel = @connection.create_channel
  end

  def create_queue
    @queue = @channel.queue(queue_name)
  end

  def queue_name
    ENV.fetch('QUEUE_NAME', 'worker_queue')
  end

  def rabbitmq_host
    ENV.fetch('RABBITMQ_HOST', 'localhost')
  end
end

The producer subclass publishes a set number of tasks to complete and then exits. file: producer.rb

#!/usr/bin/env ruby

require_relative 'rabbitmq_base'

class Producer < RabbitmqBase
  def start
    message = 'create_person'
    1_000.times { @queue.publish(message, persistent: true) }
  end
end

Producer.new.start

The worker subclass subscribes to the queue, checks if the task matches an available worker method, and then generates a person document in Elasticsearch. file: worker.rb

#!/usr/bin/env ruby

require_relative 'rabbitmq_base'

class Worker < RabbitmqBase
  def start
    @queue.subscribe(block: true) do |_delivery_info, _properties, body|
      if WorkerMethods.public_methods.include?(body.to_sym)
        WorkerMethods.send(body)
      else
        raise 'Worker method not found'
      end
    end
  end

  module WorkerMethods
    module_function

    def create_person
      person = {
        first_name: Faker::Name.first_name,
        last_name: Faker::Name.last_name,
        email: Faker::Internet.email
      }
      ES.client.create index: 'people',
                       type: 'person',
                       body: person
      puts "Job processed by worker: #{hostname}"
    end

    def hostname
      @hostname ||= `hostname`.strip
    end
  end
end

Worker.new.start

I create a ruby-based Dockerfile for the producer and workers, file: Dockerfile

FROM ruby:2.5.3-alpine3.8

RUN apk add --no-cache --update build-base

RUN echo 'gem: --no-document' > ~/.gemrc
RUN gem install bundler

ENV APP_HOME /app/
COPY Gemfile Gemfile.lock $APP_HOME
WORKDIR $APP_HOME
RUN bundle install
COPY *.rb $APP_HOME

I used docker compose to create a cluster of services. I implemented a deploy/replicas configuration to spin up 10 worker apps to distribute the load. file: docker-compose.yml

version: '3'
services:

  rabbitmq:
    image: rabbitmq:latest
    ports:
      - "5672:5672"

  app_producer:
    build:
      context: .
      dockerfile: Dockerfile
    command: /app/producer.rb
    environment:
      - ELASTICSEARCH_HOST=elasticsearch
      - RABBITMQ_HOST=rabbitmq
      - QUEUE_NAME=worker_queue
    depends_on:
      - elasticsearch
      - rabbitmq

  app_worker:
    build:
      context: .
      dockerfile: Dockerfile
    command: /app/worker.rb
    deploy:
      mode: replicated
      replicas: 10
    environment:
      - ELASTICSEARCH_HOST=elasticsearch
      - RABBITMQ_HOST=rabbitmq
      - QUEUE_NAME=worker_queue
    depends_on:
      - elasticsearch
      - rabbitmq

  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:6.5.2
    container_name: elasticsearch
    environment:
      - cluster.name=docker-cluster
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - esdata:/usr/share/elasticsearch/data
    ports:
      - 9200:9200

volumes:
  esdata:
    driver: local

Here are the commands I executed to run the apps and verify the results:

# build and start docker container
docker-compose build && docker-compose --compatibility up

# review docker containers
# NOTE: as expected, the producer container exited after queuing all the tasks
docker ps -a
CONTAINER ID  IMAGE                                                COMMAND                 CREATED             STATUS                     PORTS                                                  NAMES
8d0e832f2deb  rabbitmq_app_worker                                  "/app/worker.rb"        About a minute ago  Up About a minute                                                                 rabbitmq_app_worker_10
5420e004c788  rabbitmq_app_worker                                  "/app/worker.rb"        About a minute ago  Up About a minute                                                                 rabbitmq_app_worker_4
3d0d70b04310  rabbitmq_app_worker                                  "/app/worker.rb"        About a minute ago  Up About a minute                                                                 rabbitmq_app_worker_5
e2bc549a2cbb  rabbitmq_app_worker                                  "/app/worker.rb"        About a minute ago  Up About a minute                                                                 rabbitmq_app_worker_7
11445b7c6295  rabbitmq_app_worker                                  "/app/worker.rb"        About a minute ago  Up About a minute                                                                 rabbitmq_app_worker_3
5a27d37015c4  rabbitmq_app_producer                                "/app/producer.rb"      About a minute ago  Exited (0) 49 seconds ago                                                         rabbitmq_app_producer_1
a51bcb127e76  rabbitmq_app_worker                                  "/app/worker.rb"        About a minute ago  Up About a minute                                                                 rabbitmq_app_worker_2
42bfd224e65e  rabbitmq_app_worker                                  "/app/worker.rb"        About a minute ago  Up About a minute                                                                 rabbitmq_app_worker_9
9307e547454b  rabbitmq_app_worker                                  "/app/worker.rb"        About a minute ago  Up About a minute                                                                 rabbitmq_app_worker_1
d49337e9c5a8  rabbitmq_app_worker                                  "/app/worker.rb"        About a minute ago  Up About a minute                                                                 rabbitmq_app_worker_8
ebf2e23a8736  rabbitmq_app_worker                                  "/app/worker.rb"        About a minute ago  Up About a minute                                                                 rabbitmq_app_worker_6
788244fb1620  rabbitmq:latest                                      "docker-entrypoint.s…"  About a minute ago  Up About a minute          4369/tcp, 5671/tcp, 25672/tcp, 0.0.0.0:5672->5672/tcp  rabbitmq_rabbitmq_1
ec59a0e8f744  docker.elastic.co/elasticsearch/elasticsearch:6.5.2  "/usr/local/bin/dock…"  About a minute ago  Up About a minute          0.0.0.0:9200->9200/tcp, 9300/tcp                       elasticsearch

# query elasticsearch
curl 'http://localhost:9200/people/_search?pretty&size=1'
{
  "took" : 43,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1000,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "people",
        "_type" : "person",
        "_id" : "ZmiIjmcBxEAUMv2VhCO2",
        "_score" : 1.0,
        "_source" : {
          "first_name" : "Eric",
          "last_name" : "London",
          "email" : "eric@example.com"
        }
      }
    ]
  }
}

Source code on GitHub

Extracting people from an image using Python, Luminoth, and ImageMagick

2018-11-25T00:00:00-05:00

In this post I’ll share a tiny Python script to extract people from images using Luminoth, a deep learning toolkit for computer vision (which uses TensorFlow to detect objects in images); and ImageMagick for the image cropping.

Project setup

# setup up a new pip project space
$ pipenv --python 2.7
$ pipenv shell

# install dependencies
$ pipenv install tensorflow
$ pipenv install luminoth

# lumi checkpoint setup:
$ lumi checkpoint refresh

$ lumi checkpoint list
================================================================================
|           id |                  name |       alias | source |         status |
================================================================================
| e1c2565b51e9 |   Faster R-CNN w/COCO |    accurate | remote | NOT_DOWNLOADED |
| aad6912e94d9 |      SSD w/Pascal VOC |        fast | remote | NOT_DOWNLOADED |
================================================================================

$ lumi checkpoint download e1c2565b51e9

# [optional] start web ui
$ lumi server web

Extraction script, file: extract.py

#!/usr/bin/env python

import sys
import subprocess
import json
import os

file_arg = sys.argv[1]

file_name, file_extension = os.path.splitext(file_arg)

output = subprocess.check_output(['lumi', 'predict', file_arg])

output_lines = output.splitlines()
last_line = output_lines[-1]
parsed_json = json.loads(last_line)

matches = 0
for match in parsed_json['objects']:
    if match['label'] == 'person':
        matches += 1
        print(match)
        x1 = int(match['bbox'][0])
        y1 = int(match['bbox'][1])
        x2 = int(match['bbox'][2])
        y2 = int(match['bbox'][3])
        x_size = x2 - x1
        y_size = y2 - y1
        subprocess.call(['convert', file_arg, '-crop', '{}x{}+{}+{}'.format(x_size, y_size, x1, y1), 'person_{}.jpg'.format(matches)])

Example usage:

# make script executable
$ chmod +x extract.py

# extract people from a local image
$ ./extract.py picture.jpg

# output:
{u'label': u'person', u'prob': 0.9997, u'bbox': [331.0, 395.0, 793.0, 1877.0]}
{u'label': u'person', u'prob': 0.9995, u'bbox': [728.0, 408.0, 1090.0, 1895.0]}
{u'label': u'person', u'prob': 0.9515, u'bbox': [325.0, 404.0, 618.0, 1304.0]}

My test image:

Output images:

person_1.jpg

person_2.jpg

person_3.jpg

GitHub pull request tips to improve your git workflow

2018-11-17T00:00:00-05:00

In this post I’ll outline some GitHub pull request tips to improve your git workflow that will make managing pull requests a better experience.

A pull request should be small and enforce the single responsibility principle; as in the “S” in “SOLID”. If your pull request is too complex, separate functional components into multiple pull requests.

Create a pull request template for your repository. This will help users fill in important information when creating a new pull request.

Example file: .github/pull_request_template.md, contents:

**Task Link:**

**Why:**

**How:**

**Screenshot(s):**

Provide import information in a pull request description, answering the following questions:

Why?
How? / What does it do?
How is this code built?
How was the code tested?
How is this code deployed?
What task tracking story is this associated with (JIRA, Pivotal Tracker, etc)?

If the feature change can be shown visually, provide a screenshot or GIF. I use LICEcap to capture functionality in an animated GIF.

Integrate your Github project with continuous integration service(s); example: Jenkins, CodeShip, CircleCI, Travis. A pull request should have a successful build before it is reviewed by your team.

Code tips:

Use descriptive commit messages
Provide unit tests for each new feature
Avoid unnecessary code changes (diffs) by determining code formatting best practices with your team and implement linters to enforce. Style guides can be implemented for your IDE and text editor. Linters and code analyzers can be set to fail a CI build when the code does not conform to your rules.
- Example Ruby style guide
- Example code analyzer Rubocop
- If necessary, a pull request can be set to ignore whitespace changes by adding ?w=1 to the URL

Use code blocks and syntax highlighting to make code comments more legible (using triple backticks).

```ruby
s3_key = "person/#{person_id}.json"
s3_object = s3.bucket(s3_bucket).object(s3_key)
s3_object.upload_file(file_path)
```

Or:

```bash
# to build:
./gradlew clean build

# to execute tests:
./gradlew :sub-project:test --tests "MyNewFeature"
```

Preview:

Use tasks lists in your pull request description to track progress.

TODO:
- [ ] Fix build failures
- [x] This is a completed task
- [ ] Address feedback from Eric

Preview:

Fully utilize the functionality in the discussion sidebar. Request reviewers relevant to your project and utilize the pull request reviews workflow. Assign team members to a pull request who are responsible. Define workflow labels to track the status of a pull request (ex: On hold, Do not review, help wanted).

Mention @somebody in comments to involve another Github user in conversation.

As a reviewer of a pull request, add inline code comments from the Files changed tab. Additional commits can be pushed to the PR branch. When the feedback has been addressed, click on “Resolve conversation” from the Conversation tab. A running conversation on a pull request is good collaboration, and the history can be helpful in the future. When the pull request has been approved, ensure to squash all commits and rebase before merging.

# checkout branch to merge into
git checkout master

# ensure you have the latest commits
git pull

# checkout your branch
git checkout some-dev-branch

# squash last 5 commits interactively. replace "pick" with "s" to squash, or "f" to fixup (which ignores your commit message)
git rebase -i HEAD~5

# rebase on parent branch
git rebase master

# merge single commit
git checkout master
git merge some-dev-branch

# push to remote
git push

# delete local dev branch
git branch -d some-dev-branch

# delete remote dev branch
git push origin :some-dev-branch

Terraforming S3 bucket notification, AWS NodeJS Lambda to fetch metadata, SNS publishing, and filtered SQS subscription policy

2018-09-23T00:00:00-04:00

In this post, I’ll share some Terraform code which provisions a AWS S3 bucket for file uploads, a S3 bucket notification to trigger an AWS Lambda NodeJS script to fetch S3 metadata and push to a AWS SNS topic, and a AWS SQS queue with a filtered topic subscription. This can be useful if you need S3 bucket notifications to fanout to different SQS queues based on the S3 metadata or path.

Initial project setup

# set nodejs version, used by nvmrc for local execution
echo 8.10.0 > .nvmrc

# set terraform version, used by tfenv
echo 0.11.8 > .terraform-version

I created a Terraform file to setup the backend S3 state configuration and AWS provider version, new file: main.tf

terraform {
  required_version = "0.11.8"

  backend "s3" {
    bucket  = ""
    key     = ""
    profile = ""
    region  = ""
  }
}

provider "aws" {
  profile                 = "${var.aws_profile}"
  region                  = "${var.aws_region}"
  shared_credentials_file = "${var.aws_credentials_file}"
  version                 = "1.37.0"
}

data "aws_caller_identity" "current" {}

Terraform file for configurable parameters, variables.tf

variable "aws_region" {
  type    = "string"
  default = "us-east-1"
}

variable "aws_profile" {
  type    = "string"
  default = ""
}

variable "aws_credentials_file" {
  type    = "string"
  default = "~/.aws/credentials"
}

variable "s3_bucket_name" {
  type    = "string"
  default = ""
}

variable "sns_topic_name" {
  type    = "string"
  default = ""
}

variable "sqs_queue_name" {
  type    = "string"
  default = ""
}

Terraform S3 bucket and bucket notification (to lambda), s3.tf

resource "aws_s3_bucket" "s3_bucket" {
  bucket = "${var.s3_bucket_name}"
  acl    = "private"
  server_side_encryption_configuration {
    rule {
      apply_server_side_encryption_by_default {
        sse_algorithm = "AES256"
      }
    }
  }
}

resource "aws_s3_bucket_notification" "lambda_bucket_notification" {
  bucket = "${aws_s3_bucket.s3_bucket.id}"

  lambda_function {
    lambda_function_arn = "${aws_lambda_function.meta_lambda.arn}"
    events              = ["s3:ObjectCreated:*"]
    # filter_prefix       = "some-path/"
    # filter_suffix       = "*.csv"
  }
}

This following file defines the lambda resources, its IAM role, policy, and permissions. file: lambda.tf

resource "aws_lambda_function" "meta_lambda" {
  filename         = "meta_lambda.zip"
  function_name    = "meta_lambda"
  role             = "${aws_iam_role.meta_lambda_role.arn}"
  handler          = "meta_lambda.handler"
  source_code_hash = "${data.archive_file.meta_lambda_zip.output_base64sha256}"
  runtime          = "nodejs8.10"
   environment {
    variables = {
      AWS_ACCOUNT_ID = "${data.aws_caller_identity.current.account_id}"
      SNS_TOPIC_NAME = "${var.sns_topic_name}"
    }
  }
}

data "archive_file" "meta_lambda_zip" {
  type        = "zip"
  source_file = "meta_lambda.js"
  output_path = "meta_lambda.zip"
}

resource "aws_iam_role" "meta_lambda_role" {
  name = "meta_lambda_role"
  assume_role_policy = "${data.aws_iam_policy_document.meta_lambda.json}"
}

data "aws_iam_policy_document" "meta_lambda" {
  statement {
    sid = ""
    effect = "Allow"
    actions = [
      "sts:AssumeRole"
    ]
    principals {
      type = "Service"
      identifiers = ["lambda.amazonaws.com"]
    }
  }
}

resource "aws_lambda_permission" "lambda_allow_bucket" {
  statement_id  = "AllowExecutionFromS3Bucket"
  action        = "lambda:InvokeFunction"
  function_name = "${aws_lambda_function.meta_lambda.function_name}"
  principal     = "s3.amazonaws.com"
  source_arn    = "${aws_s3_bucket.s3_bucket.arn}"
}

resource "aws_iam_role_policy_attachment" "attach_lambda_role_policy" {
  role = "${aws_iam_role.meta_lambda_role.name}"
  policy_arn = "${aws_iam_policy.meta_lambda_policy.arn}"
}

resource "aws_iam_policy" "meta_lambda_policy" {
  name   = "meta_lambda_policy"
  policy = "${data.aws_iam_policy_document.meta_lambda_policy_document.json}"
}

data "aws_iam_policy_document" "meta_lambda_policy_document" {
  statement {
    sid = ""
    effect = "Allow"
    actions = [
      "logs:CreateLogGroup"
    ]
    resources = [
      "arn:aws:logs:${var.aws_region}:${data.aws_caller_identity.current.account_id}:*"
    ]
  }

  statement {
    sid = ""
    effect = "Allow"
    actions = [
      "logs:CreateLogStream",
      "logs:PutLogEvents"
    ]
    resources = [
      "arn:aws:logs:${var.aws_region}:${data.aws_caller_identity.current.account_id}:log-group:/aws/lambda/${aws_lambda_function.meta_lambda.function_name}:*"
    ]
  }

  statement {
    sid = ""
    effect = "Allow"
    actions = [
      "s3:GetObject",
      "s3:ListBucket"
    ]
    resources = [
      "${aws_s3_bucket.s3_bucket.arn}",
      "${aws_s3_bucket.s3_bucket.arn}/*"
    ]
  }

  statement {
    sid = ""
    effect = "Allow"
    actions = [
      "SNS:Publish"
    ]
    resources = [
      "${aws_sns_topic.sns_topic.arn}"
    ]
  }
}

Below is the NodeJS lambda script. It pulls environment variables, defines the exports handler, receives the S3 bucket notification event, collects the metadata from the S3 object/file path, makes a S3 HEAD request to get the S3 metadata, and publishes to the SNS topic, file: meta_lambda.js

'use strict';

// ENV vars
const AWS_REGION_STRING = process.env.AWS_REGION || 'us-east-1';
const AWS_ACCOUNT_ID = process.env.AWS_ACCOUNT_ID;
const SNS_TOPIC_NAME = process.env.SNS_TOPIC_NAME;
const SNS_TOPIC_ARN = `arn:aws:sns:${AWS_REGION_STRING}:${AWS_ACCOUNT_ID}:${SNS_TOPIC_NAME}`;

const AWS = require('aws-sdk');
AWS.config.update({
  region: AWS_REGION_STRING
});

const s3 = new AWS.S3();
const sns = new AWS.SNS();

exports.handler = (message, context, callback) => {
  return main(message).then(function (result) {
    callback(null, result);
  }).catch(function (error) {
    callback(error);
  });
};

const main = async (notification) => {
  let record = notification.Records[0];
  let pathAttributes = getS3PathAttributes(record);
  let s3MetaData = await fetchS3MetaData(record);
  let metaData = {
    ...pathAttributes,
    ...s3MetaData
  };
  let messageAttributes = getMessageAttributes(metaData);
  let sendSnsResponse = await sendSns(record, messageAttributes);
  return sendSnsResponse.MessageId;
}

const getS3PathAttributes = function (record) {
  let attributes = {}

  try {
    attributes.bucket_name = record.s3.bucket.name;
    attributes.object_key = record.s3.object.key;
  } catch (error) {
    console.log(error);
  }

  return attributes;
}

const fetchS3MetaData = async (record) => {
  try {
    let params = {
      Bucket: record.s3.bucket.name,
      Key: record.s3.object.key
    }
    let response = await s3.headObject(params).promise();
    return response.Metadata;
  } catch (error) {
    console.log(error);
    return {};
  }
}

const getMessageAttributes = function (metaData) {
  let messageAttributes = {};
  Object.entries(metaData).forEach(
    ([key, value]) => {
      messageAttributes[key] = {
        DataType: 'String',
        StringValue: value
      }
    }
  );
  return messageAttributes;
}

const sendSns = async (record, messageAttributes) => {
  let params = {
    TopicArn: SNS_TOPIC_ARN,
    Message: JSON.stringify(record),
    MessageStructure: 'string',
    MessageAttributes: messageAttributes
  }
  return await sns.publish(params).promise();
}

Terraform SNS topic and filtered SQS subscription, file: sns.tf

resource "aws_sns_topic" "sns_topic" {
  name = "${var.sns_topic_name}"
}

resource "aws_sns_topic_subscription" "sqs_subscription" {
  topic_arn = "${aws_sns_topic.sns_topic.arn}"
  protocol  = "sqs"
  endpoint  = "${aws_sqs_queue.sqs_queue.arn}"

  filter_policy = <<EOF
  {
    "filter-by": ["this-filter-value"]
  }
  EOF
}

Terraform SQS queue and its IAM policy, file: sqs.tf

resource "aws_sqs_queue" "sqs_queue" {
  name   = "${var.sqs_queue_name}"
  policy = "${data.aws_iam_policy_document.sqs_queue_policy_document.json}"
}

data "aws_iam_policy_document" "sqs_queue_policy_document" {
  policy_id = "arn:aws:sqs:${var.aws_region}:${data.aws_caller_identity.current.account_id}:${var.sqs_queue_name}/SQSDefaultPolicy"

  statement {
    sid    = "sns-to-sqs"
    effect = "Allow"

    principals {
      type        = "AWS"
      identifiers = ["*"]
    }

    actions = [
      "SQS:SendMessage",
    ]

    resources = [
      "arn:aws:sqs:${var.aws_region}:${data.aws_caller_identity.current.account_id}:${var.sqs_queue_name}"
    ]

    condition {
      test     = "ArnEquals"
      variable = "aws:SourceArn"
      values   = [
        "arn:aws:sns:${var.aws_region}:${data.aws_caller_identity.current.account_id}:${var.sns_topic_name}"
      ]
    }
  }
}

I put my configuration variables in secrets.auto.tfvars

aws_region = "us-east-1"
aws_profile = "some-aws-profile-name"
s3_bucket_name = "some-s3-bucket-name"
sns_topic_name = "some-sns-topic-name"
sqs_queue_name = "some-sqs-queue-name"

Here is a BASH script to pass environment variables to the Terraform backend configuration, and execute Terraform init, plan, and apply. file: main.sh

#!/usr/bin/env bash

: "${AWS_PROFILE:=some-aws-profile-name}"
: "${AWS_REGION:=us-east-1}"
: "${STATE_BUCKET:=some-s3-state-bucket}"
: "${STATE_KEY:=some-s3-state-path/terraform/base.tfstate}"

action="$1"
  TFENV=$(which tfenv)
if [ $? -eq 0 ]; then
  $TFENV install $(cat .terraform-version)
  cat .terraform-version | xargs $TFENV use
fi

rm -f *.tfstate
rm -rf ./.terraform

terraform init \
  -force-copy \
  -backend=true \
  -backend-config "bucket=${STATE_BUCKET}" \
  -backend-config "key=${STATE_KEY}" \
  -backend-config "profile=${AWS_PROFILE}" \
  -backend-config "region=${AWS_REGION}"

terraform plan

if [ "$action" == "apply" ]; then
  terraform apply -auto-approve
fi

To test SQS queue delivery I created an SQS client file, add SQS NPM dependency:

nvm use
npm init
npm install aws-sdk --save

And created the NodeJS script, file: sqs-client.js

const AWS_REGION_STRING = process.env.AWS_REGION || 'us-east-1';
const AWS_ACCOUNT_ID = process.env.AWS_ACCOUNT_ID;
const SQS_QUEUE_NAME = process.env.SQS_QUEUE_NAME;

const AWS = require('aws-sdk');
AWS.config.update({
  region: AWS_REGION_STRING
});
const sqs = new AWS.SQS();

let params = {
  QueueUrl: `https://sqs.${AWS_REGION_STRING}.amazonaws.com/${AWS_ACCOUNT_ID}/${SQS_QUEUE_NAME}`
}

sqs.receiveMessage(params, function (err, response) {
  if (err) {
    console.log(err, err.stack);
    return;
  }

  try {
    let messageBody = response.Messages[0].Body;
    let notification = JSON.parse(messageBody);
    let messageAttributes = notification.MessageAttributes;
    let notificationMessage = JSON.parse(notification.Message);

    console.log('messageAttributes', JSON.stringify(messageAttributes, null, 2));
    console.log('notificationMessage.s3.bucket.name', notificationMessage.s3.bucket.name)
    console.log('notificationMessage.s3.object.key', notificationMessage.s3.object.key);
  } catch (error) {
    console.log("no messages yet");
  }
});

I executed the terraform apply script, pushed a file to S3 with metadata, and executed the SQL client script to E2E test this functionality:

# terraform resources
chmod +x main.tf
./main.tf apply

# put file on S3 with the nonmatching metadata
aws --profile some-aws-profile s3 cp example.csv s3://some-s3-bucket/some-path/ --metadata '{"filter-by":"wrong filter value"}'

# attempt to receive SQS messages
AWS_PROFILE=some-aws-profile AWS_REGION=us-east-1 AWS_ACCOUNT_ID=some-aws-account-id SQS_QUEUE_NAME=some-sqs-queue-name node sqs-client.js
# no results

# puts file on S3 with matching metadata
aws --profile some-aws-profile s3 cp example.csv s3://some-s3-bucket/some-path/ --metadata '{"filter-by":"this-filter-value"}'

# attempt to receive filteres SQS message:
AWS_PROFILE=some-aws-profile AWS_REGION=us-east-1 AWS_ACCOUNT_ID=some-aws-account-id SQS_QUEUE_NAME=some-sqs-queue-name node sqs-client.js

# successful output:
messageAttributes {
  "object_key": {
    "Type": "String",
    "Value": "some-path/example.csv"
  },
  "bucket_name": {
    "Type": "String",
    "Value": "some-s3-bucket"
  },
  "filter-by": {
    "Type": "String",
    "Value": "this-filter-value"
  }
}
notificationMessage.s3.bucket.name some-s3-bucket
notificationMessage.s3.object.key some-path/example.csv

Source code on Github

Kafka streams Java application to aggregate messages using a session window

2018-07-26T00:00:00-04:00

In this post, I’ll share a Kafka streams Java app that listens on an input topic, aggregates using a session window to group by message, and output to another topic. This working example could be helpful to find the most frequent log entries over a certain time period.

I used gradle as the build tool and for dependency management. I created a new project via: gradle init.

I added the dependencies to the build file: kafka-log-aggregator/build.gradle

buildscript {
  dependencies {
    classpath 'com.github.jengelman.gradle.plugins:shadow:2.0.4'
  }
}

plugins {
  id 'java'
  id 'scala'
  id 'com.github.johnrengelman.shadow' version '2.0.4'
}

repositories {
  mavenLocal()
  jcenter()
  mavenCentral()
}

dependencies {
  compile group: 'org.apache.kafka', name: 'kafka-clients', version: '1.1.0'
  compile group: 'org.apache.kafka', name: 'kafka-streams', version: '1.1.0'

  compile group: 'org.slf4j', name: 'slf4j-api', version: '1.7.25'
  compile group: 'org.slf4j', name: 'slf4j-log4j12', version: '1.7.25'

  compile 'com.google.code.gson:gson:2.8.0'

  testCompile 'org.scala-lang:scala-library:2.11.8'
  testCompile 'org.scalatest:scalatest_2.11:3.0.0'
  testCompile 'org.apache.kafka:kafka-streams-test-utils:1.1.0'
  testCompile group: 'junit', name: 'junit', version: '4.12'
}

I created a class to represent a log entry consisting of a code, message, and the aggegate count. It uses Gson to deserialize and serialize to json. new file: kafka-log-aggregator/src/main/java/LogEntry.java

import com.google.gson.Gson;

public class LogEntry {
  public int code;
  public String message;
  public Long count;

  public LogEntry(int code, String message) {
    this.code = code;
    this.message = message;
  }

  public static LogEntry fromJson(String jsonString, Long count) {
    LogEntry logEntry = new Gson().fromJson(jsonString, LogEntry.class);
    logEntry.count = count;
    return logEntry;
  }

  public String asJsonString() {
    return new Gson().toJson(this);
  }
}

Next I create the LogAggregator class which will be used by the Kafka streams app to contain and aggregate all the log entries, file: kafka-log-aggregator/src/main/java/LogAggregator.java

import com.google.gson.Gson;
import com.google.gson.reflect.TypeToken;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.Collections;
import java.util.function.Function;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;

public class LogAggregator {
  ArrayList<LogEntry> logs = new ArrayList<>();
  Gson gson = new Gson();

  public LogAggregator() {
  }

  public LogAggregator(LogAggregator logAgg1, LogAggregator logAgg2) {
    logs.addAll(logAgg1.logs);
    logs.addAll(logAgg2.logs);
  }

  public LogAggregator(String jsonString) {
    ArrayList<LogEntry> logEntries = gson.fromJson(jsonString, new TypeToken<List<LogEntry>>(){}.getType());
    logs.addAll(logEntries);
  }

  public LogAggregator(byte[] bytes) {
    this(new String(bytes));
  }

  public LogAggregator add(String log) {
    LogEntry logEntry = gson.fromJson(log, LogEntry.class);
    logs.add(logEntry);
    return this;
  }

  public String asJsonString() {
    return gson.toJson(logs);
  }

  public byte[] asByteArray() {
    return asJsonString().getBytes(StandardCharsets.UTF_8);
  }

  public String groupedLimitedBy(Integer limitSize) {
    Map<String, Long> counted = logs.stream()
      .map(logEntry -> logEntry.asJsonString())
      .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));

    ArrayList<LogEntry> listSubset = new ArrayList<>();

    counted.entrySet().stream()
      .sorted(Collections.reverseOrder(Map.Entry.comparingByValue()))
      .limit(limitSize)
      .forEachOrdered(e -> {
        LogEntry logEntry = LogEntry.fromJson(e.getKey(), e.getValue());
        listSubset.add(logEntry);
      });

    return gson.toJson(listSubset);
  }

}

The LogAggregator class has to implement a serializer class to convert to a byte array. file: kafka-log-aggregator/src/main/java/LogAggregatorSerializer.java

import java.util.Map;
import org.apache.kafka.common.errors.SerializationException;
import org.apache.kafka.common.serialization.Serializer;

public class LogAggregatorSerializer implements Serializer<LogAggregator> {

  @Override
  public void configure(Map<String, ?> configs, boolean isKey) {
  }

  @Override
  public void close() {
  }

  @Override
  public byte[] serialize(String topic, LogAggregator logAgg) {
    if (logAgg == null) {
      return null;
    }

    try {
      return logAgg.asByteArray();
    } catch (RuntimeException e) {
      throw new SerializationException("Error serializing value", e);
    }

  }

}

And here is the class to deserialize from the byte array. file: kafka-log-aggregator/src/main/java/LogAggregatorDeserializer.java

import java.util.Map;
import org.apache.kafka.common.errors.SerializationException;
import org.apache.kafka.common.serialization.Deserializer;

public class LogAggregatorDeserializer implements Deserializer<LogAggregator> {

  @Override
  public void configure(Map<String, ?> configs, boolean isKey) {
  }

  @Override
  public void close() {
  }

  @Override
  public LogAggregator deserialize(String topic, byte[] bytes) {
    if (bytes == null) {
      return null;
    }

    try {
      return new LogAggregator(bytes);
    } catch (RuntimeException e) {
      throw new SerializationException("Error deserializing value", e);
    }

  }

}

Next I created the main class to build and run the log aggregator kafka streams app. file: kafka-log-aggregator/src/main/java/LogAggregatorApp.java

import java.util.concurrent.TimeUnit;
import java.util.Properties;
import org.apache.kafka.common.serialization.Deserializer;
import org.apache.kafka.common.serialization.Serde;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.common.serialization.Serializer;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.kafka.common.serialization.StringSerializer;
import org.apache.kafka.common.utils.Bytes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.kstream.internals.WindowedDeserializer;
import org.apache.kafka.streams.kstream.internals.WindowedSerializer;
import org.apache.kafka.streams.kstream.KStream;
import org.apache.kafka.streams.kstream.Materialized;
import org.apache.kafka.streams.kstream.SessionWindows;
import org.apache.kafka.streams.kstream.Windowed;
import org.apache.kafka.streams.state.SessionStore;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.Topology;

public class LogAggregatorApp {
  public static final String APPLICATION_ID = "log-aggregator";
  public static final String INPUT_TOPIC = "log-input-stream";
  public static final String OUTPUT_TOPIC = "log-output-stream";

  public String bootstrapServers;
  public Topology topology;
  public KafkaStreams streams;
  public Properties streamsConfig;

  public static void main(String[] args) throws Exception {
    String bootstrapServers = "localhost:9092";
    LogAggregatorApp logAggregatorApp = new LogAggregatorApp(bootstrapServers);
    logAggregatorApp.build();
    logAggregatorApp.run();
  }

  public LogAggregatorApp(String bootstrapServers) {
    this.bootstrapServers = bootstrapServers;
  }

  protected void build() {
    streamsConfig = buildStreamsConfig(bootstrapServers);
    StreamsBuilder streamsBuilder = configureStreamsBuilder(new StreamsBuilder());

    this.topology = streamsBuilder.build();
    this.streams = new KafkaStreams(topology, streamsConfig);
  }

  protected void run() {
    streams.start();
    Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
  }

  protected Properties buildStreamsConfig(String bootstrapServers) {
    Properties properties = new Properties();
    properties.put(StreamsConfig.APPLICATION_ID_CONFIG, APPLICATION_ID);
    properties.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
    properties.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
    properties.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
    return properties;
  }

  protected StreamsBuilder configureStreamsBuilder(StreamsBuilder streamsBuilder) {

    // 1 minute session, go ahead and change this
    final Long inactivityGap = TimeUnit.MINUTES.toMillis(1);

    Serializer<LogAggregator> logAggSerializer = new LogAggregatorSerializer();
    Deserializer<LogAggregator> logAggDeserializer = new LogAggregatorDeserializer();
    Serde<LogAggregator> logAggSerde = Serdes.serdeFrom(logAggSerializer, logAggDeserializer);

    StringSerializer stringSerializer = new StringSerializer();
    StringDeserializer stringDeserializer = new StringDeserializer();

    WindowedSerializer<String> windowedSerializer = new WindowedSerializer<>(stringSerializer);
    WindowedDeserializer<String> windowedDeserializer = new WindowedDeserializer<>(stringDeserializer);
    Serde<Windowed<String>> windowedSerde = Serdes.serdeFrom(windowedSerializer, windowedDeserializer);

    KStream<String, String> inputStream = streamsBuilder.stream(INPUT_TOPIC);

    inputStream
      .groupByKey()
      .windowedBy(SessionWindows.with(inactivityGap))
      .aggregate(
        LogAggregator::new,
        (key, value, logAgg) -> logAgg.add(value),
        (key, loggAgg1, logAgg2) -> new LogAggregator(loggAgg1, logAgg2),
        Materialized.<String, LogAggregator, SessionStore<Bytes, byte[]>>
          as("log-input-stream-aggregated")
            .withKeySerde(Serdes.String())
            .withValueSerde(logAggSerde)
      )
      .mapValues(logAgg -> logAgg.groupedLimitedBy(10))
      .toStream()
      .to(windowedSerde, Serdes.String(), OUTPUT_TOPIC);

    return streamsBuilder;
  }
}

I added a scala unit test to ensure the aggregation of logs works as planned. file: kafka-log-aggregator/src/test/scala/LogAggregatorAppTest.scala

import org.apache.kafka.common.serialization.Serdes
import org.apache.kafka.common.serialization.StringSerializer
import org.apache.kafka.streams.test.ConsumerRecordFactory
import org.apache.kafka.streams.TopologyTestDriver
import org.scalatest._
import scala.collection.JavaConversions._

class LogAggregatorAppTest extends FunSpec with Matchers with GivenWhenThen {

  def jsonLogs(): collection.mutable.ListBuffer[String] = {
    val logs = new collection.mutable.ListBuffer[String]()

    val jsonString1 = """{"code":200,"message":"OK"}"""
    val jsonString2 = """{"code":301,"message":"Moved Permanently"}"""
    val jsonString3 = """{"code":302,"message":"Found"}"""
    val jsonString4 = """{"code":304,"message":"Not Modified"}"""
    val jsonString5 = """{"code":400,"message":"Bad Request"}"""
    val jsonString6 = """{"code":401,"message":"Unauthorized"}"""
    val jsonString7 = """{"code":403,"message":"Forbidden"}"""
    val jsonString8 = """{"code":418,"message":"Im a teapot"}"""
    val jsonString9 = """{"code":422,"message":"Unprocessable Entity"}"""
    val jsonString10 = """{"code":500,"message":"Internal Server Error"}"""
    val jsonString11 = """{"code":503,"message":"Service Unavailable"}"""

    1 to 11 foreach { _ => logs += jsonString1 }
    1 to 10 foreach { _ => logs += jsonString2 }
    1 to 9 foreach { _ => logs += jsonString3 }
    1 to 8 foreach { _ => logs += jsonString4 }
    1 to 7 foreach { _ => logs += jsonString5 }
    1 to 6 foreach { _ => logs += jsonString6 }
    1 to 5 foreach { _ => logs += jsonString7 }
    1 to 4 foreach { _ => logs += jsonString8 }
    1 to 3 foreach { _ => logs += jsonString9 }
    1 to 2 foreach { _ => logs += jsonString10 }
    1 to 1 foreach { _ => logs += jsonString11 }

    logs
  }

  def lastRecordFromStream(app:LogAggregatorApp, testDriver:TopologyTestDriver, outputTopic:String): String = {
    var recordValue: String = null
    val stringDeserializer = Serdes.String().deserializer()
    var keepLooking = true
    while(keepLooking) {
      try {
        val record = testDriver.readOutput(outputTopic, stringDeserializer, stringDeserializer)
        recordValue = record.value()
      } catch {
        case e: Exception => {
          keepLooking = false
        }
      }
    }
    recordValue
  }

  describe("LogAggregatorApp") {
    it("aggregates json messages") {

      val bootstrapServers = "localhost:9092"
      val inputTopic = "log-input-stream"
      val outputTopic = "log-output-stream"

      val logAggregatorApp = new LogAggregatorApp(bootstrapServers)
      logAggregatorApp.build()

      val testDriver = new TopologyTestDriver(logAggregatorApp.topology, logAggregatorApp.streamsConfig)

      val stringSerializer = new StringSerializer
      val factory = new ConsumerRecordFactory(stringSerializer, stringSerializer)
      val key = "kafka-key"

      val logs = jsonLogs

      val records = logs.map(jsonString => factory.create(inputTopic, key, jsonString)).toList
      testDriver.pipeInput(records)

      val recordValue = lastRecordFromStream(logAggregatorApp, testDriver, outputTopic)
      testDriver.close()

      val expected =
        """[{"code":200,"message":"OK","count":11},
          |{"code":301,"message":"Moved Permanently","count":10},
          |{"code":302,"message":"Found","count":9},
          |{"code":304,"message":"Not Modified","count":8},
          |{"code":400,"message":"Bad Request","count":7},
          |{"code":401,"message":"Unauthorized","count":6},
          |{"code":403,"message":"Forbidden","count":5},
          |{"code":418,"message":"Im a teapot","count":4},
          |{"code":422,"message":"Unprocessable Entity","count":3},
          |{"code":500,"message":"Internal Server Error","count":2}]""".stripMargin.replaceAll("\n", "")

      recordValue shouldEqual expected
    }
  }
}

I create a simple Kafka producer ruby script to pipe messages onto the topic, wait a while (in this case a minute for the next session window), and pipe some more. file: kafka-log-aggregator/ruby/producer.rb

#!/usr/bin/env ruby

require 'kafka'
require 'json'

kafka = Kafka.new(['localhost:9092'], client_id: "sample-producer")

kafka_key = 'logs'
kafka_topic = 'log-input-stream'

messages = [
  { code: 200, message: 'OK' }.to_json,
  { code: 301, message: 'Moved Permanently' }.to_json,
  { code: 302, message: 'Found' }.to_json,
  { code: 304, message: 'Not Modified' }.to_json,
  { code: 400, message: 'Bad Request' }.to_json,
  { code: 401, message: 'Unauthorized' }.to_json,
  { code: 403, message: 'Forbidden' }.to_json,
  { code: 418, message: "I'm a teapot" }.to_json,
  { code: 422, message: 'Unprocessable Entity' }.to_json,
  { code: 500, message: 'Internal Server Error' }.to_json,
  { code: 503, message: 'Service Unavailable' }.to_json,
]

50.times do
  kafka.deliver_message(messages.sample, topic: kafka_topic, key: kafka_key)
end

sleep(60)

50.times do
  kafka.deliver_message(messages.sample, topic: kafka_topic, key: kafka_key)
end

At this point I was ready to start Zookeeper, Kafka, and build/run the streams app:

# start zookeeper
zookeeper-server-start $KAFKA_CONF/zookeeper.properties

# start kafka
$KAFKA_HOME/bin/kafka-server-start $KAFKA_CONF/server.properties

# create topics
$KAFKA_HOME/bin/kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic log-input-stream
$KAFKA_HOME/bin/kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic log-output-stream

# start console consumers (to watch the topics)
$KAFKA_HOME/bin/kafka-console-consumer --bootstrap-server localhost:9092 --topic log-input-stream
$KAFKA_HOME/bin/kafka-console-consumer --bootstrap-server localhost:9092 --topic log-output-stream

# build jar
./gradlew clean shadowJar

# run app
java -cp "./build/libs/*" LogAggregatorApp

# execute producer
cd ruby && ./producer.rb

# verified output topic log aggregation
[{"code":401,"message":"Unauthorized","count":8},{"code":500,"message":"Internal Server Error","count":8},{"code":301,"message":"Moved Permanently","count":6},{"code":403,"message":"Forbidden","count":6},{"code":503,"message":"Service Unavailable","count":4},{"code":400,"message":"Bad Request","count":4},{"code":200,"message":"OK","count":4},{"code":418,"message":"I\u0027m a teapot","count":3},{"code":422,"message":"Unprocessable Entity","count":3},{"code":304,"message":"Not Modified","count":2}]
[{"code":503,"message":"Service Unavailable","count":7},{"code":302,"message":"Found","count":7},{"code":301,"message":"Moved Permanently","count":6},{"code":304,"message":"Not Modified","count":5},{"code":418,"message":"I\u0027m a teapot","count":5},{"code":400,"message":"Bad Request","count":5},{"code":422,"message":"Unprocessable Entity","count":4},{"code":200,"message":"OK","count":4},{"code":403,"message":"Forbidden","count":3},{"code":401,"message":"Unauthorized","count":2}]

I develop on a Mac using Brew or Docker, here is my environment for this post:

java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

brew list --versions | egrep -i "(scala|zookeeper|gradle|kafka)"
gradle 4.9
kafka 1.1.0
scala@2.11 2.11.12
zookeeper 3.4.12

# in ~/.profile:
export JAVA_HOME="$(/usr/libexec/java_home)"
export ZOOKEEPER_HOME="/usr/local/Cellar/zookeeper/3.4.12"
export KAFKA_HOME=/usr/local/Cellar/kafka/1.1.0
export KAFKA_CONF=/usr/local/etc/kafka
export PATH="/usr/local/opt/scala@2.11/bin:$PATH"

# zookeeper conf
cat $KAFKA_CONF/zookeeper.properties | egrep -iv "^#"
dataDir=/usr/local/var/lib/zookeeper
clientPort=2181
maxClientCnxns=0

# kafka conf
cat $KAFKA_CONF/server.properties | egrep -iv "^(#|$)"
broker.id=0
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=/usr/local/var/lib/kafka-logs
num.partitions=1
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=localhost:2181
zookeeper.connection.timeout.ms=6000
group.initial.rebalance.delay.ms=0

Source code on Github