Rebuild Rails Part 3

Let’s continue our rebuild rails series by supporting controllers. We simplify things by assuming a controller/action path format and only support GET request.

We adjust our .call implementation to handle the new path format. We also introduce another method #render_controller_action that inspects the path_info and instantiates the right controller using Rails NameController convention.

# lib/tracks.rb
def self.call(env)
  path_info = env["PATH_INFO"]
  if  path_info == "/"
    text = Tracks::Controller.render_default_root
  else
    text = Tracks::Controller.render_controller_action(env)
  end

  [200, {"Content-Type" => "text/html"}, [text] ]
end

# lib/tracks/controller.rb
def self.render_controller_action(env)
  path_info = env["PATH_INFO"]
  _, controller, action, after = path_info.split("/")

  controller_class_name = controller.capitalize + "Controller"
  controller_class = Object.const_get(controller_class_name)
  controller_class.new.send(action)
end

$ ruby spec/application_spec.rb
Run options: --seed 55835
# Running:

.E

Finished in 0.015837s, 126.2865 runs/s, 126.2865 assertions/s.

  1) Error:
CrazyApp::Application#test_0002_should respond with different path:
NameError: uninitialized constant PostsController
    /code/crazy/lib/tracks/controller.rb:13:in `const_get'

When we run the test, you see it failed on PostsController which is what we expect since we haven’t implemented PostsController yet. Let’s add the controller now.

# app/controllers/posts_controller.rb
class PostsController < Tracks::Controller
  def index
    "hello from tracks /posts/index"
  end
end

# config/application.rb
require "./app/controllers/posts_controller"

$ ruby spec/application_spec.rb
Run options: --seed 2475

# Running:
..

Finished in 0.019152s, 104.4277 runs/s, 208.8555 assertions/s.

2 runs, 4 assertions, 0 failures, 0 errors, 0 skips

Automatic loading

Our tests pass but something’s not right. In Rails, there is no need to require every controller (or pretty much anything) to make it work. To support this feature in our framework, we need 2 things:

  • convert PostsController to posts_controller.rb; and
  • auto-require ‘posts_controller’

To tie these 2 together, we tap into Object.const_missing so our framework would know if a class has been used but not yet loaded.

We also update the $LOAD_PATH to include app/controllers folder so Ruby knows where to look.

# lib/tracks.rb
require File.expand_path("../tracks/helper", __FILE__)
require File.expand_path("../tracks/object", __FILE__)

# config/application.rb
require './lib/tracks'
$LOAD_PATH << File.expand_path("../../app/controllers", __FILE__)

module CrazyApp
  class Application < Tracks::Application
  end
end

# lib/tracks/helper.rb
module Tracks
  module Helper
    def self.to_underscore(string)
      string.scan(/[A-Z][a-z]+/).
      join('_').
      downcase
    end
  end
end

# lib/tracks/object.rb
class Object
  def self.const_missing(c)
    require Tracks::Helper.to_underscore(c.to_s)
    const_get(c)
  end
end

Our implementation of .to_underscore is limited compared to what’s supported in Rails.

irb(main):019:0> s = "PostsController"
=> "PostsController"
irb(main):020:0> s.scan(/[A-Z][a-z]+/).join('_').downcase
=> "posts_controller"

Also, since we call const_get inside const_missing, you will run into serious trouble if your file does not contain the expected class. Try changing the name of the class inside app/controllers/posts_controller.rb into something else and you will get this error.

$ ruby spec/application_spec.rb
Run options: --seed 60433

# Running:

.E
Finished in 0.043152s, 46.3478 runs/s, 46.3478 assertions/s.

  1) Error:
CrazyApp::Application#test_0002_should respond with different path:
SystemStackError: stack level too deep
    /Users/greg/.rbenv/versions/2.0.0-p247/lib/ruby/2.0.0/forwardable.rb:174

2 runs, 2 assertions, 0 failures, 1 errors, 0 skips

Rebuild Rails Part 2

Before we continue with out rebuilding series, we should write some tests first :)

# spec/spec_helper.rb
ENV["RAILS_ENV"] ||= "test"

require "rack/test"
require "minitest/autorun"

require File.expand_path("../../config/application", __FILE__)


# spec/application_spec.rb
require_relative "spec_helper"

describe CrazyApp::Application do
  include Rack::Test::Methods

  def app
    CrazyApp::Application
  end

  it "should respond with /" do
    get "/"
    last_response.ok?.must_equal true
    last_response.body.strip.must_equal "hello from index.html"
  end

  it "should respond with different path" do
    get "/posts"
    last_response.ok?.must_equal true
    last_response.body.strip.must_equal "hello from tracks /posts"
  end

end

Just like your typical Rails test setup, we have a common spec_helper file. We use minitest/autorun which gives us rspec-style DSL out of the box. For our test, we need Rack::Test::Methods to use get and other http methods. We also need an app method that returns our Rack application to make the tests work.

$ ruby spec/application_spec.rb
Run options: --seed 57256

# Running:
..

Finished in 0.015864s, 126.0716 runs/s, 252.1432 assertions/s.

2 runs, 4 assertions, 0 failures, 0 errors, 0 skips

We’re all good. Awesome!

#ruby Array of Hashes Quiz

Found this interesting ruby quiz from AlphaSights. Given an array of hashes, collapse into an array of hashes containing one entry per day. And you can only reference the :time key and not the rest.

log = [
  {time: 201201, x: 2},
  {time: 201201, y: 7},
  {time: 201201, z: 2},
  {time: 201202, a: 3},
  {time: 201202, b: 4},
  {time: 201202, c: 0}
]

# result should be
[
  {time: 201201, x: 2, y: 7, z: 2},
  {time: 201202, a: 3, b: 4, c: 0},
]

The first thing came to mind is to use Enumerable#group_by

grouped = log.group_by { |i| i[:time] }
collapsed = grouped.collect do |t, a|
  no_time_h = a.inject({}) do |others, h|
    others.merge h.reject { |k, v| k.to_sym == :time }
  end

  {time: t}.merge(no_time_h)
end

puts collapsed.inspect

However, after reading this a couple of times, I still find the solution hard to follow. For starter, group_by returns a hash where the values are an array of hashes which brings me back to the original problem even though it is already grouped by time. That I feel made the rest of the code more complicated.

# result of group_by
{201201=>[{:time=>201201, :x=>2}, {:time=>201201, :y=>7}, {:time=>201201, :z=>2}], 201202=>[{:time=>201202, :a=>3}, {:time=>201202, :b=>4}, {:time=>201202, :c=>0}]}

For my second version, I simply loop into the array and compose the hash using :time as the key. Afterwards, use the key-value pair to compose the resulting array. The code may be longer but it is more readable. Remember, Correct, Beautiful, Fast (in That Order).

hash_by_time = {}
log.each do |h|
  time = h[:time]
  others = h.reject { |k,v| k.to_sym == :time }

  if hash_by_time[time]
    hash_by_time[time].merge! others
  else
    hash_by_time[time] = others
  end
end

collapsed = hash_by_time.collect do |k, v|
  {time: k}.merge(v)
end

To the Crazy Ones

Though I’ve seen this video a gazillion times, I still find it fresh and inspiring.

To The Crazy Ones

Rebuild Rails Part 1

I have no plans to build another Rails-clone. Let’s leave that work to other smarter people with more time. But wouldn’t it be fun if we can learn how Rails work under the hood and find out what makes it “magical”? In this post I will only cover what happens when you typed that url until you get an HTML page. We’ll simplify further by not using any database access. If you would like to go deeper and wider, there is a book devoted entirely to it that I highly recommend.

Let’s call our application CrazyApp and let’s build our first web app.

# config.ru
app =  proc do |env|
  [200, {'Content-Type' => 'text/html'}, ["hello from crazy app"] ]
end

run app

$ rackup config.ru -p 3000                                                                       [2.0.0-p247]
Thin web server (v1.6.2 codename Doc Brown)
Maximum connections set to 1024
Listening on 0.0.0.0:3000, CTRL+C to stop
127.0.0.1 - - [16/Nov/2014 20:42:02] "GET / HTTP/1.1" 200 - 0.0008

It’s all about the Rack

Rack is a gem that sits between your framework (e.g. Rails) and Ruby-based application servers like Thin, Puma, Unicorn, and WEBrick. When you type a url, it goes through several layers of software until it hits our application which in this case just returns a hellow from crazy app. Rack simplifies the interface for web servers that we only have to worry about a few things to handle an HTTP request.

  • HTTP status, e.g. 200
  • Response headers. There are lot of things you can set here but for now let’s stick to content-type.
  • Actual content. In our case, an HTML page.

Let’s look at a boilerplate config.ru that you get from Rails.

# This file is used by Rack-based servers to start the application.

require ::File.expand_path('../config/environment',  __FILE__)
run Rails.application

One step in Rails' bootup process is to define Rails.application as class MyApp::Application < Rails::Application. Both Rails::Application and proc provides a call method that is why both config.ru works.

Now, let’s move our initial config.ru code to a different class that we can later extract into a gem for our framework that we shall call Tracks. From here on, we shall follow Rails conventions and build our gem from there.

# config.ru
require './config/application'
run CrazyApp::Application

# config/application.rb
require './lib/tracks'

module CrazyApp
  class Application < Tracks::Application
  end
end

# lib/tracks.rb
module Tracks
  class Application
    def self.call(env)
      [200, {'Content-Type' => 'text/html'}, ["hello from tracks"] ]
    end
  end
end

Exit from your rackup process and re-run it because we are not supporting auto-reloading. This time you will see a message from our Tracks - the super awesome Rails-like framework.

Render a default page

We now introduce a very simple root controller and use it to render a default index page. We also modify our route handling by inspecting the value in env object that Rack passed to our framework. The env packs a lot of information about a request and for our routing, we are interested in PATH_INFO which is the url after the domain minus the query parameters.

# lib/tracks/controlle.rb
module Tracks
  class Controller
    def self.render_default_root
      filename = File.join('public', 'index.html')
      File.read filename
    end
  end
end


# lib/tracks.rb
require File.expand_path('../tracks/controller', __FILE__)

module Tracks
  class Application
    def self.call(env)
      path_info = env['PATH_INFO']
      if  path_info == '/'
        text = Tracks::Controller.render_default_root
      else
        text = "hello from tracks #{path_info}"
      end

      [200, {'Content-Type' => 'text/html'}, [text] ]
    end
  end
end

That’s it for now. Next time, we will create our own controllers, action, and dynamic pages using ERB.

#ruby Counting Vowels

Just saw a simple exercise in my Facebook feed and I thought I give it a shot. The problem is simple:

Write a function that returns the number of vowels in the string.

Here’s my ruby solution:

  require 'minitest/autorun'

  def vowel_count(s)
    vowels = %w[a e i o u]
    s.to_s.scan(/\w/).select { |i| i if vowels.include?(i.downcase) }.count
  end

  describe "#vowel_count" do
    it "should count upcase lowercase" do
      test = "I wanted to be an astronaut"
      vowel_count(test).must_equal 10
    end

    it "should be zero for empty string" do
      vowel_count("").must_equal 0
    end

    it "should be zero for nil" do
      vowel_count(nil).must_equal 0
    end
  end

Sounds simple, right? But there are subtle things you should watch out for.

  • Upper and lower cases may seem trivial but programmers are often bitten by these when comparing strings.
  • An initial solution would be to access each character via [index] and increment a counter for vowels. Here is where familiarity with your language’s libraries becomes useful. While I didn’t get the right method initially, I know Ruby’s String library offers a way to extract regex matches. From then on, it’s just a matter of using Enumerable#select which is a common Ruby idiom for filtering elements.
  • Having tests even for a simple code is a good discipline to have. My initial test only covers the functional requirement. When I added the case of nil it quickly showed the flaw in my code, which brings me to my next point.
  • Produce sensible results as much as possible. While you can argue the requirement states a string and not a nil, it is good habit to defend your code in case the caller passed an invalid value. Hence, I converted the parameter to a string to ensure the rest of the code is working with a string object and it gives a sensible result even if the passed parameter is not a string.

Minimalist testing

If you are working with Rails' for a while, you probably been pampered with Rails seamless integration with testing frameworks you’ll be forgiven if you think these support are only available within Rails.

Ruby comes with minitest/autorun that supports a minimalist testing framework. Just require in your code and you are good to go with rspec-style testing right off the bat.

$ ruby vowelcount.rb
Run options: --seed 47907

# Running:

...

Finished in 0.001155s, 2597.4026 runs/s, 2597.4026 assertions/s.

3 runs, 3 assertions, 0 failures, 0 errors, 0 skips

#hacking Ansible

Before you continue, let me congratulate myself. It’s been years since I’ve written a post. Hooray! Also, thank you to Rad for guiding me on my Ansible adventure. The ansible code I reference here are available in github.

There are tons of posts about Ansible already so this is more about gotchas I learned along the way.

Variables

Just like good coding, you should isolate things that will change. Store variables in vars/defaults.yml for things like initial database password, deployment directory, ruby version, and whatnot. I started out with using snake format but I later learned you can have a hierarchy for your variables as well, for example:

app:
  name: google.app

Inside your tasks, you can reference it with . Of course, nothing prevents you from just simply naming your variable as app_name and use it as

Hosts and groups

You can use ansible to work on 1 or more servers at a time. These servers are specified in the hosts file (see hosts in the example). You can also group servers by using [groupname]. In the example below, you can have ansible target qa only (using the —limit option) and it will update the 2 servers you listed under the [qa] group.

[local]
localhost ansible_python_interpreter=/usr/local/bin/python

[www]
www.yourapp.com

[staging]
staging.yourapp.com

[qa]
qa1.yourapp.com
qa2.yourapp.com

Use roles

Roles allow you to separate common tasks into sort of like modules giving you flexibility. Some use roles to group common tasks like ‘db’, ‘web’, etc. For starters like me and if you are playing with combining different software, I used roles to define specific software. I have a role named ‘mysql’, a role ‘nginx-puma’, or a role ‘nginx-passenger’. Over time, you may split the roles into functional distinctions like web, db, etc.

Creating an EC2 instance

In my example, just update the variables below (included in vars/defaults.yml) to suit your requirements.

ec2_keypair: yourapp-www
ec2_instance_type: t1.micro
ec2_security_group: web-security-group
ec2_region: us-west-1
ec2_zone: us-west-1c
ec2_image: ami-f1fdfeb4 # Ubuntu Server 14.04 LTS (PV) 64-bit
ec2_instance_count: 1

You need your AWS_ACCESS_KEY, AWS_SECRET_KEY, and PEM file to create an instance. Creating the AWS access keys requires access to the IAM (Identity and Access Management).

When you’re ready, run the command:

AWS_ACCESS_KEY="access" AWS_SECRET_KEY="secret" ansible-playbook -i hosts ec2.yml

I have set ec2.yml to run against ‘local’ group (specified in the ’hosts’ file) because you don’t have a host yet. The Ansible EC2 module is smart enough to know that you are still creating an instance at this point.

After running the playbook, notice that it creates a new entry in the [launched] group in your hosts file. The new entry points to the EC2 instance you just created. At this point, you now have a server to run the rest of the playbooks.

Creating the deploy user

When the instance is created, it would create a new user. In my project where I use Ubuntu, it creates an ‘ubuntu’ user. In other cases, it might only create ‘root’ user. Regardless, your next step is to create a ‘deploy’ user because it is not a good idea to keep using root. You can change the name of the deploy user (see vars/defaults.yml) but I prefer using ‘deploy’ because it is also the default user in capistrano and I’m a Rails guy.

ansible-playbook -i hosts create-user.yml --user root --limit launched --private-key ~/.ssh/yourapp.pem

This playbook does several things:

  • Creates the ‘deploy’ user.
  • Adds your ssh key to /home/deploy/.ssh/authorized_keys
  • Gives deploy sudo rights.
  • Disables ssh and password access for root.

Note that I specified in the initial user ‘root’. Depending on the instance you just created it might be ‘ubuntu’ or some other initial user

Creating an instance in other providers

In Digital Ocean (and other similar providers), you can create an instance using their admin interface. The initial user is ‘root’ and you can specify a password and/or public key. If you forgot the public key, you must add it before you continue.

scp ~/.ssh/id_rsa.pub root@yourapp.com:~/uploaded_key.pub
ssh root@staging.app.com
mkdir -m og-rwx .ssh
cat ~/uploaded_key.pub >> ~/.ssh/authorized_keys

Bootstrapping

Now that you have your deploy user, just run the command below, take a coffee, and come back after 30 mins. That’s how awesome ansible is.

ansible-playbook -i hosts bootstrap.yml --limit launched

You may need to tweak the roles or update a specific server. For example, you’re testing an upgrade on a staging server. In that case, make sure you specify the corect host group in the —limit parameter

A few more things I learned with this exercise:

  • I can set the file mode with mkdir.
  • Use bash —login to load login profile. When I ssh using ‘deploy@domain‘, my ruby gets loaded correctly. However, when using ansible (which uses ssh), my ruby is not loaded and thus, you will get errors like ‘bundle not found’.
  • I’m torn between using system ruby and chruby. On one hand, I feel like there is no need for chruby because this is a server and no need to switch rubies from time-to-time. On the other hand, why not try it and see how it works.

#ruby Method_missing Gotchas

Forgetting ‘super’ with ‘method_missing’

‘method_missing’ is one of Ruby’s power that makes frameworks like Rails seem magical. When you call a method in an object (or “send a message to the object”), the object executes the first method it finds. If the object can’t find the method, it complains. This is pretty much what every modern programming language does. Except in Ruby you can guard against a non-existent method call by having the method ‘method_missing’ in your object. If you are using Rails, this technique enables dynamic record finders like User.find_by_first_name.

require "rspec"

class RadioActive
  def to_format(format)
    format
  end

  def method_missing(name, *args)
    if name.to_s =~ /^to_(\w+)$/
      to_format($1)
    end
  end
end

describe RadioActive do
  it "should respond to to_format" do
    format = stub
    subject.to_format(format).should == format
  end

  it "should respond to to_other_format" do
    subject.to_other_format.should == "other_format"
  end

  it "should raise a method missing" do
    expect do
      subject.undefined_method
    end.to raise_error
  end
end

However, improper use of ‘method_missing’ can introduce bugs in your code that would be hard to track. To illustrate, our example code above intercepts methods whose name are in the ‘to_name’ format. It works fine as our tests tell us except when we try to call an undefined method that does not follow the “to_name” format. The default behavior for undefined method is for the object to raise a NoMethodError exception.

$ rspec method_missing_gotcha-01.rb
..F

Failures:

  1) RadioActive should raise a method missing
     Failure/Error: expect do
       expected Exception but nothing was raised
     # ./method_missing_gotcha-01.rb:30:in `block (2 levels) in <top (required)>'

Finished in 0.00448 seconds
3 examples, 1 failure

Failed examples:

rspec ./method_missing_gotcha-01.rb:29 # RadioActive should raise a method missing

You can easily catch this bug if you have a test. It would be a different story if you just use your class straight away.

irb(main):001:0> require './method_missing_gotcha-01.rb'
=> true
irb(main):002:0> r = RadioActive.new
=> #<RadioActive:0x007fd232a4d8a8>
irb(main):003:0> r.to_format('json')
=> "json"
irb(main):004:0> r.to_json
=> "json"
irb(main):005:0> r.undefined
=> nil

The undefined method just returns nil instead of raising an exception. When we defined our method_missing, we removed the default behavior accidentally. Oops!

Fortunately, the fix is easy. There is no need to raise the ‘NoMethodError’ in your code. Instead, simply call ‘super’ if you are not handling the method. Whether you have your own class or inheriting from another, do not forget to call ‘super’ with your ‘method_missing’. And that would make our tests happy :)

--- 1/method_missing_gotcha-01.rb
+++ 2/method_missing_gotcha-02.rb
@@ -9,6 +9,8 @@ class RadioActive
   def method_missing(name, *args)
     if name.to_s =~ /^to_(\w+)$/
       to_format($1)
+    else
+      super
     end
   end

$ rspec method_missing_gotcha-02.rb
...

Finished in 0.00414 seconds
3 examples, 0 failures

Calling ‘super’ is not just for ‘missing_method’. You also need to do the same for the other hook methods like ‘const_missing’, ‘append_features’, or ‘method_added’.

Forgetting respond_to?

When we modified ‘method_missing’, we are essentially introducing ghost methods. They exist but you cannot see them. You can call them spirit methods if that suits your beliefs. In our example, we were able to use a method named ‘to_json’ but if we look at the list of methods defined for RadioActive, we will not see a ‘to_json’.

irb(main):002:0> RadioActive.instance_methods(false)
=> [:to_format, :method_missing]
irb(main):003:0> r = RadioActive.new
=> #<RadioActive:0x007f88b2a151c0>
irb(main):004:0> r.respond_to?(:to_format)
=> true
irb(main):005:0> r.respond_to?(:to_json)
=> false

Before we introduce a fix, let us first write a test that shows this bug. It’s TDD time baby!

@@ -32,4 +34,8 @@ describe RadioActive do
     end.to raise_error
   end

+  it "should respond_to? to_other format" do
+    subject.respond_to?(:to_other_format).should == true
+  end
+
 end

...F

Failures:

  1) RadioActive should respond_to? to_other format
     Failure/Error: subject.respond_to?(:to_other_format).should == true
       expected: true
            got: false (using ==)
     # ./method_missing_gotcha-02.rb:38:in `block (2 levels) in <top (required)>'

Finished in 0.00444 seconds
4 examples, 1 failure

Failed examples:

rspec ./method_missing_gotcha-02.rb:37 # RadioActive should respond_to? to_other format

The fix is every time you modify ‘method_missing’, you also need to update ‘respond_to?’. And don’t forget to include ‘super’.

+  def respond_to?(name)
+    !!(name.to_s =~ /^to_/ || super)
+  end
+
 end

And with that, we are all green.

....

Finished in 0.00443 seconds
4 examples, 0 failures

Mining Twitter Data With Ruby - Visualizing User Mentions

In my previous post on mining twitter data with ruby, we laid our foundation for collecting and analyzing Twitter updates. We stored these updates in MongoDB and used map-reduce to implement a simple counting of tweets. In this post, we’ll show relationships between users based on mentions inside the tweet. Fortunately for us, there is no need to parse each tweet just to get a list of users mentioned in the tweet because Twitter provides the “entities.mentions” field that contains what we need. After we collected the “who mentions who”, we then construct a directed graph to represent these relationships and convert them to an image so we can actually see it.

First, we start with the aggregation of mentions per user. We will use the same code base as last time. So if this is your first time, I recommend reading my previous related post or you can follow the changes in Github. Note to self: Convert this to an actual gem in the next post.

# user_mention.rb
module UserMention

  def mentions_by_user
    map_command = %q{
      function() {
        var mentions = this.entities.user_mentions,
            users = [];
        if (mentions.length &gt; 0) {
          for(i in mentions) {
            users.push(mentions[i].id_str)
          }

          emit(this.user.id_str, { mentions: users });
        }
      }
    }

   reduce_command = %q{
      function(key, values) {
        var users = [];

        for(i in values) {
          users = users.concat(values[i].mentions);
        }

        return { mentions: users };
      }
    }

    options = {:out => {:inline => 1}, :raw => true, :limit => 50 }
    statuses.map_reduce(map_command, reduce_command, options)
 end

end

We then again use map-reduce in MongoDB to implement our aggregation. Of course, this sort of thing can be done in Ruby directly but it would be way more efficient if we do it in MongoDB especially if you have a big collection to process. Note that we limit the number of documents to process because we don’t want our graph to look unrecognizable when we display it.

Now that we have our aggregation working, we construct a directed graph of user mentions using the rgl library.

require "bundler"
Bundler.require
require File.expand_path("../tweetminer", __FILE__)
settings = YAML.load_file File.expand_path("../mongo.yml", __FILE__)
miner = TweetMiner.new(settings)

require "rgl/adjacency"
require "rgl/dot"

graph = RGL::DirectedAdjacencyGraph.new
miner.mentions_by_user.fetch("results").each do |user|
  user.fetch("value").fetch("mentions").each do |mention|
    graph.add_edge(user.fetch("_id"), mention)
  end
end

# creates graph.dot, graph.png
graph.write_to_graphic_file

Once you have the user-mentions relationships in a graph, you can do interesting things like who is connected to somebody and the degrees of separation. But for now, we are just interested in showing who mentioned whom. Our sample program saves the graph to the file graph.dot (using the DOT language) and PNG output. But the default PNG output is not laid out nicely. Instead, we will use the “neato” program to convert our graph.dot into a nice looking PNG file.

$ neato -Tpng graph.dot -o mentions.png

When you view “mentions.png”, you should see something similar as the one below. The labels are user IDs and the arrows show the mentioned users.

It would be cool to modify our program to use the users' avatars and also make it interactive. Or, use Twitter’s streaming API and create an auto-update graph. I haven’t done any research yet but I’m sure there is some Javascript library out there that can help us display graph relationships.

Mining Twitter Data With Ruby, MongoDB and Map-Reduce

When is the best time to tweet? If you care about reaching a lot of users, the best time probably is when your followers are also tweeting. In this exercise,we will try to figure out the day and time users are the most active. Since there is no way for us to do this for all users in the twitterverse, we will only use the users we follow as our sample.

What do we need

  • mongodb
  • tweetstream gem
  • awesome_print gem for awesome printing of Ruby objects
  • oauth credentials

Visit http://dev.twitter.com to get your oauth credentials. You just need to login, create an app, and the oauth credentials you need will be there. Copy the oauth settings to the twitter.yml file because that is where our sample code will be looking.

Collect status updates

We use the Tweetstream gem to access the Twitter Streaming APIs which allows our program to receive updates as they occur without the need to regularly poll Twitter.

# Collects user tweets and saves them to a mongodb
require 'bundler'
require File.dirname(__FILE__) + '/tweetminer'

Bundler.require

# We use the TweetStream gem to access Twitter's Streaming API
# https://github.com/intridea/tweetstream

TweetStream.configure do |config|
  settings = YAML.load_file File.dirname(__FILE__) + '/twitter.yml'

  config.consumer_key       = settings['consumer_key']
  config.consumer_secret    = settings['consumer_secret']
  config.oauth_token        = settings['oauth_token']
  config.oauth_token_secret = settings['oauth_token_secret']
end

settings = YAML.load_file File.dirname(__FILE__) + '/mongo.yml'
miner = TweetMiner.new(settings)stream = TweetStream::Client.new

stream.on_error do |msg|
  puts msg
end

stream.on_timeline_status do |status|
  miner.insert_status status
  print '.'
end

# Do not forget this to trigger the collection of tweets
stream.userstream

The code above handles the collection of status updates. The actual saving to mongodb is handled by the TweetMiner module.

# tweetminer.rb

require 'mongo'

class TweetMiner
  attr_writer :db_connector
  attr_reader :options

  def initialize(options)
    @options = options
  end

  def db
    @db ||= connect_to_db
  end

  def insert_status(status)
    statuses.insert status
  end

  def statuses
    @statuses ||= db['statuses']
  end

  private

  def connect_to_db
    db_connector.call(options['host'], options['port']).db(options['database'])
  end

  def db_connector
    @db_connector ||= Mongo::Connection.public_method :new
  end

 end

We will be modifying our code along the way and if you want follow each step, you can view this commit at github.

Depending on how active the people you follow, it may take a while before you get a good sample of tweets. Actually, it would be interesting if you could run the collection for several days.

Assuming we have several days' worth of data, let us proceed with the “data mining” part. Data mining would not be fun without a mention of map reduce - a strategy for data mining popularized by Google. The key innovation with map reduce is its ability to take a query over a data set, divide it, and run it in parallel over many nodes. “Counting”, for example, is a task that fits nicely with the map reduce framework. Imagine you and your friends are counting the number of people in a football stadium. First, you divide yourselves into 2 groups - group A counts the people in the lower deck while group B does the upper deck. Group A in turn divides the task into north, south, and endzones. When group A is done counting, they tally all their results. After group B is done, they combine the results with group A for which the total gives us the number of people in the stadium. Dividing your friends is the “map” part while the tallying of results is the “reduce” part.

Updates per user

First, let us do a simple task. Let us count the number of updates per user. We introduce a new module ‘StatusCounter’ which we include in our TweetMiner module. We also add a new program to execute the map reduce task.

# counter.rb

require 'bundler'
Bundler.require
require File.dirname(__FILE__) + '/tweetminer'
settings = YAML.load_file File.dirname(__FILE__) + '/mongo.yml'

miner = TweetMiner.new(settings)

results = miner.status_count_by_user
ap results

Map reduce commands in mongodb are written in Javascript. When writing Javascript, just be conscious about string interpolation because Ruby sees it as a bunch of characters and nothing else. For the example below, we use the here document which interprets backslashes. In our later examples, we switch to single quotes when we use regular expressions within our Javascript.

module StatusCounter
  class UserCounter
    def map_command
      <<-EOS
        function() {
          emit(this.user.id_str, 1);
        }
      EOS
    end

    def reduce_command
      <<-EOS
        function(key, values) {
          var count = 0;
          for(i in values) {
            count += values[i]
          }

          return count;
        }
      EOS
    end
  end

  def status_count_by_user
    counter = UserCounter.new
    statuses.map_reduce(counter.map_command, counter.reduce_command, default_mr_options)
  end

  def default_mr_options
    {:out => {:inline => 1}, :raw => true }
  end
 end

Follow this commit to view the changes from our previous examples.

When you run ‘ruby counter.rb’, you should see a similar screenshot as the one below:

Tweets per Hour

Now, let’s do something a little bit harder than the previous example. This time, we want to know how many tweets are posted per hour. Every tweet has a created_at field of type String. We then use a regular expression to extract the hour component.

created_at:  'Tue Sep 04 22:04:40 +0000 2012'
regex:  (\d{2,2}):\d{2,2}:\d{2,2}
match: 22

The only significant change is the addition of a new map command. Note the reduce command did not change from the previous example. See the commit.

class HourOfDayCounter
  def map_command
    'function() {
      var re = /(\d{2,2}):\d{2,2}:\d{2,2}/;
      var hour = re.exec(this.created_at)[1];

      emit(hour, 1);
    }'
  end

  def reduce_command
    &lt;&lt;-EOS
      function(key, values) {
        var count = 0;

        for(i in values) {
          count += values[i]
        }

        return count;
      }
    EOS
  end

end

def status_count_by_hday
  counter = HourOfDayCounter.new
  statuses.map_reduce(counter.map_command, counter.reduce_command, default_mr_options)
end

Now run ‘ruby counter.rb’ in the console with the new method and the result should be something like the one below.

Filtering records

Our examples so far include every status since the beginning of time, which is pretty much useless. What we want is to apply the counting tasks to statuses posted the past 7 days, for example. MongoDB allows you to pass a query to your map-reduce so you can filter the data where the map-reduce is applied. One problem though: created_at field is a string. To get around this, we introduce a new field created_at_dt which is of type Date. You can hook it up in the insert_status method but since we already have our data, we instead run a query (using MongoDB console) to update our records. Please note the collection we are using is statuses and the new field is created_at_dt.

var cursor = db.statuses.find({ created_at_dt: { $exists: false } });
while (cursor.hasNext()) {
  var doc = cursor.next();
  db.statuses.update({ _id : doc._id }, { $set : { created_at_dt : new Date(doc.created_at) } } )
}

Now, that we have a Date field, let’s modify our method to include a days_ago parameter and a query in our map reduce.

def status_count_by_hday(days_ago = 7)
  date     = Date.today - days_ago
  days_ago = Time.utc(date.year, date.month, date.day)
  query = { 'created_at_dt' => { '$gte' => days_ago } }

  options = default_mr_options.merge(:query => query)

  counter = HourOfDayCounter.new
  statuses.map_reduce(counter.map_command, counter.reduce_command, options)
end

Since we’re now getting the hang of it, why don’t we add another complexity. This time, let us count by day of the week and include a breakdown per hour. Luckily for us, the day of the week is also included in the created_at field and it is just a matter of extracting it. Of course, if Twitter decides to change the format, this will break. Let’s visit rubular.com and try our regular expression.

Now that we have our regex working, let’s include this in our new map command.

def map_command
  'function() {
    var re = /(^\w{3,3}).+(\d{2,2}):\d{2,2}:\d{2,2}/;
    var matches = re.exec(this.created_at);

    var wday = matches[1],
        hday = matches[2];

    emit(wday, { count: 1, hdayBreakdown: [{ hday: hday, count: 1 }] });
  }'
end

Note the difference in the emit function from our previous examples. Before, we only emit a single numeric value that is why our reduce command is simple array loop. This time, our reduce command requires more work.

def reduce_command
  'function(key, values) {
     var total = 0,
         hdays = {},
         hdayBreakdown;

     for(i in values) {
       total += values[i].count

       hdayBreakdown = values[i].hdayBreakdown;

       for(j in hdayBreakdown) {
         hday  = hdayBreakdown[j].hday;
         count = hdayBreakdown[j].count;

         if( hdays[hday] == undefined ) {
           hdays[hday] = count;
         } else {
           hdays[hday] += count;
         }
       }
     }

     hdayBreakdown = [];
     for(k in hdays) {
       hdayBreakdown.push({ hday: k, count: hdays[k] })
     }

     return { count: total, hdayBreakdown: hdayBreakdown }
   }'
end

In our previous examples, the values parameter is a simple array of numeric values. Now, it becomes an an array of properties. On top of that, one of the properties (i.e. hdayBreakdown) is also an array. If everything works according to plan, you should see something like the image below when you run collect.rb.

Did you have fun? I hope so :)