Twitalyzr – Twitter stream

Industry Category
Marketing and Media Open source

Twitalyzr (Twitter stream analyzer) is a small application that is working on input Twitter stream with a particular hashtag (in our case #Cassandra) and it is processing each tweet marking it as Apache Cassandra related or not Apache Cassandra related. The decision whether it is Cassandra related or not is done based on tweet content and description of the person posting some tweet. The reason to make this application is that a lot of tweets with #cassandra are not relevant to the database. Due to its name, there are tweets about porn stars, soap operas, book characters, game characters, various sales, etc.

The application contains two Spark jobs:

  1. Training Spark job: Make and optimize preprocessing and training pipeline
  2. Streaming Spark job: Use of model on stream to classify upcoming tweets in real-time

Because there is a predefined model, there is no need for training (unless if you want to play). This pre-trained model uses Logistic Regression with preprocessing steps: clean, tokenize, stop word remover, n-gram, term frequency. Preprocessing steps are done on the text of the tweet, the text of the user description, and hashtags. Model is trained with k-fold cross-validation, and on the test set, it achieved 0.99 accuracies.

Start streaming

If you want to run a streaming job on AWS we already provide automatization via ansible in folder automation.

  1. Compile project with sbt:
sbt assembly
  1. export AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY for AWS. In spark.yml there is additional AWS configuration, like the type of instance, etc.
  2. Twitter API keys are encrypted with ansible-vault and they are in twitter_api_keys.yml. Twitter API keys are used to fill template with:
   token = "{{ token }}"  
   tokenSecret = "{{ token_secret }}"  
   consumerKey = "{{ consumer_key }}"  
   consumerSecret = "{{ consumer_secret }}"  
 ```  
4. Export variables that represent path for jar and model:  

 ```
 export TWITALYZR_SRC_JAR=<path_to_directory_that_contains_jar>  
 export TWITALYZR_SRC_MODEL=<path_to_tar_containing_model>  
 ```  
5. Execute next line (from [automation](automation) dir) if instance is already created, if not just change tag to ```aws-setup```:  

ansible-playbook -vv -i inventory/ –private-key <path_to_pem> spark.yml –tags aws-start –ask-vault-pass

Let’s get started.

Tell us what you’re working on, we’ll answer right away.

Other success stories

See all stories→
Email sentiment analysis
User projects

Email sentiment analysis

Read post →
Churn of business customers
User projects

Churn of business customers

Read post →
FAQ chatbot with back-office
User projects

FAQ chatbot with back-office

Read post →
Search ranking for a large-scale job platform
User projects

Search ranking for a large-scale job platform

Read post →