Let’s get started.
Tell us what you’re working on, we’ll answer right away.
Industry | Category |
---|---|
Marketing and Media | Open source |
Twitalyzr (Twitter stream analyzer) is a small application that is working on input Twitter stream with a particular hashtag (in our case #Cassandra) and it is processing each tweet marking it as Apache Cassandra related or not Apache Cassandra related. The decision whether it is Cassandra related or not is done based on tweet content and description of the person posting some tweet. The reason to make this application is that a lot of tweets with #cassandra are not relevant to the database. Due to its name, there are tweets about porn stars, soap operas, book characters, game characters, various sales, etc.
The application contains two Spark jobs:
Because there is a predefined model, there is no need for training (unless if you want to play). This pre-trained model uses Logistic Regression with preprocessing steps: clean, tokenize, stop word remover, n-gram, term frequency. Preprocessing steps are done on the text of the tweet, the text of the user description, and hashtags. Model is trained with k-fold cross-validation, and on the test set, it achieved 0.99 accuracies.
If you want to run a streaming job on AWS we already provide automatization via ansible in folder automation.
sbt assembly
token = "{{ token }}"
tokenSecret = "{{ token_secret }}"
consumerKey = "{{ consumer_key }}"
consumerSecret = "{{ consumer_secret }}"
```
4. Export variables that represent path for jar and model:
```
export TWITALYZR_SRC_JAR=<path_to_directory_that_contains_jar>
export TWITALYZR_SRC_MODEL=<path_to_tar_containing_model>
```
5. Execute next line (from [automation](automation) dir) if instance is already created, if not just change tag to ```aws-setup```:
ansible-playbook -vv -i inventory/ –private-key <path_to_pem> spark.yml –tags aws-start –ask-vault-pass
Tell us what you’re working on, we’ll answer right away.
Cassandra Diagnostics
NLP emotion detector