Analyzing & Visualizing Tweet using Python, pyspark, Kafka, zookeeper and Tableau

Handing humongous streaming data and providing real time analysis on provided data is like revolutionary. Python with combination of pyspark make it possible with ease. Here, we have advantage of getting data using twitter provided api. Twitter API allows to get the real time data, process it using RDD, pass it on to display the counts and then, dashboard to display analysis of whole data processed. It is interesting and full of learnings. This is as simple to implement as it looks cumbersome to the understand given problem statement.

Technologies used: Python with pyspark, zookeeper, Kafka, Chart.js and Tableau

Our Goals

  1. Scalability…


Boxplot : Different Statistical Measure in Single Plot

Box plot is the graphical presentation of data commonly used for finding the outliers in the data. As we know, data plays very important role in machine learning end to end processing. Better the data is given to train the model, you will notice model generalizing better to unseen data. So, data is the heart to solve any problem statement. Defining a problem and collecting the data is the root to start solving the problem.

Once we are ready with the problem statement, we start collecting data from different sources. And, most of the time data is input or provided…


Importance of controlling and resolving this for NN model

Machine learning is buzzword and Neural Network is the must learn thing for training the model to have better models which can generalized better with unseen data. Before NN, it is known that earlier models can overfit or underfit depends upon various conditions. Various techniques to resolve that are more data, early stopping and applying techniques to normalize the data etc. With NN, people think that they should be able to converge in each and every situations. Somehow, that is not the case and surprisingly, any NN algorithm in machine learning does not perform better due to vanishing and exploding…


Understand Bias Variance to have model to generalize better

Bias-Variance trade off is common term heard whenever you talk about data, machine learning and training the model using data to predict. The other big terms come out of this trade off is Underfitting and Overfitting. These are few important terms to understand and then, it can help to analyze your data and reasons behind ML model not performing upto the expectations. Let us make this terms simple and understand them. So that, in future these terms won’t trouble us much. It is one time effort and will help you to evaluate your ML models life time. …


Using ML model and making Data Scientist simpler by packaging to deliver the solution

Developing Machine Learning model, training and deploying is end to end exercise. Here, I will explain how a model can be packaged in such a way that it can be used in your python program effectively and intelligently. It is not as difficult as it should be. Once you are done with your code and your code is ready to be deployed. You will see how much flexibility it can provide you to do the things in more efficient and organized manner. Let us get our hand dirty. I am starting with the below code in setup.py …


Filters have meaning: Context filter is best out of that

Tableau is the new generation visualization tool. I consider it as most easiest one to implement visualization for your audience. It has several features to distinguish it from other existing tools. It is capable of handling all sort of problems for representing data in different format.

Tableau is capable of achieving smart task in smart way. Better to learn and master them

There are certain problems which can’t be handled directly. Or, I would say you have to tweak the current features to achieve your requirement. Here, I will explain you how can we handle the requirement where we want…


CODEX

Truly based on Data and Problem Statements

Machine Learning has become buzzword in last few years. And, everybody knows about this and likes to experience the same. This is one of the good things happened and thats the reason it is getting more popular. More people explore and then more ideas come. There are many algorithms and models exists. It becomes problem to choose the correct one for your problem statement. Let us experience common steps to explore the problem of choosing ML algorithm and see if we can achieve the standards where chances of wrong selection can be reduced.

Common and Simple Steps before ML models…


CODEX

Use case and Innovate will make you experience visualization differently

Tableau is the most elaborate and easy to implement visualization tool every experience by professionals. It is one of the in demand tools. I have already explored basics and important topics in below blog. Now, time to see more invented way of changing the existing charts in tableau to your requirement so that you can produce different charts. To your surprise, you will see that it is not as difficult. In single click, you can create existing chart options in tableau and with little bit innovation, we can turn them into required ones.

Tableau is capable of providing the favorite…


Overfitting | Less Data | Data Simulation — Solution is CV

Cross validation is one of the things which can be used to make your training of model more reliable with the given data. It is also known as rotation estimation or out of sample testing. You will understand in a while why is it so!

Cross validation or simply CV can also be referred as out of sample testing or rotation sampling. Once you have model which is not generalizing better with test data or in other words, we can say model is overfitting. So, it means you have less data for model to learn and converge. And, solution is…

Laxman Singh

Machine Learning Engineer | Data Science | MTECH NUS, Singapore

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store