Real-time Vs Batch analytics
The need of visual analytics has become an important need for the whole world. To aid the daily business aids and to get closer to customers this field has become a good opportunity.
Though it seems like these visualizations are used for mere decorations, in data science aspect these visualizations can be used to gain crucial business insights. Hence three types of main data analytic types can be found.
1. Predictive analytics – This type is used to predict future patterns by studying data. Techniques like data mining, machine learning, modelling are used to obtain predicted insights.
2. Interactive analytics – For this type a stored dataset can be queried in an ad-hoc manner to find useful perceptions.
3. Real-time analytics – Real-time analytics is associated with processing data without loading at a high speed. This is used to retrieve analytical information quickly for crucial applications like credit card transaction monitoring, heart beat rate tracking devices and so on.
With this introduction about analytics types, this blog article focuses on the solution provided to Uber and Lyft Dataset Boston, USA. The original dataset could be found in kaggle website. Batch and real-time analytics were used to assess market situation and also to plan the business for the Uber and Lyft taxi trips.
The research question revolves around quantitative data as the dataset contains mostly about numerical values. Other than that important field names such as product_id, cab_type, etc. are available as shown in below csv file. The original dataset contained 57 values but for the analysis only 13 columns were used as shown in the below diagram. When analyzing the dataset it was found that the ‘price’ column in the dataset contained ‘NA’ values for Taxi type with a row count of 55000. Therefore the data was preprocessed accordingly and NA value containing rows were removed from the dataset.
The analytical results were retrieved considering batch analytics and real-time analytics.
Batch analytics
As for this part, databricks platform was used to process the dataset and obtain visualizations as needed. In batch analytics the whole dataset is processed first and then only queries are run unlike in real-time analytics.
Following are the observations that were studied by me.
1. Number of rides over time by grouping the product_id
As seen in the diagram, spikes in the graph are not much visible as the number of rides are almost equal for each product_id.
2. taxi fare for each cab_type and name
According to the visualization, the taxi fares for each cab_type and name have overall equal average prices. Lux Black XL from Lyft cab type and Black SUV from Uber has the height taxi fare.
3. Interesting observations identified
1.As a total, Lyft cab type has gained an income more than Uber cabs with a slight increase.
2. The travel counts made by customers to each source has almost equal amounts implying the source rides are equal.
4. This graph shows monthly income for each cab type. Lyft cabs has the highest income for both months. In overall both Uber and Lyft cab types has the highest income for month 12.
Hence the final dashboard for batch analytics is as shown below.
Real-time analytics
The real-time analytics were processed using Siddhi editor provided by Wso2 Company. The csv file was read from siddhi editor and was pushed to the Siddhi Engine. Once the data is set to be processed the data consuming can be seen in the siddhi console like in the below screenshot.
After running the respective queries in the editor to get the needed outputs, they were directed to MySQL database by dumping siddhi data to it. After dumping data into tables, using table querying visualizations were graphically visualized. For the visualizations Power BI was used.
1. The total number of rides over the last 1 minute for each product_id.
As the first visualization the total number of rides were counted in the dataset with use of window.time() function to obtain the counts for last 1 minute.
2. Average taxi fare for the last 1 min for each product_id. Make sure that you have considered the rides in between the same source and destination.
This graph was complex as the final output gave a huge number of data as it was grouped using source and destination to consider about rides between same points. Therefore a tree map was used to visualize these data.
3. Find out the highest taxi fare for the last 1 min for each cab_type.
Using window.time(1 min) inbuilt function, maximum fare was taken by grouping against cab type. The final result was visualized using a graph to compare the values easily.
4. List down the top 3 taxi fares of each minute for each cab_type.
When calculating the top 3 taxi fares we have to be careful as streaming analytics is unlike batch analytics. In batch analytics since data is already loaded there is nothing much to be worried about. But as real-time analytics gets updated almost every second, analyzing should be done with care.
Therefore to obtain the top 3 taxi fares using the ‘limit’ function was not enough in this case. The current time and an attribute of id was used to determine this result. The resulting table is as below.
As shown in the table the data can be sorted in the descending order using both id and time. This will retrieve he latest records in spite of traditional limit queries. Hence the final output was taken into a graph as below.
5.Find out the source location which has the least number of rides of each 1 minute.
To obtain this visualization also above theory of ordering using current time was used. The graph is as below.
The final dashboard for real-time analytics looks as below.











