Processing and Analyzing IoT data in real-time at scale
Contributed By - Hari Gottipati
With the explosion of IoT over the past few years, the amount of data produced has been stunning. For example, Virgin Atlantic's new fleet of highly connected plane (Boeing 787) generates over half a terabyte of data per flight. With all the sensors in a self-driving car generates nearly 1 GB data every second. On average, Americans drive 600 hours per year in their car, which is 2,160,000 seconds or approximately 2 petabytes of data per car per year.
The market for IoT has seen an unexpected acceleration in 2018 and has lifted the total number of IoT devices to 7 billion. According to IT Pro, around 3.6 billion IoT devices are connected to the Internet in 2018 alone. Including smartphones, tablets, and laptops, the number of connected devices in the world exceeds 17 billion. The IoT market is expected to grow to $1,567B by 2025. According to SC magazine, by 2020, 90% of vehicles will be connected to the internet and 173.47 million wearable devices will be in use.
The sheer amount of data coming from IoT sensors in real time leads to new challenges and new use case. One such use case is time-series data which helps to monitor various things to improve efficiencies and avoid failures in real-time. However, real-time data brings new challenges in handling a high throughput of data with highly available systems. While your real-time monitoring dashboard queries your underline datastore every 1 to 3 seconds, sensors will flood your datastore with thousands and thousands of write queries per second. The processing frameworks consume data from the message brokers, run computations on that data before storing the data in a data store.
Anatomy of end-to-end real-time data systems
Just like Big Data, real-time data systems are composed of four stages - data acquisition, data processing, data storage, and data exploration, but the technologies used in each stage are completely different between and big data and real-time data. Real-time has two aspects. One - how fresh the data is in the real-time system compared to the transactional system. Two - how fast I can access the data that is part of the real-time system. It is very important to always distinguish real-time between data freshness and data retrieval. A good real-time system consists of the following characteristics:
Low latency - will process data as soon as they enter into the system, usually in milliseconds.
Guaranteed message delivery - irrespective of failures, the real-time system processes all the messages.
Scalability - it can scale to any volume.
Fault tolerance - the system can automatically restart and continue processing without any disruption when a failure occurs.
Data acquisition
This area is quite different compared to batch acquisition as this involves in acquiring the data from sensors over wireless networks. Instead of each sensor sending the data to cloud over a wireless network, it sends the data to edge device and in turn edge device transmits the data to the cloud or back-end servers. Sensors use Bluetooth, Bluetooth Low Energy, WiFI, Zigbee, Zwave, NFC, RFID, MEMS, HAN, LoRaWAN, etc. protocols to transmit the data to an Edge device and Edge device uses MQTT, AMQP, CoAP, JMS, HTTP, DDS, etc. protocols to transmit the data to IoT gateway. In turn, IoT gateway makes the data available for processing on messaging platforms such as Apache Kafka, Apache Pulsar, Akka Streams, Amazon Kinesis, ActiveMQ, and RabbitMQ.
Data processing
Unlike in batch data processing, the data gets processed message by message in real-time systems. Real-time processing consumes an endless stream of data that are captured in real-time and processes the data with minimal latency to generate real-time (or near-real-time) reports or automated responses. Frameworks such as Apache Spark, Apache Flink, Apache Apex, Apache Samza, Apache Storm, and Apache Beam can process messages in sub-seconds at scale. Frameworks such as Tensorflow, Pytorch, Scikit-learn, Keras, etc. can help run Machine Learning models as part of the data processing.
Data storage
The datastore needs to support strong consistency while enabling fast, inexpensive, reads and writes of large streams of data. Different data stores have different capabilities - some are better with reading, some for inserting, some at updating, and some are better for time-series data. Druid, InfluxDB, and Apache Pinot are good for time-series data and Apache Cassandra, Couchbase, Amazon DynamoDB, Redis, Riak, Neo4J, and MongoDB are suitable for various types of real-time datastore needs.
Data exploration/visualization
Real-time data visualization and data exploration become increasingly critical to access, analyze, visualize and explore live data that is coming at lightning speed. Tools like Kibana, Grifana, and Banana are good at visualizing time-series data; D3.js, Taucharts, Gephi libraries are good for charts, graphs, metrics; Apache Zeppelin, Jupyter Notebooks are good to create and share documents that contain live code, equations, visualizations, and narrative text. Open source tools like Apache Superset, Redash, and Metabase are competing with commercial vendors Tableau and Qlik to provide interactive data visualization products.
Note: InfluxDB (data storage) is accompanied by Telegraf (data acquisition), Kapacitor (data processing), and Chronograf (data exploration) to support end-to-end real-time system.