In order to identify unusual behaviors of tracked targets in a large and ever-expanding data set of raw geospatial track data, we built a analytics application based on Cassandra, Hadoop, and proprietary geospatial algorithms. These algorithms consumed track data and generated summary data, both of which were stored in Cassandra. The processing engine for the data was Hadoop.
The “NoSQL” database Cassandra has established itself as a leader in raw I/O capability. Its eventual consistency and distributed hash algorithms allow it to efficiently migrate data through a network graph. The result is a streamlined path from where data is ingested to where it is used.
Hadoop is fast-becoming the industry standard for distributed processing. Its ability to individually scale parts of an algorithm as well as a whole job help developers eliminate bottlenecks in the data flow and utilize all of the hardware at their disposal.
Together, Hadoop and Cassandra provide a highly scalable framework on which to build analytical applications that operate over massive amounts of data.
OSC developed a custom map-reduce job that fed all of the existing data through various web services and stored the resulting tracks with the originals. Each I/O record of the job aligned with its Hadoop processing node and instances of third-party web services to evenly divide the workload and ensure minimal network traffic. After processing the historical data, the client was then able to reuse the same map-reduce process to quickly ingest new records in both bulk and ad hoc form. The new meta-data resulted in an order of magnitude increase in track data points made available for analysis.
The system was tested by using public data of taxi cabs driving in San Francisco.