If we go to technical details, Splunk is key-value store, where key is a timestmap. In addition, it use MapReduce in order to process data.
Whereas most of the products that we described earlier had their origins in processing human-generated digital footprints, Splunk started as a product designed to process machine data. Because of these humble beginnings, Splunk is not always considered a player in big data. But that should not prevent you from using it to analyze big data belonging in the digital footprint category, because, as this book shows, Splunk does a great job of it. Splunk has three main functionalities:
- Data collection, which can be done for static data or by monitoring changes and additions to files or complete directories on a real time basis. Data can also be collected from network ports or directly from programs or scripts. Additionally, Splunk can connect with relational databases to collect, insert or update data.
- Data indexing, in which the collected data is broken down into events, roughly equivalent to database records, or simply lines of data. Then the data is processed and a high performance index is updated, which points to the stored data.\
- Search and analysis. Using the Splunk Processing Language, you are able to search for data and manipulate it to obtain the desired results, whether in the form of reports or alerts. The results can be presented as individual events, tables, or charts.
Each one of these functionalities can scale independently; for example, the data collection component can scale to handle hundreds of thousands of servers. The data indexing functionality can scale to a large number of servers, which can be configured as distributed peers, and, if necessary, with a high availability option to transparently handle fault tolerance. The search heads, as the servers dedicated to the search and analysis functionality are known, can also scale to as many as needed. Additionally, each of these functionalities can be arranged in such a way that they can be optimized to accommodate geographical locations, time zones, data centers, or any other requirements. Splunk is so flexible regarding scalability that you can start with a single instance of the product running on your laptop and grow from there.
You can interact with Splunk by using SplunkWeb, the browser-based user interface, or directly using the
command line interface (CLI). Splunk is flexible in that it can run on Windows or just about any variation of Unix.
Splunk is also a platform that can be used to develop applications to handle big data analytics. It has a powerful
set of APIs that can be used with Python, Java, JavaScript, Ruby, PHP, and C#. The development of apps on top of
Splunk is beyond the scope of this book; however, we do describe how to use some of the popular apps that are freely
available. We will leave it at that, as all the rest of the book is about Splunk.