вторник, 1 сентября 2015 г.

AWS S3 logs in Splunk

With Splunk it is possible to solve any task where we have lot's of unstructured data and want quickly to insight value from data. Let's consider one of example in case of cloud technologies such as AudienceStream DMP This is amazing data management platform which services us for real marketing automatization.

One of capabilities of AS platform is enriching current customer database with external data. For example we can using machine learning and internal transaction data for  defining score for every customer who is online. We can calculate scoring via R or enterprise data mining platform such as SAS or SPSS and send result to Amazon S3. AS can connect S3 and ingest files. But how can we analyze this process? There are lots of S3 logs in our bucket with GET, PUT methods.  Splunk can easy connect AWS S3 via splunk app - Splunk Add-on for Amazon Web Service.

As usual we download this app and install. Than we have to connect to AWS account:
We only need copy paste Account Key ID and Secret Key.
After connecting to AWS account we should setup new data inputs. Let's go to AWS S3:
And click "Add New".
Than choose existing account, S3 host name (in my case it is default) and S3 bucket:
Moreover we can specify index and white/blacklist and some other options. When we finish, Splunk will update input.conf file which we can access any time and adjust. There are all possible parameters of S3 input
When we finish, splunk begin to ingest logs. We can easy search them by typing name of index, in my case it is "s3logs":
 By default splunk didn't extract any field from S3 logs. But we can easy find structure of log and
using field extractor create all fields:
We have all information in order to create report, which shows us hourly upload by AS for the last 48 hours.

index=* sourcetype=aws:s3 earliest=-48h method_detail="REST.GET.OBJECT" clientip = 107.14.21.138 object_name=*.csv | timechart span=1h count by object_name

We filter by filetype, method and ip of AS.

Moreover, we can use Splunk CLI and automatically run search via crontab and extract to flatfile in order to visualize and update report every 5 minutes.

There is shell script for linux:

#!/bin/bash
# File: test_action.sh 
# Description: To output saved search result
#

SPLUNK_HOME="/Applications/Splunk"
OUTPUT="test_output.log"
USER=admin
PASSWORD=changeme
$SPLUNK_HOME/bin/splunk search 'index=* sourcetype=aws:s3 earliest=-48h method_detail="REST.GET.OBJECT" clientip = 107.14.21.138 object_name=*.csv | timechart span=1h count by object_name' -auth ${USER}:${PASSWORD} > ${OUTPUT}