Notice: Technically speaking, Cloudfront and ELB logs are “best effort” from AWS & may be delayed by up to 24 hours. That said, for at least light and moderate workloads I haven’t found any major discrepancies that would inhibit me from using them for analysis at ~15 minute intervals.

Step 1 - Turn on Logging

select load Balancer
Actions -> Edit Attributes
Access Logs -> Enable
- S3 location
- Create this location for me
Go to S3 and adjust permissions so that Athena can access them.

Step 2 - Create Athena Table

Ignore their silly tutorial, it’s useless.
On the left menu “Create Table”.
General instructions are from (AAthena from ALB Docs)[https://docs.aws.amazon.com/athena/latest/ug/application-load-balancer-logs.html#create-alb-table]
I have found the below query works with ALB’s, ELB’s, and Cloudfront logs though.
I have also found the documented query to use Athena with (Legacy ELB’s)[https://docs.aws.amazon.com/athena/latest/ug/cloudfront-logs.html] and (Cloudfront)[https://docs.aws.amazon.com/athena/latest/ug/cloudfront-logs.html] are wrong and will result in empty data sets.

The only thing you should need to edit is the table_name at the top, and the s3://bucket/location/dir at the bottom. Look in your S3 for the best bucket subdirectory to use, the patterns are subtly different by service.

CREATE EXTERNAL TABLE IF NOT EXISTS alb_logs (
            type string,
            time string,
            elb string,
            client_ip string,
            client_port int,
            target_ip string,
            target_port int,
            request_processing_time double,
            target_processing_time double,
            response_processing_time double,
            elb_status_code string,
            target_status_code string,
            received_bytes bigint,
            sent_bytes bigint,
            request_verb string,
            request_url string,
            request_proto string,
            user_agent string,
            ssl_cipher string,
            ssl_protocol string,
            target_group_arn string,
            trace_id string,
            domain_name string,
            chosen_cert_arn string,
            matched_rule_priority string,
            request_creation_time string,
            actions_executed string,
            redirect_url string,
            lambda_error_reason string,
            new_field string
            )
            ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
            WITH SERDEPROPERTIES (
            'serialization.format' = '1',
            'input.regex' =
        '([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*) ([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" \"([^\"]*)\" ([A-Z0-9-]+) ([A-Za-z0-9.-]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\" \"([^\"]*)\" ([-.0-9]*) ([^ ]*) \"([^\"]*)\" \"([^\"]*)\"($| \"[^ ]*\")(.*)')
            LOCATION 's3://<YOUR_LOG_BUCKET>/us-east-1/<alb-name>/AWSLogs/<AWS ACCOUNT NUMBER>/elasticloadbalancing/us-east-1';

Step 3 - Create Athena Query

So far, I have found the Athena query to be compliant with mySQL.

Considering it’s collecting data from zip files, I’m suprised at how snappy it is.

1
2
3

select count(*), client_ip from default.alb_logs
where domain_name = 'nelsonenzo.com'
group by client_ip;

Final Words

Ultimately, Athena feels very clunky. The tabbed query interface is kind of, um, interesting. I would much rather have an ETL process for these logs into something like elasticache, or some paid 3rd party log aggregator for querying. Primarily because not only is the Athena interface clunky, but getting the results into a service where they would be actionable requires using AWS’s SDK - which is nice that it exists, but far from as easy as hitting an API or using a webhook from a proper log aggregating service.