AWS Athena is an interactive query engine that enables us to run SQL queries on raw data that we store on S3 buckets. It is a completely serverless solution, meaning you do not need to deploy or manage any infrastructure to use that. And most importantly you do not need to pay for any data warehouse resource like clusters etc. when you are not using that. It is developed based on Presto, which was developed by Facebook in 2012 and open-sourced in 2013 as a data warehousing tool.
When we use AWS Athena to query data, we actually leverage three AWS services together to accomplish that; AWS S3 to store your data, AWS Glue Data Catalog to create catalog of your data that you store, AWS Athena to query. Note that AWS Athena doesn't store any data or copy of your data. Since it is already stored on your S3 bucket and cataloged using AWS Glue, it doesn't need to.
AWS Glue stores all following information related to the data that has been stored,
Unlike RDBMS databases, Athena works in an async manner, which means when you send a query you get an query execution ID as a response. You can poll AWS Athena to get the status of the query and when it is completed you can use this ID to get the query result through AWS CLI, SDK and Management Console. It also saves the query result to a S3 bucket that you configure on the setup.
Two options available as a data storage,
Amazon S3; using SerDe (serializer/deserializer) libraries. Available data formats to query data on S3 varies like, CSV (comma-separated), TSV (tab-separated), Custom-Delimited, JSON, Apache Arvo, Parquet, ORC. Also using line based text SerDe like Regex and Grok, you can even write your own format, so it gives you infinite elasticity.
Other data sources than S3 using Athena Federated Query; which puts AWS Lambda between AWS Athena and your data storage to run serverless queries. Data connectors available for Athena Federated Query are Amazon CloudWatch Logs, Amazon DynamoDB, MongoDB (AWS Managed or self-hosted), MySQL (Amazon RDS or self-hosted), PostgreSQL (Amazon RDS or self-hosted), ElasticSearch (Amazon Elasticsearch or self-hosted), Redis (Amazon ElastiCache or self-hosted). One great advantage is you can use this feature to run asynchronous queries on a relational database within your application.
I used WHO's COVID-19 data published as a CSV file on their website, containing official daily counts of COVID-19 cases, deaths and vaccine utilization reported by countries, territories and areas reported to WHO. On this data country codes are in ISO Alpha-2 country code format, so to be able to visualize it on a map, we'll need ISO Alpha-2 codes and coordinate information (latitude/longitude) of countries. For that I used a gist library that has been shared on github by a community member.
Covid Data: https://covid19.who.int/info/
Country List: https://gist.github.com/tadast/8827699#file-countries_codes_and_coordinates-csv
Download and upload your data to different folders in a Amazon S3 bucket using this structure:
➜ ~ aws s3 ls s3://sufle-athena-bucket --recursive
2021-10-18 22:52:01 0 country/
2021-10-18 22:55:01 12720 country/countries_codes_and_coordinates.csv
2021-10-18 22:51:47 0 covid/
2021-10-18 22:55:40 4802871 covid/WHO-COVID-19-global-data.csv
Let's go to AWS Athena service on the AWS Management Console. If this is the first time that you use this service, it will ask you some settings initially;
sufle-athena-output-bucket
.After you are done with initial configuration, click on the "Connect data source" button from the "Data sources" tab to start creating your first catalog.
On "Step 1: Choose a data source", you will choose;
And on the next one "Step 2: Connection details" you need to select,
These sections will redirect you to the AWS Glue Data Catalog creation page.
Before we move on, what is a "Crawler"?
Crawler is a tool that automatically scans your data and populates AWS Glue Data Catalog automatically for you. It is the easiest way to create a catalog.
Let's get back to the creating data catalog, on step-by-step form:
covid
s3://sufle-athena-bucket/covid
in our exampleAWSGlueServiceRole-covid
covidprod
We'll do this also for the country data that we have, very similarly:
country
this times3://sufle-athena-bucket/country
in our exampleAWSGlueServiceRole-country
covidprod
.After you have created crawlers you can select them and click on "Run crawler" to AWS Glue to create a data catalog automatically for you. After these processes are completed, you can go to "Tables" to see details of cataloged data such as record counts, delimiter, columns etc.
Since the data and catalogs are ready now, we can start querying our data using Amazon Athena. Let's go back to Athena service, and select "Query editor". Then from sidebar, you need to select the database that you created on crawler configuration and use following queries to test,
SELECT * FROM "covidprod"."covid" limit 10;
SELECT * FROM "covidprod"."covid" where country="Turkey";
If you check the output bucket that you've selected before and see the results that have been saved.
If you query the country table that we have created, you'll see that there are "
on the fields, since the CSV data file contains them. Also latitude and longitude are empty. To fix this we'll go back to the table on AWS Glue and click on "Edit Schema". Couple thing that we change to query it easily,
alpha-2 code
to alpha2
alpha-3 code
to alpha3
numeric code
to numeric
latitude (average)
to latitude
longitude (average)
to longitude
and save these changes.AWS Glue uses LazySimpleSerDe automatically. If you have a different format you can use "Supported Data Formats and SerDes" section of AWS Glue Documentation to select a different serializer/deserializer
for your catalog. As it states on the documentation, "Use the OpenCSVSerDe for Processing CSV when your data includes quotes in values"; we will use org.apache.hadoop.hive.serde2.OpenCSVSerde
, add quoteChar
with value "
and add escapeChar
with value \
to SerDe parameters section.
After you are done with these configuration changes you'll see that all data will be on the Results section of Amazon Athena console when you query the table once more.
Now you can also join these tables as,
SELECT * FROM "covidprod"."covid" AS cvd JOIN "covidprod"."country" AS ctr
ON cvd.country_code=ctr.alpha2
LIMIT 3;
Amazon QuickSight is a fast, easy-to-use, cloud-powered business analytics service of AWS that enables you to build visualizations, perform ad-hoc analysis, and quickly get business insights from their data, anytime, on any device. You can upload CSV and Excel files; connect it to SaaS applications like Salesforce; access your on-premises databases like SQL Server, MySQL, and PostgreSQL; and seamlessly discover your AWS data sources such as Amazon Redshift, Amazon RDS, Amazon Aurora, Amazon Athena, and Amazon S3. We will use QuickSight to visualize our data.
On the QuickSight dashboard, let's create a new Dataset, select Athena
as data source and name our dataset with something like, covid-data
. It will bring us the list of databases and tables to select from them. Since we want to use both tables
we'll use custom query option and put following joined query:
SELECT ctr.country, ctr.latitude, ctr.longitude, cvd.date_reported, cvd.new_cases
FROM "covidprod"."covid" AS cvd
JOIN "covidprod"."country" AS ctr
ON cvd.country_code=ctr.alpha2;
Then go back to datasets, click on the dataset that you created and select Create Analysis
to start visualizing. On the sidebar that will appear on the left side of your window, you'll see different Visual types
, we'll select the Points on map
icon (the one with an earth image). At the top you'll see that it requires Geospatial input to visualize, so to be able to do that we need to create coordinates from latitude and longitude values that we have on our data. Let's go back to Datasets and use Edit to customize our dataset. On the left sidebar, you'll see our fields latitude
and longitude
, if you click on the three dots next to one of them you'll see the first option is Add to Coordinates
. Name the coordinates as coordinate
and select the fields from drop down lists.
Go back to analysis that you created couple minutes ago and drag drop the fields to wells from Field List
:
And final result:
Once a software developer now an AWS Certified Solutions Architect Professional, Gizem is always eager to take on professional challenges. Her meticulousness at her workings follows her passion for learning and sharing her knowledge with tech-savvy professionals and communities.
Subscribe to Our Newsletter
Our Service
Specialties
Copyright © 2018-2024 Sufle
We use cookies to offer you a better experience with personalized content.
Cookies are small files that are sent to and stored in your computer by the websites you visit. Next time you visit the site, your browser will read the cookie and relay the information back to the website or element that originally set the cookie.
Cookies allow us to recognize you automatically whenever you visit our site so that we can personalize your experience and provide you with better service.