What is Apache Superset?
"Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application". (Apache Software Foundation)
Some other equivalents you might've heard of would be Tableau or PowerBI, but they're all business licensed software.
What about Amazon S3 and Athena?
S3 : "Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. " (Amazon Web Service)
Athena : "Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run." (Amazon Web Service)
What You'll Need Beforehand
- An AWS account (and cash duh).
- AWS credentials set.
- An Ubuntu 18.04+ environment.
- Mapbox account.
-
pip
installed.
Installation
PyAthena
- Apache Superset needs an API interface to interact with AWS Athena.
pip install "PyAthena>1.2.0"
Apache Superset
- Install superset
pip install apache-superset
- Initialize the database
superset db upgrade
- Create an admin user (you will be prompted to set a username, first and last name before setting a password)
export FLASK_APP=superset
superset fab create-admin
- Load some data to play with
superset load_examples
- Create default roles and permissions
superset init
Workflow
- To start a development web server on port 8088, use -p to bind to another port
superset run -p 8088 --with-threads --reload --debugger
- Switch to your browser and go to http://127.0.0.1:8088/, you should now see something resembling the following
- Login with the admin account you have just created. You'll see some examples have been loaded if you followed the tutorial. Play with them if you want to, but we'll be using some other data for demonstrative purposes.
- Throw some data into an AWS S3 bucket to process with. This airbnb data from kaggle is what I'll be using.
aws s3 cp ~PATH/TO/AB_NYC_2019.csv s3://YOUR-BUCKET
-
Now, come back to your Apache Superset's UI and add the click on
Databases
, then the+
button on the top right hand corner. -
Modify and add the following text to
SQL Alchemy URI
.
awsathena+rest://{aws_access_key_id}:{aws_secret_access_key}@athena.{region_name}.amazonaws.com/{schema_name}?s3_staging_dir={s3_staging_dir}
- Log into AWS Athena's interface and define the columns you need for your database. I won't be using all the columns for simplicity.(Make sure that the region of your AWS Athena and the S3 bucket you made is the same) If you're familiar enough with AWS Athena, you can execute the exact query on Apache Superset's UI.
- Going back to your Apache Superset UI, you should see the following
- Run a query of your preference and click on
Explore
.
- Running a
deck.gl
visualization gives us ...
- Oops, seems like we need a map token from MapBox. Register an account and export your token as an environment variable to
MAPBOX_API_KEY
. See the official documentation
export MAPBOX_API_KEY=your-token
- Restart your server and you should now see ...
Darker grids are where the average Airbnb price are higher.
Here's a place where some of the more expensive Airbnb's rooms are clustered, and the reasons might be apparent.
Conclusions
There are a lot left to talk about with Apache Superset, AWS S3, and AWS Athena, but the general idea here is to demonstrate a data analysis workflow combining various tools. Indeed, one can achieve this without using any of the above, for instance, with the combination of Tableau and Google Bigquery.
Reference
- Apache Software Foundation, "Apache Superset (incubating)", Apache Software Foundation. https://superset.incubator.apache.org/#apache-superset-incubating. 22 August 2020.
- Amazon Web Service, "Amazon S3", Amazon Web Service. https://aws.amazon.com/s3/. 22 August 2020.
- Amazon Web Service, "Amazon Athena", Amazon Web Service. https://aws.amazon.com/athena/. 22 August 2020.