The final project for the Cloud Computing Specialization in Coursera: Analysis of Airline On-Time Performance Data in Hadoop and Spark.
- Launched and configured multiple AWS EC2 instances with proper security setting (IAM Role, Security Group, Private Key Access).
- Installed and deployed Hadoop and Spark on the cluster of multiple AWS EC2 instances by Ambari.
- Designed and implemented the solutions for several in-practice problems in Hadoop and Spark respectively for analyzing the Airline On-Time Performance Data (all non-canceled flights between 1988 and 2008) from the BTS (US Bureau of Transportation Statistics).
- Installed and deployed Cassandra database on multiple nodes and stored the results into the cluster.
- Applied system-level optimizations by creating instances with a higher ratio of vCPUs of memory and application-level optimizations by adjusting spark.locality.* properties in SparkConf for increasing data locality.