Skip to content

This is my capstone project for the Google Data Analytics Professional Certificate. You can find the presentation for this case study below:

Notifications You must be signed in to change notification settings

fern996/Cyclistic-CaseStudy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 

Repository files navigation

How does a bike-share navigate speedy success?

A Case Study

Stakeholders:

  • Lily Moreno: Director of marketing and my manager.
  • Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic's mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.
  • Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.

The Data Analyst Process:

1.ASK:

Expected Deliverable:

A clear statement of the business task.

Guiding Question:

  • What is the problem you are trying to solve?

    • I am trying to figure out how casual riders differ from annual riders to maximize the number of annual riders. I will need to identify the trends and patterns between the two.
  • How can your insights drive business decisions?

    • By learning how annual members and casual riders use Cyclistic bikes differently, Cyclistic can then developer marketing strategies around the habits of casual riders and demonstrate how they can benefit from an annual pass like annual members.

Deliverable: Business Task:

The business task for this study will be to identify the differences between the trends and patterns of casual and Cyclistic members and then present these findings to the executive team.

2.Prepare:

Note: The datasets have a different name because Cyclistic is a fictional company. For the purposes of this case study, the datasets are appropriate and will enable you to answer the business questions. The data has been made available by Motivate International Inc. under this license.

Expected Deliverable:

A description of all data sources used.

Guiding questions:

  • Where is your data located?
    • The data was in a database provided by Motivate International. For the purposes of this study the data is Internal and primary data.
  • How is the data organized?
    • The data is structured and in a wide format.
  • Are there issues with bias or credibility in this data? Does your data ROCCC?
    • Using the ROCCC test:
      • The data is Reliable. It has missing cells for the start_station_name, end_station_name, end_lat and end_lng, but I may not need to use these columns in my final analysis. There are no biases in the database or spelling mistakes. But I will need to clean the data for mistakes, etc.
      • The data is Original: it is first party data collected from the Divvy data logging system.
      • The data is Comprehensive: the data is comprehensive as it contains all that I need to make my analysis.
      • The data is Current: It contains data from the last 12 months i.e., March 2023 to May 2022.
      • The data is Cited: The data source is first party.
  • How are you addressing licensing, privacy, security, and accessibility?
  • How did you verify the data's integrity?
    • By checking data schema, and filtering for blanks via BigQuery.
  • How does it help you answer your question?
  • Are there any problems with the data?

Attributes of the data and will I need it:

  1. Ride_id: No nulls found.
  2. Rideable_type: No nulls found.
  3. Started_at: No nulls found.
  4. Ended_at: No nulls found.
  5. Start_station_name: Nulls found.
  6. Start_station_id: Nulls found.
  7. End_station_name: Nulls found.
  8. End_station_id: Nulls found.
  9. Start_lat: No nulls found.
  10. Start_lng: No Nulls found.
  11. End_lat: Nulls found.
  12. End_lng: Nulls found.
  13. Member_casual: No nulls found.
  • Since I have no direction on what to do with attributes with nulls, I will discard them. I do not need Ride_id as it is autogenerated string when trips are begun and will not be meaningful in my analysis.

  • Start_lat and start_lng may be discarded as it will not be meaningful.

  • I can use Rideable_type to compare against member types.

  • I can use Started_at and Ended_at to find out travel time.

  • I need to use Member_casual as it contains what type of membership the customer had.

So, the attributes I'll be using are rideable_type, started_at, ended_at and member_casual. Will need to rename member_casual to member_type.

Deliverable: Data Sources Used:

The data source used is public data provided by Motivate International Inc under thislicense. The files from the database used are:

  1. 202204-divvy-tripdata.zip
  2. 202205-divvy-tripdata.zip
  3. 202206-divvy-tripdata.zip
  4. 202207-divvy-tripdata.zip
  5. 202208-divvy-tripdata.zip
  6. 202209-divvy-tripdata.zip
  7. 202210-divvy-tripdata.zip
  8. 202211-divvy-tripdata.zip
  9. 202212-divvy-tripdata.zip
  10. 202301-divvy-tripdata.zip
  11. 202302-divvy-tripdata.zip
  12. 202303-divvy-tripdata.zip

I loaded the files into BigQuery and combined all the tables into one. Using the aggregated table, I was able to select and find the attributes with and without blank cells. I chose the attributes that were relevant to my analyses which were also complete. The resulting data is anonymized i.e., there is no personal information (credit card, personal details, etc) which also means there will be no personal bias. But, the data will have to be further cleaned and transformed to be made reliable and comprehensive.

3.Process:

Expected Deliverable:

Documentation of any cleaning or manipulation of data.

Guiding Questions:

  • What tools are you choosing and why?
    • For cleaning I'll be using Excel. It's a lot more flexible in terms of calculations
  • Have you ensured your data's integrity?
    • Data replication: I've had no issues with data replication. The number of rows of all csv files have been the same.
    • Data Transfer: I was having issues with exporting files to bigquery due to the data type of a column being set to FLOAT instead of TIME. I re-did my data cleaning procedures and was able to export successfully.
    • Data Manipulation: Had to clean data while calculating ride_length. Turns out, some of the started_at and ended_at times were switched, i.e. trips had end times before their start times. This would give an invalid result during calculations. Most of my cleaning efforts were aimed at fixing this.
  • How can you verify that your data is clean and ready to analyze?
    • Exported all files to big query and combined them into one table. I selected the columns that I will be using for analysis plus the newly created ones and created a new table for it. I have checked the

Deliverables:

Steps followed in excel:

  1. Unzip the files.
  2. Opened each file and saved as .xlsx since I chose to work with excel.
  3. Then opened each file and added columns "ride_length" and "day_of_week".
  4. For ride_length column, I found difference between ended_at and started_at to give length of ride. I used format cells to change format to h:mm:ss.
  5. For day_of_week column, I used the weekday function to get the day of the week the ride started on.
  6. I then sorted or filtered ride_length column to look for "######". This indicates an issue in the started_at and ended_at fields where the start time is later than the end time.
  7. To change more than one row, I sorted ride_length in ascending order, then switched the started_at and ended_at with each other.
  8. Repeated steps 6 to 7 for the rest of the files.

Note: The ride_length field became a means to find dirty data in the started_at and ended_at fileds. But the ride_length fields has a limitation where the data format of time (hh:mm:ss) can only accurately represent ride lengths less than 24 hours. For e.g., ride_id: 7D4CB0DD5137CA9A rented a bike for nearly the whole month of October in 2022. But the ride_length field only shows a time of 17:47:15.

For this reason, I uploaded the data to BigQuery and calculated ride length using the TIMESTAMP_DIFF to output the time difference in minutes. Using the above example, its ride length is now accurately shown to be 41387 minutes or 28.7 days.

Because the collated table would be too large for a spreadsheet tool like excel or sheets, naturally SQL or R became the options for handling this large data set. I chose SQL and the following steps detail the process I used to import, create, and transform the collated table in BigQuery.

Full Steps followed in BigQuery:

  1. Imported all data sets to BigQuery then merged all datasets into one table.

image

  1. Created new filtered table with only the fields that were relevant to my analysis, renamed member_casual to member_type, changed day of week to type string and calculated time difference using TIMESTAMP_DIFF function.

image

  1. Updated the new table day of the week numbers to actual days of the week.

image

  1. Removed 18 rows of null values found in the table.

image

  1. To extract Month from the table I used

image

  1. Because the extracted value is in INT64 it will have to be casted as string before I can update the number values to literal months. This will involve creating a new table with the casted data type and then replacing the original table with the casted table.

image

  1. Copying back the table with the modified string back onto the original YearData table.

image

  1. Changing numeric month to literal month.

image

  1. To represent yearly quarters, I created a new column of type string and I used this set of queries to update the new column.

image

  1. At this stage, the number of rows that I had were 5,803,720. Of these, 633,307 entries (10.7%) were rides that were 3 minutes or less. I decided to filter these rides out as customers could have made mistakes or they could have canceled the ride.

  2. The final table has 5,170,413 rows of data.

Schema of the final columns used:

Column name Description
ride_id To reference ride_id with source data
member_type Type of customer membership
rideable_type Type of bike used
started_at Date and time of trip start
ended_at Date and time of trip end
day_of_week Day trip started
ride_length Excel calculations for ride length
ride_length_mins BigQuery calculations for ride length
month Month of the year, extracted from started_at
year_quarter Year and quarter based on the month

4.Analyze:

Expected Deliverable:

A summary of the analysis

Guiding questions:

  • How should you organize your data to perform analysis on it?
    • Ill be importing my data into BigQuery and compiling all the datasets into one table. Ill be then using aggregate functions to create summary tables, or filter the table for a column to answer specific questions.
  • Has your data been properly formatted?
    • Yes, I went through each table and made sure that the data type is of the proper formats and that the table dimensions for all the tables are the same.
  • What surprises did you discover in the data?
    • That casual customers may have longer ride lengths, maybe up to a month. We may want to track such behaviours in real time to maybe persuade them to switch over to annual memberships.
  • What trends or relationships did you find in the data?
    • Members are more consistent with their usage during day hours and during the week. They are also more likely to use the service during colder months, where casual members sharply drop off in usage.
  • How will these insights help answer your business questions?
    • These insights highlight how both member types use the service differently. Hence it does allow me to answer the business question.

Deliverable: Analysis:

For the purposes of the analysis, I will be referring to our annual customers as members and our casual customers as casuals.

First, I wanted to find the distribution of rides between both member types. I used the following query,

image

And got the following result,

Row member_type percentage_of_rides count_of_rides
1 casual 41.86 2164550
2 member 58.14 3005863

We can see that for 2022 Q2-2023 Q1, annual members account for a bit more than half of the total trips. Annual members have 16.28% greater usage than casual members.

Next, I wanted to look at average ride_length and frequency of rides but at a per quarter level. To do that I first created table showing then number of rides between member types for each quarter.

image

Then I created a second table with average ride length between member types for each quarter.

image

Finally, I created the summary table showing number of rides and average ride length.

image

year_quarter member_type avg_ride_length_mins count_of_rides percentage_num_rides total_rides_per_quarter
22 Q2 casual 33.08 730187 45.45 1606634
22 Q2 member 14.71 876447 54.55 1606634
22 Q3 casual 30.95 988103 47.16 2095416
22 Q3 member 14.82 1107313 52.84 2095416
22 Q4 casual 26.9 319351 34.15 935093
22 Q4 member 13.06 615742 65.85 935093
23 Q1 casual 25.28 126909 23.8 533270
23 Q1 member 12.32 406361 76.2 533270

This table gives us more insights into how both member types use the service. By looking at the avg_ride_length_mins column, during Q2 and Q3, we find that casual member on average have a ride length of more than half of the ride length of annual members. This figure drops to just about half more than annual members.

When it comes to frequency of rides taken, members are our more common riders throughout the year with them consistently representing more than half of the share of rides. Casuals make up around 46% of the number or rides in Q2 and Q3 (Spring and Summer), with their usage lowering during Q4 and 2023 Q1 (Fall and Winter).

We see this trend even with the total number of rides, Usage peaks at summer and bottoms out during the winter.

Created a table showing quarterly frequency of rides for each day in the week.

image

I'll use visuals for large tables as they are easier to analyze and explain.

image

I chose to represent the above query as the bar graph above for brevity's sake. The orange are members and blue are casuals, and they follow different trends as the week goes on throughout the year. Casuals peak during the weekend (Saturday, Sunday) and trend downwards during the middle of the week (Tuesday, Wednesday). Members peak during the middle of the week (Wednesday and Thursday) and they don't have as steep a trend downwards like the casuals.

I also wanted to look at the start time between both members, so I created another table from my main table.

image

And then, I selected the year_quarterly, month, member_type and start_time fields from the new table.

image

image In all four seasons, members use more of the service throughout the day. Casuals usage rises steadily throughout the day to peak at around five in the evening.

I then looked at bike type usage between members for the year. This is the query used on. image image There are differences between the ride choices of casuals and members. I see that casuals are the only ones who used the dock bikes for the year 2022-2023. I see that they prefer electric bikes over classic bikes in any given quarter.

Members change their usage of bike types depending on the seasons. In Q2 and Q3 (around spring or summer), members prefer to use classic bikes over electrics and in Q4 and Q1, they prefer to use electric bikes over classic bikes. Members have not used docked bikes in any of the four quarters for the year. The below pie chart does a better job of highlighting the differences between ride types for each member type.

image

I also wanted to look at the maximum ride length that taken by each member. This represents how long a customer in any member type is willing to use the service for.

This is the SQL query used:

image

Row year_quarter member_type max_ride_length_hrs
1 22 Q2 casual 604.3
2 22 Q2 member 25.0
3 22 Q3 casual 570.15
4 22 Q3 member 172.55
5 22 Q4 casual 689.78
6 22 Q4 member 25.0
7 23 Q1 casual 560.07
8 23 Q1 member 26.0

Here the max length is represented in hours. The longest ride was for 689.78 hours or around 28 days from a casual member. In all four quarters, casual customers have long rides of over 500 hours. Meanwhile members have only around 25-hour long rides in three of the quarters, but there was one ride in Q3 that was for 172 hours.

5.Share:

Expected Deliverable:

Supporting visuals and key findings.

Guiding questions:

  • Were you able to answer the question of how annual members and casual riders use Cyclistic bikes differently?
    • Yes, I was able to answer how members and casuals use the service differently, based on seasonal, time, ride length and day of the week.
  • What story does your data tell?

  • Who is your audience? What is the best way to communicate with them?
    • The key audience here is the executive team, as they will be deciding on whether the marketing strategy moves ahead.
  • Can data visualization help you share your findings?
    • Data visualization will definitely help me share my findings, as there is just to much information and not enough time to explain the whole thing in text or table.

Deliverable: Visualizations and Key Findings:

Presentation Material can be found here:

Tableau Story

Supporting visualizations:

image

image

image

image

image

image

Key Findings:

Both member types differ in how they use the Cyclistic's services:

  1. Members take more rides than Casuals in a given time period.

  2. Overall ridership drops during the colder seasons like fall and winter (Q4 & Q1)

  3. Members prefer riding during weekdays. Conversely, Casuals prefer riding during weekends.

  4. During a given day, both member types peak at around 7pm. But Members ride more often throughout the rest of the day.

  5. On average Casuals have longer trips than members.

  6. Also, Casuals have done some of the longest trips that can last several days or weeks.

  7. When it comes to ride preference, Casuals can prefer electric cycles in any given quarter. Members are more balanced in their preferences but have not used docked bikes in the last 12 months.

6.Act:

#### Guiding questions:

  • How should you organize your data to perform analysis on it?
    • Ill be importing my data into BigQuery and compiling all the datasets into one table. Ill be then using aggregate functions to create summary tables, or filter the table for a column to answer specific questions.
  • Has your data been properly formatted?
    • Yes, I went through each table and made sure that the data type is of the proper formats and that the table dimensions for all the tables are the same.
  • What surprises did you discover in the data?
    • That casual customers may have longer ride lengths, maybe up to a month. We may want to track such behaviours in real time to maybe persuade them to switch over to annual memberships.
  • What trends or relationships did you find in the data?
    • Members are more consistent with their usage during day hours and during the week. They are also more likely to use the service during colder months, where casual members sharply drop off in usage.
  • How will these insights help answer your business questions?
    • These insights highlight how both member types use the service differently. Hence it does allow me to answer the business question.Deliverable:

Your top three recommendations based on your analysis.

Guiding questions:

  • What is your final conclusion based on your analysis?
    • Both member types do use the service differently. Based on start time during a given day, members have successfully integrated using the service in their everyday life, like traveling to and from work, school, etc. Casuals seem to mainly use the service for leisure or will take very long trips.
  • How could your team and business apply your insights?
    • They could look at the differences and craft advertisement, pricing changes, pricing tiers to suit the different needs of the casuals.
  • What next steps would you or your stakeholders take based on your findings?
    • They should keep the findings in mind as they go about answering the other business questions.
  • Is there additional data you could use to expand on your findings?
    • I could use the starting and ending stations. I did not use it here because of its incomplete nature. But it would have given me locations based usage data between both member groups.

Deliverable: Recommendations:

Top Three Recommendations:

  • Advertise to Casuals the benefit of using Cyclistic long term as a mode of transport for everyday activities and not as a leisure activity.
  • Keep a track of Casuals who are renewing trips for multiple days and advertise to them the cost-savings of having an annual plan for longer trips.
  • Begin ad campaign that explains the benefits of cold cycling:
    • Deals with holiday weight gain.
    • Counteracts Seasonal Affective Disorder.
    • Spending less time indoors, means less time being in contact with others, means less time getting sick.
    • Saves gas, money, and the environment.

Releases

No releases published

Packages

No packages published