GitHub - fern996/Cyclistic-CaseStudy: This is my capstone project for the Google Data Analytics Professional Certificate. You can find the presentation for this case study below:

A Case Study

Stakeholders:

Lily Moreno: Director of marketing and my manager.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic's mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.
Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.

The Data Analyst Process:

1.ASK:

Expected Deliverable:

A clear statement of the business task.

Guiding Question:

What is the problem you are trying to solve?
- I am trying to figure out how casual riders differ from annual riders to maximize the number of annual riders. I will need to identify the trends and patterns between the two.
How can your insights drive business decisions?
- By learning how annual members and casual riders use Cyclistic bikes differently, Cyclistic can then developer marketing strategies around the habits of casual riders and demonstrate how they can benefit from an annual pass like annual members.

Deliverable: Business Task:

The business task for this study will be to identify the differences between the trends and patterns of casual and Cyclistic members and then present these findings to the executive team.

2.Prepare:

Note: The datasets have a different name because Cyclistic is a fictional company. For the purposes of this case study, the datasets are appropriate and will enable you to answer the business questions. The data has been made available by Motivate International Inc. under this license.

Expected Deliverable:

A description of all data sources used.

Guiding questions:

Where is your data located?
- The data was in a database provided by Motivate International. For the purposes of this study the data is Internal and primary data.
How is the data organized?
- The data is structured and in a wide format.
Are there issues with bias or credibility in this data? Does your data ROCCC?
- Using the ROCCC test:
  - The data is Reliable. It has missing cells for the start_station_name, end_station_name, end_lat and end_lng, but I may not need to use these columns in my final analysis. There are no biases in the database or spelling mistakes. But I will need to clean the data for mistakes, etc.
  - The data is Original: it is first party data collected from the Divvy data logging system.
  - The data is Comprehensive: the data is comprehensive as it contains all that I need to make my analysis.
  - The data is Current: It contains data from the last 12 months i.e., March 2023 to May 2022.
  - The data is Cited: The data source is first party.
How are you addressing licensing, privacy, security, and accessibility?
- The data does not contain any private information i.e., credit card information which could connect the rides to the customer. Also, licensing is covered in https://ride.divvybikes.com/data-license-agreement.
How did you verify the data's integrity?
- By checking data schema, and filtering for blanks via BigQuery.
How does it help you answer your question?
Are there any problems with the data?

Attributes of the data and will I need it:

Ride_id: No nulls found.
Rideable_type: No nulls found.
Started_at: No nulls found.
Ended_at: No nulls found.
Start_station_name: Nulls found.
Start_station_id: Nulls found.
End_station_name: Nulls found.
End_station_id: Nulls found.
Start_lat: No nulls found.
Start_lng: No Nulls found.
End_lat: Nulls found.
End_lng: Nulls found.
Member_casual: No nulls found.

Since I have no direction on what to do with attributes with nulls, I will discard them. I do not need Ride_id as it is autogenerated string when trips are begun and will not be meaningful in my analysis.
Start_lat and start_lng may be discarded as it will not be meaningful.
I can use Rideable_type to compare against member types.
I can use Started_at and Ended_at to find out travel time.
I need to use Member_casual as it contains what type of membership the customer had.

So, the attributes I'll be using are rideable_type, started_at, ended_at and member_casual. Will need to rename member_casual to member_type.

Deliverable: Data Sources Used:

The data source used is public data provided by Motivate International Inc under thislicense. The files from the database used are:

202204-divvy-tripdata.zip
202205-divvy-tripdata.zip
202206-divvy-tripdata.zip
202207-divvy-tripdata.zip
202208-divvy-tripdata.zip
202209-divvy-tripdata.zip
202210-divvy-tripdata.zip
202211-divvy-tripdata.zip
202212-divvy-tripdata.zip
202301-divvy-tripdata.zip
202302-divvy-tripdata.zip
202303-divvy-tripdata.zip

I loaded the files into BigQuery and combined all the tables into one. Using the aggregated table, I was able to select and find the attributes with and without blank cells. I chose the attributes that were relevant to my analyses which were also complete. The resulting data is anonymized i.e., there is no personal information (credit card, personal details, etc) which also means there will be no personal bias. But, the data will have to be further cleaned and transformed to be made reliable and comprehensive.

3.Process:

Expected Deliverable:

Documentation of any cleaning or manipulation of data.

Guiding Questions:

What tools are you choosing and why?
- For cleaning I'll be using Excel. It's a lot more flexible in terms of calculations
Have you ensured your data's integrity?
- Data replication: I've had no issues with data replication. The number of rows of all csv files have been the same.
- Data Transfer: I was having issues with exporting files to bigquery due to the data type of a column being set to FLOAT instead of TIME. I re-did my data cleaning procedures and was able to export successfully.
- Data Manipulation: Had to clean data while calculating ride_length. Turns out, some of the started_at and ended_at times were switched, i.e. trips had end times before their start times. This would give an invalid result during calculations. Most of my cleaning efforts were aimed at fixing this.
How can you verify that your data is clean and ready to analyze?
- Exported all files to big query and combined them into one table. I selected the columns that I will be using for analysis plus the newly created ones and created a new table for it. I have checked the

Deliverables:

Steps followed in excel:

Unzip the files.
Opened each file and saved as .xlsx since I chose to work with excel.
Then opened each file and added columns "ride_length" and "day_of_week".
For ride_length column, I found difference between ended_at and started_at to give length of ride. I used format cells to change format to h:mm:ss.
For day_of_week column, I used the weekday function to get the day of the week the ride started on.
I then sorted or filtered ride_length column to look for "######". This indicates an issue in the started_at and ended_at fields where the start time is later than the end time.
To change more than one row, I sorted ride_length in ascending order, then switched the started_at and ended_at with each other.
Repeated steps 6 to 7 for the rest of the files.

Note: The ride_length field became a means to find dirty data in the started_at and ended_at fileds. But the ride_length fields has a limitation where the data format of time (hh:mm:ss) can only accurately represent ride lengths less than 24 hours. For e.g., ride_id: 7D4CB0DD5137CA9A rented a bike for nearly the whole month of October in 2022. But the ride_length field only shows a time of 17:47:15.

For this reason, I uploaded the data to BigQuery and calculated ride length using the TIMESTAMP_DIFF to output the time difference in minutes. Using the above example, its ride length is now accurately shown to be 41387 minutes or 28.7 days.

Because the collated table would be too large for a spreadsheet tool like excel or sheets, naturally SQL or R became the options for handling this large data set. I chose SQL and the following steps detail the process I used to import, create, and transform the collated table in BigQuery.

Full Steps followed in BigQuery:

Imported all data sets to BigQuery then merged all datasets into one table.

Created new filtered table with only the fields that were relevant to my analysis, renamed member_casual to member_type, changed day of week to type string and calculated time difference using TIMESTAMP_DIFF function.

Updated the new table day of the week numbers to actual days of the week.

Removed 18 rows of null values found in the table.

To extract Month from the table I used

Because the extracted value is in INT64 it will have to be casted as string before I can update the number values to literal months. This will involve creating a new table with the casted data type and then replacing the original table with the casted table.

Copying back the table with the modified string back onto the original YearData table.

Changing numeric month to literal month.

To represent yearly quarters, I created a new column of type string and I used this set of queries to update the new column.

At this stage, the number of rows that I had were 5,803,720. Of these, 633,307 entries (10.7%) were rides that were 3 minutes or less. I decided to filter these rides out as customers could have made mistakes or they could have canceled the ride.
The final table has 5,170,413 rows of data.

Schema of the final columns used:

Column name	Description
ride_id	To reference ride_id with source data
member_type	Type of customer membership
rideable_type	Type of bike used
started_at	Date and time of trip start
ended_at	Date and time of trip end
day_of_week	Day trip started
ride_length	Excel calculations for ride length
ride_length_mins	BigQuery calculations for ride length
month	Month of the year, extracted from started_at
year_quarter	Year and quarter based on the month

4.Analyze:

Expected Deliverable:

A summary of the analysis

Guiding questions:

How should you organize your data to perform analysis on it?
- Ill be importing my data into BigQuery and compiling all the datasets into one table. Ill be then using aggregate functions to create summary tables, or filter the table for a column to answer specific questions.
Has your data been properly formatted?
- Yes, I went through each table and made sure that the data type is of the proper formats and that the table dimensions for all the tables are the same.
What surprises did you discover in the data?
- That casual customers may have longer ride lengths, maybe up to a month. We may want to track such behaviours in real time to maybe persuade them to switch over to annual memberships.
What trends or relationships did you find in the data?
- Members are more consistent with their usage during day hours and during the week. They are also more likely to use the service during colder months, where casual members sharply drop off in usage.
How will these insights help answer your business questions?
- These insights highlight how both member types use the service differently. Hence it does allow me to answer the business question.

Deliverable: Analysis:

For the purposes of the analysis, I will be referring to our annual customers as members and our casual customers as casuals.

First, I wanted to find the distribution of rides between both member types. I used the following query,

And got the following result,

Row	member_type	percentage_of_rides	count_of_rides
1	casual	41.86	2164550
2	member	58.14	3005863

We can see that for 2022 Q2-2023 Q1, annual members account for a bit more than half of the total trips. Annual members have 16.28% greater usage than casual members.

Next, I wanted to look at average ride_length and frequency of rides but at a per quarter level. To do that I first created table showing then number of rides between member types for each quarter.

Then I created a second table with average ride length between member types for each quarter.

Finally, I created the summary table showing number of rides and average ride length.

year_quarter	member_type	avg_ride_length_mins	count_of_rides	percentage_num_rides	total_rides_per_quarter
22 Q2	casual	33.08	730187	45.45	1606634
22 Q2	member	14.71	876447	54.55	1606634
22 Q3	casual	30.95	988103	47.16	2095416
22 Q3	member	14.82	1107313	52.84	2095416
22 Q4	casual	26.9	319351	34.15	935093
22 Q4	member	13.06	615742	65.85	935093
23 Q1	casual	25.28	126909	23.8	533270
23 Q1	member	12.32	406361	76.2	533270

This table gives us more insights into how both member types use the service. By looking at the avg_ride_length_mins column, during Q2 and Q3, we find that casual member on average have a ride length of more than half of the ride length of annual members. This figure drops to just about half more than annual members.

When it comes to frequency of rides taken, members are our more common riders throughout the year with them consistently representing more than half of the share of rides. Casuals make up around 46% of the number or rides in Q2 and Q3 (Spring and Summer), with their usage lowering during Q4 and 2023 Q1 (Fall and Winter).

We see this trend even with the total number of rides, Usage peaks at summer and bottoms out during the winter.

Created a table showing quarterly frequency of rides for each day in the week.

I'll use visuals for large tables as they are easier to analyze and explain.

I chose to represent the above query as the bar graph above for brevity's sake. The orange are members and blue are casuals, and they follow different trends as the week goes on throughout the year. Casuals peak during the weekend (Saturday, Sunday) and trend downwards during the middle of the week (Tuesday, Wednesday). Members peak during the middle of the week (Wednesday and Thursday) and they don't have as steep a trend downwards like the casuals.

I also wanted to look at the start time between both members, so I created another table from my main table.

And then, I selected the year_quarterly, month, member_type and start_time fields from the new table.

In all four seasons, members use more of the service throughout the day. Casuals usage rises steadily throughout the day to peak at around five in the evening.

I then looked at bike type usage between members for the year. This is the query used on. There are differences between the ride choices of casuals and members. I see that casuals are the only ones who used the dock bikes for the year 2022-2023. I see that they prefer electric bikes over classic bikes in any given quarter.

Members change their usage of bike types depending on the seasons. In Q2 and Q3 (around spring or summer), members prefer to use classic bikes over electrics and in Q4 and Q1, they prefer to use electric bikes over classic bikes. Members have not used docked bikes in any of the four quarters for the year. The below pie chart does a better job of highlighting the differences between ride types for each member type.

I also wanted to look at the maximum ride length that taken by each member. This represents how long a customer in any member type is willing to use the service for.

This is the SQL query used:

Row	year_quarter	member_type	max_ride_length_hrs
1	22 Q2	casual	604.3
2	22 Q2	member	25.0
3	22 Q3	casual	570.15
4	22 Q3	member	172.55
5	22 Q4	casual	689.78
6	22 Q4	member	25.0
7	23 Q1	casual	560.07
8	23 Q1	member	26.0

Here the max length is represented in hours. The longest ride was for 689.78 hours or around 28 days from a casual member. In all four quarters, casual customers have long rides of over 500 hours. Meanwhile members have only around 25-hour long rides in three of the quarters, but there was one ride in Q3 that was for 172 hours.

5.Share:

Expected Deliverable:

Supporting visuals and key findings.

Guiding questions:

Were you able to answer the question of how annual members and casual riders use Cyclistic bikes differently?
- Yes, I was able to answer how members and casuals use the service differently, based on seasonal, time, ride length and day of the week.
What story does your data tell?
Who is your audience? What is the best way to communicate with them?
- The key audience here is the executive team, as they will be deciding on whether the marketing strategy moves ahead.
Can data visualization help you share your findings?
- Data visualization will definitely help me share my findings, as there is just to much information and not enough time to explain the whole thing in text or table.

Deliverable: Visualizations and Key Findings:

Presentation Material can be found here:

Tableau Story

Supporting visualizations:

Key Findings:

Both member types differ in how they use the Cyclistic's services:

Members take more rides than Casuals in a given time period.
Overall ridership drops during the colder seasons like fall and winter (Q4 & Q1)
Members prefer riding during weekdays. Conversely, Casuals prefer riding during weekends.
During a given day, both member types peak at around 7pm. But Members ride more often throughout the rest of the day.
On average Casuals have longer trips than members.
Also, Casuals have done some of the longest trips that can last several days or weeks.
When it comes to ride preference, Casuals can prefer electric cycles in any given quarter. Members are more balanced in their preferences but have not used docked bikes in the last 12 months.

6.Act:

#### Guiding questions:

How should you organize your data to perform analysis on it?
- Ill be importing my data into BigQuery and compiling all the datasets into one table. Ill be then using aggregate functions to create summary tables, or filter the table for a column to answer specific questions.
Has your data been properly formatted?
- Yes, I went through each table and made sure that the data type is of the proper formats and that the table dimensions for all the tables are the same.
What surprises did you discover in the data?
- That casual customers may have longer ride lengths, maybe up to a month. We may want to track such behaviours in real time to maybe persuade them to switch over to annual memberships.
What trends or relationships did you find in the data?
- Members are more consistent with their usage during day hours and during the week. They are also more likely to use the service during colder months, where casual members sharply drop off in usage.
How will these insights help answer your business questions?
- These insights highlight how both member types use the service differently. Hence it does allow me to answer the business question.Deliverable:

Your top three recommendations based on your analysis.

Guiding questions:

What is your final conclusion based on your analysis?
- Both member types do use the service differently. Based on start time during a given day, members have successfully integrated using the service in their everyday life, like traveling to and from work, school, etc. Casuals seem to mainly use the service for leisure or will take very long trips.
How could your team and business apply your insights?
- They could look at the differences and craft advertisement, pricing changes, pricing tiers to suit the different needs of the casuals.
What next steps would you or your stakeholders take based on your findings?
- They should keep the findings in mind as they go about answering the other business questions.
Is there additional data you could use to expand on your findings?
- I could use the starting and ending stations. I did not use it here because of its incomplete nature. But it would have given me locations based usage data between both member groups.

Deliverable: Recommendations:

Top Three Recommendations:

Advertise to Casuals the benefit of using Cyclistic long term as a mode of transport for everyday activities and not as a leisure activity.
Keep a track of Casuals who are renewing trips for multiple days and advertise to them the cost-savings of having an annual plan for longer trips.
Begin ad campaign that explains the benefits of cold cycling:
- Deals with holiday weight gain.
- Counteracts Seasonal Affective Disorder.
- Spending less time indoors, means less time being in contact with others, means less time getting sick.
- Saves gas, money, and the environment.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Assets		Assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Case Study

Stakeholders:

The Data Analyst Process:

1.ASK:

Expected Deliverable:

Guiding Question:

Deliverable: Business Task:

2.Prepare:

Expected Deliverable:

Guiding questions:

Deliverable: Data Sources Used:

3.Process:

Expected Deliverable:

Guiding Questions:

Deliverables:

Steps followed in excel:

Full Steps followed in BigQuery:

4.Analyze:

Expected Deliverable:

Guiding questions:

Deliverable: Analysis:

5.Share:

Expected Deliverable:

Guiding questions:

What story does your data tell?

Deliverable: Visualizations and Key Findings:

Presentation Material can be found here:

Supporting visualizations:

Key Findings:

6.Act:

#### Guiding questions:

Guiding questions:

Deliverable: Recommendations:

About

Releases

Packages

fern996/Cyclistic-CaseStudy

Folders and files

Latest commit

History

Repository files navigation

A Case Study

Stakeholders:

The Data Analyst Process:

1.ASK:

Expected Deliverable:

Guiding Question:

Deliverable: Business Task:

2.Prepare:

Expected Deliverable:

Guiding questions:

Deliverable: Data Sources Used:

3.Process:

Expected Deliverable:

Guiding Questions:

Deliverables:

Steps followed in excel:

Full Steps followed in BigQuery:

4.Analyze:

Expected Deliverable:

Guiding questions:

Deliverable: Analysis:

5.Share:

Expected Deliverable:

Guiding questions:

What story does your data tell?

Deliverable: Visualizations and Key Findings:

Presentation Material can be found here:

Supporting visualizations:

Key Findings:

6.Act:

#### Guiding questions:

Guiding questions:

Deliverable: Recommendations:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages