Data Analyst Portfolio Project | Correlation in Python | Project 4/4

164,396

2,945 0

Published 2021-06-22

Today we continue our Data Analyst Portfolio Project Series. In this project we will be working in Python to find correlations between variables.

Please remember to save this project and add it to your GitHub once you are done!

LINKS:
Project Dataset: www.kaggle.com/danielgrijalvas/movies

Python IDE: www.anaconda.com/products/individual

Link to Python Code: github.com/AlexTheAnalyst/PortfolioProjects/blob/m…

____________________________________________

SUBSCRIBE!
Do you want to become a Data Analyst? That's what this channel is all about! My goal is to help you learn everything you need in order to start your career or even switch your career into Data Analytics. Be sure to subscribe to not miss out on any content!
____________________________________________

RESOURCES:

Coursera Courses:
Google Data Analyst Certification: coursera.pxf.io/5bBd62
Data Analysis with Python - coursera.pxf.io/BXY3Wy
IBM Data Analysis Specialization - coursera.pxf.io/AoYOdR
Tableau Data Visualization - coursera.pxf.io/MXYqaN

Udemy Courses:
Python for Data Analysis and Visualization- bit.ly/3hhX4LX
Statistics for Data Science - bit.ly/37jqDbq
SQL for Data Analysts (SSMS) - bit.ly/3fkqEij
Tableau A-Z - bit.ly/385lYvN

Please note I may earn a small commission for any purchase through these links - Thanks for supporting the channel!
____________________________________________

SUPPORT MY CHANNEL - PATREON/MERCH

Patreon Page - www.patreon.com/AlexTheAnalyst

Alex The Analyst Shop - teespring.com/stores/alex-the-analyst-shop

____________________________________________

Websites:
GitHub: github.com/AlexTheAnalyst
____________________________________________

All opinions or statements in this video are my own and do not reflect the opinion of the company I work for or have ever worked for

0:00 Introduction
0:58 Download Dataset
1:45 Download Python IDE
3:16 Import Python Libraries
4:38 Read in Data using Pandas
8:43 Look for Missing Data
12:30 Data Cleaning
25:08 Finding Correlations in the Data
54:21 Saving and Uploading to GitHub

All Comments (21)

@danielbristow6954 2 years ago

Update: Alex, I just accepted my first job as a junior data analyst. This completes my 6-month journey to learn data analytics and change careers, and I could not have done it without your excellent Portfolio videos. Thank you so much for making these available to your viewers for free. After I built my portfolio, companies started taking a second look at my resume and inviting me to interviews. BEFORE the portfolio, I received ONLY rejection emails. Thank you, thank you, thank you!
@neella97 1 year ago

If anyone else is having issues due to IntCastingNanError, I advise to try the following: df['budget'] = pd.to_numeric(df['budget'], errors='coerce').fillna(0).astype(int) df['gross'] = pd.to_numeric(df['gross'], errors='coerce').fillna(0).astype(int) it worked! :) Thank you Alex for your amazing videos!
@izzyinsc 2 years ago

Hello! The dataset appears to be updated on Kaggle and for anyone new, you will run into some issues that you need to fix to follow along. 1. Missing data. There are missing values opposed to this video so you will need to fix that. There are many ways to handle missing values but for the sake of time, I decided to drop all rows that have missing data. You will have about 71% of your data remaining. You will need to run the following if your dataframe is named df. df = df.dropna() 2. Extracting the year is different as the formatting is different. Running the following should extract the correct year. df['yearcorrect'] = df['released'].str.extract(pat = '([0-9]{4})').astype(int) 3. Duplicates, there aren't any in this dataset so you should be fine on that. I hope this helps anyone that is working on this and best of luck on your analytics journey!
@woahnelly3286 1 year ago

Hey all, just a "stats" heads up/correction you might want to make for your portfolio: In this video, Alex wanted to see if the company was "correlated" with gross revenue. What he did was assign values (randomly, I think) to companies, countries, etc. Then he tried to see if those values were related to the gross revenue. Those randomly assigned values are "measuring" the company, country, etc at the Nominal scale. In other words, they're essentially just being used as a numeric "name"—the values themselves don't mean anything. What that means is that one value being higher than the other doesn't represent an increase in the thing being measured (for example, the USA was assigned a 54 and the UK was assigned a 53. Those are just names... the USA isn't one more of something than the UK). Because the values themselves don't represent anything, it doesn't make sense to do a correlation with them. Correlations tell us, as one variable increases, what happens to the other? So in the first question, as the budget increased, what happened to revenue? It increased. But with country, company, or other categorical variables, correlations don't make sense. The values for country and company are random, so the numbers that represent them going up doesn't tell us anything. It's no wonder then, that the correlations weren't large. Instead, it would make sense to do a t-test or ANOVA and compare means. In that case, the question would be, "Do some companies tend to produce higher revenue than others?" Or, "Do some countries tend to produce higher revenue?" etc. (For more discussion, see: https://www.webpages.uidaho.edu/~stevel/519/Correlation%20between%20a%20nominal%20(IV)%20and%20a%20continuous%20(DV)%20variable.html). Since this is a portfolio project and you want to show potential employers the result, maybe just take that part out—you wouldn't want to make a mistake like that in an application to a potential employer! (Alex, thanks so much for doing these videos! They're super helpful and I'm very very grateful!)
@darkavenger100 3 years ago

I can't wait for the beginner, intermediate, and advanced Python series by Alex the Analyst. It's what the people want, besides a happy Alex.
@OmarJimenez-dq8sr 10 months ago

The dataset is updated and is not the same as the one in the video, if you guys have problems in the 'Create correct year section' you can do a split of the data to get only the year df['yearcorrect'] = df['released'].astype(str).str.split().str[2]
@omashan6634 2 years ago

Honestly, you're an absolute legend. You really break down some of the technical barriers that exist for people entering the field of data science. You really are gem to the community.
@tylerlaquinta2996 2 years ago

Hey guys the info got updated since this video was posted. While I was going through the project I was able to google the problems as they came up. In case you guys get stumped here's what I found that works: This will drop any rows with null values df = df.dropna(how='any',axis=0) This will add the released date column into a separate column df['yearcorrect'] = df['released'].astype(str).str.split(', ').str[-1].astype(str).str[:4] Let me know if you that works for y'all
@naincypushpad2093 9 months ago

if df.corr() shows the error that a string variable can't be converted into int pass parameter df.corr(numeric_only=TRUE)
@snudgegalbraith3447 3 years ago

I have recently decided on becoming a data analyst and your videos are really helping me understand what i need to do and keep me motivated on that goal which will improve my life. I want to say thank you for your content and your honest helpfulness.
@rickydonne802 2 years ago

At 11:08, instead of printing null percent, we can use: for col in df.columns: print(df[col].isnull().value_counts(), "\n") This will print how many values are null. Cause you might have 1 missing in 10k values, and you will need high precision in decimals.
@Dpereira96 2 years ago

Man, thank you so much for this, I know you've put a lot of effort into this project serie and I can definitely say that i'm a huge fan of your work! Thank you for sharing this, you're helping a lot of people to pursuit their dreams! Greetings from Brazil :)
@shanali3473 2 years ago

Man, thank you so much for this, I know you've put a lot of effort into this project series and I can definitely say that I'm a huge fan of your work! Thank you for sharing this, you're helping a lot of people pursue their dreams! Greetings from the UK :)
@gastonsuarez5320 1 year ago

I really appreciate the fact that you did not edit out the parts were you made "mistakes" and actually fixed them.
@rebekhathangam7466 9 months ago

If you are facing an error in datatype change, try the following df_copy = df.copy() df_copy['budget'] = df_copy['budget'].astype('int64') df_copy['gross'] = df_copy['gross'].astype('int64') df_copy Thank you Alex for this amazing video
@vishakhasingh3162 8 months ago

At 13:40 If you are facing an error in datatype change, try the following :- df['budget'].round().astype('Int64') df['budget']=df['budget'].astype('Int64') hope it will help uh
@netol02 1 year ago

This 4 part tutorial is pure gold! After your announcement that you were launching your version of data analyst course/certification, can’t wait for when it goes live, as to follow up in more depth for the concepts presented in this series. Really appreciate the time, dedication and quality of content you produce Alex.
@reezalzainudin8097 2 years ago

Hey guys, at 46:24, we can simply assign .copy() method to our new variable if we want to use for loop to iterate over our new variable without affecting the original dataframe or df: df_numerized = df.copy() for col_name in df_numerized.columns: if(df_numerized[col_name].dtype == 'object'): df_numerized[col_name] = df_numerized[col_name].astype('category') df_numerized[col_name] = df_numerized[col_name].cat.codes df_numerized
@hazimrashid1231 2 years ago

Hi Alex, just finished the project. It’s awesome. Thanks for everything. I pray for your success in the future.
@NroShock 3 years ago

Thank you Alex, this has been a great project! You are a great teacher and this has been very helpful. Looking forward to everything you release in the future!