#Full #Data #Science
How do you become a successful data scientist well actually most get this one wrong they focus on the technical skills only the math skills the statistical skills the machine learning skills the programming languages they need to master and all the libraries the list could go on but you know what
That’s actually the wrong approach to become a successful data scientist let me reveal the secret to you a successful data scientist focuses on something totally different they focus on what’s called the data science workflow they focus on what adds value to their end clients their customers they need to generate useful
Valuable insights or actionable insights to the end customers in order to do that they need to understand the full data science workflow they need to know how to acquire prepare analyze and make reports and make actionable insights in the end in this 15 part data science
Course with python we will focus on the full aspect of the data science workflow we will teach you how to think how to prepare how to analyze how to validate that you’re doing everything right and how you can make actionable valuable insights to your end clients most courses only focus on the technical
Aspect giving you as many tools as possible while this course focuses more on giving you the best tools to cover the full process of the data science workflow so are you ready for this i hope so Over the next 15 lessons we will teach you how to work with the data science workflow it will teach you the entire process we will be working in jupiter notebook using python you will know how to explore the problem how to understand and how to define the data science
Problem you will know where to identify additional data if your client does not give it to you we will use the most versatile tools to achieve that and actually panda’s data frame connects all we need so we will start by learning how to import data from various sources we
Will also learn how to combine data in the data frames you will learn how to explore the data quality you will learn about how to group the data how to make simple statistics how to read advanced statistical box plots you will learn how to visualize data and actually
Visualization does three things for data scientists it actually makes it easy to spot outliers discrepancies in data the quality of the data secondly visualization is a tool for you to understand data are there any connection is it correlated or not and finally data visualization is a great way to
Present your results we will also learn to clean data we will show examples of what can go wrong if you don’t clean the data and there are different techniques for that we will learn how to analyze data with machine learning models we will learn about feature scaling we will
Learn about feature selection actually did you know you can get higher accuracy by selecting the correct features you get simpler model you reduce the risk of overfitting we also learn about model selection i mean the big question is also i have this data i need to create a
Model but which one we will learn about that to verify we did our analysis good we have a checklist then we learn about reporting how do you present your findings one key thing is to understand your audience and give key messages to them visualizing result we already
Talked about that but we need to learn and master it because this is a key skill to selling your message visualizing result is often to build up the confidence that your model your findings are correct credibility counts you always remember that sometimes you find things you actually would prefer
Not to show your clients but you need to remember the long run credibility counts so don’t leave out results actions so this is where you really create value use the insight you have how can you make valuable actionable insight to your customer this actually includes measuring impact you need to be able to
Measure the impact your findings has on your customer your clients remember this is the main goal of a data sciences your success as a data scientist is to create a valuable actionable insights at the end of this course you will get a template with all the resources all the things you need to
Master with pointers at where you can find more in-depth information about this specific topic this is what makes data scientists successful so i hope you’re ready for this journey together with me this course is structured in 15 lesson for each a lesson there will be a jupyter notebook for the lesson that is
The material i’m presenting and then there will be a project for each of them so they are numbered 0 1 2 and so forth for each lesson and this stops down in 14 because there are 15 lessons right in the first lesson all you need to do is
Open this lesson here and you are ready to follow along what i’m doing when we get to the project all you need to do is open the project notebook and then you try to do it on your own and after you get stuck or you’re finished
You can see how i would make a solution too i hope you are ready to join this journey in this 15 part because it’s going to be amazing i really enjoyed making this and i hope you will enjoy it too you want to get started immediately what do you need
In the description there’s a link to anaconda and when you go to this webpage you can download anaconda anaconda includes python and all the libraries you need to install so it’s all done for you and it has the jupyter notebook that we’re using so click the download there
Then you install follow the installation process when you have installed anaconda launch it inside anaconda you can launch jupiter notebook do that then you will go to my github repository there’s also a link down there and then you actually download all of them in a zip file so this will contain all the
Notebooks we’re using it will contain all the files we’re using all the data we’re using so you don’t need to do anything so inside jupyter notebook you will find a screen like this you have to navigate to where you downloaded the zip file and you unzip it and then you find
All the files that we’re using you click the first lesson here and you’re ready to go Why data science why is it so amazing well i don’t know why you think it’s amazing i know why i like it today we are surrounded by things that collect data about us is it on social media is it from your smartphone and also most businesses know
They need to make data driven decisions data is everywhere now and available for you so you can learn about the world learn about human behavior learn about customer behavior learn about everything and that’s just so amazing companies know that data-driven decisions is the way to go forward not to forget about
The job opportunities as a data scientist they are listed over 2.5 million jobs as data scientists today and that’s just the top of them also python is the most popular language for data science and there is great reason for that python is easy to learn it has an enormous community and also
There are so many libraries that are connected to python that makes your job as a data scientist so easy you should not underestimate the size of the community the bigger the size of the community the easier it is to get help and the more certain you are that they
Will continue to build new modules that make your life as a data scientist even easier and i must say already today it is insanely easy to use python as a data scientist so you can be certain that that is the right choice and actually data scientist and analyst they have a
Starting salary of hold on here hold on here 80 000 per year in the u.s whoa now for many people that’s a lot of money and imagine that’s a starting salary in this course we actually make a project where we actually look at what are the salaries in different company
Sizes what are the salaries in experience level and you’ll be surprised by that isn’t that exciting i think so this will surprise you did you know how data science started actually it was before computers yeah i kid you not i have two links here to two great stories about how data science
Actually started the first one here is about how they used data to defeat a disease read the story how they used a map to pinpoint where people were dying from the disease and how they helped them identify the source of it it’s a great story the story about
Weather forecasting is just as amazing that was also before the computer actually before the phone was invented they only had the telegraph system and that was actually the key because this guy here he realized that if he knew what the weather is next to where he was
At he could probably predict the weather coming to his region so he kind of thought if we divide the entire world in cubicles around the world and we know the weather in each cubicle around the world we can predict the weather coming near to us in the future
He used this data to predict weather storms he was living at the coast and wanted to help the fishermen to predict weather storms the problem was that fishermen were not sailing and if a weather storm came well there could be casualties so he wanted to be able to
Predict weather storms this is actually a great insight if your predictions are not accurate what happens take this weather forecast imagine that he predicts that it’s coming a storm the fishermen say okay we stay at home because there’s no storm and then it turns out no storm came what about
The fishermen how will they react well they didn’t get catch any fish then then they’re probably hungry and hungry people are often not very happy you need your models to be accurate or inform them how to evaluate the results you are giving them this this is such a great
Insight to understand the consequences of false predictions what about this data science workflow i talk about so how do you use it well well it all starts at step one you need to understand the problem so what is the problem we try to solve and how can this form a data science problem
So example right sales figures from a call center you have a logs on that right they evaluate new products right so this is one kind of problem it could also be sensor data from multiple sensors it could be used to for instance detect equipment failure you need to predict when equipment and
Fails and when they need to replace modules you can also just be customer data and marketing data to better target your marketing okay so there are a few steps in this one here right so you need to understand the problem and to find a data science problem you also need to assess the
Situation so you understand the problem better there can be some risks there can be some benefit from what you are doing and there could be some contingencies and regulations resources requirements to the job you’re doing this kind of shapes the data science problem you’re working with
You also need to define a goal what is the objective of your research do you know that what is the success criteria what is the thing that you’re looking for the more you can define these things the better you understand your data science problem and keeps you on track on doing the work
That adds value to your clients so in conclusion defining the problem is key to a successful data science project okay so before we take an example let’s just look at this one again i want you to understand that this is not a one step-by-step guide sometimes we walk
Back and forth you might understand the problem then you look at what kind of data there is then you make some preparation explore the data and realize you need more data you go back to step one you get more data then you prepare it and you analyze it you figure out you
Need more data you analyze it again you need to figure out that okay you need to talk to your client again and you go back and forth and so on and even when you get down to the actions the insightful actions the actionable actions you develo you provide to your customer or client
It might be a long-term relationship so you might say you come with some advices you need to acquire more data from your sources in order to make better analysis to get better insights okay so my point is the data science workflow is not a step step done step done step done step down
No it can go back and forth all the time you get smarter and you learn more along the way so it’s a back and forth battle what you need to know is well the key value is still here but the key to understand this one here to
Give customer value is to understand and explore the problem good let’s take an example here weather forecast so we already talked about this ancient time so what is the problem we’re trying to do here it can actually be different for him it was to predict weather storms for us it might to be
Predict the weather tomorrow it can also be predicting if it’s going to rain tomorrow or something else right but let’s just assume it is to predict the weather tomorrow so what data do we have well it might be a time series on temperature air pressure humidity rain
Wind speed and so on and so on and step two you need to investigate the data quality if it’s sensor data maybe sometimes the sensor have wrong readings for instance if it takes the wind direction maybe it is you know these things that take the wind direction maybe it gets stuck or
Something right so there might be wrong faulty data you need to be able to identify that and deal with that also there might be missing data data quality is a wide aspect so a great way to understand data is to visualize it and to understand it is often to
Understand the data quality but also to understand are there any connections so cleaning data handling missing data and faulty data in step three you need to figure out which features to select and then we will learn about feature selection and feature scaling and understand the model all right
I actually already have an old tutorial on how to predict a rain and no rain in our machine learning course i’d advise you can check it out if you want and analyzing you need to evaluate your prediction model uh if you do poor prediction is it good
Or bad i mean you need to be smart on that and we will also learn how to evaluate models later in the course then then we need to present it right and in this case it could be a weather forecast right it could be some charts some maps or something like that
And credibility right remember even though you want to make everybody happy to tell them every day that it will be sunshine and it will not be good for your credibility in the future inaccurate results too high confidence and not presenting findings right so a learning from the first weather
Forecast is like okay how he should have a metric on how confident he was in his results right so he said there might be a weather storm and i would say it is 20 risk of weather storm and other times you could say it’s like 90 risk of a weather storm they
Will be give your customer his clients the sailors the fishermen a better way to evaluate how to use his results right finally what insights so what uh what can you use a weather forecast for right it is like what to wear it can also be what happens on outside
Events right like the fishermen’s right so these are insides this is something that you can use if you are for instance if your client sells ice cream maybe they want to know that they need to have a lot of ice cream available because there comes a lot of sun so impact you
Need to be able to measure what impact it has and it can be on sales for instance umbrellas ice creams and so on and this is what makes good data science right this is what makes it valuable that somebody can use it to do some predictions right they need to predict
When are we selling umbrellas when are we selling ice cream right so these are the key insights from your work and this is the key success of a data scientist understanding this connection okay i hope you’re hooked to this one because this is going to be amazing So what skills do you need well you need some math and statistics programming domain knowledge and data visualization these are the hard skills you need and most people get scared because they think whoa do i need to be expert in all of these no actually not let me tell you
A funny story i’ve been working with data sciences for many years and sometimes i wondered why did this guy get hired he has no math skills he has no statistics skills he has no programming skills he does nothing with data visualization but he had one thing and that was domain knowledge and they
Probably realize well he will learn the math and stats he needs he will learn the programming he needs and the data visualization it’s not so important to be master in all aspects domain knowledge weighs high in the recruiting process of the soft skills you could be curious be good at communicating
Storytelling structured thinking well i know that knowing all these things makes it kind of scary but let me in on a small secret well data scientists they often or most data scientists they often work in groups so you don’t need to be an expert
On all of them so if you’re not good at storytelling or communicating don’t be scared it’s a team effort some are extremely good at math stats some are good at programming some are good at communicating some are good at storytelling well it’s all a package in
A team so don’t worry too much if you master the data science workflow you are pretty good to go the difference between beginners and expert data scientist is that the beginners they focus only on the technical aspects learning as many technologies as possible learning the most languages as possible and
An expert he knows he’s looking at the bigger picture so this is the key to your success it is to understanding that most beginners most data sciences courses only focus on this aspect but a expert data scientist a successful one know to focus on the full data
Science workflow he knows that in the end if you don’t add value you don’t have any job you don’t have any customers now remember this when you get scared when you hear somebody talking about some framework or some technology you never heard of as a data scientist and they say every data
Scientists need to know this one remember the bigger picture and i can reveal to you that some of the best data scientists i worked with they understood this they knew that it’s not about knowing the most technologies no it’s about giving customer insights customer value
In this course we will focus on how to connect these steps together with the most efficient technologies so you don’t waste too much time learning too many of them okay i hope you’re ready student grade prediction so we did here actually finding a data set on cackle
And if you don’t know kaggle yet it’s a great place to play around with data there are a lot of projects and so on so i’m not gonna dive into that we’re gonna get back later a bit about where you can find data but this is basically a
Playground for data analysts or data scientists which we are right now okay so i found this data set here so first of all so when you find a data set well it’s a student grade prediction so probably it is it is something about uh predict the final grades of portuguese high school
Students it says that on on on calculator so this is what it what the job was to do right we’re gonna twist it a tiny bit to make it a bit more realistic because maybe as a school you’re not interested in predicting the final grades maybe that’s not what you
Want but you want to help students we’ll get back to that right but the first thing you look at is the features so features what are features well this is actually when you have a excel sheet with all the data well the features are the column names you could say or a
Database if you’re related to that so don’t think so much about what things are called just more about what it is so it doesn’t matter if you remember it’s called features or not it doesn’t matter well there’s a feature description right so there’s a school feature here and you
Can see here the student schools is binary so there’s two schools in this data set it’s a gp and ms and then the sex of the student it is female or male and then the age of the student 15 to 22 an address and so for family size and mother’s education father education
Mother job father job and so on we can see what it means right so this makes you understand the data we’re not going to go through all of them there’s one called hire it is wants to take a higher education is either yes or no so this is the student
Does he wants to take or she does the student want to take a higher education right so the targets that we’re looking at that means what is that we want to predict so these are the things that in the data set they want to predict what we’re going to twist it here but
That’s often called targets right so what you wanted to predict was the final grades or g1 g2 g3 right so how can you use all this data you have here on students to predict these down here can you make something right so that was original goal but
What i’m thinking more about is so if you have grade one and grade two or maybe you have all this data out there how can you propose activities to improve g3 grades right the final grades because ultimately that is the school’s purpose right it is to increase the level of the
Students in the country right so this is portugal right so think of your the school ministry in portugal your goal is to get the smartest people in portugal because the smarter the people they are maybe the more the better it is for the economy because they can do more by themselves right
This is also what you and i are doing right now we are becoming data scientists right we are getting smarter so we can get better job doing what we like and maybe a better salary and we want to do something similar for the school we want to guide the school how they
Help students getting higher grades and i’ll get back to that in a moment because this sounds like our goal we want them so what activities can they do to get higher grades ryan how can they help the students and just a few notes here i will not cover notes
Like this normally but i will introduce them along the way but here in the first one here we’re going to use something called pandas which we’ll get into later in the course and the csv files read csv correlation group by means standard deviation and count because we don’t
Know anything i added these programming notes here with links to them so if you want to read about what is a csv this is actually lecture about csv here that you can follow along and the read csv that we are using here so you can actually see it actually this is
Normally i would say the documentations are pretty good and in the end often there are often examples of code or there are examples read csv is a enormous one so as you see here it actually is pretty bad example of good documentation but it is there
We could look at something like a group by instead oh that’s also actually quite complex one but you have examples of how you use group by down here so they make examples in a group and it is a really good source to learn and understand things and i highly encourage
People to learn to look at documentations because in the beginning you’re often frightened of them but actually they’re not that difficult when you have some practice so perhaps from the start good so acquire the data so i already told you that this data is from kaggle here
But i did something to make it easier for you so i downloaded it for you but we’ll get that in a moment but first of all remember our process our 10 step not 10 step 5 step process it’s step one two three four five we’ll follow along down here
So let’s see here step one acquire right explore the problem identify data import data right so get the right question that is the problem explore the problem this forms a data science problem this is what we want to do what is the problem right and we had some examples
Here sales figures and evaluate new product right sensor data from multiple sensors detect equipment failure customer data marketing data better targeted marketing but what do we need to understand in our context here the data right what is student age what is possible right uh so we know the student ages were 15
To 22 right but this gives you an idea because it was like uh first graders uh like seven eight nine years old kids well it’s something different you should propose than when you know it’s 15 to 20 years old 22 years old students right it’s a difference right you know that that you
Cannot advise eight years old to take take up uh read a book per week right you cannot propose to do that it doesn’t make any sense and what is possible right so it also this the age for instance set limitations on what’s possible but also the budgets right what is the budget
Right so maybe you say okay what we need to do is we need all students that are below all the half lower half of all students need to have extra classes or they need a student helper not student help or a student help right so they need to make a
Program with that but maybe it costs too much right maybe it’s not possible right so there are some limitations and obviously we don’t know this budget but have it back in your mind this is a portuguese school we don’t know much about the school you can maybe research on that the name is
Up there so maybe research is a rich school it’s not it does how much funding does it have what is the budget but import the data we can do that and i prepared it for you because actually inside here in files we actually have the data so it’s actually i think it is
Uh this one here and we have a csv file and again if you don’t know what a csv file is it doesn’t matter right now with the file representing all the data here right and we’ll get back to that later and uh what do we need to do with that well we
Need to import it and how do you do that so right now if you’re starting from scratch well follow along import pandas pandas as pd so pandas is the library representing having a data structure inside uh our notebook here and that’s what we do so data pd read csv that was a library
We had up here this read csv here remember we found the csv file we imported the pandas we found the csv file and now we’re going to read it okay let’s read it how do you do that files and it was called student files matt okay so now we read the data
And we have the data inside data now so that’s a viable now and you can actually get the first set of the data here and you can inspect it that it is similar to the data we have over here here we have it actually the same it’s just structured a bit better
It’s easier to read good good so far so good and one of the things you often do is well what is the length of the data you can just take length of it and then you have the length of the data and you can do more stuff like uh data columns
And uh you get all the columns names here right so these are the features and targets right remember the target’s g1g23 and you have the features here uh school sex age address and so on family size and so on right so this is it right and now we actually
Have the data here and that’s basically the first step in normal cases you’ll figure out where to get the data and how to import it and maybe it’s not as simple as this but we will get back to more complex cases later good so a first thing you need to think about
Is other data types as expected what do i mean by that well actually as we don’t know yet we need the data types of these right so h for instance it sounds like an integer and sex sounds like a string right but sometimes when you actually have these integers they’re not
As integers so that’s one of the things you want to investigate so how do you get that data d types and then you execute it right so here you see actually that the age here is uh an integer mother’s education father’s educations and travel time and so on so you see all
The things here and and right now we’re not going to investigate that much of it because we are basically beginners and uh don’t worry too much about these things and uh what i want to look at is are there any missing values so this is
A main thing right so if you look at this one here it’s difficult to see but other data points that are missing is there some of them that don’t have numbers as supposed to so that’s a basic thing that you actually do all the time
So how can you do that and again we’ll get back to all this here but what you see here this is a perfect data sets it says false to everybody so all the data points are actually available down here and again we’ll get back to doing these
Things here so don’t worry too much about it and now to the analysis right so we need something called feature selection model selection and data analyze data and again it’s a different journey depending on what we’re doing but what the first thing we want to do is correlation again
With all these things here in the first one we’re getting back to it so don’t worry too much about it basically it gives a number and this number isn’t from the scale from -1 to 1 right so it can be anything from -1 0 to 1 right so it’s a
Just a number and what it says when you have two columns or two features against each other it says if something is highly uncorrelated or not correlated or highly correlated positively correlated so what does it mean so basically said if something is uncorrelated that means you cannot
Predict anything so if one number goes up you don’t know about the other numbers so it doesn’t say anything about each other right what you want is highly correlated or highly uncorrelated and it’ll be somewhere along the scale here right what you’re not looking for is something
Which is close to zero what you’re looking for is something close to one or close to minus one and highly correlated means that if your value the value goes up the other one goes up and negative correlated if yours goes up the other one goes down or most likely
Again it’s not a 100 certain fix fit unless it’s minus one or one so so let’s let’s let’s try to do that actually so it’s basically easy again we use this correlation here you remember up here in the programming notes i added a link to correlation here and you can
Read a bit more about it if you want and what we’re interested in is not correlation of everything but let’s just do that so what you get here is an enormous matrix here of data and you notice actually some of the data is correlated and for instance the age thing here is
Correlated with one by h but that makes sense right h is correlated with one by h because if h goes up each goes up it’s the same thing right it’s the same number but if you look at the different things what we’re interested in is actually g3 right the
Final exam is something is some of the features correlated with that so let’s just focus on that and take g3 out how do you do that you do it like this and again the details don’t worry too much about them so what we’re looking at
Is we want to notice here as it says up here that the correlation is only on the numeric features not on the other ones and it makes sense right because these are numbers right so what are correlated right so it might not surprise you that if you had high
Grades at g1 and g2 you’re most likely to get high grades on g3 that’s what it says it’s highly correlated right and g2 more than g1 right so you might have not done well on g1 but you can still do insanely good on g3 or you might have done really well on
G1 and then things happen you didn’t study and then you do bad on g3 but they’re highly correlated but maybe that’s not the most interesting thing right so if you look at it are there anything interesting right so there’s mother educations it is somewhat lowly correlated uh travel time is negatively correlated
Right so the longer travel you have the worst grades you get health absence so actually there’s nothing that’s like striking in your eyes it’s failures so we can actually read what failures is it’s probably something about failing where is it here number of past class failures it makes sense right the
More you fail the more or worse grades you would get expect right so this is basically what it says here so this is one the highly correlated right but we have a problem here we’re only looking at the numeric to the numeric right so what about all the features that says something about
Non-numeric right how can you make anything about that so this is where group by comes into the picture this group by here and we’re going to explore that so let’s do that data group by hire so let’s do that here so this is about your expectations do you want to go at higher
Higher how is it called grids and it says here actually the average g3 grade is 10 in the end if you want to get higher education and 6.7 if you don’t so why is that interesting so we make a bit better represented here it is actually because this actually
Makes a huge gap in the average educational degree right so if you look at data and you only look at g3 column and you take the mean value of it well you see the mean value is actually 10.4 but those who ends in nowhere actually 6.8 so it’s a big motivation to be
A student that wants to have higher education it seems like it really divides the people highly uh one thing great thing is to look at here is like oh how many are actually in each category and you can do that by count and uh let’s just take g3 here
Because it’s the same number okay so so you actually see there’s only 20 that don’t want to go into that but those 20 they have low education and the low grades right that’s what you see right so you can actually identify a group of students now based on this single question which
Actually predicts pretty much the grades they’re getting right because this is enormous gap here can we use that i don’t know let’s look at something called standard deviation so again you want to see how is a spread right so right now you’re all happy because you find this insanely good number here it
Says like well 10.6 if you want higher education and not so the standard deviation is well how much does it spread if the standard deviation is small then it doesn’t spread a lot if it’s high it spreads a lot what does it mean with grades say right so if you have
High standard deviation on the grades it means the spread out is enormous of it but if it’s small it’s precise so let’s look at it oh and let’s just do it on the g3 here so you see here actually on the no one here the spread is bigger than the
Yes here right so again here what we see here is when you have this uh mean grade of 10.6 one standard deviation which is 30. so everything is up to sixty eight point two percent are lying within the grades of four or ten minus yeah ten point six minus ten
Point five right so it’s down to uh what is a 6.1 right up until 14 point 15 point something right i cannot calculate it right now so so you get the point right the most students are in this area here and the no well it’s a big wider group of people
Of students right or the grades are covering a bit more right good so what to do what to do what to do what to do what to do what to do uh well credibility counts here right so we need to make a report with the skills we
Do now right so again so this is where we need to present our findings so one thing to present is for instance what is the g3 uh grades here well it is 10.4 right and what we figured out was actually uh these numbers up here right so
Let me figure it out let’s just copy it down here the mean here here and we found out the count here this is these are our findings that we can actually identify some students which is which are which are easy to identify that need help in order to improve the grades right so
The one thing is here right so if you advise according to these uh this survey here if they want to have higher education or not actually the group is not that big right so it’s it’s it’s the impact is not big right so let’s let’s just see how many right so
We have a length of data 20 length of data right so times 100 we have a percentage here it’s only five percent right so five percent of the student we could help somehow trying to get them motivated trying to get them higher grades right but again what will be the
Impact right it will be small in the sense here right but again this is just to get you started you can play around with it finding something more interesting you need to identify the students that need help and what can you do and how can you measure impact right so
In the future for instance you could do the same scoring and see if you did some impact on it right did did it influence it right so this is what you would do in the future right so again focus here on what is it we can add value to right
Right so for instance you say the recommendation recommendation could be right we know how to identify the students that are doing poorly these are the students that answer notes of these questions what can we do well we can advise them to do this and this and that
And it has to be within some kind of budget right how can i measure impact in the future well we can do the same scoring right the same survey and we can see if it had an impact here right this is this thing to notice here there’s only five percent
Of the students we are doing some impact on so we cannot assume that next year the average grade will be 11 12 or so whatever right in this great system i think the grade systems is from 0 to 20 or something like that in in some kind of scale so
This is basically how you can do the data science thing right so let me just shortly recap before we go continue here right so we import the data we have already downloaded the data so we read csv we just inspected the data the length of the data the columns of the
Features the column names the data types and we checked if anything was missing there was not we made a small understanding of correlation we’ll get back to that later we grouped the data also how did i do that how does this work we’ll get back to that later we looked at the mean
Of the average again what does that mean we’ll get back to that later we counted how many were in each group of the higher education or not and finally we did the standard deviation of it to see if it’s accurate number or not and then we went back
To the analysis what did we figure out we figured something out it doesn’t really remember the credibility counts here it’s not really a big impact maybe maybe not but also maybe finding 20 students and you can actually do something about is better than finding say fifty percent of the student
And you don’t know what to do right so these students and more about motivate them to see the long-term perspective i want to add higher education for instance okay so that’s basically it so in the next one we’re gonna do a project and it’s gonna be amazing so first of all uh
This is gonna be your first project and i’ll guide you how to do that and uh afterwards i’ll show you how i would try to solve that project it’s gonna be exciting i promise you that welcome to your first project so let me just emphasize one thing so why do we do projects
As you probably already saw that each lesson comes with a project so why do we do that so the purpose of that lesson is to learn something new adding something that you can do more and then project you want to use it in the bigger context right the context of the data science
Workflow so that is a goal you learn something new and then you apply it in the real settings so obviously in the beginning it’s going to be simple and it’s going to be more and more complex because we don’t know so much in the beginning and as we continue our journey
Journey it will be more and more complex so i often get it this so people yeah but i do understand what you’re saying and i understand everything you taught me in the lesson so why do i need to make the projects and again don’t cheat yourself on this
One because often it is more complex or there are details you didn’t really realize before you do it yourself so one thing is understanding it on a mental level another thing is understanding it so you are able to apply it yourself and it’s often way more difficult than you expect
Trust me i’ve been a learner my entire life and every time i hear something new i say oh that’s very simple and then i try to apply it myself it’s difficult and it’s just like okay there’s some details i didn’t understand and this is the main purpose of projects so here we
Have it project guide to school activities to improve three g3 grades that’s basically the data set that we worked on right so in each project i think almost all of them except maybe one or two there will be this data science workflow because we need to remember to remind ourselves why we’re
Doing this this is so crucial for your success remember if you don’t understand the problem well how can you make insights to the end customer right so that’s why we always have that and the project will be guided in steps as you see here so we have step one here
And so forth down here step two step three and step four and step five so basically the project all structured in a way so you realize what kind of work the work we’re doing where does it lie in the process so you remember it all the time and again it’s going to
Be quite simple here in the beginning and so on it’s going to be more complex so in this project here so what are we going to do well the goal of this project is to make investigate or explore the data set from the lesson further right we’re going to
Follow data science process to understand it better right it will be your task to identify possible activities to improve g3 grades right we did a job inside the inside the lesson can you do a better job that’s basically what we want to do note we have very limited skills hence
We must limit our ambitions in our analysis right so again don’t expect you to solve this school grade problem or whatever you would call this problem because we don’t have many skills and also you cannot make wonders you cannot make a high school afterwards which has insanely high grades
Good so what i want you to do now is actually to just let me introduce a project fast what you’re supposed to do and then you should do it yourself right so here i introduce it and the next one i’m gonna actually do a possible solution or help you with possible solution steps
Okay so in the step one right here we have that imported libraries so we need to import it actually it should be written here actually just said import pandas as pd for you and all you need to do is actually execute the cell let’s shift enter right since how you execute it
And then you read the file right so in this case here it’s like the students mad file here and so you need to use this read csv and remember to assign it to a variable right so that’s basically what we did in the lesson remember data equals to pde
Dot read csv and then the file name inside there then we need to inspect the data we also did this in the lesson but again do it yourself right so what does it mean called that head what that what does it mean remember if you’re get stuck
You can watch along in the next lecture or not next lecture the next step in this lesson here and i’ll show you how i would do this and then check the length of the data we also did this in the lesson and again if you think this is too easy then
Well don’t worry the project is going to be more complex later so don’t worry about it then we explore the data right so we are at step two so notice that we will not cover visualization in this lecture we also know that the data is clean but we will
Do validation here anyway right we did it already right so here first in step 2 a we check the data type so in this step here tells you if some numeric column is not represented numeric and we get the data types with the types we did that in the lesson and
Then we also checked for missing values here data is of the missing entries there can be many reasons for this we need to deal with that we’ll do that later in the course use is null dot any right and again we remember from the lesson there are no missing data so it’s
A perfect data set but again this is a process you need to get into the mindset right okay analyze the data feature selection model selection and analyze data description want to find three what we want to do is find three features to use in our report right the three features should be
Selected based on uh actionable actionable insights right what can you do right conveying credibility in the report and what is realistic within possibilities including a budget right so you need to figure out what kind of budget do you have what what kind of things can you do right
Note this step is where you can explore you know how to use the following core group by core with correlation group by with mean count and standard deviation this should be used for step four right investigates correlation right so here we have it correlation is easy measure
To find inside that are actionable right so use correlation and only show g3 as no interested notice that g1 and g2 are highly correlated but they are not intended to be used step three b right so you do that in each step right so get the feature names right so
This can help you to understand features better right get the features available with that columns applied on the data frame that you’re using investigate features repeat this step right so now you take one feature at the time and then you see how it works right
So we did this with that mean that count and that’s a standard deviation right remember so if you are a bit confused you can also resee the lesson part of this lecture and try it yourself d3 or 3d select three features right so decide on three features to use in the report the
Decision should be based on actionable insight convey credibility in the report what is realistic within possibilities right remember what we talked about this is where you base your findings in the report percent finding visualize result credibility counts right and again we don’t visualize anything here because we don’t have that
At our disposal yet so description with the three features from step three create a presentation as we have not learned visualization yet keep it simple remember that credibility counts right notice at this stage it is not supported to be perfect present the findings here in the notebook right don’t go crazy i
Mean if you enjoy doing these things just go crazy i’m not to judge you for doing that i like that too actions use insight measure impact main goal right description what action could should the school take right this this is basically use insights right what should they do what do what are you
Proposing right how can they evaluate the impact one goal could be like like in the lesson and say well do the same measurements next year and next year and so on remember this is the main goal right so this is why we’re here actions right take actions we need to do that okay
So are you ready right so you just basically follow step by step and write the code in the cells and if you really don’t know what to do right now don’t worry i’m here in the next one i will try to guide you through how to do this
And again this is beginners if you don’t get stuck at every step don’t worry about it um there follow along try to do it at your best and then you can try to modify it and follow along right and if you find this too easy don’t worry about
It it’s going to be more difficult later it’s basically if you get stuck you watch how i’m doing it try to do it yourself again then continue if you get stuck then again again and in the end if you get it all done maybe you want to
See my solution as well just to get inspired to see if it could be done differently okay so are you ready stop now and try it on your own see you Buh buh did you did you oh so if you didn’t try it yourself try it yourself first and then continue right this is our deal right but if you get stuck don’t worry about it i’m here to guide you through how i would do this and let’s get started with that okay
So i already introduced the project so let’s just get started so step a one i call it a1 it should be 1a right import libraries execute the cell below so let’s do that boom that’s basically it that’s all you need to do because pandas is library you’re
Using and that’s what you’re doing now you’re loading it into the environment okay step one b read the date let’s just set data so let’s just uh it does here no it was up in this one here sorry sorry to correct it while talking to you
It’s just so we don’t forget it in the future okay use use this one to read the data from this file here right so and remember to assign it to a variable let’s just call it data right so pd dot read oh read csv and you see here this auto
Completion as well so if you press tab you will actually get it the same with here files stude files student you actually see you can use autocompletion here right so when it does it reads the data inside the variable data here so you have the data in data right
And use tab to use the auto completion it helps you along good and then we try to call head on data so let’s do that so you see the first couple of rows of data here and why is this a good uh good thing to do is because if you
Continue here and something wrong happen it’s not the data that you expect it to be maybe you put the wrong file name or something something went wrong there it’s really just nice to validate that the data is here you see here there are some dots here and that’s just to summarize it and
This is basically also what hit does it doesn’t show you all the data it’s just the first five columns and the first five rows and there are 33 columns but it doesn’t show all of the columns either here in this summarized view so that’s that’s basically it right it’s not that difficult right
Good and then check the length of the data so call length and i wrote dot dot it means you should insert something here on the data right and data is the data column data we have right it should be 350 395 it says and did we get that three nine five
Three nine five yeah that’s correct so perfect good so that’s basically step one of this right here in this case it’s very simple we do understand the problem right we need to guide the school we have identified the data we have we have imported the data perfect step two
Explore visualize cleaning and again we cannot do uh we cannot do the visualization so far but don’t worry about that so check the data types right this step tells you if some numeric columns is represented it’s not represented numeric right so this is often what we’re looking for d types
Because for instance if age for instance was not an integer it you cannot use mean and all mathematical statistical things on it right so again it’s it’s important to know that the data types are correct as they should and it might be a bit confusing because there’s
Something called object and some inc 64 and that might be different now in this one there’s not anything else but basically what it says is object is often a string or some similar object and but most of the time it’s a string and integers can be in 64 in 32 an inch
Bigger as i think there’s into 128 and actually can go down to into eight but basically most of the time it’s in 64 that means it uses 64 bits to represent it if you’re familiar with c programming it’s it’s basically the types from c and they can also be floats and floats
Are comma valued numbers and there are none of them in this one here good check for null missing values so null is the way you represent missing data and you will do that here so basically you call that isn’t and any and what is null
Does is it checks if any is null and any is it summarized if any of them were null and maybe you have a question here i’m very good at typing right now good so i said that any summarizes that it is if any of them are but you have a long
List here right but you can actually let’s just try for fun to remove this one so what it does here is actually you see here for each row it says if it’s false or true right so what any does is summarize each all the rows so it’s only the columns
Left right and you can actually put any again i think and then it should be one value right so if it if it’s any of the columns right so first any rows any columns right and if one of them was true then it would be true but this is basically what
We were asked to do okay perfect now step two step three so step two prepare analyze now we need to analyze the data right and here it’s it’s basically it’s basically your task to do these things but uh for the fun of it i will just
Help you get started right so the first one one was uh to investigate correlation right and remember we’re only interested in g3 and g1 and g2 are already highly correlated but we’re not looking into them what is it that i want you to see in this it is basically
Let’s just take a look again here it’s basically that they’re highly correlated are g1 and g2 and they’re highly correlated meaning they have value close to one or minus one but we’re not interested in them because schools i already know that have that inside but we’re looking for is is there
Anything else highly correlated and basically highly correlated is above 0.7.8 it’s different what people say but at least you can see here this the highest one is point three six so it’s less than point four and in general terms that’s slowly co it’s not highly correlated good but the point is
To see here that these are only the numeric values here and as we saw in the lesson actually there are actually some which are non-numeric that are giving greater insights and i want you to explore that right so what you can get is all the feature
Names here and so that’s what you do in step three b i’m preparing you to say okay what are all the column names what are all the features right so you can get that by data dot columns right and then you get them all here right and remember in the lesson we used
Higher which was a question if they wanted higher education or not good investigate the features so select a feature and calculate the group by mean and do that right so basically what i want you to do is actually to say okay what can we do data group by
And let’s just say school the first one right the school up here this is the first one right i’m just added here and there were two schools if you remember and then we take the g3 and we do mean and we see here actually there’s a difference on the schools so actually
Students in one school are i wouldn’t say better but they have higher grades in general right but what can we do else we can do count and standard deviation so let’s also do that count so we also see here again we have a most students are in the bigger school
And actually the bigger school has better grades on average right and uh finally we have the standard deviation here so how much is a spread and actually see here that the spread is higher on this school so this bigger variety of the grades remember this curve here right so
Basically what it says is on the bigger school or the school wherever more data it doesn’t we don’t know if it’s a bigger school the grades are more spread so there are some with higher grades and some with lower grades while the smaller school is not spread
So much it might be a result of having more students but it also might be that the spread of good students and bad students is bigger on that school smaller on the smaller school right good so this step here basically means that you should try to look at all these
Different features first school and then so on right so you change actually the variables here and you see something that stands out and that’s that’s your goal right you change them and see if something you feel comfortable about making a report on and in the
In the lecture we used higher as you can see higher had a bigger oh a bigger and bigger so what we’re looking for is this one here right how can we divide the people in this one here right and again higher has a big one comparing to the
Others but there’s also a small group down here right that doesn’t mean just because i used it in the lecture doesn’t mean you cannot use it right but the goal is to explore here iterative try to take all of them possibly or some of them to get an idea of that right so
Down here and i’m not going to do that down here is to decide three features right one like higher maybe you take school maybe you take something else and what you need to think about is three that you can use in number four right so a good idea is to go to step
Four and just think about it right here we need to present findings uh don’t worry about not being perfect you just need to think about you need to be able to sell it is tell a story that that can be backed up by the data right
So the data says that for instance on this hire that we can identify some students immediately i mean we don’t know when they’re asking ask this question if they want a higher or not higher education but when they do we can actually identify whether they’re good or bad in school right because
There’s a big difference on them right and that’s your job right so you say okay if we want to make an impact on these 20 students here how can we do that and then it’s your your job here to make find those three three features and try to write something here
Right and again you can just write it here in the notebook and if you want to change the cell to a markdown where you can just write here write my answer you can see here then you can execute it and it will write my answer and you go
Back here right and if you don’t know markdown yet the basic thing is if you need something new line here here my next answer you can do it by two two spaces no no two a new line here between here because if you don’t have a new line here oh
You see here it will immediately add another same line this is not a course about using these things but i’ll just give you an advice how to write it and it doesn’t have to be perfect right just write it down here and in step five here actions this is again
So what actions do you propose for the school this is your findings right but you have to formalize them so the school understands what you’re saying and it should be based on your findings that you did in the report here right so basically this is what they care about
What action should the school take and uh how can you evaluate the impact and again i would just say well when you do the same measures again maybe you should also advise the school to take more surveys among the students right ask them more questions so have more data
That’s also often a normal advice to come with right i hope you enjoyed this one because the lecture is done and i hope you learned something new if it is too easy don’t worry worry in the next one we’re gonna actually dive into visualization and you might wonder why
You want to start with visualization so fast here in your data processing right here because there’s more to learn about in step one and maybe also in step two but actually visualization is the best way to understand big data when you have a lot of data points the best way to
Understand it is to visualize it and let’s get started with that because this will be the key for your success to understand how to visualize the data to get a fast understanding of the data because we as humans our eyes we don’t we don’t really when we see a list of
Numbers we don’t understand it we don’t know what to look for it it doesn’t make sense but visualizing it it makes sense so are you excited let’s get started for the next one i mean see you there in the next one and if you find somebody that could like this content please share
With them and also like subscribe and all that kind of things to my channel it helps me grow and i will make more great content for you in the future so see you there what is the best way to get an understanding of the data you’re working with
How can you find correlations in the data how can you find outliers the data quality how can you present your findings wow that’s data visualization why is that because if you look at a data table full of data how can you find discrepancies how can you find connections how can you find
Correlations in the data you don’t the human mind is just not wired to find connections but if you visualize it well basically you find these patterns immediately it’s like no brainer you just look at it and you see it immediately you find outliers like nothing it’s easy right and finally
It’s presenting your findings because people are eager and have an easy time to understand visual presentations but they don’t understand data like this they don’t so this is actually one of the main skills as a data scientist it is to understand visualization how it can help you when
Most data scientists in the beginning me including think of visualization they were only thinking about the findings presenting the result and that’s basically because in this world we live in we’re presented with charts and all kinds of visual things visual things in the media and that’s how we perceive
It but there are two other main aspects of it it’s finding data quality data quality is like are they missing some data some discrepancies data and some outliers in data you can see that immediately in a chart so just visualize it you see it immediately there are some problems with the data here
Also data exploration right if you need the connection you look at data like this you don’t see the connection but you see a connection in a chart like this oh wow there is connection in the data you see that immediately right so our human brains are wired to understand
Visual presentation of data but not column data like this in rows no we don’t so are you excited about this this is what we’re going to do in this lecture here we are gonna dive into how to visualize data to these three purposes one data quality two data exploration three data presentation
And in the end we’re gonna create a project where we look at the co2 data set from the world bank and we’re gonna try to do some visual things to try to send a message to people out there so if you’re new to this one this is the
Second part of the course of data science with python and we are focused on learning the data science workflow we want to add value as fast as possible to our customers so are you ready to get started with this one here i hope so let’s jump to the job with the notebook
And let’s get started so inside here we have data visualization and we have the five step data science workflow here and if you’re new to this i advise you to see the first lecture here so we you know about what is this process here so it’s basically the steps
You need to think of as a data scientist most only focus on the two steps in the middle here prepare and analyze and not so much on the report actions and actually understanding the problem and what a customer wants is actually useful insights that add value and how to
Measure the impact of that that’s actually the main goal of a data scientist it is not about making this work here of course this is part of the job but the job is the value is out here so you might find the most amazing analysis but if it doesn’t add value to
The end users the customers well it doesn’t matter good so data visualization right data visualization so what does it where do we use it well we use it here in prepares it’s step two so it’s when you explore the data quality and explore the data in general
Right it can also be helping you to make the analysis here if you want to do data exploration right what kind of connections are there so it’s also in step three and finally in step four when you are presenting the results so again we don’t go through this
Process step by step we look at data visualization which is focused on these three steps here and the reason we do that is because data visualization is one of the core most important things to understand data and the first step you need to do is understand the data and this is
Where visualization is the first step you need to understand the data good are you ready good so data visualization it’s a key skill today and i had a quote here from a google chief economist hal varian what does it say the ability to take data to be able to
Understand it to process it to extract value from it to visualize it to communicate it that’s going to be a huge important skill in the next decade right so understand it right this is visualization right to process it again visualization can use to do that extract value from it data visualization again
To visualize it yeah that’s basically what it is right communicate it that’s also data visualization so mastering data visualization is so important with data it cannot be under emphasized and as i said in the introduction data visualization for a data scientist is actually three things most people only focused or
Do understand the data presentation right it is to present the result that’s because we’ve been bombarded with charts and all kinds of stuff in the media and we love to look at them but actually data visualization is way more for a data scientist it’s about understanding the data right
Data quality we’ll come back to these things don’t worry about it but data quality is like missing values remember that values that are stored incorrectly or outliers in the data data quality is a key thing because the worse quality your data is the worst conclusions you make you need to deal
With the data quality you need to do that and we’ll get to that later also in the course how to deal with data quality data exploration right it is to understand the data again if you look at numbers we don’t understand it we need to be able to
See patterns and we do i said it’s c patterns right that’s visualizing it then we see patterns in the data so it’s easy to get an understanding of it and you’ll be surprised how we’ll see in a moment so the power of data visualization so consider the following data set and
There’s no data set here what i’m talking about so let’s try to import that actually so let’s start by importing pandas and then we’ll make a data set here actually we just call it sample here because and we do read csv as we will learn more about
Later so files and samples here sample correlation it’s called and you see you use auto completion here and if you look at the data here my question is let’s look at the data here we don’t know anything about this data can we tell anything about this data
Here we look at this oh we look at this data here we have some x data some y data so it’s probably some coordinate data right and we have some numbers here i mean numbers here some numbers are lower Some are higher 1.1 1.3 1.1 1.4 and so on and it goes down here right does it tell us anything i don’t know it’s difficult to see right and remember this is just a tiny data set of 20 entries right 20 20 rows here right 0 to 19.
It could be enormous data sets we’re talking about this is just to exemplify it okay so visualizing visualizing the same data let’s try to visualize it right so one of the great things is to use matplotlib and you hear a lot of people talk bad about matplotlib but basically
You need to master matplotlib and it’s an easy way to get it’s integrated well with a lot of other visualizations uh platforms and many of them are built upon matplotlib and matplotlib is tightly integrated with the pandas that we’re gonna use and everybody uses so it’s you you need to
Have some basics understanding of that it’s easy to use and it’s a nice library and if you want to read a bit more about it you can click on the link here and then you can see you see read about it and there are a
Lot of things here and i’m not going to go into details of that good if you want to use it uh actually you can use it directly from the pandas data frames but i often advise people still to import this one and it consists of two lines
Here input matte partly pi plot as plc that’s basically the one we are going to use the pi plot and we import it as plt so we can access it this module here matplotlib pi plot as plt and then i have something written here it’s called matplotlib inline right so
You’re communicating to in the notebook here so this is the environment we’re working with how to visualize it and we’re just going to use inline here you can use different other modes i’m not going to go into that inline is the most simple one and it’s for our needs it’s fine
So what i want to do now is take our sample here and then we need to plot it so sample is a data frame i mean you call plotted and scatter so and the first one here don’t care about what we’re doing uh just we’ll get a bit later to this uh into
This a bit later so don’t worry about about the details right now why so what i’m doing here is i’m plotting the same data that we had here this data here and now we look at the data here suddenly we see something immediately we see somehow the data is
Kind of along the lines here right we have no data up here actually and we have no data down here actually right all the data is basically around this line here right what this tells me is there is a correlation between the data and i can
See that just looking it takes a split of a second to realize that there’s some correlation here because there’s no data here and there’s no data here so it doesn’t take any phd or a phd degree to realize that there is something because if there were no correlation it would be scattered all
Over the place and it’s not perfect so this is the power of visualization and it could be millions of points and here we only have 20. and you look at this can you tell if there’s correlation or not and this here you immediately see it right perfect so
What is it that data visualization gives us right it gives it can absorb information quickly it improves our insights right so you absorb information you process all this data here and could be millions of points in a split of a second and it proves inside right you see here there’s some correlation it
Might not be as beautiful as in it could be better than this and could be worse than this and then we can make faster decisions based on that right because now we actually know okay there is connection here now we know we need to do something about it before we wouldn’t know
Anything about the data especially if we looked at this one here so what can we say about this data here you might say yeah i could find the same conclusion now that or could we right we don’t know because the data set could be way larger than that good
Data quality so this is our first main goal is to understand that data quality so is the data quality usable right consider this data set here okay so let’s do that so data pd read csv and we do files sample height and then we take
I often like to just to take the head here just to get an idea what the data is so we have some heights here right so you see here uh check for missing values right so so this is what we can do we can use this
Isn’t any and the other one we use is now but we can also use isn’t so what is nut does here you can see there’s actually a link here it is it um to the to the manual here to take missing values right so that’s what it does
So far so good so let’s try to do that so data is not any and it says false so what that means is actually there are none of them that are missing right so this is this is a this is a great thing right so visualize data uh notice you’ll need
To know something about the data we know that it’s height of human in centimeters this could be checked with a histogram okay so how do you do that so data plot hist and it’s basically that easy right and i also want to share with you there’s a way to get the dock string
Off things inside jupyter notebook it’s just to be on the method or function and press shift tab and then you get this one and you can expand expanded by the plus sign here so it draws in one histogram of the dataframes columns okay perfect and what you see here is actually you
See data immediately here so we know there are no missing values but we also see the height here right so what we see here is human height is in this average a area here but we also see some data points down here actually so this is a frequency count right so we
Actually see that there are some 12 13 maybe humans that have a height down in this range here below 25 it seems like right that doesn’t seem real right this is something wrong and this is centimeters so for americans sorry it’s not in feet and inches but uh
Humans below 25 centimeters is not realistic even for midgets it’s too small right and we have some tall ones here so this actually shows us immediately that there are some down here and one way to go around that is actually to get all the data height
You can actually find how many are less than 50 for instance and get them here right so now we actually get them here so actually you see here there are some heights here down here and what is funny it actually looks like okay this is great right because this actually looks like that
These are actually not typed in wrong maybe they’re just in meters and not centimeters right so 1.91 is 191 centimeters 162 is uh yeah 101.62 is 162 centimeters right so this could actually be correct numbers it’s just in meters wow so you see here we visualize it we identify some problems
We we get the problems don’t worry about the syntax here because this we will cover later uh but we get all the data points that are less than 50 in centimeters and then we actually see that these are actually maybe they’re actually correct values they’re just typed in meters or not in
Centimeters and this is one of the things that happens often when you work with data it’s like wow you get data and it comes from maybe from different sources and different people typing in the data and some type of the meters and some in centimeters and it makes
Discrepancies but maybe this is not bad because we can actually use the values still good another thing could be identifying outliers so let’s try to do that so we use the data set here file sample h let’s try to do that read csv files age
Here and let’s just i like to do the head thing here so we see here we have some ages there it looks uh reasonable it looks in years and there’s no discrepancy from america and europe so americans you can read this age it is in years around the sun right as
One year is for the earth to go around the sun once i think that’s what i learned in school at least this gives fast insights so let’s first describe the data so let’s try to describe it what describe does it make simple statistics of the data frame so let’s
Try to do that in data describe and we see here we have 100 ages here we have the mean value the average value is 42.3 years old the standard deviation is 29 that means i can remember in the previous lecture we’ll get back back to statistics so don’t worry about it but the
Standard deviation is almost 30 that means the span of 68 of the people it is 68 are within the range of 60 years right so there are some last 30 something percent which are outside we see the minimum value here is 18 the maximum value is 314.
Can somebody be 314 years okay so you see here something is wrong right and you have 25 50 and 75 percent of it and again describe here we will get back to that in the statistics uh session less lesson so don’t worry about all the details here good so again
We could also visualize this and we could do this with a histogram again so let’s just do that data plot hist and we will probably see something here yeah you see here we have actually it seems like one out here and the rest here right so again we have identified a
Data point which doesn’t look real right so data data uh how’s it called what’s it called years h was called sorry h is greater than 150 let’s see that here right we have one one here right which is outside the scope right so easily you can with visual visual effects like this identify
Problems in the data right again i want to emphasize that you need to have some context of the data like this is the age this doesn’t make sense right 314 if it’s the age of the human at least so there you go okay this is exciting now we want to
Explore data now we talked about data quality and now we want to explore data right so data visualization and again it is to absorb information quickly improve insights make faster decisions right so the world bank is a great source of data sets and i must say i just love it because it is
One of the favorite places to find data in it and it has a lot of interesting data and if that is your interest you can find amazing data it’s about all kinds of things around the world this is amazing good and let’s explore the data set
This one here and i put a link here and this is a co2 emission metric ton per capita right and you can see how it has been evolving around the world so back in the 60s it was down here and then it’s been growing rapidly in the 70s and i guess oil
Crisis came along here i don’t know and we kept it on here and in the 90s we actually got back and then it’s been growing again good so far so good let’s go back to the lesson here so uh we haven’t talked about how to get
Data so right now i just prepared the data in this csv file here so let’s try to use that one so let’s say data pd read csv and we take files and we take a world bank here perfect and uh basically i want to show you one thing here it
It is if i read the data here you actually get the year out here as a column by itself but basically you want to use that as an index and you can use that again we come back to these things later so don’t worry too much about it you can write
Index call equals zero and you see here then this one year came down here instead of the index out here so now we have the year as index and that’s very convenient so a simple plot so now we’re going through a lot of things here so how to work with
It right so how to create a simple plot it’s just basically called plot so data uh for now uh we have a lot of countries along these axis here actually how we have 266 so we don’t want all that but we have one which is called usa right so
Let’s just focus on the data of usa and let’s do a plot here right and what you see here is a plot here along the axis here so we start around here and then it goes up and the usa has been really really great in reducing the co2 per capita here as
You can see good but if you explore this one here you can actually see well we have years here but we don’t know anything on this one here so basically you can also add labels to them so let’s let’s try to do that so actually it’s just taking this one here add
Adding here and say title oh and we say co2 per capita in usa right and then you see immediately you get a title here but you can also get the axis here you have a x label now we had the x label that was here y label label let me say co2
Per capita so you get that out here right you can also put a1 overwrite the one for y label but we don’t need that it will we need the year down there okay perfect sometimes it’s also like you want to represent a more better access i don’t know
You see this one starts on 15 and it goes up to 22. maybe it doesn’t really give a picture of what you want to sell right so you can also add like ranges here right so you can say y limit or x limit and y limit right and uh
And if you only keep one it says say range right so let’s add the y limb to be zero here so what happens here is actually you get a chart here where you start from zero here so you get kind of perspective where are we at the chart
Here right we’re up here and goes down because often if you only have this chart here it looks wow we had so much progress going down with co2 per capita but if you look at it at this perspective say yeah we are in the right direction but but it’s still in
Perspective of what we are using right so if the goal is to get to zero at some place well you can see there’s still some some work to get there you can also compare data uh inside here so here we have one chart here a simple one way is
Actually to do a comparison here so let’s let’s do that data and you actually with inside here where we chose our data you can actually make a list inside here so usa and world the world and then we just make a plot here so
Let’s do that so here we get it right so here we see the world comparing to comparing to the to usa right so you see actually on average the world co2 per capita is all the way down here so let’s try to do this uh y limb here to zero
Now we actually see actually that this one here actually is all the way down here and even though you saw the world population on average emitted more you’re gonna see it’s way way way less than the american again if you didn’t have the wild limb you wouldn’t really get the perspective
Because without the y limb here you see this chart here starts on 2.5 so here it looks even more extreme but that’s not fair so you need to have like oh you need to have the y limb here to be zero in order to represent it here so again when you’re selling things
Your knowledge you need to be incredible credible about that often you see charts charts where they don’t scale it like this and you get a different story right you see it gives different stories also comparing the us to the world on average yeah it makes a big difference of course
You might complain and say yeah yeah but that’s not fair comparing american with everybody in the world because half the world population doesn’t have the same equipment that u.s citizen does so it’s not a fair comparison yeah i know i’m not trying to judge anything there i’m
Just saying it tells a different story and again as a storyteller you need to be able to communicate your message right and you need to do it with credibility always remember that another thing that you often might want is something like well you want to
Have the size of the figure right so the figure size here and uh i’m actually a bit curious here in denmark i live in denmark if you don’t know so let’s add denmark instead of the world there and then say fixed size to be 20.6 okay so so so what i’m getting here is actually you see we stretch out the figure size here instead because that’s sometimes what we want right so now here we see we have the us co2 per capita and the danish co2 per capita and you see here both countries
Are going the right direction the u.s citizen has a significant higher co2 footprint than the danish cavita has and let’s just compare to the world just to have that we still see the world is way beyond better than both of them right but actually the distance from the
World population of the danish citizen is way smaller than the distance from american so denmark is like america a modern society with cars and all that kind of stuff that pollutes so yes there is a difference but so far so good we are not in a political fight here so
Let’s try to do a bar plot with this data so let’s go back to usa again and uh let’s say plot bar and actually i want the fixed size fixed size here to be 20.6 again let’s just do that so what i’m getting here is you can make
A bar plot like this this is often very useful because then you can see all numbers like this there’s a nice presence and you use that often right and you can do the same thing actually with multiple you can do the same here as we did as the other one here
You can add a list in here world and then you get the same chart here and you get multiple as you see down here you get a multiple what’s called bars in our bar plot here amazing right perfect and you can actually also plot data in ranges so i think we just take
This one here and what you want to do is actually take a location here and what do you want to do you do for instance from year 2000 and forward so here it actually pays off to have the years here right because now we have from year 2000
Inside here so remember in the top where we did the index uh set column index to be zero because then we have the years over here so we can do these nice things with the location here and again we don’t know exactly all these things what i want you to focus on
Is what are the kind of charts plots you can make inside a data frame and this is why we love them because you can do a lot of things with ease good histograms so we already tried to do one of those histograms here so let’s try to do that
In data usa and we do plot hist and let’s just do the fixed size fixed size 20.6 and um and we can have the number of bins i think the default we can actually find the default by looking in them in lookup here sometimes you see bins here 10 is the default right
So here we have it right so what are we looking for we are looking about a histogram of uh the number of times the frequency we have a year in the co2 pollution that was up here so you can see how it is and you can make uh
Bins to be more fine-grained 20 for instance you see there are some some that are not when you do this fine grain so maybe 20. i mean it’s it’s a it’s it’s a balance right you can also do fewer like seven and then you see a pattern like this
Good so far so good and another thing that you can actually do is actually you can make a pie chart and i actually don’t want to use this data for that so let’s make some fictional data here so let’s make a data frame pde serious and don’t don’t
Worry about what we are doing right now uh three five seven comma index and uh this is just to create some data data one no data two data three okay so this is a a data frame and uh let’s just plot it as a pi here and you see here
What it does is actually we make a series with the numbers three five and seven and then we have data one data two data three data one data two data three and then we make a pi right so data one is the smallest with three data two is
The second largest and data data3 is the largest one of them right so it represented like this and you can actually make a lot of things with a pie plot here you can for instance look let’s let’s let’s try to look at data usa so what we’re trying to for instance
Look at what is less than 17.5 right what you get is a long list of two to true false and so on and you can create that to a series and then you make value count don’t worry about all the detail here value count counts here and then you actually get oh Here you actually get the numbers here and if you want to represent that as a pie chart you can do that by plot pie don’t worry about how complex this is right you see here and what you also can do inside this pie and you can see the colors list right
Color so you can put it like i want it to be red and green right is this good well this is good right and then you can you notice that the labels the labels are true and false they don’t make sense so let’s say labels here and we take
This is greater than 17.5 and this is less than 17.5 and then you have that then you have these here and then you can have a title i don’t know we can have a co2 per capita and uh And then we have that and we can actually we can actually have this one here set the percentage in the charts right so if you want to have it more accurate here we can actually add that to this one here right again so i don’t remember
All these by hard here that’s why i wrote them up here but it’s it’s a great way to make some interesting figures or charts here that show something and you can play around with it right so you can see that in most years if the goal was to be less than 17.5
I don’t know if the goal is that but you can see how many years that were below that that’s what we’re calculating up here in the value counts here right so again that’s amazing uh way to play around with that okay so a great thing and i’m one the one you
Will use all the time is a scatter plot and uh let’s just try to investigate that so data pd read csv files co2 per capita perfect and we put the index call to zero and again why do i do that so ahead so let’s just look here so we
See here we actually have the co2 per capita and the gdp per capita so the data set here is representing the co2 per capita in gdp per capita because one hypothesis could be well we want to investigate the gdp per capita and co2 per capita are if they’re
Correlated right so how can we do that right so remember the scatter plot let’s try to do the scatter plot right data plot scatter uh that went well oh we need the arguments sorry that’s actually what it says there right so we need this one here
It had a funny sign there and y equals it’s because i think i also here we go right so is there a correlation or not it could look like it but also the spread is getting bigger and bigger right the spread is getting bigger and bigger wow amazing and
We actually know that we can actually calculate that data by core actually to see if it’s strong and remember when i said it has to be above 0.7 or 0.8 to be a strong correlation here it’s 0.6 this is what six three looks like you see down here it seems to be more
But it’s also smaller scale right and then spreads out and out there are actually countries with high uh high gdp or medium gdp per capita and high co2 right so the spread is big up here good so far so good exciting right so what we want to do now is go into data
Presentation right so the message here is assume we want to give a picture of how u.s co2 per capita is compared to the rest of the world so let’s take 2017 as a more recent data is incomplete what is a mean max and so on so let’s
Just read the data here again here so we have it right so remember the data here ahead oh here head and we see here this was the data we have the years here and so forth we have the countries along that axis here and let’s see the data in 2017 so you do
That by this trick here and describe it what does it tell us i don’t know actually we don’t know we know we have 239 countries we know that the mean is here the standard deviation we have the max here right so let’s look uh what it is in the usa so data location
Year year and Usa okay perfect so it’s 14 but again we know the mean is here and so on it’s difficult to tell a story by that but i want to show you that you can actually do that with the data visualization and i’m not going to do that
I’m not going to do that from scratch here i put it in here but let’s what is it we want to see that the u.s is above the mean the u.s is not the max that it’s about above 75 right look at the data here we have that 75 percent are below 6.1
So that means that 14 is way beyond that but how can we tell a story with data like that and this is actually what we’re going to do here so what i’m doing here i’m actually making a histogram with bins 15 bins and face color green
And then i put some label per capita and i put a number of countries on the labels and then i add usa on a specific point and uh put the text and put an error and here we’ll see the result here so what we are doing here is actually we
Made this the count of number of countries right the number of countries that has a co2 per capita in this bucket here and in this bucket and you see here they’re actually the most countries in the smaller buckets and it goes less and less and less this is not surprising
Right we have one above 30 remember but we need to show where the usa is and it’s all the way over here so we actually get a picture that usa is actually one of the greatest co2 per capita in the world it’s above as we know 75
And here we actually see it that this usa is here on this chart here right so it tells a story the majority is way up here and we have actually a few countries above 10 as it looks out right actually we know that six point something is 75 but this picture itself
Tells a story co2 per capita the americans are here this is a number of countries of the co2 per capita pockets right wow so with that said i also advise you to understand that this is not the end of data and visualization i advise you actually to be creative when making stories
And for that actually there’s a awesome video and i’m not gonna show it here but i advise you to place this one here and that will show you a master of data visualization what you can do how you can tell a story it is blowing my mind how nice you can do stuff
How important it is to get this message out and how what is it you want to tell right this is a masterpiece so advise you to watch this one and it’s it’s amazing and uh after that we’ll dive into the projects and i’m going to introduce it again and then i’m
Going to show you one way to solve it and uh again why do we do projects it’s ordered to use the things that we have been learning and uh what are we gonna look at we’re going to look at this co2 per capita and we’re going to make a project and try to
Explore the data set and figure out what message you want to do and try to do that in our report okay are you excited i hope so see you are you excited let’s get to the project so in this one we’re gonna explore the co2 per capita data set so let’s get
Started and see what the project is and again here i described the project and after that you should try it on your own this is where you learn when you’re trying to do the things we just learned and apply them in real life and don’t get fooled by yourself and say i
Remember everything because often the details are coming when you try it on your own so let’s get started in this project again we have focus on the full data science workflow and visualization is mainly in the middle of these three steps here so let’s get rolling
So the goal of this project is to explore how data and visualization can help present findings with a message we will explore the co2 per capita data set it will be your task to what kind of message you want the receiver to get note we still have
Limited skills hence we must limit our ambitions in our analysis okay just to put it there so step one and again often you explore the problem the problem is to give a message to the people about co2 per capita and then you need to import the data and
To do that we need to import the libraries inside pandas and maple clip we will use those and then we read the data so we have the read csv and the file is here if you downloaded everything you should have done that and remember to assign it
To a variable data right and use index equals zero as argument to set the index column and apply head on the data to see that it is as expected so you do that below here if you need more cells down here you’re going to always add them
With a plus sign up on this one here good size of data it’s always a good idea to see how big the data is you can see how many columns and the columns represent countries and rows years right so you can apply shape and that’s also a new thing for you
And you’ll get the data to see how many rows and columns are there that’s basically what you need to do in step one step two is preparation explore visualize and cleaning the data so check the data types right we do that right with the d types remember this
Will tell you if some numeric column is not represented numeric so we do that by d types we don’t know what to do with it if it’s not but it’s good practice to do these things and then we do get a new overview by using dot info
It’s pretty nice good so you do that in the cells down there then check for now missing values right remember data is often missing entries there can be many reasons for this we need to deal with that uh we’ll do that later in the course right use is null any right
This is expected but we need to be aware of it right so even though it says something is missing we are not going to do anything about it right now visualize numbers of missing data points so to get an idea of the magnitude of the problem uh you can visualize the
Numbers of missing rows of each country right so isn’t identifies the missing values is now some counts the number of missing values per country is now some plot hist plus how many countries have missing values in the ranges right you’ll see it so there’ll be like x
Countries that are missing 10 values and so on clean the data a simple way to clean data is to remove columns with missing data right so we have drop now axis columns to remove columns with missing data so you can do that it removes them basically and later in
The course we will learn that this is not the ideal way to do it but often when you do some research here it doesn’t really matter sometimes you just need to drop the missing data to have meaningful data check how many columns are left so you can do that by apply len
Of the data frame columns good and then we need to analyze the data so here we make feature model and analyze data we don’t have so many feature selection model selections yet so here’s simple data of each country right so calculate the range of co2 per capita from 1998 to 2018.
No no 2000 yeah 2018 right hint the formula is the value in 2018 minus the value in 2989 divided by that so that’s the change right and we can calculate it as a row as simultaneously like this one so this is assuming that you have the
Data and data clean so you cleaned the data up here so you might have applied that to a data frame called data clean good describe the data a great way to understand data is to apply describe how does this help you to understand data right so try to apply data describe and try
This helps you to understand data better so we start with a histogram and this is visualization right and we try to make a pie chart of the values below zero so this is what we’re doing so this is countries we have reduced and then we have a pie chart like this amazing and
Play around with that report percent visualize credibility counts right present a chart the goal here is to present your message a visualize one chart add a headline title and give the audience a message right so this is basically what you want to do right you want to make a message maybe it’s a
Question to people something that hooks their attention right optional option There’s a misspelling here up sure now present another chart right you can make a supporting chart or dig deeper into data does this give the true picture of the situation ideas look at at least 10 years are the many countries close to zero right actions so can you propose some actions based on
This i mean again this is your message to the people right and finally propose how to measure impact of actions try to do that figure it out this is a good good way to work with things because it will help you become a data scientist good so the next thing now is stop
Playing this video try yourself if you get stuck don’t worry i’ll help you in the next one and we’ll see how far you get and when you get unstuck try yourself again and then if you get stuck again try to play again and see how i would do
It so see you in a moment so did you try it out yourself well you should and if you are stuck then let’s try to do it together so i already introduced the project so i’m gonna dive directly into it let’s try to explore it together so let’s jump to the jupiter
Notebook and let’s get started project right we have this here i talked about all that all the time so let’s just jump into it so the first one here is just to import the libraries is just to press shift enter so let’s do that did you manage that perfect i like you
And then we need to read the data and we’ll just assign it to data here as it suggests mpd read csv and again we haven’t dived into all these things yet here so if it’s still a bit difficult for you to understand uh don’t worry about it how did i type it that
Fast it’s auto completion and this is index call to zero right so this is basically what it says here right so read the data uh assign it to data index called zero and then something afterwards but let’s just take it step at the time right apply head it says
Here on the data to see if it’s as expected right data hit let’s do that and here we have it right it looks what we expect the size of data right so in step 1 c here the columns represent countries and the rows years right that’s what we know so
Apply shape and let’s see if we can make anything out of that data shape and the shape here and notice here we don’t have parenthesis down here because it’s not a method called called on it is just a variable so it has 59 and what is the 59 it is the
Years of data we have so 59 different years of data and then we have 266 countries along these axis down here right this is the same number here the only difference is this one has five here and it is 59 because this is only showing the five first rows perfect
Step two check the data type this step tells you if some numeric columns is not represented numeric so let’s do that by d types data d types so again here you don’t see the full picture here because it doesn’t show it here that’s great but basically what you’re looking for is
Something not called a float in this case because you want them all numeric right and you don’t get them all in the view here and you have 260 of them here don’t worry they’re all correct here and let’s try data info instead here so what it says here it is we have a
Data data frame here and it has 59 entries uh 62 2018 it has all these here and the data types are all floats right so this one here didn’t tell it all but info down here does right and also tell something about memory usage it doesn’t really matter these are
So small so it doesn’t matter two b check for null missing values right remember so let’s do that data is no any boom right so here you actually see here that you have missing values many places it doesn’t really matter right now because all i want you to notice is that they’re
Missing data and we don’t really know how to deal with it actually we’re going to do it in the next one but we’re going to get a better understanding of missing values so let’s do that by visualizing it so how do we do that good so data is null right remember this was
This one here right all the faults and truth and we did the any of them so if someone was missing right this one here is missing something we have it here right so some here we have for each country we have the sum of is nulls right and then we plot it
Plot it with a histogram to see the ranges right so you see actually by the far most countries are in the first bucket they’re missing only a few ones here because we have 200 here right but but a few one are missing a lot of values right
And that’s expected i would say uh good so now we have an idea of how much data is missing basically not that much a few countries are missing something so a simple way to clean data is to remove columns with missing data let’s do that so let’s first do that data dropna access
And again we get back to all this later so let’s just say clean data like this here good so now we have clean data here so this is removing all the data which is missing right so let’s take length of data columns and the length of data clean data i call that clean
Data columns here so we see we started out with 266 countries and now we have 191 countries left here with all clean data so they have all the data right so this is not the advised way to clean data but sometimes it’s just good enough to make some pre-analysis we’ll get back
To cleaning data later in the course so if you’re excited about that stay tuned on this channel it will come to you in the future or if you it might be released already analyze step three right so this is about feature model selection and analyze data and we don’t have much
Of this yet here so let’s look at what we are doing here so calculate the change in co2 per capita from 19 89 to here right so we have this formula here right and i actually wrote this one here and i called it data clean and not
Clean data so you see a mistake there but what i can do is call it data clean to be clean data so we just reassign it here and then we can actually calculate it here boom so you see here we actually get the data here so we get the change so what is
We’re getting we’re getting the value in 2018 minus the value in 2098. so if this one is greater than this one we have a if this one is greater we have a decline right so that’s a negative value that’s what we’re looking for and we divide by this one to get a percentage
Let’s just keep this as data to plot later here right we put it in a variable and we have it there a great way to understand data is to put describe on it so here we have it right so we have 191 countries still we have that the mean
Value is actually an increase in the co2 per capita the standard deviation is quite large comparing at least to the mean value and the minimum is the great thing the one who declined the most and increase is bad so there was one sinner but you also you also have to
Understand that sometimes the data points are not perfectly clean we need to understand the data is there something that they didn’t this country didn’t measure 10 years ago that they do now because this is a 10-year difference right and again there might be reasons why this is so extreme so don’t judge
By the first look at it so visualize our data so we start with a histogram so let’s do that data plot plot hist bins 30 right so what does this tell us right again here it tells us actually that most countries are around the same range
That they were 10 years ago right and they’re actually most countries have increased not most countries most countries are here but they’re more countries that have been increasing than decreasing right either you decreased a tiny bit or you increased a lot and then there’s this discrepancy
Out here it seems like this one country out here so it might be something that they’re measuring things different or something like that so don’t get too into the details of that try with a pie chart too on values below zero right remember what we did here right data’s plot count values remember
So what this one tells you is actually how many countries decreased it right so how many countries have below zero you can do that by value counts here and you see actually 68 countries did and 123 didn’t and again we excluded a lot of countries remember that also so this is not an
Accurate picture so how can we make a chart of that well we can actually just copy this piece here on here and actually we should count it on this one here but i wanted to keep that one for you to enjoy here we go so what am i looking at here right we
Are looking at the message here we want to say here that if we need to decrease from our standard from 10 years ago well the majority of countries are not doing it that could be one title here that’s our findings at least right so now we need to present these findings and how
Can you do that well there are many ways right you need to make a title you need to do all this kind of stuff so let’s actually try to copy this one here and we set in title here and countries with decrees in co2 C o two per capita here and it’s actually not 10 years i was talking about actually 20 years right 20 years see oh two per capita development right so here we have it right so this is just an idea what you can do right uh so this shows a story right what is
Below what is higher uh in a title here and i think you should you should play around with that another idea is actually to show well add an additional chart right because this if you have one number here is this good or bad so this was the last 20
Years what about the last 10 years right uh many countries close to zero actually just just above zero maybe right so how many countries are actually getting in the right direction and how many are not right there are many questions you can ask and these make more credibility i mean this might
Be you know shocking in one sense but if all countries if you actually look at this histogram up here you actually also know that most countries are just seems like just above zero here so maybe they’re extremely close and it’s actually quite good that most countries are around the same level
Maybe they had an increase and then a decrease so again what will a chart for 10 years tell you as well as zero and now for the actions right what the actions you want to propose and again i i don’t know you and i don’t know your perspective
Maybe you think co2 is a great thing for the world economy and maybe it is on the short term but in the long term i think most scientists agree that it’s bad and is there something you want to tell people with this and again this is powerful right you did the research
Yourself so you have insight into this now and what is what is your message what do you want to tell people huh and again when you’re told people how can you measure impact right most likely it’s measuring the same thing again is there is it possible well i don’t know we’ll see good
I hope you enjoyed this lesson here in the next one we’re actually gonna learn about the pandas data frames because they’re so useful a tool to work with and actually this is the only tool basically for data representation that we use in this course because it does everything so nicely integrated so
There’s nothing to cry about data pandas data frames is the key thing to your success so if you are ready then you should continue to the next lecture it should be a link up here somewhere unless this is the first week and if you like this one please help me spread the message
Share it to somebody that could benefit from this and also like and subscribe and all these things it helps me grow my channel and it helps me wanting to share more with you the feedback i got so far is so amazing so i just love hearing from you guys i love
How i can help you it’s it’s amazing and i’m getting better and better on this and i hope you enjoy it so see you in the future bye as a data scientist how do you work with data what yeah how do you how do
What do you do i mean how is it that you you how do you model how how do you integrate data with different models and how do you do that well this is where a data structure comes into the picture right you have the data on the disk in a
Csv file a database or big data something and now you need to do something with it and what you use there is a data structure and pandas data frames are the most versatile most well integrated data structure that is out there so as a data scientist you can actually do everything
From pandas data frames and you need to know it by hard because honestly you can do everything you want and pandas are is enormous data frame with enormous framework around it you don’t need to know everything but you need to know the basics and that’s what we’re gonna cover
Here so in this lesson here we’re gonna cover what you need to know for a data scientist and along the way in our course we will learn a bit more about it because it doesn’t make sense to learn everything you need at one step but you
Need to understand what is a data frame and how to use it the basics of it and so on and that’s what we’re going to cover here so are you excited about that so if you’re new to this one this is a 15 part course of data science with python it is
An amazing journey and this is the third part where we learn a bit about the data structures used in as a data scientist so there’s a link down below there click it download everything there’s the notebooks there’s everything prepared just go there and get started so are you
Ready let’s jump to the jupyter notebook and again here pandas for data science so we need to understand pandas because that’s the main building block that glues everything together includes the data with the models includes data with everything right and data science is about data and pandas data frames represent your data
So again just to have a quick view of this one here actually acquiring data you need to you need to have the data somewhere along this pipeline here and panda’s data frames are ideal because it integrates with everything along this axis here so there’s no question about
It you need to understand pandas data frame are there other alternatives yes there are but panas data frames are by far the most used i would say the only limitation i see of pandas data frame is when you have enormous data sets then you need to use something else but other
Than that they are used for everything so pandas what is pandas when working with tabular data that means spreadsheet databases i mean you know that so they have a rows of data and columns of things well that is pandas is the right tool pandas makes it easy to acquire explore
Clean process analyze visualize your data that’s basically covers the full data science workflow i call it processor it should be a workflow here right it covers everything pandas help so pandas is a large tool but also complex right pandas can do almost everything with data if you
Can do it in excel you can do it in badass and you can do way more than you can do in excel right but often people have a many people working with data as everybody’s worked with excel almost and if you can do it there you can do it in
Patents and you can do even more so pandas has a great cheat sheet that i advise you to download as well where it has covering a lot of things that you need to do in order to to work with it and it can be a bit scary at first to
Looking at at all these things here but don’t worry about it it gets easier with time and practice and that’s what i’m gonna do here and pandas also has great tutorials on their official page here they have a great here what kind of data does i mean it has everything described in
Really nice tutorials down here and you can get started with everything i mean these are great resources so if you want to dive deeper into pandas i advise you to look at those tutorials here i’m just going to cover the insights into data frames the main data
Structure of pandas and how to work with data so this is what we’re going to cover here and later in the course we also cover how pandas can get data from various sources this is this is basically why i love this so much here it is uh
As you see here there’s some typos once in a while pandas does everything he does do web scraping databases csv parquet excel files files and so on it can just do everything for you it’s amazing and how to combine data with different sources you can also do that and how to
Deal with missing data it’s amazing it has great tutorials also on that from uh inside the pandas manuals but we’re also going to cover it but still it it because it’s used by so many people it is enormously good documentation and you can find help for everything on the internet
If you use something that is not so well known well how are you going to find help right you need to use something that is used by many because that’s better easier to get help so how to get started pandas is installed by default by anaconda so this is our jupyter
Notebooks we’re using so if you’re using jupyter notebooks and install ganachana everything is there other environment need you can install it with if you’re somewhere else you can use pip install pandas in a terminal to access the pandas you need to import it and the standard is to use import
Pandas as pd while this pd it’s just too shorthand so you don’t have to write pandas every time you’re just pe so this is you could you could in theory write something else in pd but everybody uses pd so do the same don’t be a stranger remember coding is
About readability not about being creative i’ll take that back the coding is about being creative but not created with your naming so what is pandas pandas is like an excel sheet just better i told you you will love it to learn pandas let’s play with some data
So what is a csv file i have a lecture on csv file here i advise you to look at that if you want it’s basically a youtube video here i’m not going to play it but but look around on it play with it understand what the csv’s
File is it’s a comma separated file if it doesn’t make sense you show see that lecture there and be fine right then we have the read csv and again we have this documentation and i agree with you that this is by far some of the most complex documentation there’s so many things but
Basically in all the documentation you see how to use it down here and basically the base case is just to read csv file often there are some arguments you need to use and i put them here actually file name parstates and index call and those are the one we want to use so
Let’s try to read some data here so data pd read csv and we take files and we take apple stock here and the first in the first here i actually don’t do anything and let’s just say head here immediately after here so what i want you to realize here
Is we have some dates here and these dates here when when we don’t have these par states true are not actually dates uh representatives days they’re represented as strings so what you need to do is parse dates to be true oh true and this is actually you don’t
See any difference now but actually the computer does so they’re represented at dates as dates now and this makes it easier to manipulate things afterwards and you’ll see why index call is also to see here now we have an index call here which is 0 1 here it
It does that by default but maybe we want the dates to be the index code right so we can just take zero here and you see the date goes down here and now we can index by time and this actually makes things really really nice and we’ll come back to that later and
What are we seeing here is actually the the stock price for apple and we have the high the daily high the daily low the open the close the volume and adjusted close i’m not going to go into what this means specifically it’s just a data set so
I actually did that up here already but it says here always check the data with head and then present prints the first five lines right so let’s do that here again just to keep it here and but it’s always a great idea when you read some data to just check
Is it as we are expected are we seeing something different because if you work with data afterwards and you assume the data is on some format or something well good for you but if you’re assuming wrong then what you’re doing some wrong work right good so these are actually the basic three
Steps in the beginning right import read the data check the data and then we have some index right so data index let’s just explore it right so what we do have here is actually we have that that’s the index this is the column out here right the dates and you see
Actually here here they’re represented again you don’t see anything but you see something interesting here the d type date time 64. so actually here it tells you that this is a date time object and not a string if actually we didn’t have had this date part states true it would not have
Been a string so this is actually the difference and what does why does it matter it matters because now you can use it as a date so you can say i want all the data from 2020 or more along the data from 2021 or all the data from february in
2020 and so on right so you can use it as date objects if you didn’t you wouldn’t know what it meant another thing you can have is data columns so those are the column names that you saw up there high low open close volume adjusted close right so this is
Excluding the index here right at high low open close volume adjusted close good each column has a data type and this is also this is a bit different from excel sheets because excel you don’t have that each column needs by itself to have a data type but a pandas data frame needs
To have one data type per column right so if you write data d types you’ll see here it is float flow for float but for each one hi has one data type low has one data type and they need to have one and this is again remember we were checking for
Missing values and so on because if you have one column where all are numeric except one where it might be a string it says maybe a missing value or something as a string then everything in that column let’s say it’s open column here and then this field here it says missing data it
Will represent everything as a string then because it cannot represent missing data as a float anymore so that’s important so you want to ensure all the time you want to ensure that the data has the right data types here good and the size of the data you can use
Length of data it shows you that we have 472 rows of data you can also take shape data shape if you do here you see here you get the 472 rows and six columns right we have six columns here one two three four five six perfect it matches
So another thing is a slicing of data and what does it mean well let’s see here the first one here you see here you can select a column of data so it becomes a series and that’s actually another data structure that there is so if it’s a close oops close
Here you see you get a series here and you see are you serious yes i’m serious and you see here you still have the index here and then you have the values of the close here and then you have one data type which is float which is the data type of the column
Inside the data frame right so this is how you select one but you can also select multiple columns with specific names right so let’s do that instead so if you yeah let’s do that down below here actually so data close oh and oh i did wrong there open and
Execute that then you see here you get close and open here as columns and now you see we represent it again as a data frame and it looks a bit nicer than the series here right and we still are keeping the index out here but we have
472 rows and two columns right two columns here perfect and then we can also use select multiple columns between the dates including the dates right so let’s try to do that in location data location and we take 2021 0 5 0 1 two two thousand twenty one zero five fifteen
Here right so here you you get it until i said including but the 15 was a vegan weekend so if i actually take 14 here it’s still there it’s because this dataset doesn’t have any numbers in the weekends right so this is why it didn’t change anything
When i put it to 14 because there’s no data in 15. but the point is that both end points are including and you see the first one is not here either it’s the third one so if i change it to the third it will be the same here because the
First and second of may apparently in 21 were weekends i didn’t know that by hard but now we know so we learned something together you enjoy that i hope so perfect good let’s try to continue our journey so this is location another thing funny thing about location is actually you can
Get the data for a specific month uh say that you want it for may you can actually just do it like this then you get all the data from may right this is again because it knows it is a date and then it knows what it is right this
Is amazing this is why we needed to parse it as a date sometimes when you’re working with data you don’t know whether what what is inside the rows i mean the index here so then you can use i lock which is in text between columns right so
Then it’s like a counting value so let’s try to do that instead data i lock 50 55 right so you get it right so these are the from the 50 to 55 one two three four five right it’s five columns so this works as we know it with starting
Including excluding right so here you get it right so this is i log the other one was lock so i remember by integer location the other one is just location right perfect another great thing with data frames is actually remember in excel where you can calculate the full columns you can do
That here as well so let me just demonstrate here so if you take the open and the close value so basically what you take is the close value and subtract the open value you want to see the difference during the day right and then you actually get that
Down here as a series here because you take this serious column here and this series here and subtract it and then get a fool here so you’re calculating it from one to one so let’s just demonstrate this data let’s just take the head here because i just want to make sure that
You do understand here so we take close close it was 75.08 minus open 74.056 right so that’s 1.27 apparently right you see that so it takes close minus open and as you do you see here they’re pretty accurate the same right so it’s 0.07 right so that’s what it does row by
Row down here isn’t that amazing i think so but you’re just like okay okay this is not very useful because what do i do with this data here now it’s in this one here but you can add a new column you just add a new column right so look
Here we have high low open close volume adjust to close now we just create a new one called new let’s look at data again and what do we have here we have down here new and what does it represent the data of the calculation we just did it’s
The same data we had up there right amazing isn’t it good and what we’ve also done already is select data based on boolean expressions right so for instance if we take this one here if we are interested in figuring out where we have a positive value well was this above zero
False no it was negative negative right uh actually i did the opposite calculation here than i did up here you didn’t tell me that but but it doesn’t matter right so this is this kind of it doesn’t matter it doesn’t matter it’s just a demonstration purpose so here you see you get false
Wall falls true and so on here and you can actually get say wait a minute i want the data i want the data in it i don’t care about the values well here you go now we get everything where it’s a positive right you see you get every
Every time this one is true you set the row here so you get the full row every time it’s true so all the values here are in case case case guess positive right you could do it opposite by turning it around i think you figured that one out
Another thing we’ve been using a lot is group by and value counts and i love that too right so let’s try to do that so example let’s make a category here which has this boolean value now it has a boolean value and let’s try to group by and take the mean value of
Of that and what does it mean right what does mean mean so we have categories with false and true right so these are all the false values what is the mean value of these different columns here and what is the tr and when it’s true what is the mean
Value so we have categorized in true and false right remember we did that already we did that already in the course where we did it with the first project we did we did group by there and then then you could get the grades of the students right remember in the very
First lesson i know it’s a long time ago but i remember because i made it and i know you were there or if you weren’t you should go immediately to the first lesson in this course and see it because it’s amazing okay so here you get right so the difference it seems like
The funny thing is when it you grow the value grows the mean value is higher than when it declines what does it tell us it tells us that from the beginning of the year to the or the beginning of the data set to the end of the data set
Every time it grows it grows more on average than it goes down so that means that the starting point is lower than the ending point huh i can tell that from this yes i can because every time you go down you go less down and every time you go up you
Go more up actually i i just pull that back because maybe you go up up more times than you go down or down more times and you go up so actually it doesn’t make sense what is that erase it erase it erase you don’t remember i said
That or i should cut this out i don’t know what do you think i just keep it but you forget about it then i’ll just practice my jedi mind trick forget it you forgot it no i have to work on that okay perfect and uh what you can do now is actually the
Value counts here as you remember so let’s try to do that here so here actually true is less often than false so what does that mean right it means that that basically and it’s just like you can do it this way again here yeah yeah what does what does it mean
It’s just two different ways to do the same thing because this is what was inside this category here and you see we get the same numbers here right so it goes down more often that it goes up right so basically you need to multiply you need to figure out the value here if
It’s more up than down by these numbers right good so basically this was basically what i wanted to show you about data frames and again if you really want to go into data frames and master them to detail go to these tutorials down here they’re actually great remember to download the cheat
Sheet it might be possible to learn some things faster this way we don’t remember everything me neither i often look up things all the time don’t worry about that that’s what you do you just know what you can do and just say okay remember i could do this so just to
Figure out how to do that and basically this is basically what you get you started and in the project we’re going to do some data frame again it’s not going to be a project based on the process here uh i think almost that that’s the only project which is not
Like that but it’s again we just need to learn some with data frames and you need to learn that i need to learn that everybody needs to learn that as a data scientist right because it’s amazing too so see you in the next one this project is about data frames and
It’s just to make you play a bit around with data frames it’s not really complex anything but it helps you understand what a data frame is and how you can use it so let’s get to the jupyter notebook and i introduce the project to you so just follow along that
And afterwards you should try it and and finally i’ll show you how i would possibly solve it okay so let’s get started so again i told you what the goal is is to master panda’s data frame master is maybe a bit big word but let’s just say
You get started with it and that’s basically yeah actually i take that back actually mastering something is i don’t know if it’s being a wizard or just using it for what you need right just like you can use a car for many things do you master driving a car where you
Can drive in a city i think so you master it but doesn’t mean you can drive on a race track and make world records no some other people do that and that’s again the same with panda’s data frames you don’t need to make world records with it with your
Knowledge about it how to use it you just need to use it for what you need so that’s about mastering enough talking so the first one here is actually you need to import it and this patent as pdspd and let’s just execute it the second one is let’s try to read this
Dataset here and remember to assign it to a variable data right again when you read something you need to assign it to a variable and this is a file and if you’re curious you can always find the files over here and what was it called that’s good with my memory it’s it’s
It’s hopeless population right so you can find it over here and um here we have it right so it’s a tiny data set and it’s actually not fictional but it’s kind of funny because you have denmark denmark sweden sweden sweden and we have the years and some populations okay
Then you want to investigate the data you take the data types right remember that and convert year to date time you do that convert to a date time and you can do that with this and format year format of the input here it is year right so this is something new that we
Haven’t done and i i like sometimes just to put some things in here so this is the reason why i should do the project is to play around with things and see if you can figure it out if not don’t worry i’ll help you in the next one right and scale populations to
Millions uh to millions i don’t know i don’t think we do that by calc multiplying by 1000 but we’ll see maybe i don’t remember the data set anyhow you multiplied by 1000 you can do that by this and we haven’t tried that either so this is also a great exercise
Calculate the mean population of each country so here we can use group by and it groups the data right and how do you do the mean remember if you don’t don’t worry i’ll help you again replace denmark with dnk so given a column you can actually access the
String function with the sdr we didn’t do that either right so this enables you to apply string functions on it right so here we can have a data country string replace denmark dnk perfect good so so the project might not be 100 clear but it’s just to make you play a
Bit around with it and i’ll help you give you some ideas afterwards how to do it so are you excited i hope so because i am so hit stop try it yourself and i’ll help you when you get stuck okay see you in a moment did you try it out yourself if you
Didn’t please hit stop and try it out yourself and let’s if you did so let’s get started so i introduced a project so far right so the first step here is just to import it let’s just execute let’s shift enter so let’s do if you can do that so press shift and enter
Did it i mean we did it okay perfect then we need to read the data so let’s try to do that we have the files here populations we need to assign it to data and so on so this is basically what i’ve been doing a lot of
Times but often it’s good for you also to try it out don’t just see me doing it all the time perfect and uh perfect i also often like just to show the data this data set is so small right here so we can just see it here we have the
Years we have denmark denmark denmark we have years 10 20 10 no 0 10 20. this is a population a million people in denmark yeah it’s a small country that’s where i live our neighbor sweden or one of our neighbors sweden has a bit more it’s
Also a bigger country and it has been growing as well as denmark’s population has been growing so good for that investigate the data types so let’s try to do that data d types and we have a countries and objects and that’s a country here it’s an object
It’s a string so it’s an object year is an integer so this is an integer right and float is a population is a float makes sense comma convert year to date time right so how do we do that so this is where we’re a bit new and i’m
Teasing you a bit here right so let’s try to use this pd to date time right so pd is accessing the pandas library remember we imported it as pd up here pd right so we access that now and then we have here convert to date time and then we actually what
Is it we want to convert we want to convert data here but there’s a problem here because you see here now it actually all becomes 1970 because it doesn’t understand it because an integer is often interpreted as seconds or million microseconds since since epoch and epoch was 1970 first of
January and that’s what it does so you need to say tell it there’s a format for it and in this case it is years right oh it should be percentage years it doesn’t know that so now you get them correctly here right now you get them correctly here good
And the reason why this one is not here i don’t know maybe it was missed out good uh and you can read more about formats uh here this year there’s a list of these good but now i have it as a series here right we want to assign it
To this one actually so let’s just try to do that by assigning it like this and see what happens nothing happened it looks a bit different now right now it’s first of january instead but the key thing is let’s just add one here d types we actually see here that year
Now is a date time and this is what we want because we want to work with them as date times instead of an integer 64 right perfect scale populations to millions and i think it shouldn’t say millions here because i’d multiply by 1000. so let’s data population maybe i should
Like this right so this is in i don’t know if i make sense in this writing here but now i did it right and we can actually assign it here like we did above here right so now we have data like this here right perfect calculate mean population for each country
This is tricky right now we need to group by here right this is amazing right so let’s try to do that data data or group by group i went really well this one here i’m happy you are here to help me out and uh we just need to apply mean on that right
So what this tells us is that the mean population over the last 20 years or it’s not the last two decades full decades not starting the one we’re in right now is the mean was uh 5.5 million and uh 9.4 millions in in sweden i mean these are not
Accurate numbers because we don’t have for each year so it doesn’t really tell us anything valuable but never mind it’s just you know practice and again here we need to replace something with a string here and you can actually access something as a string here like this right so you take data country
String replace if you don’t know about replace i advise you to take the python course i also have for free here on my channel it will explain a lot of things that you need to do so here you see how it can be done and again if we assign it immediately here
Oops we can actually see here that it changes the original data to this format here right so every time i assign it back to it it just takes the values and replaces them right you see that so this basically gets you started with python or pandas data frames and you
Need to master them and it’s not that much you need to just to understand them right how to work with them this is a great exercise just to get you started understanding the playing around with the group by the strings and all that kind of stuff so i hope that made you excited
In the next lecture it’s going to be amazing we are going to work with web scraping if that is something people love it’s web scraping people just love that and i love that too because you see a table on a web page on wikipedia somewhere and you just want that data
Down in your data frame right we work with data frames how do you do that well data frames are actually amazing to do that and i’ll show you how the funny thing is when you look at web scraping along web pages and tutorials on that they’re insanely complex
Why don’t you just use data frames directly they do all the hard work for you i’ll show you in the next one it’s i’m promise you you will be shocked by some of the complex tutorials on web scraping when it’s so simple because i guess those people just don’t know
How to use pandas they don’t maybe they don’t know pandas but it is so simple are you excited i hope you do there should be a link to that tutorial up there and remember if somebody you know somebody that needs to learn something something about pandas data frames or don’t know
Them please help them out send a link to this one here and if you like this video please share like uh subscribe all that kind of stuff to my channel it helps me grow i’ll create more awesome content to you as for free for you yes for you good
Perfect and if my jedi mind trick didn’t work i’m sorry then i blew it but see you in the next one bye-bye web scraping why do people over complicate it it’s actually quite easy if you want to know how to do it easy follow along in this lesson we will
Learn how to do web scraping how to do the data wrangling that’s often required afterwards and in the end we’ll do a project where we do all the web scraping and do the wrangling by ourselves are you ready for that i hope so if you’re
New to this series this is a 15 part series on data science with python it’s a full course that focuses on what adds value to your customers that way you can focus on what really adds value in the process and not so much on too many technical details so are you ready i
Hope so let’s get started so today our focus is on web scraping as i mentioned the focus on this 15 part course is the bigger picture that is we want to focus on what adds value to the customer and not only focusing on the preparation and analysis that most courses focus on
Today though we are focused on acquiring and preparing data and we will do that with web scraping if you didn’t notice on the internet almost every single information is available out there so all you need to do is scrape the internet and get the data and we will do
That and i promise you web scripting is one of the most fun things to do and obviously it’s because you can get the data and work with it and create create awesome analysis perfect are you ready i hope so so just a bit note on acquiring data because
That’s a big key thing in doing your analysis if you don’t have the right data how are you going to create awesome analysis and make great reports well the most common data sources are actually the internet web scraping databases csv excel sheets actually rk files and we will cover them all in this
Course so in that way no matter where your data comes from you know how to get it but the first one is the internet the web scraping so what is web scraping well it’s extracting data from websites you should know that right but of course web scraping is like one
Of these things that is in the gray zone because hmm the lego the law the legislation for web scraping is not really clear and you can go here to wikipedia and read about a common cases in different regions of the world and how they are treated and you’ll see it is a
Honestly a mess i wish there were simple straightforward rules but that is not the case summary of that could be said like the legality of web scripting various across the world in general web scraping may be against the term of use of some websites but the enforcer of these terms is unclear so
What so yeah i wish i could give you a better answer so my best advice is be ethical what do i mean do not use it for commercial use without prior commitment that it is legal that means you can do it for private use only i
Would start there and i mean do it privately in most cases there’s nothing to be afraid of but if you want to publish your data and you want to go this step further well consult a lawyer okay so web scraping again i somehow got confused if you do a google
Search on web scraping you see the most insane tutorials on how to create big frameworks that webscrape for you and i’m just like hmm maybe those people should learn a bit about pandas so that’s what we’re going to do here we’re going to use pandas and for our
Purpose we’re going to look at this webpage here it’s a fundraising web page so one thing when you web script data most of the time almost all the times when you need it it is something in a table similar to this one here and i chose this ones particularly
Because this one is interesting because actually can you guess why can you let me know in the comments if you can guess why because this is actually funny because when you see this here there are actually some things that are annoying with this actually if you parse the data right
Because your dollar sign and you have a comma signs and so on so it’s not easy to represent them as numbers as figures right and that’s what we’re going to do we’re going to extract this data into a data frame and then we’re going to convert it
To be integers because when we work with the data after web scraping we want to convert we want it as integers so that’s what we’re going to do perfect awesome so first step first our favorite library and i promise you pandas can do everything for you so let’s start with importing that so
The first thing you need is actually the web url so let’s i just copy it there let’s add it in here so you have the url here good the next thing is you want to read the data and uh obviously pandas if you didn’t notice it has a lot of read functionalities or
Read methods and one of them is read html and you can look at the documentation here i always advise it and in the end of the documentation you actually often see examples of how to use it and obviously this one doesn’t have that but most do perfect
And that was not perfect it was semi-perfect good here we go so let’s try to read the data so pd that is our pandas and we do read and just to emphasize here all the different read methods that are there and what we need is the read html okay so
We just read the url here and what we’re going to do it reads it and what do we have here in data well let’s try to take the type of data and you actually realize this is a list hmm a list yeah so let’s just read the documentation again even though i
Said it was a poor documentation let’s just read what it says it says read html tables in a list of dataframe objects okay so it’s a list of dataframe objects interesting so let’s just do type data let’s just take the first one here if there is data and that’s a dataframe
Right and then let’s also just do the length of data just to see there are two tables so if you look at the page here there should be two tables on there sometimes it could be maybe this table down here i don’t know really but the table we’re interested in this one here
So it’s probably the first one and there’s no way really to know that so let’s look at data zero and see ahead of it and see okay so this is the first table of them right so it’s this has the same figures as this one here has here perfect good
Let me just add some more room for us to here fund raising fund raising and uh let’s take data inside that one right why do i do that it’s easier to access so i don’t have to access the first entry all the time so now in fundraising
I have it okay so what is my first point my first point is that okay let’s explore the data types here remember we’ve been doing that all the time so far not so much done anything about it because it was just fine but as you see here they’re all objects
Well objects can be many things but one thing they are not they’re not integers and they’re not floats and we want integers in this case here float could do but not objects most of the time when it’s objects it is i would say in many times it’s just
Strings so you can most of the time just assume it’s strings so what do we want to do here so let’s look at the expenses here for instance uh let’s try to work with that one and try to convert it so how do you do that right one thing
We see is the two first characters off of it is actually a dollar sign and a space so one thing you can do is actually fundraising expenses and then you can actually access the string by typing dot str and then you can do the string methods
On it so let’s just try to do that here so you see what happens now is actually i applied the string method on it and it will apply that on every single row in it so what did it do well as you see here it removed the dollar sign and space that’s actually
What i do here i do the i use i was called a list list indexing i’ve listed indexing for from the second character and back to the end here that’s that’s what i’m doing here okay so far so good so let’s let’s just for the fun of it create a new one here
Just to see the difference here and fundraising head here so here we can see them together what we did the problem still is this is still a string down there or an object so that’s not good but we cannot convert it yet because it has these comma signs
Here so let’s also remove those so fundraising oops that was not good here oops i do it again it goes well ryan so how can we do that right again we need to access that as a string and then we do a replace here and the replace replaces the comma
To nothing right so this is basically what we do and now we actually see we removed the spaces no the comma signs here right so now we have it now it’s ready to be converted to a integer so let’s try to do that so fundraising exponential
And then we use do you remember that we used uh two numeric remember that one so again if you are confused about these functions you can get the manual here it’s shift tab and then you get it right so this one here converts argument to numeric type the
Default return is d type is float or int 64. so let’s try to do that fundraising exponential and now the interesting part is fundraising d types and you see here the exponential here is now an integer and before it was not an integer so that’s what this
Numeric does it makes it to numeric and why do we care about that that’s because then we can use the numeric value and perfect so what we’ve been doing is actually data wrangling here so actually this one should go up here so this is data wrangling we’ve been doing
Here so data of wrangling means transforming and mapping data from one raw data form to another format so this is what we’re doing here so with the intent of making it more pro appropriate and valuable for the for the variety of downstream purposes such as analytics right so this is basic
Called data wrangling and people call it other stuff too but we’ll just call it data wrangling here and and this is what we’ve done right so we transform expenses from a string format to an integer format so we can use it to process later because we’re not interested in the string
Format of any of this perfect good so i actually i want to show another example here and let’s let’s let’s take the revenue so for instance let’s say we have some discrepancies in the data so let me explain okay so let’s take fundraising and we take revenue here and we take fundraising and
We take revenue and we take string and we take oops take the two first character and then we’ll take fundraising uh revenue that would take fundraising rev and we take string replace and we take comma to here right so this is what replace does and then we take fundraising d types first
This is it and fundraising hit so what is we’re looking for we’re looking for this is still an object but we have converted it over here so basically here we cannot see the difference but here we see that revenue is an object and then we say hmm fundraising
Revenue and what do we do we do let’s let’s do the location here zero comma fundraising and let’s write something spam here good and why do i do that let me show in a moment so now we have numeric possibly numeric values inside here we
Also have a spam here so the idea was before right remember revenue what would you do we did pd to numeric here here so the problem is now we get an error and most people when they get an error they get scared but my point is don’t be scared
Let’s try to read it it says value arrow unable to parse string spang to at position zero and this often happens when you do these things in this case we had a really nice data set but often there will be you know some things that are not easy convertible to a
Integer or a float with two numeric so what do we do well let’s look at the manual again so return blah blah blah it says something here about errors let’s read a bit more about that so errors it has race then when then invalid parsing will raise an exception
Ignore then invalid parsing will return the input of course uh the invalid parsing will set at as not a number so let’s use that one actually and why let’s see here in a moment so what we do now here is actually on the first place here you see here that’s
Actually not a number now and why is that good why is that good so let’s just add it here to the revenue why is that good let’s let’s see fundraising head well this looks pretty good and then we have not a number here but the thing is actually and this is
This is interesting it is actually not a number it’s actually a float and a float is a numeric one it doesn’t matter with the integer float you can still add them and multiply together but the point is that another number can be represented in a column as a float and
Hence it doesn’t destroy the data type if we didn’t have that another number it will not it will not be able to represent this row of data and in many cases you would not like to delete this row of data you just keep it and have it as not a number because then
You know that there was a problem here but you still keep it because maybe the rest of the data in this row is useful but this piece of data well you don’t have it you have to deal with that and later in the course we will learn
How to deal with missing values and discrepancies and so on so right now for now it’s just to show you that this is a common common error you get when you do this to numeric that there’s something that it doesn’t understand so the work processes look at it what it doesn’t understand it
Will tell you in the exception as we saw before and then figure out what to deal with it in this case here we just say ignore and make it to not a number maybe you need to do some more wrangling before you do the conversion okay i think that is actually basically what
I wanted to show you about web scraping isn’t that crazy it’s so simple all you need is pandas it will take all the data that are in tables and in i would say in 98 of all web scraping cases i actually 99 i would say maybe even more
It is in tables so that’s all you need and often it is data you need to do something afterwards and that’s called data wrangling and that was what we’re doing we’re converting the pieces down here to numeric and we did it to do something with the strings before and
Pandas again are so insanely effective to do that awesome so in the next one we’re going to work with this data set and let you do it and uh do deuce yeah it’s gonna be funny because often when you see somebody doing it it looks easy but
Let’s try to do it on yourself right so do that in project so see you in a moment i will introduce the project and then i’ll let you do it and i’ll show you how you could solve it okay ready for that i hope so see you in a moment
This is introduction to the project where we’re gonna do the same and a bit more actually that we did in the lesson actually i will teach you a few tricks in this project as well so this is also reason for you to do it it’s quite straightforward explained but it is the
Best way to learn so let’s jump into our jupiter notebook and let’s get started so here we have the project acquire prepare data for web so again we have the big picture of our data science workflow and in this particular project we’re only going to focus on acquire and
Prepare and later we’ll focus on other aspects but this is the part where we look at acquire and prepare as we did in the lesson so the goal of the project is to focus on step one and two acquire and prepare we will read data that needs
Data wrangling to prepare the process we follow demonstrate how data scientists work right so again we are here at step one first first we need to import the libraries here and uh it’s just straightforward here you just need to execute this one then in one b we need to retrieve and
Read the data and here are two options options one and options two so the reason why i have two options is sometimes people sit with poor internet connections or the web page here has disappeared or changed and it doesn’t work so i have it as a csv file so you
Can get the data down here as a csv file if you cannot read it uh from directly from the internet but i should assume that what you want to do here is option one because this is the data scraping okay so what you do is you assign a
Variable with the url you’re looking for in this case if you want to follow along here it is fundraising statistics on wikipedia and then you will get the tables and assign the first data frame to a variable in tables is a list of data frames containing the data right so this
Is what we also did in the lesson then we need to prepare the data and this is where you check the data type again in this case you will get the data types and see they are not numeric and that’s what we are looking for numeric values
Then we check for if there are any missing values uh we’ve done that before and so far we don’t really focus on that but it’s just to get a good habit that we check for it so it gets into your workflow so you can do that by is not
Dot any and is not is checking if is there any not available or and any is if any of them are there good and now to a new thing here and this is amazing actually uh it is uh the column source adds no value for for investigation so
My advice is often just if there’s data you don’t need and it has no value i advise you just to look what is in this column here then just delete it and you can do that by del data and the column name here amazing then we want to convert year column to numeric
And all the strings are years and formatted year-year years last year year to get the last year as a string you can do like this here right and then you convert it to numeric use this one here just similar to what we did in the lesson but now we’re just
Doing on this one here as well set the year to index this is also something new to change the column to be the index use the index set index year in place true and you can also sort the index in correct order data sort index in place
True and what does this in place do it means it does it immediately on the data frame you’re working on otherwise you would need to assign it to a new data frame and you could do that as well i just want to teach you also you can
Use in place here so that’s fine convert the remaining columns to numeric right so you do the steps we did inside the lesson you take the two string or not two string here you take the string and take remove the first two characters then you use the replace then you use the numeric
Perfect visualize data to investigate quality to make a simple plot of revenue expenses and total assets this will help you spot any outliers and then finally visualize data to investigate quality to make a simple plot of asset rise this will help you to spot any outliers yeah i think so
So what i advise you to do now is stop hit stop did you no you didn’t okay but when you hit stop then try it on your own if you get stuck play along and i’ll show you how i would solve this problem here okay are you
Ready i hope so because this is web scraping and it is one of the most amazing things and i’ve just taught you how to do it easy than most people do it online so you are among the top people on the internet amazing see you Did you manage let me know in the comments and how do you like to do it this way try to look at the internet for other ways to do it people are crazy and people love internet whip scraping and i understand i’ve done it a lot myself
Because you find amazing data on the internet and then you can do a lot of awesome things and you get smarter understanding the world better perfect are you ready i hope so let’s dive into the jupiter notebook i introduced the project so far so let’s get started so
Here we are and again the focus of the project were step one and two the first one here did you manage that one it was shift enter and let’s see it it it managed right so here we have like option one on option two here and uh
The idea is to take this one here and assign it down here and then we read the data into tables it says here pd read html is called and url right so that was the first thing here so now we have the data inside the tables and what does it
Say assign the first data frame to a variable so let’s just call it data i usually do that i think table 0 and apply head it only it doesn’t say it in this one here it was called tables and i wanted to write data head down there
So we see we have the data in this one here and it says hit apply head in option two and not an option one maybe we should do it in both i mean it’s a good practice to just to see that you have data so you don’t sit later and
Don’t understand it good and uh and now to the fun part here right data we already know it’s fun data types here and we see here we have object object objects so it basically means that it understands all these here as strings most likely good check for null value data is not
Any and i know we haven’t worked with this so far but it’s just good practice always to check if some data is missing and in this case here we have false false false false false so that means that we have all the data available and when there are strings Sometimes it’s not really a good indicator if the data quality is fine but for so far it’s also good when before you do something that you just check it and maybe you should check it after you’ve done your data wrangling delete the source column and again here so
This is again if you look at the source column this is a source it has this phd here and a phd pdf file no it’s not phd and if you look at the web page actually it’s because it’s a link a link to a source of these figures i
Think we can just take a look at one of them maybe and yeah okay so this is financial statements right so that’s what i’m saying so this is a link to them but here we actually don’t have anything except pdf written here so it’s not really useful for us
As far as i know good so the point is we want to that’s not where i want it to be i want it to be here the point is we want to delete it so first here let me just emphasize here we have data ahead this is our data how it looks like
Now and then we do this dell data source and then we do data ahead again and we see what happened the source is removed perfect del delete del delete perfect convert year column to numeric right so we know here that this year column here is in the format four digit year slash
Four digit year right so get the last year as a string so let’s do that so data year equals data here a string and y minus four right uh i don’t know if you’re familiar with indexing but minus four goes from the back and forward you can also go from the
From forward and so then it would be 0 1 2 3 4 it would actually no it was it will include it right where uh x included while negative indexing is minus one two three four so it should be a five on that other case there
So this is basically what we want to do here yeah let’s just take it in steps here so here we have the year now so now instead of 2020 slash 2021 it says 2021 here and 2020 and so forth here and then we can do the conversion data gear pd 2
Numeric and we take data here and data d types and we see here immediately year is in integer now which we wanted so that is amazing and uh perfect set year to index so basically what that does is it converts year to the index so let’s just do that data ahead
And you see here now the year here is the revenue and also sort the index in correct order that means that instead of it starts from now and goes backwards you want to make it the opposite order so actually we can do that and just add it in here in front
Of this ones and you’ll see here now it starts from 2004 and goes forward in the numbers right and remember head only shows the five first rows perfect convert the remaining columns to numeric right so we have revenue expenses asset asset rise total assets okay so let’s let’s do that good data
And let’s actually try to do it in one i don’t think you i mean you don’t have to do it do it like in one like i’m doing here but let’s just do it try to see if we can do it here so you just see look here and then we have data
D types we have here right and you can actually do this for all of them we had we have them all here i’m actually curious what happens with the first one yeah that one it doesn’t it doesn’t want to convert it twice right so so that’s okay so the reason
Why i got an error is because this one is already converted to numeric so you cannot apply these methods down here uh the function method the string methods so that’s why it failed so i just commented out and i could comment it back in here so if we
Redid this again it would have to be a part of that but now we see we have all of the data types here to be integers perfect actually and let’s just let’s just for the front of it date ahead show it here right so here we have that and it looks
More numeric than it did uh did up here right where we have the dollar sign and commas and all that kind of stuff perfect awesome make a simple plot of revenue expenses and total assets revenue expenses and total assets okay perfect uh so data and then we have a list in here
I did that again we have a list in here and we take a revenue rev new comma expand says and total assets and let me just make a simple plot and just to show you again how easy it is and we can actually see that the expenses is
Orange one here and the revenue is here and the total assets is growing faster than that so we’re not going to dive into what it means these things here but it’s just to show you uh how easy it is so yeah actually we want to look at a
Data quality here uh make a simple plot of acid right this will help you to spot any outliers okay so data acid rise as it rise dot plot so what are we looking for here it is uh basically we’re looking if there are any outliers in this data set here and uh
Honestly i don’t really remember what we’re looking for here uh it seems to be growing so actually i think i had a point with this i i honestly i just forgot okay but if you figure it out in insights here so please let me know in the comments here because
Sometimes poo my brain turns off okay so thank you for following along to this tutorial i hope you enjoyed it uh in the next one it’s going to be amazing actually we’re looking at databases and you say oh but database is so old school no it’s not you will be surprised in
The real world how much data are in databases and actually a lot of data is shared in databases because it’s easy way to represent multiple rows or not multiple rows multiple schemas of data inside one file with the sqlite one so we’ll dive into that and i’ll teach you
How to use databases connected databases and extract data out of databases in the next one it’s gonna be fun so see you in the next one and if you like this one please let me know like subscribe and all that kind of stuff it will help me grow
I really enjoy hearing back from you people it it’s amazing how many people have been helping and i’m always eager to hear more stories of how i help you so it helps me motivates me to help more people this is all me helping you for free so see you
In the future bye bye databases down easy if you are a data scientist and you want to get access to database well you consider wow that sounds difficult but it doesn’t have to because you can actually connect a database connected directly to your pandas data frame and that’s what we’re
Going to do in this lesson here yeah you heard me right you can actually from your data access the database and put the data directly into a pandas data frame and we’ll make a project where we access a database and in that project actually we’re going to visualize on an
Interactive map yeah you heard me right on an interactive map the shootings in dallas from a shootings database it’s gonna be great and awesome i can’t wait to get started so are you ready i hope so if you’re new to this series this is a 15 part course on data science with python
There should be a link down there you can download all the material and just get started immediately with the notebooks i’m using here in the tutorial as well as a project so check it out down there and let’s get started databases so you probably know databases and if you
Didn’t well you will as a data scientist so again here so the focus on our learning journey is obviously always to what adds value to the end users the customers right use insights but we need to master all the entire chain here of steps before we can add values and right
Now we’re focused on the first step here acquiring data because your reporting or your preparation analysis reports and actions will depend highly on the data you get there so the more different types of data you can access the more valuable insights you can create in the end here perfect
So this is again focused on the first acquiring step and the most common data sources are the internet web scripting we covered that in the last one if you didn’t so check it out there’s the previous video or the previous lesson in this course here was about web scraping check
It out and this one we focus on databases in the next one we’re going to focus on csv excel and part k file so what are databases when people refer to databases often actually they talk about relational databases but there are many other types but we’re not going to
Dive into all of those but it models data in rows and columns in a series of tables and also i should say it’s similar to excel not excel sheets like data frames but data frames are similar to excel sheets but you have columns and rows of data right columns and rows of
Data and that’s basically what our database is it’s just that each column has a data type and each each row represents a row of data in the database so like a collection of data frames and excel sheets because you have multiple of them so it’s like an excel sheet with different sheets inside
Sql is the language we communicate with databases and it’s often pronounced sql yeah i know so if you hear somebody talking about a sequel sequel then it’s not a sequel the bird no it’s not the bird it is sql and it’s a structured query language which you communicate with the database and
To be honest with you the amount of tutorials and knowledge about sql or sql it is enormous because sql or databases has been and will be for a long time the main method of storing data and the technology is just proving to be so amazing and even though we data scientists think
It’s old school and we should use nosql and or nosql databases and big data things a lot of things are still in databases and they will for a long time because changing system from one to another one is difficult and also technology has just so much support and so much
Experience but what i want to say here is not about why people still using it but it’s more about well if you look at the hours of content about advanced sql or sql statements you’re going to be confused the good news is you don’t need to master any of that because all you
Need to do is being able to extract data simple data with sometimes a few filters and we’ll cover that in this one the rest of it is for extreme optimizations of weird things and specific queries and specific things but we don’t do that as data scientists we just need the raw
Data in our pandas data frame and that’s all that’s all we don’t need anything else everything else is for experts and special systems and we don’t do that so the good news is you don’t need to master anything or not a lot with sql so specifically in this part here in
This lesson here we will look at sqlite or sqlite databases and sqlite database software is a library built provides a relational database and management system so it’s all in one thing and you can read more about it here i think at least what is sqlite and so forth but basically the beautiful thing
But basically the beautiful thing about it is it’s a lightweight to set up administrate and require slow resources uh and it’s used in mobile phones and that kind of stuff so the data set we’re going to look at is a dallas police officer involved shootings it’s interesting and this database has
Three tables what are tables it’s just you know excel sheets three different excel sheets incidents offices and subjects so before we get started right i’ll just mention there are some other nice sql databases sqlite databases data sets outside online i just put three of the
Three of the ones that are used all the time all over and it’s good starter to work with these and yeah so it’s just links for you so how does it work in this sql thing or sql well you need a connector a database connector and now you’re already oh it’s
Going to be so difficult don’t worry it’s not let’s just follow along here so the sql lite 3 is an interface with sqli databases and no installation is needed this is also what is so amazing about it you don’t need to install anything so it’s straight out of the box
Other database connectors is a mysql and a post postgres sql and a pi ms sql for the microsoft one good but that’s more so you know if you have a different database you need to connect with it it’s basically the same thing you do perfect so to import the connector you
Just need to import the library which has all the magic and now we are looking at the database and the database here is just a simple file that has all the data but it’s structured in a specific way which is specific for databases and that’s why we need the connector so let’s connect
Perfect so what do we do the interface of the database is through a cursor and this is this one here so you actually have a connector now the execute method allows you to allows you to run sql queries on our database to get a list of all the tables
The following query can be applied so let’s try to take this one cursor execute oops i did it wrong i did the copy wrong here so let’s just see here so what are we doing here so it has an error because i need double quotes here why do i need double quotes
I need double quotes because you have single quotes in the type here so you need double quotes in front and in the end here because then it will understand single quotes in here cursor and then you can use fetch all as it says here the result is
Fetched with fetch all so let’s try to do that so so what do we do here let’s let’s try to read this sql statement here and don’t worry you most of the time don’t need to do so complex things here so select from select name from sql master
Where type table order by name so basically you see i want all the names from the master and of type table and you want to order them by name so it gets in alphabetical order here so you have incidents offices subjects isn’t that what we wrote up here incident officer subjects perfect
So that’s basically what you want here another thing you can do and this is very specific to this type of database is actually to get table info let’s take the offices here right so this is basically what i would do here i did it again i
I tapped ctrl c instead of ctrl weave and i wanted to pass it there i should i should as you get punished by that so the reason why i put a fetch all here is because basically we want all of them here so what is i want to show you here is
Like now we get a description of the table officers and the table officers has case number which is text type raise text type gender text type last name text type first name text type full name text type so it’s basically like a data frame again right you have a
Column name and a data type perfect so a bit about sql or sql syntax is given here and this is basically what you need to know the first part here is only about you know you connect to the database you look what’s in the database and then you get a description of them
Here is more how do you access the data right so you have the select star from table this is basically all you need because often you need to read all the tables all the ta all the data from a database a database table inside a data frame so
This is how you how you can access all of that sometimes the data amount is so enormous so you want to limit the results so you can do that by just adding limit 100 then you get the first 100 rows of data you can also filter data and this is basically also
Straightforward you do the select from uh table name where column name is larger than one so that means you get uh everything where this filter is true right it makes sense right it’s the same thing basically very similar to what you can do inside a data frame you can
Do the same kind of filters in the data frame and yeah it’s the same thing perfect awesome so import the data into a data frame this is actually what we have been waiting for because we don’t want to work with these connectors they are pretty much annoying
What would you do then you have the data in this weird structure so you want to have it directly in there so what you can do is actually we have this connection here we have this connection here and let’s just try to do that immediately down here then you get all
The officers there and let’s look at the officers their head so you see here now we actually have the case number we have the raise the gender the last name the first name the full name here right basically we took all the data we have a connector we do the sql
Statement here it says read sql right you do the sql statement which is basically often what all you need is a select star from officers and that’s all you need because then you get all the data and then you do you working with the data afterwards inside your data pandas data frame
In cases where you have a lot of data you can use limit or you can use a filter to filter down if you already know you only need specific data sometimes you need some sql syntax to join tables i’ll be honest with you many of the times you don’t really need to do
This in the in the sql level you can do that in data frames later in this course we will show you how you can do it on the data frame directly but sometimes it makes sense to do it on the database level uh so i’ll just show you how to do that here
And basically uh you will most of the time just need a inner join it returns records that have matching values in both tables so what does that mean so it means if you have something like the case number here you want to join the case number with other
Case numbers here then you can do a inner join which takes all the data from table 1 and joins it with table 2 on table 1 where the column name equals the column name on table 2 here and there are also different types of joints left joint for instance and so
On and there are many types of them i’m not going to go into details of that but let’s just try to show of officers pd read sql and we do a select star from officers join incidents on officer’s case number equal incidents case number right uh i need the connection there
So here we have it and let’s just look at the officers ahead so now you can see here actually we joined two databases the officers the officers we had up here with the incidents and the incident had a lot of further data on the incident right so
Specific data so for each officers and you can see each officers can be represented or each case number can be represented multiple times but we add the same data for the incident inside the same and often you do that when you work with big data you actually have just to
Flatten the data out so you don’t have different tables but for each record or each row in the data set you add the same things here and then what you can do with a join like that uh perfect this is basically how you connect them and you could actually just put them in
Dense inside the database like read sql here and select star from incidents con then you actually have it here incidence in incidence shape so you get how many how many they are and subject pd read sql select star from subjects connection right so the idea here is
That often you just read the data inside these data frames and then you can do the joining afterwards we haven’t covered that yet so right now the only way you can join them is using join statements like that so this is actually amazing this is basically all you need to know
About a database is you just need to figure out a way to get the data raw data over in a data frame and then you can do all the work there and sometimes you need to join databases together to flatten the data out so you don’t have different tables because it’s not easy
To work with that like that as a data scientist you work with flat data long rows of all the data and there might be duplicates in it but it doesn’t matter like in this case here because you joined it there okay and uh what we’re going to do in project it’s going
To be actually funny because we’re actually going to make a interactive map plotting all these shootings on it with different colors in an interactive map you’re going to be amazed so let’s head for the project and i hope you enjoyed this one so let’s continue see you in a moment
Are you ready for this one it is amazing what we’re gonna do we’re gonna end up with a interactive map with color plots on it so are you ready let’s get started so we are focused on this dallas shootings data set and we’re going to focus on the entire process here again and
A lot of it is obviously getting the data in the first one here and you will learn to understand why it’s so important and the reporting and actions maybe not the actions so much but it’s up to you if you want to add some useful insights then
It’s still to have that at your heart we’re not going to analyze that much we’re going to just present our findings perfect so the goal of the project is a newspaper or an online media wants to visualize the shooting in dallas with focus on subject what is subject well the subject is
In the database it will be clear in a moment we will read data from the database and join into broader datasets we will explore ideas to visualize it and create a map with the shootings suitings shootings perfect so at step one the first one we are
Doing it is we need to import all the libraries and here we have a bit more than usual we have the sqlite and that’s just executed then we need to connect to the database and we do that with this connect connector here and we do the file name
Here and the file name is this database here perfect read the data into a data frame we will read all the three tables here later we’ll explore the data and then we’ll because this is a good step we read all the data and then we explore to understand it a
Bit better instead of using these awkward sql statements this is a better workflow or easier workflow so we have the incident officers subjects read the data for each table into a data frame so we have three data frames use the pandas sql sql statement connection right and
You remember the sql statement it is down here select star from table where you replace table with the corresponding table perfect explore the length of the data frames so we take the length of them we want to explore explore the data based on officers and based on subjects
Both with incident data notice it is difficult to create one data set for both problems explore data further to understand why so why isn’t you cannot make a one single flat data set that is based on officers and subjects well because there can be many officers and many subjects so it’s many-to-many
Relations often it’s easy we have many-to-one relationships to explore that but think about it when you see it read the data into data set create first data set subject incidents as officers joined with the incidents what does this data give ourselves right so we have this one here table one table two
You need to figure out what to replace it can you join on column case number or you can join is all data represented so that’s what we are asking you here good prepare check the data type so again the d types and again when working with databases it
Has often the right types because databases are very strict about that check for null values again is not some it’s a good practice to have that explore subject statuses column so we know that subject status is a categorical therefore we can use group by and count right repeat previous step from on column
Raise do that feel free to explore more visualize idea we want to make a visual plot of the shooting incident let’s explore it if we can make a plot based on long and latitude a hint use scatter long and light right then the analysis phase will focus on
How to make useful insights with the feature selections analyze feature selections here we will continue our features selected feel free to explore other features create data set which with features race subject status latitude and long good shoot select the features from the data frame subject incidents by filtering with the
List of the columns to make further processing easier apply drop not to remove missing data remember dropper it removes all the missing data perfect how to visualize features and we have subplots uh sub steps here right so we want to visualize the two features race and subject statuses a way to visualize
Data is by color and size idea map the rays features to colors map the subject statuses to a size so raised column has the following values b w a and l and we can map that to a caller values and there’s documentation on that a simple way to map columns is by using
Apply on a lambda function you haven’t done this so that might be new to you so we get it down here so you make actually a mapping of it so you want to map b to capital b to lowercase b w to y and so on so these are color indicators right
You can find the color maps here and and we can do the mapping with apply on the lambda as follows right so you have data rays data rays applied the laptop function where you do the mapping here so what it does it converts capital v to lowercase b and so forth
With this here it’s actually quite amazing and convert columns subject status has the following category categories deceased injured injured deceased deceased injured and uh injured others shouldn’t miss the main categories are disease injured shoot and miss a simple way is quite similar to the last step so we create like
Mappings to numeric values do the mapping with apply lambda and we get it here right so do the mapping here and get the value and you have a default value on 100 here i didn’t i didn’t really explain that but instead of doing this kind of mapping
Then you need all the entries here you have a default value of 100 when you use git instead visualize the data now we can visualize the data so we do a scatter plot with these things here where s is the size c is a color and fixed size it sets
We didn’t use that actually here what you can add it fixed sets the size of the figure alpha sets the transparency of the dots perfect reporting so this is where it gets really interesting the goal here is to present your message visualize in one chart add headlines
Title and so on that’s fine but where it gets really interesting here is with the folio map and if you haven’t played with that it’s amazing uh a quick quiet guide here that should inspire you to use it it’s here so you can do actually interactive maps and you can add things
On it so for instance i would advise you to take a look at some of these down here and see if something suits can inspire you and maybe it does maybe that doesn’t and i will do one of mine uh one map for myself and you see here they added things on it
Click me you can do all kinds of stuff with this stuff it’s amazing actually it’s quite funny to play around with these things and here you have these circles and that’s what i’m going to use there’s also circle here and i’m going to use something like this
You can do whatever you want and it’s up to you but here you get the map running so far so far so good and play around with it it it i’m going to play with it because i love this kind of stuff and you should do the same and finally are
There any useful insights you want to highlight any future measurements it’s up to you perfect awesome so i hope you’re ready for this one because it’s gonna be amazing and uh i’m looking forward to see you in the next one so right now you should stop the video you
Should try it on your own and if you get stuck i’m here to help you in the next part of this series or this part of this video here so see you in a moment bye bye Did you manage did you make a great interactive map if not don’t worry let’s try to do it together let’s jump to the jupiter notebook and let’s get started so the first thing we need to do is import this piece here shift enter did you manage that maybe yeah good
Then we need to make a connector and again the trick here is uh maybe i mean the trick here is to remember to assign it to a variable and we i used in the lecture connector so sqlite connect and then you connect to the file name it’s files dallas here you go
Perfect so that’s basically what you need to do in this one the second one here the second one the third one actually one c here is read the data into a data frame right so basically it says here you use the pandas read sql the statement and the statement is down here
So it’s basically putting the pieces together so uh let’s do that incident pd read sql and the statement select star from table incident and we need the connector here perfect and then we take officers i would say pd read sql select star from officers connecting and then we need subjects pd read sql
Select star from subjects connecting perfect so actually here you see something very common that you type a lot of stuff and no substable incident incidents right and you mistyped something and people many people think that as an experienced program you don’t make any mistakes but yes you do you make
Typos you make everything all the time you make false conclusions but the mistakes are often more difficult and you’re so used to the the loop where you make mistakes so you don’t get scared and you understand better and better what is it trying to communicate to you so that’s basically
What we wanted to do in this one explore the length of the data sets so let’s just do that length in c then oh i called it incident there as well it doesn’t really matter let’s just keep it here officers and length of length of subjects so here we get the lengths
Of all of them right so we have 219 incidents we have 370 officers involved and then you have a subject 223 subjects right so notice it’s difficult to create one dataset with both problems right the subject and the officers explore data and further to understand why right so
What is the connects officers to subjects it’s incidents right so let’s look at incident head here so for each incident we have a case number we have a date we have location we have subject statuses and webwinds subjects right and subject has one here subject count one here officer count two
Right so in this case here if we want to represent that you would know that you would have two rows to represent this one here right but there’s more right so other cases yeah here for instance we have three subject counts right and uh that means we have three to one here so
For every single row in this one connecting there might be multiple multiple subjects a multiple officers count right so for instance here we need three rows to represent this part here and there’s only one officer but he had two officer we would need six rows that’s three times two
Because for each subject we would need two officers right so in that way it becomes difficult to represent the data but i told you if focus was on officers you could you could flatten it out and the same with subjects you can flatten it out but flatten both of them out doesn’t really
Make sense or you explode the data set and you need to know that perfect they read the data so what we’re going to focus on is subject incidence here right so that’s what we’re going to focus on so that is basically pd read sql select star from subjects join incidents on subject
Case num per equal incidence case number and then we have khan here good so that should basically do it and that’s just uh subject incident so we have 223 and remember we had 223 subjects here so we have a few cases with more subjects on it perfect and let’s just subject incident hit
Let’s just see it here right so for each subject name for each subject name right we have this the case joined to it here perfect awesome good prepare explore data visualize ideas cleaning data so let’s try to take it here sub checked here d types so what we have here is
Everything is object except subject count and officer count and of course lat and long which are floats down here but other than that the rest of them are strings so that’s a good start check for null values so sub oh subject incidence is not a sum
Uh i use sum instead of any here and the reason why is because then i can count how many are missing right attorney general forms urls so that this is missing a lot we but we don’t care about that grand jury disposition we don’t care about that so there’s a few with
Long and lad that are missing nine of them so it’s not a lot and we have 18 first names missing and basically i would say that that doesn’t really matter for our case here good explore subject statuses column right so we have a subject statuses column column here and we can use group
By on it right so sub checked incidents we do a group by subject stay to says and we do a counter so what you will notice immediately is that it has the same count on each and one of them right but that basically it doesn’t matter because we’re all
Interested in the first one here or whatever which one we want to look at right so you see here the main categories are deceased injured shoot and miss right other than that we have a few one injured one disease one injured so that’s a mixture and deceased injured and others right so we
Have out of all the cases the 200 something cases we have two four six seven of them which are not in that so that kind of tells us that this subject stated this count we can use that data and it has three main categories and the rest are don’t really matter
Repeat previous step for race race you saw how i did just copy paste so we have four races and we have one race which is really low but it doesn’t matter but other than that we have four raises perfect explore more columns if you want so again i just proposed two
Columns here because that’s what we want to do visualizing ideas good so what we want to do is actually we want to make a visual plot of the shooting is it in let’s explore if we make a plot based on long and glad so let’s do that just a straight forward sub
Eject incidence and we just make a plot oh and we do a scatter plot and we do x equals long shoot and y equals latitude perfect so here you see actually then you actually we don’t have any map map yet we’ll come later but what we have
Here is all the places there were shootings inside our city area and you see the long and that the long here and the lat here and yeah it’s dotting everything so this in itself is not so interesting because oh i’m just kidding this in cell can be
Very interesting because it shows you if there are areas that are more shootings than other areas so for instance if i’m buying a house in this city maybe there are areas i want to stay out of maybe i want to live here maybe i want to live
Here or here or here or maybe i don’t want to live in the main areas with a lot of shootings right so this is actually useful insights as me as a consumer of house or apartment spine good so let’s try to continue our exploration
Here a bit because what we want to do is actually feature selection right so what we’re going to focus on is actually a few features here and uh we’ll make a data set which has the features raised subject statuses latitude and longitude so let’s do that here and the data set
Just to make it easier to work with subject incidents and then we have the list here and we have race race and we have subject state to sis and we have a latitude perfect and then it says here to make further processing easier apply drop now to remove missing data so there
Were nine of them which didn’t have data on land and long drop nah perfect perfect so i’d just like to take length of data set here just to make sure that we have some data and you see here we have a lot of data it’s fine how to visualize features right so now
We start by we want to do some mapping of the data so let’s try to do that so basically it says here what to do and basically it’s just you know copy paste here i know it’s a bit lazy here so this is what we’re doing
Right so create a dict which map them here perfect so this is basically what we’re doing so let’s just take data ahead here and you see here oh it was not called data it’s called data set so why did i call a data set up there nevermind perfect so you see here
Uh how it works right so just for the ease of u i correct it here so it’s easier for you to do it perfect got convert column and again you can see here data set perfect so we basically do the same thing here but there was a difference in
The one we’re doing it because we’re using get down there now what does get do it looks up if it exists can we how can we get the dock here yeah uh dog string no it doesn’t do it yeah it’s the wrong manual it’s it’s looking up it’s fine it looks apply good
But basically what get does here it is it looks up in x if x is available if not it returns 100. perfect visualizing the data okay perfect so this is where again we call it data here data set it has an idea how to do that here and
Let’s just try to do that here and you see here now we get all dots here in different sizes here and this might be a good one or a bad one i don’t know but we have here so you see here the point is here maybe we actually put the
Alphabet down here let’s try that so you see here we can see where the shootings have and the point is the bigger circles we have right the circles is based on on on the subject statuses right so the bigger they are the more severe the engine injury was and the
Race the color is showing the rays so in which areas do we have different races right so you can see maybe it’s very blue down here blue is here and it has diff more different colors up here so this is kind of the things you understand with this one here
So one thing you could do here is actually take this map here you could maybe play around if the circles are too big i don’t i think maybe they’re too big on this plotting but it’s it’s fine you can play around with that but what you can do here is
You can add a title here to this one and i’m not very good at making titles right now so it will just it’ll just be dallas shootings here so it’s perfectly fine so you can do whatever you want and play with it i would probably make the circles
Smaller and how do you do that the circle size was done inside here oh inside here the circle sizes are set and you can change the sizes of them i think i was a bit overestimating the sizes of them there good the next thing is really really amazing because now we
Want to make a map and this is where folium comes into the picture right so if you don’t have it installed already you can install volume by executing this cell here so you just x put this out and execute this one here if you do have it already like i do then
You don’t need to install it again so i’ll just execute this one the first thing we do is actually to create a map and how does that work whoa did you see that so i created a map with this location here and this is i was this location is dallas right and
As you see it’s an interactive map you can just drag it around zoom in and out so this is this is really really fun good but the problem is that we actually want to do some mapping here and i told you to look in the manual
A quick right here then you’ll see how you can do that and what you can find here is circle marker so what you want to do is what i want to do is actually use this circle marker here and add them to the map good so
Let’s try to do that the circle marker is one of the easier circles here so let’s try to do that uh good so but we want to do it for all of them right so what we want to do is for underscore row in data set either rows here we go good
And then we want to add them one by one oops one by one on this one here with a circle marker and we don’t want to have the same location obviously we want to have a row let it shoot it goes really well uh row shoot perfect and uh this
Row here we need the size of it right remember subject stay to sus and i think we’re gonna try to make it yeah we have integers values we do integer division and uh we will not have any popup and the color should be uh row the race here and we take fill true
F and fill color so there’s a color around it and a fill color we’ll use the same here good so this should basically add the map so what we’re doing here is we iterate over each row in the data set and then we have a circle marker we add a circle
Marker on the latitude and longitude then for each uh circle marker we made a radius which is the size of the subject statuses we divided with 100 to make it a bit smaller and then we add the color to be the raise right remember we added
The color in the racing mapping then we have fill true and fill color to be the rays so we do that for each and single one of them perfect so when we see the map afterwards here we have that as you see we have all the circles marked here
And we can zoom in and out on these circles and the severity of things again is the size of it but then as a user you can go and look around where does the stuff happen where are the shootings so it seems like this this area here no shootings are
Happening here right well a lot of shootings are happening here and you can also see kind of the races are getting shot here so it’s the blue rays down here and so on there so you see here now we created an awesome mapping of this thing and it’s that’s
Pretty amazing how little code it actually takes when you prepared the data well finally in the fifth one here i’m not going to go into this one but it’s optional are there any insights you want to highlight is there any way to measure it i really hope you enjoyed this one
Because i really love to do these things here it’s amazing i love to play around with it just for fun and in the next one actually it’s gonna be amazing because we are gonna dive into csv files excel files and parquet files and many when they start out at least
They don’t know parki files but it’s an amazing file format to store values and it’s used in many big data systems so you need to know that one so that will be the last one of how to get data inside your data frames and work with them and then
We’re going to continue the journey afterwards so are you excited i hope so if you like this one please share your comments down below there and also subscribe and like and all that kind of stuff it will help me grow my channel and i will be able to help more people
Because this is what makes me driving forward creating more free courses for you guys so see you in the future bye bye when it comes to panda’s data frames this is the most amazing tool to import data it supports so many formats in this one we’re going to look at how to import
Csv files excel files and parca files if you don’t know what all of them are especially parquet files you need to know that in the project we’re going to find some data online on the internet and download it and use pandas data frames to import into our project and
Then we’re gonna visualize it and do some magic with it it’s actually gonna be pretty funny because sometimes showing one chart shows one one gives you one idea of what’s happening another one will show you something different and it’s gonna be amazing you’re gonna be actually quite surprised i think so
Are you ready i hope so if you’re new to this series this is a 15 part course on data science with python if you want to get started and know more about it there’s a link down in the description you can get all the notebooks all the
Data we’re using here so don’t be shy just get started and you know what it’s also free yeah i know i don’t charge anything for it i just enjoy helping people so that’s why i’m here to help you become a data scientist so are you ready i hope so let’s get started So in the jupyter notebook here again we are always focused on the bigger picture here where we want to know that the main reason why we are data scientists is to make useful insights for the end customers if we don’t do that well our job is failing and also as i
Mentioned many times before that most courses and tutorials focus only on preparing and analyzing data and in the end of that that’s all you know at all you master in those courses but real data scientists need to look at the bigger picture in this one here it is
More about acquiring data because you need all the steps all the tools and all the steps in order to get the useful insight the good news is you don’t need to be a master or expert in all of them you just need to know what to do in each one step
And then for little or no knowledge at all you can actually create amazing results for your end users customers perfect so yeah we are here in the acquiring step here and the most common data sources are web scraping databases csv excel and part k in this one we are
Focused on csv excel and park here files awesome so csv files let’s start with them most people know what comma separated values files are if you want to read a bit more about it you can read here on wikipedia i’m not going to go into depth with it i
Also have a lecture on csv files here so if you want to watch a lecture on it i have a youtube tutorial on it and it has actually an e-book it is part of a full course so again if you want to learn a bit more about python it is a great
Advice to take this free course it’s also free it’s eight hours and it’s out there and people love it perfect so csv is a common data exchange format and often with these old formats the ones that succeed are the one which have high high success and why is csv so
Successful it’s because it is very simple and easy to use you don’t need any extra modules like in databases you need often a database server in order to use it and so on so that’s why csv has become so popular is because it is so good to keep structured data in it
So how do you do it well how do you read csv and the good news is yes you can use your pandas data frames to do all of it and we will do that so first of all we have some files here i actually think we’re gonna just explore
The files that we are looking at so inside here we have the apple files that we’re going to look at it’s we have them as csv parque and x x lsx excel files right so the csv file looks something like this it is comma separated so each column has
A name and then the values down row by row so the first column gives a name and the columns afterwards are the data rows in it perfect so let’s start by reading it and focus on the few arguments that you need to master right so data
Is equal pd read csv here we have files and then we have the apple here we take the csv here let’s just start by exploring that and see what are the problems here so one problem is that we noticed before is this one here the date
Here is parsed as a text string so that means that it’s actually not represented as a date and we want that another problem is often we want the date here to be the index and not adding additional index so this is basically what we’ve seen before
But now we are just showing it again here good so if we parse it dates true and index call zero then we see here immediately this one here is nice to look at it becomes index and we can actually see here data in the household index we can see the index
Here is now a date time index it wasn’t if we had this one false let’s just remove this one here and then we see the date time is object now another date no the index now is an object now the date time so that’s why we need this one here
To become a date time a final thing i mean these two are often the common ones that we use often one of the columns is index it doesn’t always have to be the first one zero and you can actually write the name here if you want as well it will
Understand you put date here if you don’t know what number for instance if you wanted to be close to p index i don’t think it’s a great idea but you can have closest index here and you didn’t know which number it was you can actually write it write it like this and
Final i think the one that you will use the most it is this separator here because sometimes the separator is not comma meaning that when you’re looking at this file here sometimes people use their own format and semicolon and stuff like that are often used so you will need to change it
There but again with formats people tend to twist them a bit and move around and do different things and it’s kind of annoying that they just don’t all of them use the same one but luckily this csv reader here is amazingly powerful and it does all the
Things so if something is failing you can see with all the arguments possibilities it is somebody else had the problem good so that’s basically csv files let’s go to excel files if you don’t know what an excel file is i’ll show you here so what we’re having here is an excel
File and actually i can see the top here didn’t come with there it doesn’t really matter but an ex no i wanted to to zoom a bit more here so an excel file here again we have the same data in it here right we have the date object here then high low open
Close volume and adjust it close it’s the same data here but it’s inside an excel sheet we have the first sheet here we could have multiple sheets in it we’re not going to go into details of that but the reason why i want to show you with excel sheet is
People are working so often with excel sheets in organizations and then you have some data in an excel sheet and what do you do then because you don’t want to be compromising putting your data inside the excel sheet copying it in there no you want the excel sheets data inside
The data frame so you can do your mastery things uh good so luckily and again we have the excel sheet excel file in the folder as we saw it so basically all you can do is the same thing here right we have the read excel file here and if you look at it
Closely here you’ll see it has a lot of argument as well but in the end here we have some examples here of excel files here right into use index call and you can also use sheet names here instead so you have examples of the most common
Use cases of it here right so again that’s why i really like these documentations here and often go to the bottom there are the most common use cases of it i would say that read csv is one of the mistakes not mistakes exceptions so perfect so
Here we do the same we put the index call here and let’s just say data hit here and here we have the same date data here the interesting part here is as we see it that the date time is a no the index is a date time so it does
That because excel has information about that so it’s not using it as a string so it’s already converted you don’t need to do that because excel has data types in it perfect so that’s basically excel files and the importance here to understand is yeah when some
Friend comes with a big data set in an excel sheet well you don’t want to compromise copying your data in the excel sheet no you want to take his data or her data and put it inside your awesome data frame and continue you’re working with it perfect parki file is a
Free open source format and parking files it is amazing i must say when you first start using them you can’t stop using them because it’s big data format i would say it’s a free open source column or the data storage format for apache hadoop right you probably heard about hadoop
If you heard about big data and big data is the future so you need to understand this format and it’s not that difficult so one thing to look at is actually i don’t know if you know it but if you have an exclamation mark here you’re actually executing in the terminal the code
So what you what i want you to see here is actually you have csv file excel file and the parquet file and the great thing about the pca file is it’s smaller you actually see the excel file it’s not that bigger but the csv file is actually double size
And when we have small data sets like this one it doesn’t really have that great impact but usually on big data sets uh you actually will see that storing it in csv file format is often a factor 10 to 20 larger so the compression in the park here file is actually amazing
So it’s an effective way to have it compressed and you might wonder why not just use zip funds it’s because these are optimized for structured data data in column based data like it says here right it is a format for column oriented data right and this is optimized for that so that’s
Why you would want to use it and you can do a lot of tricks on the parquet file themselves you can actually filter when you read them and so on we’re not going to go into those details because most of the times you just need the data inside
The data frame and then you can continue your journey perfect awesome so that’s actually the point but here in the smaller file sizes here where we don’t have that much data it doesn’t make that big a difference later when you have bigger data sets it will
Be a factor 10 to 20 you save on it so it’s actually quite a lot and again one great thing with these again comparing to a csv file is also it has the data types in it so actually here you see it actually knows which one is the column and the index
And which one are out there and let’s just say here data index also just showcase that this is actually also a date time so it knows all these things because it has a strict schema on it so it knows that the date is a index and it is a how’s it called a
Date time object while the other columns here are float in this case here perfect so the key thing to understand here again is that when you acquire data you want to get all the data from different data sources inside a data frame because when you get to that step i will teach you
Everything you need to know with data frames because joining data and all that kind of stuff from different data sources we’re going to cover that in the next lesson here it’s actually pretty amazing good so that said one of the most common questions you also get but where do i find data
When you work as a data scientist often you get data from your customer and this is your main focus but sometimes you need to combine it with other data or your customer like we played later before that it was like a newspaper that i wanted to get some insights in
Different things well then you have to find the data online and this is the main focus on these lessons here and great places to find data i listed some of my favorite places to find data and there are more please let me know in the comments if you know some awesome place to
Find data because i would love to build this list as well but just a few ones here i’m not going to go through all of them but here’s like a great machine learning repository where you have some amazing data sets in it here not gonna show you any of them then
There’s this this one one of them i actually really like also is the world bank i actually use that one a lot in the world bank you can find datasets on amazing things there about the world economy and all kind of metrics on them finally
Uh the kaggle it is a famous place for people starting as data science and also not just starting but where you can actually learn things people share data sets here and sharing how they work with them they share best practices and they score up good results and so on so you can actually
Practice with your with yourself and see how other people are doing it in there so it is actually an amazing place good i’m not going to go through more of them but here’s a list for you to explore you can look at them and uh yeah try to play with them
In the project that we’re gonna start in a moment i’m gonna introduce you to how i could work with uh yeah it’s actually the world bank or some other places and extract data from that and put it in a data frame and try to represent something and actually
There’s a twist in the the thing because it depends on how you represent the data what message you actually give to people it’s gonna be surprising good so are you ready i hope so so see you in a moment when we start the project see you
Are you ready for the project it’s gonna be funny because i love to do these things it’s it’s it’s amazing actually so let’s jump into our jupyter notebook and let’s get started so here we have it identified data right we have been working a lot with acquiring data and uh
Well it all starts with a customer right so the goal of the project is to figure out what is happening with the world population growth rate are we continuing to grow we hear a lot of metrics that estimate that the growth of people will stop at some point but
Sometimes i’m just curious what can we figure out and how does data representation actually show different pictures of what’s happening so is the growth rate stable increasing or declining that’s the question so just imagine this could be for your own sake or it could
Be like you have a blog and you want to write about it or it doesn’t have to be blog anymore people don’t use blogs i think but it could be social media posting and you want to give a message or it could be a newspaper so this is
About you teaching you how to make this kind of research so the first thing is obviously acquire the data explorer yeah so what we start with explore the problem right what data do we need to answer this problem well i would say we need some
I get ahead of myself this is you first so let me not answer that so of course i mean when i have to do this project here i need to guide you but i want you to stop here and think about it so this next step here is like okay
There are multiple sources of that and you can find it different places i will use the world bank you can also see on wikipedia and there are probably a lot of places you can find it more not probably it is certain then we need to download the data and
Basically how do we do that well browse around and figure it out and then when we download the data we need to import some libraries let’s do that then we need to read the data and i’m kind of showing you here that actually it is already downloaded for
You but i think you should do it on your own and then change the file path to read where you downloaded it to and remember we need to use a data variable and uh when you look at the data there will be something like i told you there’s something sometimes people
Represent the data a bit different and when you look at this specific from the world bank it will be the first four lines are actually not useful so you need to skip rows ahead so skip rows is like skipping rows inside the data frame no not inside the ccp file
Perfect and then apply head just to see it is as expected if you don’t have this skip rows here it will not be nice get the world data we are only interested in the data of the world so identify a country world country or the country code is w
L d so you can use that or the contour name here as world so this can be done as follows and remember to keep data that means you should save it in a variable make and b it can be data again check the data types this is what we
Always do because maybe the data type is not as expected keep only needed columns right so you can drop the other columns by drop list of columns to delete axis column and notice there is an unnamed column and how do you do that remove it with
Data set drop now how all x’s column so you do it right around there try to play around with it and then do that and uh it makes sense to have years in the rows right so what we will see in this data city is that on columns you have the years and
It makes sense to have it opposite and there’s something called transpose and transpose what it does it transposes if you don’t know what transposing is you will know it makes the columns to rows and the rows to columns that’s what we need rename column so the name 259 can be renamed as follows
You can rename this one to world population perfect visualize data when this dot check the data quality you can plot the data as a data frame with plot analyze calculate yearly percentage change right we want to explore the growth of the world population the first step is to calculate yearly growth and
Then we can make percentage change and you can see the documentation here what it does it actually just takes yeah the current and the prior element and makes shows a growth of it perfect and this is funny so now we visualize this one here and this gives you an idea of
The trend and if you look at the original plot here that has a population i want you to look at that carefully and show you what message does that tell you and then the percentage change down here what message does that tell you then we want to smooth the result
Because often you have results in one year it’s a bit up a bit down a bit up a bit down so what you want to do is calculate a 10 years rolling average using rolling mean and we haven’t done that but you have links to the documentation here but basically all you
Need to do is this here and then then you make to make a plot with this new calculation and what what does it tell you well it gives you a more average trend and not so focused on the specific years and that’s a common way to work with data
Where you’re actually not interested in the specific years but an overall trend report when you do the reporting we need to transform data right so make a plot more readable transform data to percentage right so you multiply it by 100 to get a percentage because it’s not in percentage to start with and
Set title and label on axis you have a label the same for you can put labels on it that that you prefer to make the message to make it more readable to make it more consumable for the end users and then finally you want to add some ranges on
Axis you remember remember when we talk about how people perceive it when you only have like the chart and it goes all the way up and all the way down it looks really really extreme but when you scale it from zero instead on the y-axis you
Might be a chart that goes like this and this this actually doesn’t look so bad as extreme as if you have a chart going like this so this is the point and then finally let’s just assume you’re doing this for your own social media so are there any insights to use and
What is it you want to propose people is this a message you want to tell them and is there a way to follow up on it that’s your job okay so my advice now is to scroll up on top here and start working on this if you get stuck in the
Next one after this i will do it together with you and uh but advice you try first on your own this is how you learn it’s not me showing you how to do it if you get stuck it’s normal i get stuck all the time with work i’m doing so don’t worry
About it it’s just to see how i’m doing and then try to do it yourself again and and continue like that it’s actually pretty fun so i hope you enjoyed this and let’s stop the video did you stop okay and uh i’ll show you in a moment how i would solve it but
Stop did you manage to do it all by yourself and you just want to see how i’m doing it it’s amazing so let’s get started in the jupyter notebook i will try to do it myself and if i get stuck i hope you can help me
Good let’s jump here so we are at step 1a i always want to say a1 but it’s 1a because it’s step one and it’s first part of it step 1a explore and understand the problem right so again what is it we need to answer it
Is the growth rate is it stable is it increasing is it declining that’s our interest we read so much about it in the media we want to do our own research on it and want to figure it out so what is it we need to know well we need
To have the data of the world population so a great place actually which i like is the world bank here so just clicking on that you will actually get the data here and this one is actually pretty scary line and that’s the first one we want to see if you look
At this numbers here nothing seems to be happening right it seems to be straightforward so that’s not what we’re going to explore inside here we’re going to get the data and explore it ourself so one way to get it is to get it as a csv file which is
The common format so you just download it here boom and it will be ready for you i already downloaded it i already downloaded it so it’s fine i also just want to mention there are many other places wikipedia would probably also be a decent source of information we have the world population here
And we can use web scraping for doing that if you want that instead and perfect i’m not going to go into that because the purpose is not web scraping in this one here and so on so what we did now is download and import the data or we didn’t import it
Right we downloaded it from the world bank and i advise you to do the same because that’s the data we’re going to use here then we import libraries and in step one e we need to read the csv file write data pd read csv files and
And i i just want to emphasize here that that you might have a different location where you downloaded it but if you don’t want to download it it is available here right so let’s first see here what goes wrong here because i told you sometimes
Things go wrong and it’s like oh i’m in panic i’m in panic so let’s just explore it together here files explore what’s wrong here and it’s a big data set here right so you see here actually we have like four first rows are actually not useful where actually
The real data comes from row five that was the point so this happens many times that they just want to add some additional information and it’s not really useful for you so that’s why we need to skip rows four how do i know about skip rows well again you look in the documentation
Which is not so good for csvs but it is their data head so now we have all the data here and you see here this is exactly what we want right so we have a country name country code indicator name and indicator code it’s the same one all the way and then
We have the years here from up until 2020 yeah so so far so good so you see we need to do something with the data because it’s not nice so we’re only interested in the data from the world and the data from the world can be identified by country name
World or country code w d this can be done as follows perfect so remember to keep the data so let’s just call it data set here here and let’s just see here here so now we actually actually have only one row of data and it contains all the data we need here perfect
Prepare so let’s just look at the data types so data set d types and you see here there are a lot of things here and most of them are floats down here but the first ones here are object object object and we probably don’t need them so we’re only interested in the year
Columns so you can drop the other columns by here so data set equals data set you can also use uh in place now i’m talking while speaking to you that’s not a good idea but uh i won’t i assign it to a the same variable but you can also use in place
Inside the argument list here but i’m not going to do that here so what are they they we had the country name country code country name contr country code indicator name indicator code data set ahead perfect there so something went wrong here it’s probably a misspelling here it says not found on axis
Uh so what i’m doing wrong here i forgot about it we have to do the axis columns perfect here right so now we have the data here and what we see here in the end we have an unnamed one this one here and it can be removed with this one here
So let’s do that data set here data set hit so now we see here in the end we removed this one down there perfect awesome so now we actually have the data and we have the data only the data we need so this is often good practice when you work
With data is like remove all the things you don’t need because it’s just going to make noise and now to this transposing thing so let’s just try to do the data set transpose and see what happens so what happens is you get one Column with all the data like this and this is more convenient right so let’s keep it like that data set assign it to it here right so now in data set it is one let’s just do the head here one column with all of them right
And now we see the renaming we the name of this one is not very convenient so that’s what we’re going to do in this one here 2d we’re going to rename it and you can do it like this here so let’s just say data set equals data data set rename data set hit
Here we go and now you see now it says world population here instead of this annoying 259 perfect visualize data let’s try to visualize data here data set and it’s just a simple plot and the plot within t and not an r so this is a scary chart when you look at
It right so it’s basically a straight line here this tells me one story when i see it and it’s like if you don’t think about it it’s just like whoa this is out of control the world population is increasing at the same rate every year right scary
Let’s analyze it a bit more so what we’re actually looking at is the actual change every year and that seems to be quite stable but what we’re looking for is actually that growth rate and how do you get the growth rate well you get it by percentage change
Let’s try to do that data set and we add a new column here yearly growth and we take data set and we take percentage oh we need to do it on a specific column here and it was called uh it was called world population one world population percentage change oh change change
That was it data set hit so you actually see the first one is not a number it doesn’t really matter because what it does it takes a current the current one and see the change from the previous one right so this is percentage change so you cannot do that with the first one
Because there is no previous one right so the first one is not a number well then you see here that growth rate was 1.3 percent right it’s 0.013 that’s 1.3 percent and 1.7 and 2 and 2 and so forth down there right so let’s try to visualize that instead data set So now we need to how how do they call it yearly growth and do a plot on that so now actually when you look at this chart here right you see actually something okay in the 60s it grew pretty a lot and it was up above two percent
And then it’s actually been declining declining and declining so when you look at this one you kind of look at okay i can see the trend it goes up in one straight line but that’s actually not the truth the truth is that the change the world change is declining actually does it
Decline fast enough well let’s look at more trends way to do it and that’s like smoothing the result right and you can do that with rolling and mean what rolling does it takes in this case when you take rolling 10 it takes the last 10 entries and takes the mean value of it
So let me show you what it means so if you notice here the first actually the first 10 of them would be not a number because it needs 10 values in order to do the mean and when you have the 10 values the 10 last values it will make a mean value
So let’s just call this data set smooth i don’t know what to call it let’s call it here and that’s let’s say data set smooth and yearly growth plot here so and what we’re saying is actually when you have the first 10 years around 70 here then you get the first point
And then you see a 10-year trend right so you can see actually as long as the orange line that is the actual one is lower than the average line then it’s actually pulling it further down if the orange becomes above the smoothing line then it’s pulling it up so what we’re looking
For is for instance here actually here was bad right here it came above and you can actually see the 10-year smoothing goes a bit up here but as long as it’s below here it is good sign it’s a good sign right because it’s pulling it further down
Our goal is obviously get down to zero or even less if we want to min have a smaller world population good so what we want to do now is actually we want to we want to make a better chart so let’s do that so what we can do is actually we want to
Multiply this by 100 let’s actually just do that and then we have data sets let’s just make the chart here again there and then see now you actually have the percentage down here so it’s easier readable for the user so that was actually the first one here
Perfect and then we want to put labels on it label year and uh why label what is a li while label it could be yearly growth and title could be world population growth okay so you just put some titles there maybe you have some better titles than me that would be amazing and
And the final part here is just to adding this wiley mirror y lemur to be zero and why do i do that again here so this gives you kind of a picture of how far we are from getting to our goal while this one here looks whoa it’s going
Really really well but you need to see where it is on the scale because if it was you know 10 right let’s say it was up at 10 percent well then this line up here will basically be straight so you wouldn’t see any difference so here it gives like an
Impression that this is amazing it’s enormous progress and actually it is not that bad but it gives you perspective of what is a goal we need to have a stable world population because we cannot continue to grow or that could be my call on it right i don’t know what your thoughts are
But anyhow it shows you that actually starting with this picture up here it looks crazy but actually you go down to this picture down here it shows you a more authentic picture of what’s happening of the world population so are there any is finally i want you to
Consider are there any insights measure impact can you do anything about this and that’s up to you i hope you enjoyed this one because it’s been amazingly fun to create this part of it and in the next one actually we’re gonna continue our journey with our data
Frames we have now learned to get a data from enormous amount of data sources web scraping databases csv files parquet files excel files but sometimes we need to combine them and that’s what we’re going to do remember the databases we did if you didn’t there is a lesson on it but you
Could join data and that’s what we’re going to look at in the next lecture is how to join data in data frames so are you ready for that if you are and you like this one please subscribe and like this video because it helps me grow and it helps me understand if people
Like it or not so thank you so much it’s been a awesome to have you in this lesson and i’m looking forward to seeing you in the next one bye-bye there’s a reason why we love dataframes first of all it’s really easy to get data from different sources like
Databases web scraping csv files and so forth but secondly when you work as a data scientist you want to combine that data into one data set and that’s actually quite easy with pandas data frames in this lesson i’m going to show you how to do that how to get data from
Various sources and combine it into one big data set this is the way data scientists work they have a data set they enrich it with more data in order to get the best presenting data at the task they’re doing in the project we’re going to take a data set and enrich it
With meter data so we can present more rich presentings on our findings that’s gonna actually be blowing your mind what you can do with enriching data sets this is gonna be amazing if you’re new to this series this is a 15 part course on data science with python if you didn’t notice there’s
A link down there below in the description and you can get all the material we are using here in the course so are you ready i hope so let’s jump to the jupiter notebook and get started so the purpose of this one here is actually to work with combining data so
In the last few lessons we looked at how to acquire a lot of data the next step is often well i get data from different sources we need to combine them and the way data scientists do that they combine it often in one big data set because that’s what they do they want
One rich data set that it can do all the analysis on so we will look at how to do that and again we have to remind ourselves that this is a bigger scope of why we are data scientists it is to use get in useful insights to the users so
In the project actually we take a data set and we do some enrichment with metadata which makes the reporting way more rich and more easy to understand to get the message you want to convey the user to make the insights okay so enough talking let’s see how this is
So again acquire data so often we need to combine data from different sources so that’s what this is about so i already showed you how to get data from various sources with the pandas data frame and i also showed you actually in the databases part how to
Join and combine data but sometimes you need to combine data from different types of sources it’s not all of them databases one might be databases and web scraping and so on so in that case it’s really really good you can just use the data the pandas
Data frames to do that and also you could actually just use it for combining databases in the database in in the data frames so you don’t need to learn all the magic with the joining and left joins and all that kind of stuff with this sql syntax so just use it on
The pandas so one great source to use when you want to learn about these things is actually the pandas cheat sheet i showed it to you before and there’s actually this one combined data over here it actually shows many of the common type of merging data together so
Merge is basically the one that is below we will also show there’s a join and concat but basically they all use the merge functionality on top there others are just for easier use so if you are in specific cases you can see how it works right you have a
Data frame adf and bdf and you will see how the merging is done down here we will have specific easy cases in our case here so let’s go back to our data frame we will explore three different types here uh actually we’re only going to do one of them
But i’ll just mention there are more so one of them i often use is concatenate because concatenate what it does it is it concatenates pandas objects along the axis and again if you go here in the manual here there will be various examples of how to do this
So join is a great one as well and often i use join when we have the index is the same the index is the one we want to join on on the data frames but if not you can use set index and on the key here but honestly i only use it for
Joining when it’s on the index and it’s really great tool to do that on merging on the other hand is more versatile but as we will see here when we use it we might sometimes have a column which is the same one in both data frames and that’s the one you want
To join on and there’s something called inner outer join and stuff like that we’re not going to go into details of that most of the time you just need an inner join on the datasets and in cases where you have different column names you can also set that
On left and on right instead but if they have the same column name you could just do it like this i’ll just show you in a moment how to do it so let’s get started so the first thing you want to do is to import pandas because pandas does
Everything for you if you didn’t notice that so far so we’re looking at a data set we actually looked at before and i don’t know if you remember it but it was this one with world population and we downloaded it using csv download down here okay and when we got the files i’m
Actually just going to show you the files you actually had some more files in the download you had the actual data in the first one here and then you have some metadata on it here as well oh you have some metadata here on the controls so if
You look at the data here it is enriched with a lot of other information about it for instance what con what region and what income group there are and then a lot of other things so we are going to look into that so that’s basically what i’m reading
Here i’m reading the data and the metadata so the data we already know it it’s the populations and we also know from previous lesson that we should skip the first four rows we do that because the first four rows are actually not a part of the data set it’s just some
Extra data there and the meter data is luckily well formed so i just do that so let’s just explore it here so the data here so how it is structured is actually actually we have a country name and country code indicator name it’s the same one population total
Indicator code these are internal codes of the world bank data and then we have the years all the way over here from 60 to 2020. perfect and we have that for all the countries when we look at the metadata if you look at that we actually see here
Immediately that we have the country code here and we can see a b w here and a f e and a f g we have the same ones down along here and then we have some enriched data here for instance like region which region it is
An income group if it’s high low and so on we also have special notes and table name and so on but so far so good what we actually want to do is we actually want to look at the data one thing to look at is also just to see if there’s a
Approximate to the same amount of data and we see there’s one more in the data than in the metadata it might be because we have the world as part of this data set when we explored it and last time we remember there was a world data and there’s probably no metadata for the
World i don’t know but that might be a reason so what we want to do is actually to use this merge aspect here so let me just for the purpose of it copy it down here so it’s easier for us to see good so we take data frame one and that’s
Actually our data and then we say we want to merge it just a note comparing to the cheat sheet you actually notice that the cheat sheet use it directly on the pandas data frame library and not on the data frames itself but that doesn’t really matter
Because if you use it on the data frame you’re actually applying as a first argument merge and then on a second argument but that can be a bit confusing the first time you try to do it so dataframe2 is a metadata and how do
We want to do it we want to do the inner and you can read more about the different types there and which one we want to do it on well we see here the country code is actually a great one to do it on because it has the same name and it is
Combined on that so let’s try to do that and what happens if i do this we want to make a data set like this perfect awesome data set hit let’s do that so here we see it we actually get more data in the end here you see you get
Additional data on it here and you get from the two sets are combined together and how long is the length of data sets it’s a really good great thing to just check it so we knew we had meter data for 265 and we do have 265 of them here
So that’s looks pretty awesome okay so one of the things you can do now is actually when you work with the data you can again use this enriched data so now you have options you didn’t have before so one of the options you have is like let’s try to
Group by region for instance let’s try to do that data set group by was it it was i think it was a big region and what can you do you can sum everything in the region so you sum that and you want actually i want sorry
I want 20 20 and it should be like this one here and then we have a plot and a bar plot so let’s try to do that so what i was seeing here is actually a world population right so you can see in the east asia and pacific we have
Over two billion people right in south asia you have over close to two billion you have sub-saharan africa also big region europe and central asia not so big latin america and caribbean that’s the middle middle and north or not middle and then we have a middle east and north america and north
No middle east and north africa sorry i’m reading really bad and then you have north america of the regions right so you can see where the population is in the world perfect another thing you could do is actually you could do it by income group
So if i just copy i just want to do the same actually just to show you then in com can i order income group i think it was called no it was not called income group it is with capital g good so here you can see it so so the world bank has
Made four different groups of uh of uh income so each country has an income group so obviously not all the people in the country is in the same group but on average i guess you have to read a bit more about it what it means but it’s still interesting to see that
The high-end group group of countries have approximately a bit more than one billion the low income that is the lowest is actually small now and the the upper middle is larger and the largest is the lower middle right so we have four categories of income groups and
Obviously we don’t know exactly what it means but we get an indication of where the wealth in the current world is uh located how many people are in each wealth group but it’s based on countries so it means like i would assume a country like u.s
Is in a high income group that doesn’t mean everybody in the us are rich just like where i live in denmark it’s also a high income group but there’s still poor people in denmark it’s not like everybody is like that but it gives you a high indication
Of how it looks like on a world map with country by country in each their income group perfect so basically what we wanted to show is how to how easy it is to combine data set and basically the key thing is look at this cheat sheet when you need to do
Something many times it’s just something like the merge or join the join is when the index has the thing you want to join on and merge is when you have like two columns you want to join on so i think that’s ready are you ready for
The project i think so so let’s get moving to that one see you in a moment are you ready for this project i hope so because it’s going to be great so inside our job and notebooks we see that it’s about the statistical performance indicator the spi
And again we always keep focus on why we are doing data science it is not for our own joy in the end if you want to be successful we need to get some useful insights good so the goal of the project is and what can the statistical performance indicator tell us
And if you don’t know it don’t worry there’s a link about it down here so we will investigate it together then the next thing is we want to investigate original spi scores and then final thing i’m curious about and maybe you are too is there a correlation
Between spi and the gdp per capita interesting good so the first step here is explore the problem right you need to understand the problem what are we talking about so read about the spi on the world bank so there’s a link here to the spi here and
You can read about it i don’t encourage you to read everything here but just to get an idea of what it is so i have a summary here here the sbi measures the capacity and maturity of a national statistical system by assessing the use of data the quality of services
The coverage of topics the sources of information and the infrastructure and availability of resources that is one big measure and what does it mean i don’t know okay i know it means all these things but it’s often difficult to comprehend the goal is to improve development outcomes and track progress towards
Sustainable development goals okay so this is really great right this tells us why do they have why did the world bank create if there was a world bank creating this spi metric good could there be a regional difference on spi right that’s an interesting question and do we expect
Sbi to be correlated with gdp per capita interesting so the world bank has data for sbi and gdp per capita the data is downloaded loaded and already but you can find it here right so i encourage you to download your own versions of it you might have more updated data and it
It is good practice to do it yourself second one here we need to import the libraries so do that then we need to read the data so we’re going to do that in three different sources because we have three different things we want to read we have the spi
Here we have the spi we have the metadata we want to enrich it with we have the gdp here and uh yeah you need to skip rows and csv skip rows 4 and remember to assign the result to variables spi meter and gdp apply head on the data to see if
It is as expected okay actually we do the head down here so it shouldn’t be applied here we do it down here perfectly so we have three variables and three data sets now okay perfect then remove columns right we only focus on 2019 for from spi and gdp i only keep country code
19 spi and gdp on meter keep country code and region that’s all we need right so delete the rest check for null missing values data of data often is missing entries there can be many reasons for this use a length on data frame and on data frames where drop
Now is applied so you get an idea of how much data is there so then we apply dropna on them to remove missing data rename columns we need to rename 2009’s in column appropriate and rename 2019 to spi and spi and rename 2019 to gdp per capita in gdp
Makes sense right because they are both called the same and there’s a hint on how to rename here then merge data emerge data on spi and gdp spi merged here we have actually how to do it we do another country code use merge on dataset with meta as well
Investigate the length of the header then visualize right we need to group by on region with mean and create a plot on the mean sbi value then create a scatter plot gdp capita on the x-axis axis and spi on the y-axis and why do we want to do that
Try with logarithmic scale log x true as argument why why does that make sense because often when it grows it grows exponentially so a logarithmic scale will scale it so it becomes more linear to look at percent findings oh this is amazing sort and make horizontal bar plot
This will get you started creating your plot all right so you need to sort values and you nee need to make a bar plot try to do this play around with it and see how far you come add colors to regional plot you can use factorize and assign the
First index it says here and you can use colors and so on and you can use a color map if you want to and finally actions again it’s up to you if you have any insights are there any actions we need to take and how to measure it in the end okay
So i hope you are ready and excited about this this was really a fun project i really like to make these things so i hope you enjoy it too so stop the video try as far as you can and if you get stuck play along and i’ll show you how i
Would do it so see you in a moment are you ready for this did you try it out yourself i hope so because this is the only way you will learn it is to try out the things yourself if you don’t try you just sit and watch me do it you
Might think you know how to do it but you really first learn it when you try it yourself so please do that you’re only cheating yourself okay so let’s get started so explore the problem right so the world bank has made this spi world indicator and uh
If you didn’t read about it at least i hope you read these two lines here so you get an idea of what the spi is trying to achieve and then you should go to the world bank and there is a link here where you can
Download the spi you can do it with the csv here and You can do the same with gdp per capita here and there’s a download here as well download that and i already did that for you but i think it’s a good practice that you do it yourself as well right so import libraries shift enter did you manage that oh i did
So read the data so now it’s time to read the data so let’s do that so let’s make a variable spi and it should take pd and read csv and it takes files spi and it takes api and we take the spi in this one here
Right that’s the first one oh it should be spi not spr and then the second one is meter data here and we take pd read c cs csv and we take files we take spi we take meter data Which one is it is it’s the first one here right perfect and then we take a gd gdp to be pd pd read csv files spi ap api and it’s the gdp one here right so we read them all there and obviously something is wrong and let’s see what
Went wrong here’s index and thrill three of line okay so what did i forget here i forget about the skip rows here right so is it in both of them i don’t know it probably is perfect here right so i get ahead of myself right so i just forgot about the
Skip rows so what am i talking about is actually if you look at the data uh the sbi data here you will actually see that in the csv files here the first four rows are actually not useful the actual naming comes here and then afterwards the data below there so that
Was what i was forgetting perfect so inspect the data is always a great idea to just inspect the data and say here we have the data we have the country country code and then we have the indicator names and then we have the years over there so it’s perfect and
Perfect and then we have the meta meter hit looks also fine we have the country code region income group and special notes and so forth and then have the gdp and we take just head there just inspect that looks fine and we can see we have the country country code and the key
Thing here to see is actually we have the country code in all three of them right so your country code which will be able to combine the data perfect good step to a remove columns so we will only focus on 2019 and we’re only only focus on yeah that so
From spi only keep spi we’ll keep con tree code and 2019 here and then we’ll have gdp gdp and we actually keep oh keep the same here and then for the meter we keep the meter and we’d keep the country code and region perfect so that worked perfect as you see
Perfect so what did i miss i i used an integer not a string and they are represented as string so my mistake perfect check for null missing values right so so what i can do here is actually use length on the data frame itself spi and on length on the spi
Drop now so the key thing is here to get an idea of how much data we lose gdp and no length of drop gdp dravna so we see here basically this one spi there are a lot of countries that don’t have this spi yet and this is our main interest so
Yeah countries without it well we don’t know there are some countries without gdp well not so many so so far so good so the key thing is the conclusion is that we will have to use that data which is available gdp so let’s do that so we’ve done that here
So now we have made the actual reduction here we only see the reduction here we assign the reduction to our variables rename so if we look actually let’s just do that speedhead so we look at it here we have this 2019 column here and we have the same one for the gdp right
So this is actually what we try to avoid here is what does this mean what does this mean we don’t know but we know it because we have the name here but we want to rename that so let’s do that spi spi a rename actually we can just add this here gdp
Gdp rename columns oops it was an equal sign and 2019 gdp og gdp per cappy capita and f parenthesis here right so now we have spi head here you see it has a more appropriate name and then we have gdp hit here i mean it
Has gdp per capita right so now we have spi and gdp per capita awesome now we need to merge all this stuff together so let’s make a data set here and we want to merge spi with with gdp so it’s basically this one we want
Uh so gdp merge here but we also want to merge data set merge with meter data and how again inner and on country code perfect so let’s look at our data set now here so now we have all combined here right so we have the country code we have the
Spi we have the gdp and then we have the region which we are interested in isn’t that crazy how easy it is i think so good now we want to visualize it right so use group by on region with mean create a bar plot on the means okay perfect so data set group
Group oh this is group by uh region main plot bar is not how we do it otherwise i’m pretty sure you’re gonna help me and actually we wanted only the spi value that was actually it i forgot that part okay here we go right
So how can we see the spi right so can we see any regional differences right yeah we see the north america is pretty high and europe and central asia is pretty iowa and the rest of the countries are basically in the same range right so these regions are leading and the north
America is leading a bit more but and then also more variety on the variety of countries in europe and central asia than there is in north america perfect scatter plot okay so what do we want to create a scatter plot gdp per capita on x-axis and spi on y-axis so let’s try to
Do that data set scatter we need x here and if you don’t remember all these here then you should also call look in capita look in our i have a less and we had a lesson on this one here so you can go back in the lessons if you didn’t do it and
Check it out so here we are a bit fast because i don’t remember how to do it either right so this is a point here right so when you do a plot like this it sometimes has something called exponential growth that means when things get crazy they are summed together here right so
Here let’s try to do a log x true right lock x true and what happens now is actually it has a logarithmic scale down here if you saw before actually actually just keep both of them i put the new one here and then i keep the old one here
So you can see what’s different okay it’s difficult to get both of them here but what you see here here’s the linear scale and down here is a logarithmic scale so that means it grows exponential this one here and then what it does is actually that the gdp per capita is
Spread out as a logarithmic scale because it turns out that when countries get richer it gets exponential uh so here we see we actually see a better connection here than we do in this one here it’s difficult to see you can see there is some kind of order but
It’s also difficult to see because it’s so scattered together here perfect sort and make horizontal bar plot this will get you started in creating a plot take a regional plot and sort it okay so how do you do that so let’s do it data set group group by region and mean sword
Value actually we only want the spi right maybe it’s not very specific in this one here but let’s just do that sort value i actually think it’s values scandin false plot bar age okay so here we go so i just noticed a typo here values right so
You don’t have to make the same mistake as i do perfect here right so so this actually shows a different way of representing the same data we had up here because what’s the difference here first of all it is horizontal right so we change it this way so it’s easier to
Read these over here and also it sorts them so you can see well here we have the highest score and so forth up is up upwards or downwards it should say uh so it it’s easy to digest right so sub-sahara is the bottom middle east and you don’t really get this this
Impression as easy on this one here right yeah you might see it but it’s also difficult so it’s a great way to do it and a great way we all custom to read in this direction here and not up and down so it makes it easier to digest also to order them perfect
Then we can make a regional plot so it gives some hints here and used to factorize so what does factorize do actually let’s look that up encode the object as an enumerated type or categorical variable so let’s try to see that down here often some good
Examples here right so you have a list here factorized of values here b b a c b right so it number zero zero one two zero so it changes labels into numerics right that’s what it does ah that’s interesting let’s see if we can use that
So let’s do some colors here on the data set and let’s do it on region right so we have the region as a color and then we do fector rice and we take the first one as it says here because you need to extract that one there and then we can
Make a data set and we can make a plot and we can make a scatter plot and uh x we need the gdp per cappy capita and then we have the y as the um the y as the spi and the color as colors and log x true and uh see map
It suggests just tap 10 here right that’s what it does here right you can see different color maps but this is just one of them right and uh yeah i forgot this one there so what i was seeing here so this is a starting point of something
Right so you have different colors of the different regions so you can see where they are at uh is this the best color map i don’t know and uh actually the naming here is not good because we renamed them to Numbers right so you need to know what the numbers are as well right but this is a starting point to for you and you should obviously also add some more things to it like x label it’s gdp right per capita i actually didn’t add it so never mind
But you see here there are things to play around with and the colors is intended to make it easier for you to see what’s happening you could play around with making bigger dots make them a bit transparent so you can see how they overlap and all that
Kind of stuff but again you get some kind of idea about where the richest countries are on this scale are they doing well on sbi and which region are they from to be honest i maybe would minimize the number of regions to be fewer regions or maybe you can do it in
High and low income i don’t know there are many options there so the idea is for you to play around with this and i i find it funny and uh yeah the final part here is use insights are there any insights do you want to tell some message with it maybe it was
Like an idea of understanding the spi do you think it’s a good indicator does it tell anything great is there a connection between the gdp or not and is it does do you think it will encourage countries to become better places to live in does it have any good impact or
Anything like that so i hope you enjoyed this one and in the next one it’s going to be amazing actually because we’re going to look in the statistics and you might no no no this is not for me the beautiful thing about being a data scientist is yeah you need some data
Statistics but it’s not that difficult and i’m going to show you it’s not difficult and i’m going to show you all you need to know actually to get started and yeah maybe when you’ve been working with data science for 5 10 15 years you become a statistical expert but i will
Also reveal to you i’ve been working with data scientists not all are people don’t often don’t think about it that data sciences scientists are often working in teams of three four five people and you don’t need to be an expert in every area some are extreme good at the
Domain some are good at the programming some are good at presenting some are good you know it’s a team effort it’s not like one person you need to master at all no you don’t okay so i hope you enjoyed this one if you did please subscribe and like and spread
The message that i’m creating this awesome free course for you to enjoy okay see you in the next one mean standard deviation box plot correlation when it comes to statistics most people are getting frightened but as a data scientist you need to have some basics in your pocket but don’t
Worry it’s not that difficult in this lesson here we’re gonna cover the basic statistical concepts you need to understand and how to interpret them okay so are you ready for that i hope so because it’s gonna be amazing and as a teaser in the project we’re actually going to look at
Data scientists average salaries around the world or at least the reported ones so you get an idea of how your salary is as a data scientist aren’t you excited about that well i am so let’s dive into it if you are new to this series this is
A 15 part journey of the course data science with python if you didn’t notice there’s a link down in the description you can download all the resources all the notebooks and all the stuff we’re using here for free so if you didn’t sign up yet do it immediately and let’s get started
Statistical concepts in data science okay so again we need to remember why are we doing this this is the data science workflow that we’ve been focused on all the way during our journey here and to understand things to analyze things we need some statistical measures and again it doesn’t have to be
Complex and most of the concepts are actually quite straightforward everybody understands and because we use them so much there are a few like box plots box plots how do you use them and what does it tell you and what what are the free informations that are given by a
Data frame how can you use them what is correlation what does it mean we’ll dive into that but again we focus on each step here on the journey in order for you to get the most value out for the customer the end user using insights but you need tools in
Each box here to get to the goal okay perfect statistic concepts so what is statistics you might ask well an analysis an interpretation of data is one way to say it so basically what statistics does is it makes an analysis most people know what average is and
It’s kind of a way to summarize things but average doesn’t tell everything we need more tools in our box for that and i’ll explain that later but what is statistic also it’s a way to communicate findings efficiently again right what is the average age in your class well that tells you something one
Number you can communicate something about i don’t know how many students are in your class but let’s say there are 50 students you can communicate what is every average age are you above or below and so on so why statistics right statistic present information in an easy
Way right again it can summarize things really really efficient remember the visualization we had also it’s the same thing right it can summarize a large amount of of data in some specific figures right so it gives you an easy understanding but you still need to understand the things and i
Think the best way to learn is just to get into it and try it out so what we’re going to do is we’re going to use our pandas again and i told you pandas can give you everything you need and for this purpose here we have a data
Set which is called weight and height and let’s just explore it here to see what it is so basically it’s a list of gender i guess we have male and female probably and we have a height i will assume this is inches and then we have a weight which i assume is in
Pounds it’s not in kilograms i would not hope that and it’s not in centimeters here because i would be smaller than midgets okay perfect so a very simple statistics and often people think well this is not statistics it’s actually the count count is a descriptive statistics and counts observers say of observations right
Often when you do some work right you need to know how many observations how many data points do we have right because if you make a conclusion based on something on only let’s say three data points what’s average of something with three data points if you have three data points in the entire
Population in the world and you say the average age in the population is six years old it’s like what well how did you do that well it took the average of some samples well how many samples did you have how many observations did you have three and then you can see the quality
Of that work is not really high right so actually count is a very important statistical measure so count is actually the most used statistics and has high importance and evaluate findings right example making con conclusion on childhood weights and study only had 50 12 children observations is that trustworthy that’s
Kind of my point right the count says something about the quality of the study right so a great way to get count is actually basically to see how many observations do we have right and this is the full data set we have 10 000 observations of weight weight and height perfect
But you can also do count on groups right to see the significance across results right so that’s actually what we’re going to do next here right so you can take data group and i think i misspelled this group by all the time and was it gender gender with the capital yeah
And then we take a count here good so what we see here actually we have 5 000 females and 5 000 males so we have equal number of observations and sometimes you would see it’s very very skewed you might have one group which is really really low and when you make
Conclusions based on that right basically it can be skewed right it cannot maybe it’s not informing you enough so the point here i want you to understand is actually well this is extreme count is an extremely important important statistics and as a beginner you often think well
Is that statistic yes it is and it is the most crucial one you get it because it tells you something about the quality of your research so now i hope you sh i shocked the world and you i was shocked too i i kind of surprised myself there
Just kidding okay so mean what is the mean value and what does it what does it mean mean so return the mean of the values or the requested axis so let’s try to do that so data group group by gender mean mean so here we have for instance the mean value of females
Of the height is 60 63.7 inches and the average weight is 135 pounds for males it’s 69 inches and 187 pounds okay so this is our standing a starting point but we can also see what does it tell us so what does the mean tell us well it tells you the average but
The mean value in itself is not really valuable but combining with the count it is valuable more valuable because you see what is the validity how valid is this information does it represent data well it’s just like i want to show you one thing here right so let me just uh
Show you it and i’ll get to the point gender let’s just take the males here and i want to take the for instance height it doesn’t matter it’s just an example here and let’s take plot hist and bins 20. just have something so what i’m trying
To show here is actually here we have plotted the heights of men and we know the average height is 69 so it’s around 70. that is the height what does this chart show us or this histogram show us it actually shows something about how other people hide spread
So for instance if this was really really tall it would say that the mean is everybody’s around the mean but it can also be how much it’s spread and this is where the next measure the next statistical and really important measure is coming into the picture it’s standard deviation
Because mean itself it says something but it doesn’t say how much it spreads because you can have the mean value and it can actually be an enormous amount down here an enormous amount down here so it goes like this a u-shape instead right so and the mean value is actually
Not represented or it can be a curve like this one and everybody’s around here and that curve can be narrow or wide so that’s what the standard deviation can tell you something about the standard deviation is a measure of how dispersed spread that is the data is
In relation to the mean so it connects the mean right low standard deviation means that the data is close to the mean high standard deviation is spread out right so this is basically this chart is telling you everything and if you’re not familiar with standard deviation the first time you hear this
Like what’s the point well the point is it tells you something about how the data is so one standard deviation which is called the sigma here from minus one sigma to once plus one sigma it has 68.2 percent 31. 34.1 times plus 34.1 right so it means that er
A bit more than two-thirds of the data is inside one standard deviation and the rest is on the outside so let’s immediately try to do that data group group by gender standard so here we see that actually the standard deviation of the height for men is 1.8 and the standard deviation for a
Female is one point almost seven and it’s actually almost one point nine and uh for the height again it’s a bit larger for the men but the men also in general get taller but in general it shows that the picture is well if you take a random sample of the man with 68
Chance it’s within the mean plus minus the standard deviation so if you take three samples three random samples of males or whatever you take from two of them will most likely be within one standard deviation of the mean so that means that means that it’s taken in this pile
Here in the middle pile while the third one will be outside here does it mean that every time i take three it will be like that no but on average so if you take three samples multiple times on average two of them will be inside one standard deviation away from the mean
So isn’t that beautiful i think so perfect so now we understand that count is important we understand what the mean means and we understand the standard deviation actually we are pretty good now so describe is a method that has some statistics including that summarizes central tendency dispersion shape and dataset distribution including
Not an not a number of values right you can see the documentation here we’re not going to do that but we’re going to call this scribe here perfect so what are we seeing here is actually we have the height we have the weight we have the count the most
Important statistics we have 10 000 of both of them then we have the mean value of the height and the weight and now is the mean for both men and female because that’s what we what we did here then we have the standard deviation right and you see
Uh the standard deviation is actually bigger than when it’s on on not done on the gender and why can that be it’s because there’s a difference bigger difference on men and females when you mix them together there there’s a bigger range within the normal so that’s why they increase right
And some interesting figures are also like the minimum so this is the smallest one and this is a smallest weight we have seriously so that might be a mistake can you be away 64 64 pounds maybe never mind we’re not looking into data quality right now but it tells you something right
And then we say that the 25 first percent that means for instance this one here if you have it all scattered out then 25 first percent will be below below 63 inches and 50 is 66 and you see actually here the mean and the 50 are not identical it’s because
The 50 is like well if you have 50 on this side and 50 on this side where’s that it’s not the mean the mean can vary somewhere in between so what it tells us is actually if we look at this one here if we have the mean right here where are
The 50 of the data points on this side and 50 on this side so those two do not necessarily are not necessarily the same and same with 75 and then we have the maximum and we have those data points here so this actually these statistics here tells you a lot of valuable things
Right how many data points do we have the most important statistics the mean the standard deviation how much spread is it and this tells you how it’s located right so if we had a flat a flat distribution you would see that in these numbers here
Box plot i almost agree with you when i first time saw a box plot i was kind of i was kind of kind of i don’t understand it and most people don’t understand it you need to use it in order but now we actually have the knowledge to understand it
So box plots are great way to visualize descriptive statistics right but is it for normal people to understand this afterwards most likely not so notice that q1 is 25 q2 is 50 and q3 is 75 percent and so forth right and you have the minimum this is minimum the maximum q1 and q3
And q2 the median the median is 50 right and this is what it can show you right so another thing is it shows you like outliers it’s often used for that right so when you have your data set there might be some some of them that are outliers are there a lot
Of them these are one of them and so on this is always useful information when you do that so actually let’s try to do that right here make a box plot with the data frame columns use plotbox and let’s try to do that data wait cloth box so what you see here
You can actually rotate it so let’s i’d actually like to have it like this because then have it the same way up here right so what you see here is actually well if this one was wider here then you would have that more data is spread out so the narrower this one
Is the less spread out it is and then you see here this is the low weight and this is a high weight and actually we have an outlier on the high side here right this is an outlier this is what it tells us and might it be true
Me it could be it’s it’s what is it it’s less than 300 pounds so it’s not unrealistic it depends on who’s represented in this data set here right uh but this is what it tells us right so we have the where is the median and uh where have where how spread is it
Perfect so let’s just do it to try to do this for the height as well so we can see that you see here and the interesting part here of height it is actually we have more outliers you see here you have multiple outliers here and again it’s actually less
It’s more centered so the spread is not as big as it is so what is telling us that the weight of people is more spread out it’s a more variety than the height and i think it makes sense right the difference in height of people is less than the difference on
The weight of people so it makes sense and we have a few outliers on that good perfect another great thing is actually if you want to make a box plot and that’s actually amazing i think i’m just gonna draw it to you box plot hide and wait so so so
So what i’m seeing here is and we will learn later about scaling data we haven’t done that yet but what we can see is actually you can use a box plot like this and you can make a chart like this and it tells you information about uh how many outliers are there what’s
The spread and so on and later in the course when we actually look at scaling data we will use these things again so it’s going to be amazing actually so that’s pretty much perfect and a great thing with this box plot here is actually you can do it with
Group by as well so this is actually amazing box plot i’m just going to show you here columns i don’t want to write this i types too slow today if i could hit it here and then we take by and this is this is actually the amazing
Part of it right so here we actually see the connection between female and male and uh female male on the weight right and uh the reason why these are so small is because we have the same chart the same scale here and later we will look at how to scale these
Things but you see here that actually on the weight for instance female weights are in general lower than male light weights and the same with height that the male heights is higher than the female so there you have it right perfect it makes easy digestible statistics for you which
Serves kind of the description that we have here in numeric in a visual way in these box plots here amazing and you can correlate them or not correlate and you can group them by something it’s amazing correlation and this is this is this is actually amazing we actually already
Used correlation but now i want to show you what it means more likely so remember correlations is just one number that says the relationship between two variables in ranges from minus one to s to one right so minus one means it’s negatively correlated and one means it positively correlated so
What is perfect positive correlation it means that the data looks like this so if you made a scatter plot it would look exactly like this and highly positively it’s set here to 0.8 low positive no correlation negative negative perfect negative right that’s over in this one here right so
This chart explains everything you need to understand about it so if you do a scatter plot so let’s try to do that data plot scatter x height y weight let’s put the alpha on here good so you make a scatter plot here and i did a really low alpha here so you can
See they’re all transparent because then you can see where the data is so what would you say this one is is it close to a straight line uh it is close to a positive correlation meaning that the taller you are the more weight you have right
So that should be the case so let’s try to do that data core and surprisingly we have a positive correlation of 0.92 right so that’s a really really high correlation so that is not surprising and we can actually let’s let’s try to do the group by right group by gender core
So i actually see that you actually have a lower correlation now right and kind of doesn’t make sense yeah i think actually it does make sense because when you’re grouping in the females together you exclude the data which are outliers and men are in general more heavy than women and
When you mix them together there’s a higher connection between that so if you have female lower weight male or higher weight it is more correlated than if you do it in divided into genders okay awesome really really awesome and are you ready for the project now yeah because now we’re gonna dive into
Salaries of data scientists are you curious about that i am so let’s get started see you in a moment are you ready for this i think so because this is about data science salaries aren’t you curious i am i have to check if i am on the right
Track with my salary so let’s get started so project data science salaries again we have focus on the bigger picture here uh we will try to make a great report in the end and it’s up to you if you want to make some insights you can use it for
Your own salary negotiations when you get there and when you have those because maybe you are not on the right path right so the goal of the project is to present insightful statistics of data science salaries the local newspaper and online site want to write an article on how lucrative it
Is to be a data scientist perfect so let’s go forward so we have here we need to have these libraries in order to work with them and then we need the data so i have downloaded the data here and you need to do the normal thing but you also have an
Updated maybe there’s an updated set here on kaggle where the dataset is from so you can get it there if there are new data added new data points to it uh it i think it’s regularly it gets updated from time to time inspect the data check the size of the
Data what can you make uh can you make conclusions based on that right is it representative data right again this is like uh the count the most important metric when you do data science it’s count are you surprised you thought of some advanced statistics right but it’s not it’s count
Prepare data so check the data types uh can you do that you just do this check for missing values uh we don’t deal with this yet still right but it’s still a good practice to do it sometimes it is not something is null here sorry about that uh understand the feature right most
Features has categories a way to categorize them is by using data work unique so what you get there is like the unique values of it right so try to do that similar for other categories example experience level you have entry mid-level senior level executive level see the full description on cackle right
They have a description there you can actually just go there and there should be a description i don’t know if it is in the description here but uh metadata it should be there no um there should be a description there right so here we have it we have
Experience level employee type job title salary and so on you have the full description there right perfect we’re not gonna dive into that it’s for you to enjoy on your own salaries notice that sellers are giving in different currencies also notice that we have salary in u.s dollars which
We’re gonna use because you cannot compare them otherwise analyze explore features right so one way to explore features is as follows here we explore experience level right so you do that by describe right then you get some kind of insights in that explore other features explore data on two columns
Say you want to investigate columns of experience level and company size then you can group by on both of them and get the salaries and the mean try similar for other combinations describe data on two columns how does this spread look like can we conclude anything based on that
So again here we do have the group by on double here and then we do the describe again to get the statistics visualize the description so we do a box plot and we do it on a salary in u.s dollars and we do it by company size do this
Do this for the other features if for your own interests report so here we focus on company size and experience level create a data frame for the data to plot this makes it easier for reordered index and column notice do it for the features you want to present right so
Data grouped by we have this us dollars mean on stack on stack on stacks multi-index you’ll see why reorder index and columns right so to do this we do this to present the data in a logical way right so to use reindex we have small medium large assuming the same
Example right reorder columns simply for filtering uh the different experience level here perfect visualize result right so visualize your result with a bar plot finalize the title and labels credibility considerations with the insight we have for our analysis could we tell another story right spread of salary outliers size of data set and
Categories used okay actions how could we use insights how to measure it right so one way to use your insights is obviously for your own benefit so or helping friends to say that you are underpaid or you are on the right track my friend and so on and of
Course the data set is probably not big enough to make full conclusions but that’s up so us to explore in the project so my advice now is stop here and try to do it on your own and in the next one i’ll just meet you and try to
Do it together with you if you need help but try first yourself see you in a moment Are you ready for this i hope so because i’m so excited to figure out what are the salaries in data science okay so let’s get started so first of all let’s import the libraries let’s do that then read the data right and you can find an update that said hit on kaggle
But let’s not go this way let’s just say data equals pd oops we need to equal here pd read csv and we take files and data science salaries and let’s take it here as its post right so here we have the reported years out here and then we have the experience level
Employment type job title salary salary currency salary in us dollars employer residence and so on perfect inspect the data check the size of the data set length data set so i’ll call the data in the data set uh 245 okay so also knowing that actually we have
Data from around the world is probably a pretty thin data set but often what you can say it’s an indicator of it right so it’s not really a you cannot make any conclusions based on it but it can give you an indication of where you are also often remember that there might be
A tendency that some people are more eager to tell their salary than others good check the data types so we are here at 2a okay let’s do that oops next data d types so we have experience level employment level job title there all objects salary integer seller currency
Seller us dollars integer that’s good employee residence remote ratio integer that’s also good company location company size company size is actually let’s see what it is okay so that’s a category of the company size right perfect so it seems all fine so also let’s check for null value missing values let’s do that
Data is no any perfect so we see here this is a perfect data set all of them are there so there’s nothing missing so it’s a perfect dataset understand the features right most features have categories a way to categorize them is using data and unique right so let’s try to do that data
Work year unique so we actually see here we have two different here we have uh 2021 e and 20 20. that means that the e is probably something it’s not finalized yet so we get an idea of that example is also experience level let’s try to do that unique
And you see what values we have here we have entry level senior level expert level and mid-level how did i know that well i read it right up here right and you can see the full description on kaggle i’m not going to do that i’m not going to go through any
More oh let me just add one more actually company size right data we’re going to use that down here is it called company size company size yeah unique so here we have it we have small medium large and opposite direction right perfect salaries notice that sellers are giving in different currencies also
Notice salaries in u.s dollars right so if you didn’t notice that let’s just show it here again then we have the salary here it’s in euro here and then we have it in u.s dollars here right so it is important that we use the us dollars my
Salary in us dollars so for comparison perfect analyze explore features one way to explore features is as follows so let’s try to do that and see what we have here so what i’m looking at here is this experience level right so we have the entry entry how was it en entry level
En for entry level okay mid level senior level and x part level right so you see how many are in each one of them surprisingly enough there are not so many in expert level but if you look at the salary right so you see here uh the salary in u.s dollars is 226
000 us dollars and that’s quite fine and then senior level here it is 128 dollars and mid level is 85 000 and entry level is around 60 000 that’s also my claim that the entry level of salaries is around 60 000 we have other data actually supporting that effect so
We also see something about the spread here the standard deviation here is actually quite a large no it’s 50 000 and this is no this is 60 yeah actually quite large right i think it’s because we have some entry levels which are pretty high and
We have an entry level here which is 250 000 that’s that’s insanely high right and we have the smallest here on uh 21 000 right but we see around 50 where the half is is 58 75 is so above 25 percent and above of entry levels get more than eighty thousand dollars per
Year that’s pretty good right and uh if you look at the mid level 85 and you actually see the spread here is insane right the maximum salaries of these are actually extremely big right we have one here six hundred thousand uh us dollars right that’s pretty amazing so also just
We don’t know much about the data quality are these verified data points or not we don’t know that okay but this is to give you an indication of it right so you understand the data better perfect i think we can do something similar here for this with experience level let’s look at
The company’s size instead right company size so it’s a difference on what kind of company we are working in actually the smaller company has a lower salary emit has bigger so if you’re going so so what can this tell you based on this data set it will indicate that actually
The larger the company the higher the salary you can get but also the mid is telling the same here and the maximum salary actually giving you the larger company here so again what this can tell you and that’s also what my experience is the smaller companies have a lower salary often what
You can get in smaller startup companies is part of the shares in the company so if you build it up you actually become an owner of the company or you are an owner so you get benefit you get benefit if the company grows perfect so here we actually explore two
Things at the same time and this is actually amazing amazing data set here so entry level we can have it a small medium large we see the same picture here right that the average salary actually medium here it depends on how big the data sets are right
That’s where the count should come in we should do a count on this as well uh and then they have 75 here in the larger companies here and expert remember there are only a few experts this doesn’t really say much and you see here maybe there’s a great
Starter company that pays a great deal of money in the mid one we actually have the same pattern small companies don’t pay much and senior level it’s actually pretty much the same salary but again the same pattern smaller companies do not pay as much good describe data on two columns okay so you
Can do this this is actually what we want to do right so again now we get more more insights in what’s happening here right we have the count here right so for the large companies we have a count here in expert levels we have only eight here and here
We only have one here we have two so right this one here it dictates everything about it right so there’s no standard deviation there’s only one salary right so this gives you great insight this is what’s happening and often i like to see what is the
Middle value of it all and what is the mean value compare those to each other right because is there a big difference on that you know fifty percent get less than this and fifty percent get more than this right so this gives you an indication of it right
So again isn’t that amazing how much information you can get from statistics now we know that and i mean you can dive more into it but let’s just get this moving here so here we get a box plot of large medium and small companies and we see
The spread of that right and you see the outliers uh you have some outliers here in the salary in u.s dollars here right so you see here in the medium you have one which is a great outlier way above everybody else right but you also see this is extremely high salaries right
In general you can see here on large companies you have a lot of people spread out above here then below here and the salaries are going from yeah the salaries are expected smaller companies pay less perfect uh and we can do the same for let’s just do it for experience level
No no that was not the one i wanted this one experience level and then we get the same one here right so so what’s kind of annoying is not out of order right so entry level medium senior expert right it tells you all the full information that we did up in this one
Here it’s just for when you get trained to read these it’s just easier to read these things and you get the outliers as well so you see how how how it works right perfect present your findings right so this is getting interesting now because now we need to
Kind of figure out a way to present this and we focus on company size and experience level those are the one you might been working with different ones then feel free to do something else right but the first thing we need to do is using this on stack here and data
Or just call it plot data here so let’s just look at this one here so basically what it does here we have unstacked it what it does it puts a company size and experience over on these axes of the date of the salary because that’s what we want to use to present it
And we also want to re-index the data right so plot data equals plot data and and then we do the re-index and forgot a parenthesis there uh there’s something wrong oh it’s because this one was missing oh it’s because it’s missing up there perfect oh and she don’t still have this one
Here perfect good and uh plot data equals plot data okay i should just have copied that here so what did i do here so basically what i did is putting entry level mid-level senior level expert level small medium large right so it’s in correct order so this makes it easier
To visualize it right so let’s do this plot data plot bar and uh basically uh yeah what can you do here you should add some company size you already have there and y label you can have uh what is this u.s dollar salary yeah so here you see it so now we
Actually have some information represented here right so you see mid-size companies you can see how the salaries are right so you can see the expert level is maybe an awkward measure because it doesn’t really show the same because in mid-size it doesn’t really pay that much but you see in general that
Small companies have lower it’s actually difficult to see actually you see the entry level is actually higher in smaller companies here but that might be because of the statistics right so that’s again credibility counts here down here can we tell a different story now we looked at the data behind it then
This one actually shows here uh look at the counts look is it representative but still we get a picture of the story right and you might possibly want to represent this differently but this is just an idea of how you can present it uh for the things right
But again we learned a lot in this one here it was really really amazing i mean we learned a lot about statistics and how you can read it and yeah i must say it’s been an amazing journey so far in the next one actually we’re going to look into linear
Regression which is one of the most important tools in data science it’s many of the analysis you do it is with linear regression so you need to have that so i’m looking forward to the next one because it’s going to be great to have you there
If you enjoyed this one please like and subscribe and comment below if you thought that statistics was easier than you thought and are you surprised about the count that that’s the most important statistics right it’s not that complex when you think about it okay so hit like subscribe and
All that kind of stuff see you in the future bye bye the number one top goal of a data scientist is to create a model that predict something valuable for your end customer really yeah if your findings your data science findings do not predict anything what value is it adding to your customer
Right so you might wonder well isn’t that difficult to create a model predicting something well actually not and the most used one is a linear regression and in this lesson we’re gonna dive into a linear regression and break it down so you understand it step by step how to use it
And create valuable insights to your end customer in the project we’re actually gonna apply our knowledge on a sucker database a european soccer database and try to predict good players based on metrics so you’ll see how it works in real life if you’re new to this series
This is a 15 part course on data science with python if you didn’t see it there’s a link down in the description you can download all the resources all the notebooks we’re using here so do that and let’s get started on this journey together linear regression this is the most important
Prediction model used in data science if you don’t understand this one you’re missing out on a lot of value as a data science and again remember we’re looking at the bigger picture of the data science workflow and in the end it is creating value use insights for the customer that
That actually pays your bills that is your success criteria so again using a linear regression you can use that in your model to do some reporting and create some insights to your customer so your customer can use it to predict their customer’s behavior and that is where linear regression is a powerful tool
Perfect so what is this linear regression so you probably remember you remember we talked about correlation and actually they are kind of related and we’ll get back to that but even though they’re related they’re not they’re different so what is the goal of linear regression well it is given a data point
Uh can we predict the output so it’s called often called a dependent variable so it is kind of like a mapping from input points to a value really yeah so here are two examples of predictions right so the predictions is actually the blue line so given any
Point here it will predict the value on the line given the points here it will predict the value on the line but there are differences as you see here one of them the points are actually pretty far away from it and the other one here they’re pretty close to it so in this
One here it actually does a better prediction because the value it predicts is not so far away from the actual values of the data points so when you create the model what you’re trying to do is making a line that has the smallest distance to the line you’re predicting so that’s actually your
Success criteria in creating these linear regression models okay we’ll get back to this and we will create a measure measuring how accurate your prediction is and your goal is to find the input points the features that create the best predictions okay awesome and this again this is like a
Two-dimensional so you have one input variable and one output or dependent variable and in in the general case you have way more and we’ll do that in our in our project today so you see that but to understand it visually it’s easiest just to think it like simple points like this okay
As i said correlation and linear regression they’re kind of the same thing or actually not so what is a correlation right remember it’s a single measure of relationship between the two variables but again we remember something like images like this this had a worse correlation than this one here
Because these points here were more aligned to align while these are not right remember it was like the last session if you didn’t see that one i would advise you to go back and see the one with the statistics because there we actually looked into the correlation but again it’s important to it’s
Important to understand that correlation is just one number one number measuring actually how accurate this correlation uh linear correlation will linear regression will be okay so what is linear regression it’s actually an equation an equation for this line here and this equation can be used to predict so again given a value
You can predict what is the outcome what is the output what is the dependent variable i know you’re using a lot of different naming of these but that’s because that’s that’s how it is people have used different namings and it’s kind of annoying actually they just have
Why don’t everybody just call it the same and again it’s because linear regression is used in many fields and many fields has different terminology and then it’s all a mess if you ask me so what are the similarities also between correlation and linear regression well they describe a relationship between
Variables right so they describe so correlation describes in one number the how is the relation is it good or bad is it positive or negative right uh while the linear regression actually tries to describe it with a line a prediction line good enough talking now we need to get
Started with our notebook here so the first thing we do is we use our pandas because our pad is our main data structure and then we read this a data set we actually used it in the last one as well or we used it before and
What it has here is like weight and height so what it does have it is has a gender column a height and a weight and you might always already know that there might be a correlation between between the height and the weight of people it’s not a perfect correlation but it’s
There and that’s what we’re going to investigate and again if you look at a customer perspective of this we might find a linear regression model which is decently accurate but does it does give any useful insight to the end customer maybe not because it might be pure logic but it
Might be valuable for a customer as well so we are not always the ones to know because maybe your customers segment has some specific attributes which are interesting so let us just try to plot this here let’s do a scatter plot x it is oh did i do that again oops scatter plot
Let me do x here it’s height here and y is weight and then i’ll do alpha what i do mean here with alpha i’m just making a bit more transparent because we have so many points so you can see where all the weight is here right if i didn’t have
Alpha here if i had alpha one for instance you’ll see you look like this right and it doesn’t look it gives more information for me like this because then you can see where all the weight is right so what we’re looking at here is something that has a high correlation or
Yeah it has a high correlation you can easily see that this this creates a line but it’s not a perfect line of course right and again what we’re doing with linear regression is trying to find the best fitted line here and the closer they are the better the correlation right the better prediction
You can do with it okay good so far so good and just to just to memorize what we did in the last one we did at the correlation right here we get a number a metric here and this point 0.9 and 0.9 is a really high correlation so it is useful while you
Say i mean it depends on the context but you will also at least say that below 0.7 it’s poor correlation not really strong but sometimes that’s all you have so you have to create a model from that perfect so far so good so what is we want to use right we want
To use some machine learning if you look at the course overview you’ll see there’ll be something about machine learning later and i’ll also notice that i actually have a full course on machine learning and i advise you to dive into that later because in this course we’re actually
Only going to touch upon a few models but if you want to become really good at data science you need more models okay perfect so machine learning in python we have this scikit-learn it’s a great great resource it has a lot of the things that we need it doesn’t
Have everything but most of them are actually taken from here when you do it and again i’m not going to dive into what machine learning is i have a lesson on that later so don’t worry about that linear regression ordinary least square linear regression right so there’s a link also to it here
So you can see it and we’re not going to go into much we’re just going to use it straight out of the box here perfect so linear regression model takes a collection of observations right so the observations are points here right this is one observation this is one this
One this is one so far so good okay and each observation has features or variables right so often we call it features but they’re also other names like variables and so features are actually you know there’s a gender height and weight and in this one we’re going to ignore the gender we’re
Only going to focus on height and weight because they’re numeric and the feature the model takes as input are called dependent often denoted with a capital x as a python programmer i never liked the capital x but that’s that’s what people do so we add that it’s again when different fields
Meet each other it’s then yeah it’s for readability right the feature the model outputs is called independent often denoted with y right so we have uh so we have that okay perfect so so that’s what we’re going to predict right so so in our case here we might be
Able to say okay let’s create a model let’s first import it so what i’m doing here is importing the linear regression model here and let’s try to use these features here and try to make a prediction so let’s just call it linear here and linear regression so we create the
Linear regression model here and what we want to do is we want to to fit it it’s called and if you again want to you want to look into it so what it’s it’s kind of annoying actually you should have that on okay perfect so so what it does it takes the
Dependent features and uh we’ll denote them by x but we’re not going to actually do that in this one here because we’re just gonna add a list here so what it does this one it takes a data frame so you need to do it like this so let’s just take hide
So you that’s why you have double brackets there because then it creates a data frame and you cannot do it with a series if you only took it with single square brackets here single here then it will not be working and that’s again and we want to predict the weight here
Right so this is what we do here so now we created a model and you think um then what well let’s try to visualize it actually because what it does it has created a line so you can predict with it interesting so data plot oh scatter
Actually i want to do the same as i did up there and it’s height weight and alpha 0.1 and the reason why i use this one here assign it is because i get the axis because i want to add more to it axis to the axis right so what a plot here is
Like the data height on the x-axis it’s a height and then i actually use our linear regression model and we predict with it and what do we predict from we predict from the data height right so we are actually not using the we’re not using the
The weight in this one here right so this is actually plotting the height actually i should just make the plot first without right so this is the one we have already right this is the one we had above there and then i add to it a
Red line right so the red line here the red the color right here it is actually our prediction right so it takes every single point and says this is how we are predicting this is how we’re predicting right all the way up here so given some height so how would
You use this model right so you get somebody you know the height of this person and then you say by all means this person is probably has a weight around here right and then we would have some measure to say well what what is a deviation from this height good so
You need some kind of a measure to say okay is this a good prediction is this a bad prediction and when you’re trying out with different features different ways we will learn about data cleaning also then then again how do you create a better quality of the data so you can better do
Better predictions right and we’ll look into that so the idea idea is this this is how you kind of iterative can look into it can you create a better model can you create a better fit what do you need to do with data in order to get it better working right perfect so
Yes so what you do here is you need a measure so in linear regression you use the or most often you use the r squared method and there’s some dots here about it how what it actually does but basically what it does it is it uses a measure called r
Square and it takes the square root of the distance to the line right so it takes each point and calculates the squared distance that means the further away you are the more impact it has on the score right so points closer to here the impact of the score is not as big
So the best possible score is one so that’s what you’re aiming for but the you cannot say something general about the score it’s very context dependent right so you need to some make some measurements from the score and then you just you know you do some changes and see if the score
Is better or worse good so so far so good so let’s just say a r2 score so what are we scoring here well we are actually scoring let’s just see here it says the true values so what i’m looking for here is actually is it the true values first and the
Predicted values afterwards yes it is right so the true values here are the height and the predicted values are linear predict uh data actually i’m doing it wrong right so it shouldn’t be able to hide here it should be weight because that is what we’re predicting right so we’re predicting
This right so here we go so so here we have a score of 0.85 and you say yeah but that’s not the correlation score yeah but this is a different measure again so this is different measures you cannot compare a correlation and this one but again
This is is this good or bad it’s context dependent and i also just want to show you that actually the models themselves have a score as well right a scoring function and and you’ll see here the coefficient of the determination math is r square right so
It’s actually the same one so we should actually get the same score here and again you do you do the you actually do the the x value here first so it does the calculations for your right height and then you do data wait here good so as you see here you get the
Exact same score because it’s the same calculations it’s doing so it actually does this for you down on the scoring function here right but doing it with your own here some sometimes you use other scoring functions and but this standard one is r square so this is one you should focus on and
It’s it’s basically very context specific if you use something else on r square then you should know what you’re doing so r square is just you know the go-to which is also the default one in the model okay i mean this is quite amazing because what you did actually even
Though this this looks kind of simple right you’re focusing on this area can you make something that predicts some value to your end customer and this is what linear regression can do for you it can do some predictions right because now you’re giving some feature values as input and you can predict some
Independent value as output for the customer and you can measure how good it is and try to play around with it in this case here our input variables are actually just one right in our project we’re going to do something amazing we’re going to look at the european soccer game
Data set it’s one out there it’s really famous to use and we’re going to try to predict good players and we’re going to use a lot of features on it so it’s going to be fun so you can see how it’s used in more general settings okay are
You excited i hope so because this has been great let’s jump to the project description see you so are you ready for the project i hope so let’s jump to the jupiter notebook and let’s look at it okay so here we are predict soccer player with regression right linear regression and
Again just to remind you we have the focus on the bigger picture here and sometimes we dive into smaller aspects of it so go good little project is to make a model to predict player overall rating based on metrics hmm this is a subset of the cackle data set
European soccer database right so it didn’t take the entire one because it has a lot of other things and you can do a lot of fun things with it right here so they’re like 25 000 matches they’re 10 000 players they’re 11 european countries seasons from 2008 to 16 so
There’s a lot of data it’s really really fun but also you see here it’s a big data set the big one and we are not going to use the entire one here so i took some of it here a bigger project would be to predict the
Outcome of games i mean that’s fun right being the one that actually has some insights uh some advantage of predicting that i would say this is a this is a this is a difficult project because they’re often variables that are difficult to understand right but maybe you can have
Some percentage of advantage right so you have some skewing that means it’s probably difficult to predict specific games but if you try to predict 10 games maybe 20 games then you might have a better odds than 50 right but here we only model to predict the player’s overall rating based on the
Metrics in giving there right good so explore the problem identify data yeah so the first one we need to do is to import a lot of libraries we have pandas here we have linear regression we have something new here called train test split which we will be using then
We will have r square and then we will have our matplotlib pi plot okay so first we read the data and we read the park here file here so i have the soccer data we need here as a poke file here and the reason for that is again compression packet files
Take up 10 to 20 times less space so that’s why we have it as a parquet file and assign it to data and take head of that to get an idea if the data is correct and what kind of data we have perfect and then the data side just to
Get you give you an idea of how big datas it is inspect the data so again so we want to check the data types but what you will realize is it’s a long list so what you want to do is actually you want to find all the columns with
Numeric data and you can do that by select data d types including number right and that way you will see all the features which are numeric that’s what we’re looking for check for null and missing values data is of missing entries so forth use is no any and then you actually want to
See how many are now you can calculate that percentage with this formula here right you have the sum of those who are missing and then you divide with all of them multiply 100 you get in percentage good so we haven’t worked with cleaning data or missing data so far
So what we’re going to do now is just to drop them all drop now remember to assign it to the variable so you keep the changes visualize data make a histogram of the overall rating this gives you an understanding of the data what does it tell you right
I don’t know what it tells you i’m curious about that feature and target selection so the target data is given by overall rating that’s the one we want to predict right overall rating that’s what we won’t predict and we do not have a description of the date
Of the data let’s learn a bit about it right so use correlation overall rating sort values as ascending right so what we’re looking for here is actually we’re looking at the correlation so we find those features which are correlated highly correlated and sort them according to that for simplicity deselect features you do
Not think do not think should be part of the analysis right it should be a k here uh create a data from x and y containing features or pre uh pres prescriptive to get all columns except the one use overall rating instead other here axis one right so
Yeah to get all the columns that the one use it is we need to drop the one here of course because this is one we have for overall rating and then we have the other ones here we don’t want to use perfect divide into a test and training set so
We didn’t really talk about this inside here but one thing you can do here is actually you can divide it into a train and test set so you have a training set and training set train x and train y train x test and next test y test so the
Reason why we do that in machine learning is actually you you need something to train on and you need something to test on so the data you are training on is not the data you use to test and why does that make sense it makes sense
Because if you train your data you know test it on the same data maybe your accuracy is pretty high but when you get some new data which you didn’t train on then actually accuracy might be way off so that’s why often you take some data and you train your model and
You have some data on the other side where you still know what it should predict and then you test it on that and check if it is predicting correctly or how accurate it’s predicting so this is pretty standard and we’ll get back to this in the machine learning part of it
And if you’re really really excited about these kind of things go to the machine learning course as well it’s free and you should just get started so train the model so here’s we will use x train and y train predict on test data well then you
Predict on this one here and you assign it to y print right and now it is you take the r square on y print on data that the model when we trained the model when we fitted the model has never seen right it has never seen the data we’re predicting with here so
That’s the point and we have the real values here and the predicted values here and then we’re making our square score on it perfect and in the end present the finding and this is more our practice for creating a model but feel free to be creative an
Option could be to investigate the best indicator of a player right and use insights well this is also again it’s like a practice of the bigger picture here we know the goal is over here but sometimes we need to learn something we use some practice data and then
It actually this one here is more focused on these aspects i mean actually it’s it’s focused on this one here right perfect are you ready i hope so and uh i’m excited to see how it goes for you so please let me know in the comments don’t
Be shy so see you in a moment if you get stuck then i will try to solve it with you in a moment so see you in the next one so stop the video and try as far as you can if you get stuck watch me do it see you in a moment
So did you get stuck or you just managed it all by yourself please let me know in the comments and i know we are learning some new things here inside the projects but that’s also i mean that’s your sometimes it’s best to learn just by
Doing good so step one a it’s just to import this beast here so let’s try to do that and it did okay so read the data so we have done this quite a few things a few times so he’s if you have been following my course here this course there then
You should probably know how to do this but if not don’t worry about it uh read parkins first time you should do a care file by yourself but it’s basically the same thing here right data head here so this is basically what i’m talking about uh it’s always good
Just to check the data with the head here just to see that you read some data here and you see here you have a lot of features and the dots in the middle here notify you that there are way more of them you can actually get get it with a length afterwards here
Right no not the length but here in here you guys get more information actually we use the data shape here it’s also an option right so here you get the length of the data so you see you have 183 000 rows of data and you have 41 features right and what
Data do you have here you have a you have a overall rating of of the player you have the player to have the names of the players i have no idea actually what you have here we have a lot of data here good perfect so that’s basically the step
Ones here then we go to step two here you want to investigate and inspect the data and often we use d types here right but what you will realize here when you do d types here it is actually this is a long list this is long list this is
Really really long right so you see here um you have an id you have a player api id you have a date which is an object you have rating potential preferred foot which are not and then you have a lot of floats here so this is actually pretty pretty awesome right
But what else is you can select all the columns of numeric data using this one here so let’s see what it does actually okay so here so here you see all of them and again it doesn’t show all of them as it dots here which are numeric
So it’s just for you to enjoy check for null and missing values right so what do we do here again here it says here is null any so data is no any and it will show you actually all of them except the first one here right these are always
There but all of them are missing data right so what do we do let’s kind of get a percentage instead of it so what do we do here right we calculate percentage right so the top three ones they’re all there right so that means how much is missing zero percent right
And other than that you have like less than half a percent other some that are missing a lot you have you have some which is one point almost five percent missing right so as you see it here or 175 here as you see here it’s not it’s not big
Things the big question is of course is it a cross is it scattered all all around or is it kind of the same players that are missing the same data we don’t know but we we don’t know but but what we can do here is actually to say okay what is
The length of data and what is the length uh length of theta drop now right this is what we’re going to do right so what we see here is actually when we drop all the columns all the rows with missing data we’re not losing
Out on a lot of data here right so it’s not it’s not it’s not that dangerous to do right so we just do data data drop now right so now now we have all the data available right good so make a histogram over the ratings and why do we do that so let’s
Try to do that overall rating plot hist and I actually want a bit more bins here 20. here okay so so remember again uh when you have the statistics piece we had this curve here right and this is often how data is distributed right so you have somewhere where most people are that is uh that is uh the mean
And then you have something called standard deviation right so how much does it spread right so you see here you do get some understanding it seems like it’s a rating from 0 to maybe 100 right so let’s actually figure that one out right so data overall rating and
Let’s say described remember that one i should actually add it there right because this good thing right so here we have it right the overall count here it says that’s the most important metric right remember that it actually it says something about the quality of the
Conclusions you can made on make of them from this data point here so we have 180 000 right wow that’s a lot and then we have the mean the mean value that means we’re 68 it looks like the same here right it’s right next to 70 the peak here
And the standard deviation is 7 that means that 80 now 68 around 68 are one standard deviation away from 68 right so that means it goes from if we say this is 68 let’s just ignore the base here then it goes from 61 to 75 scoring that 80 not 80. 68 are inside
That range and you can see the mean here it is the mean point is uh 69 that means that up to 50 there are 69 and above that are the other 50 and then you can see the 20 and 25 the minimum is 33 the maximum is 94. okay perfect awesome
So what we want to do now is we know we need to find features that are interesting and that means that have high correlation so this is where we use the one metric to get us started because we need to figure out what matters what doesn’t matter right because we have so many
Features and a key thing is also so why not use them all and that’s often not very good because then you’re over complicating your model and it it doesn’t really really pay off good so let’s get started with this so what does it say right so it actually has an entire formula here
So i don’t have to type anything here it’s just trying to describe what it means right so let’s let’s look at this so overall rating has a one correlation with overall rating that makes sense right because it’s 100 then he has a reaction a potential potential uh
If you look at the description i’m pretty sure it is a measure of what do experts think about this personality right so it has also a really really high score and then it has something less right but something called reactions is actually really really good indicator and then it
Has a lot of them which are really really low and remember there can be some negative ones that are close to one that’s also really good but in general if you look at this a reaction is the best one but for this for this one here i don’t know uh
Yeah so for this one as i wrote here for simplicity deselect features you do not think should be part of the analysis and also we are not really learned about feature selections we’ll do that later so for now we just say okay potential it might not be a good metric to use let’s
Let’s let’s let’s drop that one and uh what does it say it says assign x to the ones that are left right so so we take data drop and then we have overall overall rating and then we have potential right and uh x is one right isn’t it like this one yeah
That’s x and the y is data overall rating right so here we have it right so now in x we have the features all the features and and why we have the overall rating the one we want to predict and i think this potential it is maybe there are others and you find
Others please add them to the lister and let me know in the comments if something is missing but you can read about it in the data set what they are specifically right so if you went back here to this one here you would be able to say you will be able to say
There should be some description about all these uh features here so actually i don’t know where they are and we’re not gonna look into that right now okay perfect but you can find the description of the things there and um please feel free to add more on the list
And later in the course we will learn a bit about how to how to select features here there will be something called feature selections because it doesn’t pay off to have them all here but for now we’re just going to have some fun and do that divided into test and training set so
This is basically what you need to do here it is to execute this piece here and what it does it takes it takes i actually think yeah there’s a manual there and it says it to you it takes i think 20 train size here it should be if if load it should be
Default none it doesn’t say if none the value is automatically set to complement of the test size it doesn’t say how big it actually does it i think actually it’s 20 you can add that as if you want it it doesn’t really matter for us here but what it does it takes
Randomly selects some of them in to be in a train and someone to be the test and this correspondence train and y test are set here if you don’t understand this piece yet i think it will make sense when you go down there good train the model so
What do we need to do here right so again we have to remember what did we do in the lecture well we made a linear regression here and then we did lean here and we take fit so why does it call fit when you call it train yeah that’s a mess right
But against terminology sometimes i also think in the beginning it it kind of confuses a lot so here you go this is what this was nice we got an error here oh we have the we have so what did i forget here right so so basically what i forgot here is actually
We have some uh we have some what is called strings and stuff up here so that’s actually what i should have done here in this one here selected uh those who include numbers and it was that was actually not really clear here so so actually let’s take this down here
And go one step back here where do we have it here and say data equals here right because what i’m doing now excluding the ones which don’t have numeric value and then i have to execute that one again and then this one here right so i apologize for that
Uh a little mistake here and that’s what happens all the time you forget something you fix it don’t worry it’s part of the journey of being a developer or a data scientist don’t worry about it good so so far it’s good we created a model and
Now we need to test it all right and we need to do that with the prediction here right so we take y pred is equal to the linear model predict x test so what it does now it takes all the testing data we created and create a new
Series with the predicted values and now we need to evaluate it and then we use this r2 score and we take yeah i always forget the order of it and we just checked it a moment ago so the true values right so that’s why test with y print
And we see here the score is actually pretty good 0.78 oh i said pretty good i i have no idea good guys again r square is a measure and it’s context dependent and in this case is it good is it bad we don’t know but sometimes you actually
Know that the the r square can be the best one is one the worst one is minus infinity so basically when you are in this range it is kind of indicating it’s not entirely bad not entirely bad but it could probably be better right so it’s i’m kind of
Surprised because i would expect it to be a bit uh a bit a bit worse to be honest perfect so if you want to you i think you should present your findings you have an indicator now that can predict something about the accuracy or not the accuracy the
The overall score of players so you can spot players when you have these metrics right in your data set so it’s up for you to have some fun and i hope you will i hope you enjoyed this one and in the next one we’re actually gonna do something amazing we’re looking at
Cleaning data so what does that mean well data quality is a something people love to talk about but there’s not a specific measure on it but when you look at data quality it’s more about are there missing data are there outliers are there wrong data and if you don’t deal with
That if you don’t deal with that your models like this one the linear regression model or whatever model you’re using are not going to be very accurate or not very good so when you don’t deal with it uh and so far we’re dealing with missing data with drop now we will actually see
That just skipping all them can be a bad choice it can be better to make some average value of it and you will get better results with your model so cleaning data is very important and you might say well is that really a problem in real life here it is because where
Does the data come from it might come from sensors which have are malfunctional or it can come from people typing in and it can come from people i mean when people type things in they type it wrong and they type it in different measures and uh as we’ve seen
Already that you remember with a metric uh some some type thing with in meters and some typed in in centimeters so there are a lot of things you need to figure out with data cleaning and it is so important and crucial in order to make your models perfect so you as a
Data scientist you cannot ignore this piece so looking forward to meet you in the next one see you there and it and if you like this one did you yeah okay so like subscribe and all this kind of stuff share it with somebody you think that would take advantage of this will
Be helping me a lot and i’m thankful for any help because i’m doing this for free for you so you can become a better data scientist so see you there in the future bye bye As a data scientist the accuracy of your models is kind of a measure of your success cleaning data is a key the quality of data getting into the models are kind of the key to the success of the models so if your data is dirty the quality is low
Your model will be highly affected by that so you want to learn how to clean the data dealing with missing data outliers and so forth to get the best quality data inside your model so your model prediction the accuracy of your models is high and why because that’s basically
What counts in the end making a model that makes create great predictions so are you ready for that i hope so if you are new to this series this is a 15 part course on data science with python if you didn’t see there’s a link down there
Below there where you can download all the resources all the notebooks we’re using and follow along in each lesson there’s a project and in this project we’re gonna investigate how the quality of data how you can improve that one and how it affects the model so are you excited i
Hope so let’s begin Cleaning data missing data and dealing with outliers so that’s basically what we’re focused on so why do we do that well we do it to increase the accuracy of our models to make better predictions and if you’re not knowing how high impact this actually has then follow along it will be for you
Because in the end what we need is high accuracy high accuracy gives better insights for your end customers so again let’s dive into that so what does cleaning data mean well something that people love to talk about is data quality and data quality is often very very vaguely defined and it’s context specific
Data quality can be a lot of things it can be like is there missing data is there duplicates are the measurements accurate i mean it can be sensor data is the measurements accurate are there wrong data points are there outliers that shouldn’t be there so what is the goal of cleaning data it
Is actually to increase the data quality and again data quality is not specifically defined but that is a goal so you can see there are basically two main or three main things we’re gonna look at here which covers a lot of that in the end it requires domain knowledge to go
Further than this missing data both rows in sync and single entries right so we know already that we can have missing rows that is data that is missing right and uh we also need that single entries of something can be missing it can be like a survey and some people don’t
Answer all the questions so you don’t have a full data set so examples including uh replacing missing values entries with the mean value or interpret interpolation and which which works in time series and we’re going to cover both in this one so you know what it is dealing with data outliers
So what does that mean well default missing values in systems sometimes is a zero value right so sometimes it’s not like giving us another number like we know it but it can be is zero so they just if there’s no value it can be zero so that is
One of them another one is wrong values right so what is the wrong value if it’s like if you know something about it again domain knowledge then it might be you know this is a height of humans and if it is in meters and some somebody is like 300 meters
High tall then you know it’s wrong right but again this requires domain knowledge another big thing is duplicates often duplicates tend to sneak into systems where you have the same data represented twice and this will also screw your models when you do further calculations okay so far
So good again we have missing data outliers and duplicates that’s what we’re going to focus on and when you go this step further you need domain knowledge good missing data missing data is sometimes referred to not available values in pandas a great source to learn about it is here
Right so this is actually amazing source it’s not just a great source it’s amazing source so here they have an entire enormous page writing how to do this and this is again just to emphasize how important working with missing data is because it will increase your accuracy
Of your models as we will see in this one going to come further along but i just wanted to show some examples here so here we have pandas and numpy if you’re not familiar with numpy actually numpy is the underlying data structure panda is using
And the reason why i need it is to add these another numbers as we see here so i create a data frame and we haven’t been looking at how to create data frames and so on so just ignore this this is not part of the tutorial part if you’re interested in this there
Are other courses that go into depth of data frames and creating them but what i want to show you here is the example of two types of missing data one is not available right not a number right so this is data like this it looks like
This we have columns here and then some are missing but some are there and it could be like you know in surveys or whatever somebody’s not answering everything or there’s some malfunction some of the detectors or something i don’t know something so this is quite common another one is actually where you’re
Missing full rows of data and that often happens and you often see it in time series right here so again ignore this piece here it doesn’t really matter it’s just to create this one here but here we have dates right the first of january 2nd of january
The third is missing when we have the fourth then we’re missing some 7 8 9 10 right and we see here the numbers 0 1 3 6 7 8 9 right and you see there’s some similarities right so actually giving this simple data set you will probably be able to
Insert the missing values here and this is where something like interpolation comes into the picture and we’ll get back to that later how to do that this is just to example some examples with what is missing data outliers so often this requires domain knowledge not often but all of them but an example
Could be this one here right so imagine we’re having a list of weights and some of them are zero right and that could be one of them right so the problem with zero is it is a numeric value and it’s not like not a number right so it’s difficult to deal with so
That’s why do you know it’s uh it’s it’s a wrong outlier it’s a wrong value maybe maybe not but in this case you do because you know the weight is nobody’s weighing if it’s human weight it looks like human weights in kilograms and nobody’s weighing zero kilograms we know that
Even though you would wish you did but you don’t perfect demonstrating how it affects the results right so this is where it becomes really really interesting so why does this matter why does it matter can we just drop them delete them well let’s try to figure it out let’s look at that
So we’re looking at a specific data set here this uh house pricing has been in competition if you’re new to kaggle they often have competitions where you can win some money this one is closed i’m pretty sure because it’s so far away but it’s a great way to
Teach yourself things but we’re not gonna go into it uh right now so here the data contains a training and a testing data set the goal is to predict prices on the testing data set right so this is how you are being evaluated we will explore how to dealing with
Missing values impact on the prediction of the linear regression model perfect so here again we just import the linear regression model the test trains has split and there are square and here i have added the data i downloaded it and put it here for you to
Enjoy as well so let’s just put that here and let’s look at the shape so we have 1460 uh rows of data and we have 80 shapes no not 80 shapes 80 features wow that’s a lot so what we actually want to do is actually want to remove the non-numeric values and
You could do that like this and check out for a missing value so why do i remove the non-numeric values well there is value in them but for our purpose here we don’t care about that uh if you’re doing further investigating and you want to be trying it on out i
Advise you highly to to look into that right so we know we have so many values there and you see here some of them are missing some right because yeah they are so but it’s pretty decent data set so it’s only actually this front that is missing isn’t there no there’s
Also missing here good perfect but but anyhow what we’re interesting is actually when we do this prediction series like uh let’s let’s try to look at the data here and look at the correlation between them and we’re actually only interested in the sale price and we want to sort
Values and we want to do it scanning false so what i’m doing here so what i’m interested in is like what are the most accurate predictors on our data set and obviously sale price is correlated one and then we have overall quality is the highest one and then we can see how
They are correlated and remember to look it down at the negative ones sometimes there’s an highly negatively correlated one good so so what we’re doing here is actually just to get an idea of how the data set is working good what i want to do is actually i want to
Do a regression score test function a helper function here right so we want to implement this helper function to calculate the r square score right it should take a dependent feature and an independent feature right so x and y right and then split that into training and
Testing so that’s what it does it takes split and testing remember we did that also in the last one we’ll get a bit back to that in the machine learning piece and then split that into a training and testing set yeah yeah yeah that’s what we did fit the training so fit it
Predict predict and then return the r square a square right and just a note i use a random state equal 42 here it is just so you can replicate it and do the same testing as i’m doing so you get the same results so if you use a
Different number here than 42 then then you get different results that’s because it’s random good so what i want to do now is on this data set here which actually has a lot of values and almost no missing values right it has a few missing values a few here
And a few here but it’s not a lot right it’s few what i want to do is actually i want to do these calculations so i do a split it i do a model i train the model and i do prediction with it and then i get the score
I want to do that in different ways now so here i just drop them all so this is our test case right and i need to execute this one here right so otherwise it won’t be able to to make a score right so this is like if i just drop all the
Columns with missing values this is my score what if i did something different what if i instead of dropping them put the average value so this is a common way to do it right if you’re missing values just put the average because that’s the best guess
You have and in this case we have no domain knowledge we have nothing we haven’t investigated anything so this is just a poor guess but maybe it’s better maybe it’s better than just skipping them that’s interesting because you can already figure out if you know more
About it you can do better guesses than mean maybe but let’s just try to mean whoa what are you kidding me this is enormous increasement right whoa this helped so much so you see dropping them was actually a pretty bad idea adding the mean value actually a great idea
Uh can we do better uh maybe maybe not so whatever we doing here we are actually filling uh we take the mode so so let’s just data mode let’s just show what that is so so what it does is actually it adds some so what it does is it gets the mode of
Each element along the selected axis right the mode of the set of the value is the value that appears the most often right so it takes the most most often appearing value and puts it in here right so that’s what it does here right so let’s see how that does actually and that
That actually does pretty decent it actually does a tiny bit better and again in those differences here it’s difficult to conclude anything you can be the specific data set and so on and we’ll get a bit back to how you can look at these numbers but again so these are just two
Different ways to look at it the one is taking the mean and the one is taking the most common occurrence of values right so that’s what the mode does right isn’t that crazy that you went from this score here to this score here and uh these two approaches seem to be evenly good
Amazing perfect time series what are we looking at here right so remember again that sometimes in the time series there will be outliers or bad values so we’re looking at a weather data set here and it looks like this you have a summary you have a temperature humidity wind speed visibility pressure
And so on so missing time series rows right one way to find missing rows of data in a time series is as follows right so you do take the series here and you take the date range from start to end yeah yeah you have the minimum and the
Maximum and you have the frequency of hours so this is actually what this one does here right it has it has data for each hour and then you can do the range here to find all the indexes good so that’s what the first one does it gets all the indexes
And then when you have all the indexes it says uh is in index right so you check if the indexes are in the index if not you get a mask right so you get a true false mask right and then you can get all the missing values like that okay so
This is pretty amazing right so you see actually you’re missing data on these dates here this is the dates that you’re missing perfect uh so let’s just for the fun of it look at this one here and you see there’s no data right so you take you took the
First one here and you check on that location and you see there’s no data right if you took one hour later there should be data there you have data right but not on the specific hour giving there there’s no data okay insert missing date times and interpolation right so this is where
Inter-blade comes into the picture right so one thing is you can re-index it that is you make a new re-index and uh what do we do what does that do uh re-index it confirms times into new index with optional filling logic right good so that’s what we do here we do a
Re-index with the index here so we have a new index here and then we have interpolate so interpolate is the interesting one and there’s a lot of things you can do with this but we’re going to use the default one linear interpolation and it does a decent job for most cases i
Would say but basically what it does i don’t know if they have a chart down here okay it has an example here right so what it does it when they’re missing values you can add values in here right so interpolate called here is instance a value 2
Because we know what is the most likely value between 1 and 3 that is missing here it is 2 right and uh yeah so that’s what interplay does it finds it takes the two points uh the two points that are there and then they put it on a linear mapping between
Them and that actually does a really decent job so let’s try to do that here yeah so what it does it interpolates right and what we’re looking at is the summary where there is isna in it and because the summary here why why would you check the summary because here we
Added rows of isna values and the interpolation is done on this one here where they’re missing here the missing value so actually if we had done it uh done it before let’s let’s just show you here let’s just take this one here and here without the interpolation you see here they’re all
Not a number right so they’re all not a number but this one down here it has in interpolated with interpolation saying one of the closest numbers uh next to these one missing and then make a linear interpolation between them perfect awesome if we focus on pressure millibars in 2006
One way to handle zero values is by replace so so let’s let’s first look at the pressure here is now all the values are there in 2006 that’s interesting right uh and what we’re doing here right is just we are taking 2006 and assign it to here just
To have it right so let’s look at the values but we see immediately when we visualize it like this and again visualization is a great tool to find outliers because we know that the millibars again domain knowledge we know it’s the values around up here but you
See a lot of zero values down here zero values and that’s bad value so actually what we see here is that all the values are actually inside here but in this one here we have a lot of zeros right this is what this checks here right are there
Any equal to zero any yes there are so what if we uh interpolate the one so we replace a zeros with another number and then we plot it how would the chart then look like right so this is actually the same data we had up here where we replace all the zero values
With a inter interpolated value here right and you immediately see the difference right so here we have the values inside a range from uh below a bit below 1000 up to 1050 maybe and up here we have a different range all the way down to zero because we had zero values right good finally uh we want to be able to remove duplicates and that’s actually one of the things that is really really easy in python right so not python in the data
Frames right uh remove duplicates returns data frame with duplicates rows removed right so consider this one here do we have any duplicates we have row one and row three here right so if we remove duplicates on this one we will not have row three because that’s a duplicate awesome
So far so good uh in the project we are actually gonna do something very similar uh we’re gonna do a testing we’re gonna do some magic and see how it affects its uh performance and you’ll be actually quite amazed by that i was myself so i promise
You it’s going to be amazing so see you there in a moment when i introduce the project to you see you are you ready for this i hope so because it’s gonna be quite interesting because now we’re gonna try to measure the impact of interpolation on a data set i’ll
Promise you one thing i got surprised by how huge impact it has so this is why you need to understand data cleaning and it is a art in itself so are you ready i hope so let’s jump to the jupyter notebook and let’s get started so measure the impact of interpolation
And again we often have focus on the full picture here but in this one actually our focus is actually on step two right cleaning the data so the goal of the project is to see how big impact interpolation can have on results right and even though i say it’s only focus
Here in the end it is what adds value down here right because the better models and higher accuracy you have the better predictions you can make and the better predictions you can make the more value you can add to the customer so remember this always keep the big focus here good
So the focus is mainly on step two that’s what i said see the impact we will make on simple model usages uh the project will not go into details of step three to five perfect so the first one here we need to import it it’s just to what are the what are we
Using here we are using pandas numpy linear regression test train test split r2 square and matplotlib pipeline perfect then we need to read the data it is a weather predict park here data remember to assign it to a variable take ahead of it to just see how much data is there
Check the data types d types perfect check the length null values and zero values so we have length is not sum so that’s how many null values are there and data equals zero sum right that is how many are zero values perfect baseline so check the correlation to have a measure
If we did nothing right so this is correlation right so we make a correlation measure just to see what are the correlations right now we know pressure plus 24 has not a number and zero values uh we don’t know that yet but when you do it we know uh these are not correct
Values and we cannot use them in our model create a data set without these rows use filters like data there on drop down perfect check the size and serial variables check the data set of data set and data sets uh perfect so the data sets is the one
We created up here and data is the one we created originally check how many zero values each dataset has check if correlation for fun check the correlation of dataset we do have the same after the interpolated serial values apply replace and interpolate does the result surprise you notice how much interpolation
Improves the result right linear regression functions so here we create this function we also use used in the lesson it’s actually the exact same function so it takes features x and y uh yeah the feature is dependent and independent then we split it into training and testing set we fit it we
Predict and we create the r square score and then we taste the test the regression score function on data set and then finally we test the interpolated data set right so make an interpolated data set and get the result from regression score for interpolated data set whoa
Is this just exciting i hope so because how much difference will it give right how much different will it be i don’t know and uh i actually said i was surprised so maybe i know i don’t know we will see in a moment let’s try to do
It first so you stop the video try to do it if you get stuck play along i’ll help you i’ll cue you don’t worry about it everybody gets stuck all the time but if you don’t try by yourself you will not learn the best way to learn is to try and figure out
Some details are difficult to remember and when you see then you’re more observant about the details when you’ve tried it yourself if you didn’t try it yourself well you just said yeah it looks easy it looks easy but the key thing is to try it yourself
That’s how you learn okay so see you in a moment when you have tried it yourself bye-bye did you manage let me know in the comment what did you find difficult and what was easier than you thought please let me know in the comments i love to hear from you guys okay perfect
So let’s just jump into it here so the first one here did you manage this one the import libraries did you perfect if you followed along you would know you just have to press shift enter or if you read it and you need to of course mark
The cell there good so use uh we need to read the data so let’s do that and read park here park here here and files weather prediction and let’s just take head here perfect so we have a data set here actually we don’t know how big it is but we have
Pressure and pressure 24 hours in ahead of time right so basically it is uh looking what we want to do is we have the pressure and we want to predict the pressure 24 hours ahead there perfect and you see it goes from hour to hour hour to hour so obviously it’s
Predicting 24 hours ahead there right so data shape just to see here we have 96 000 rows of data or observations right perfect check the data types okay data d types did you manage this one it’s float and float so why do we do this uh
Again sometimes you get lost in it but sometimes actually the data types on these here is uh because there might be some random data in there which is not nice so the data type could be something like a string instead of a float perfect check the length uh null values and zero
Values okay length of data it is here we already knew that data no and this one we have data is na sum so what does it tell us it tells us we have uh missing values for 38 of them and then we have data equals equals zero
Sum so what are we trying to do here we’re trying to sum the number of data points which has zero in it right and we have 1288 of them hmm that’s a lot actually but uh for the pressure 24 plus 24 hours we can actually not use that data and uh
As well as this one here so the one which has isna in plus 24 and pressure we cannot use them good so as a baseline let’s just look at the correlation between here right so you see here the correlation here is 0.4 between pressure and this one here interesting
We know pressure 24 has another number and zero values these are not correct values and we cannot use them for our model so we create a data set which is called data actually we just copy this one first and then we take data set and we also make data set drop now
Here right so we remove all the mis with a missing missing values and we remove all the ones with zero and pressure plus 24. so let’s just look at the sizes again oh so here we see we lost 25 000 no we lost i don’t know how many
I i i don’t do calculations okay perfect for the fun let’s check correlation on data set that’s really fun uh whoa that’s extremely fun so the correlation went really really low and then do the same after you interpolated zero values right uh apply replace and interpolate right okay so data set uh replace
Zero from np not a number and then we do interpolate and core and then you see we go from 0.08 up to 0.79 right so this is by far not correlated at all this is highly or is this is correlated i mean often you put the mark it it’s context
Dependent right but the mark is often around 0.7 0.8 you say it’s correlated then we create this amazing function here so let’s try to do that if i can i hope you’ll help me here regression score it takes x and y right and what does it do it’s split x train x test
Y train y test test it’s called train test split x y how’s it called size i don’t remember test size it doesn’t say that here but i’m just using the same numbers as as from uh as from the how’s it called the lesson we had there okay so and then
We make a linear model in the aggression and then we fit it oh fit x train y train and then we wipe pret we predict linear predict and we take x test and then we do return r to score our square score and we take y
Yeah i always forget the order of this the true y test y spread okay perfect so now we created this uh helper function here so let’s try to do that and then Regression score on data set data set uh what is it we want to use it is a pressure or pressure and then we take a data set the y is pressure 24 right so again here we have a pretty low score here it’s close to zero that’s pretty bad
And then we do the same here Data set interpolated data set replace zero with n p not a number and again we need to replace this uh zeros to another number because those are the one that’s how interpolate does it takes all the all the how’s it called all the missing all the nana number and interpolate them
If this is zero it doesn’t know it has to interpolate it right so oh uh data set interpolated and then we have pressure and data set interpolated pressure oh pressure plus 25 right perfect so we have some keyer here press pressure isn’t it called pressure i don’t know i’m misspelling something here
Data sets interpolated oh i meant the the hour here that’s why i’m perfect good uh so here we have it right so this is enormous difference again right so we have something close to zero and something at 62 or 0.62 the ideal score is up around one but we’re not getting straight
There but again you see how interpolation actually does an enormous impact on your result of your model right this is this is great this is basically useless right this is quite useful uh i mean again the score here is uh context dependence we don’t know if it’s
Insanely good score or quite good score but we know up around one is good and around zero is really really bad and negative is even worse so this is basically what we’re doing i hope you enjoyed this one in the next one it’s actually going to be amazing because we’re going to
Look at classification we’re going to look at what is classification and why such a big thing in data science and learn a bit about machine learning we’ve already been using linear regression which is like the base model you need to know that one because you use it all the
Time but classification is also a big thing in data science so you need to master that and then we’re going to look at machine learning in general and how to continue with machine learning because the more models you know the more things you are able to do as a data scientist
But remember stay focused on the full path first the data science workflow instead of just knowing everything in details in each step you need to cover everything and create value and you can do that with all the tools i’m giving you here in this course so when
You master all the tools in the course you will be good to go and in the end please let me know how it went this thing where you’re surprised about data quality and data cleaning i was at first because i didn’t know it has such a big
Impact but please let me know and comment if you like this also please hit subscribe and hit like it helps me grow and uh please share with somebody it always it’s nice to be good to people that need this help so see you in the future i enjoy creating these things so
Bye bye Classification is a common machine learning task and you probably know it for for instance the spam filter inside your mailbox is made by classification is a spam or not spam but as a data scientist this is actually also highly valuable imagine that your end user has
A lot of reviews and they want to figure out which of the bad revenues to focus their attention on that you can do a machine learning model to do that also they might figure we have a lot of customer data and we need to find the customers that have a high potential of
Buying a lot more i mean where should we focus our effort and this is classification you need to classify customers in different groups so this is about classification and to start that we need to understand a bit more about machine learning so that’s what we’re going to do so at the end of
This lesson here you will know about classified machine learning in general what is classification how to do it and in the project that’s going gonna be amazing actually we’re gonna get a data set which is hidden from you and you need to make a model to do accurate predictions or classifications of them
What yeah it sounds crazy don’t worry you’ll get there so you will know all that in the end what machine learning is how classification works and how to use it applied on a real project are you excited i hope so so if you’re new to this series this is a 15 part
Course on data science with python where we cover everything you need to create customer value in the end if you’re new to this there’s a link down there you can download all the resources all the notebooks and follow along and do the amazing projects we do
Here together so are you excited i hope so let’s jump to jupiter notebook and let’s get started so classification right so this is one of the one where you create a model of classification to make insights to your customer right so again you need to focus on the bigger picture why are we
Doing our data science analysis it is to create value down there for the end user if you’re new to this here try go back to the first lesson and i’ll explain a bit more about why this data science workflow is so important to have focus on all the time perfect
So what is classification right what is it what is it what is it you know i don’t know let’s read it a machine learning algorithm but what is machine learning and why does it matter we’ll get there but the main thing that classification does it is classifies rows of data into
Categories in classes right classification so we all know a lot of classification tasks and that is what classification does right it can classify things in good or bad or whatever it is good so before we dive into what machine learning is i want just to
Mention that i do have a free 10 hours video course for free and it has as you see here it has 15 lesson 30 jupyter notebooks and 15 projects and if you click this link here you’ll immediately and there should also be a link in the description you’ll immediately see here the full
Scope of this course is insane and for each lesson here we have jupiter notebooks where you can follow along and again it is all free for you to use and it has been very popular so far so are you ready to learn a bit about what machine learning is if you are
Knowing about machine learning and you just want to skip forward to the classification there should be some chapters inside the video so you can move fast forward to those so the best way to understand machine learning is to compare it with something you know right so there’s classical computing and
Machine learning so what are the differences so in classical computing you have specific instructions right so when you see this do this when predict this when status it’s like a static logic right imagine you had to explain to your five-year-old kid or whatever how to take a glass of water in the kitchen
Right so you do specifically you have to open the drawer you need to take the glass you put the class under the tab you turn on the tab you close the tab when the glass is full and you explain everything in detail machine learning on the other hand is
Different it’s a different approach you feed data into the algorithm and say nothing to it figure it out i mean here is the outcome i want when i see this kind of data this is the outcome i want and the idea with machine learning is continuously improving the more data the
More accurate the more precise data you have the better the model so this would be similar to saying to the kid well when you go to the kitchen what you need to get out with is a glass of water and then you just say go crazy and the
Kid goes in there and say hmm oh i’ve looked in all the drawers and figured out that this is glass this is one glass and where is the water and checks things and then finds the tap and opens the water and oh yeah there’s the water and
Gets the water and gets back there so this is a different approach of learning right you don’t do specific instructions you say here it is this is what i want i want you to go to the kitchen and get when you get back from the kitchen there’s a glass of water okay so
This is kind of mind-breaking because traditionally everybody was thinking like this but i mean i’m not an expert on raising kids but i mean i know for my own learning i like to experiment and figure things out i’m not i don’t know i don’t like to be you
Know instructed everything i want to do and i know computers don’t have feelings but but still i think the best results are coming as machine learning shows also from using that analogy and learning right good so this is just a repeat of what i said up there good so how
Machine learning works right so basically there are two phases in machine learning there’s like the learning phase and the prediction phase so in the learning phase that’s where you build the model which is used afterwards to make predictions your model the goal of the model is to make predictions okay
So how does it do that right there can be some steps in that right you need to get some data and again here this is where you identify data sources this is quite similar to the data science workflow right you need to identify data and then you need to do some
Pre-processing and they can be clean data prepare and manipulate so cleaning we had that in the last lesson there it’s extremely important thing to do because the quality of the data determines the quality of the model we make right that’s important to understand the quality of the data creates is
The model’s quality is highly dependent on the quality of the data and what can be with preparing and manipulate preparing it can be like okay the model needs it needs the data in specific formats and stuff like that it’s it’s sometimes kind of a pain in the ass well i can’t
Say that it’s a pain and but more and more models are more and more easy to work with nowadays they get easier easier so you don’t need to manipulate data in specific formats and i appreciate that because we just need to have it easy and then you train the model and in
General there are three different categories you divide models into supervised unsupervised and reinforcement good so let’s just get back to that in a moment just imagine you have a model so you get the data you clean and prepare it you train the model and then you test
The model you need to understand that you need to figure out how accurate your model is you can train a model and it can be bad you need to know that so what you do is you have a training set and a test set and then it says that you know what
Results you want you know the results you want so that’s a key thing so you keep some of the data to test the accuracy of your model and we’ll get back to all these things good so i just want to mention a bit about supervised unsupervised and
Reinforcement and it says it down here right so supervised is when you tell the algorithm what categories each data item is in each data item from the training set is tagged with the right answer right so basically in supervised learning it doesn’t have to be categories but we’re working with categories here
It knows exactly what it is you want so you tell the algorithm that when i see this i want this when i see this i want this and when you test it it obviously holds easy you’ll say okay use now i show you this i know what i want but
What do you predict right so that’s the accuracy right unsupervised on the other hand is when the learning algorithm is not told what to do and it should make the structure itself wow so how can you know the quality of that well future usage of it right so when you use
The model you’ll see does it add value or not not right but our cases where you just want to uh create some categories for instance and you’ll see you’ll see that it adds value or not reinforcement i actually love this one and i have done some tutorials on that
You should check them out because reinforcement is one of those models that are just easy to understand and you can program them yourself from scratch so it teaches the machine to think for itself based on past action rewards so the analogy often used is a dog
So the point with a dog is you don’t talk the same language as a dog so how do you communicate with a dog well you when a dog does something good you reward it and this is where the analogy i’m not not always happy about it but when it does
Something bad you could punish it but i’m not enforcing punishment for dogs so i would more think of it like you’re rewarded when it does something good and maybe you do nothing when it does something bad right so this way the dog learns every time it
Gets a reward he says oh when i do this i get a reward you know it encourages encourages the dog to do better behavior and that’s the thing principles used in reinforcement learning and i must say i love this stuff it is amazing it is
Crazy so go check out the video of that the tutorial on that it’s amazing test the model yeah okay so that was what we did right so again here we have the learning phase we get the data we pre-process it we train it we test it
And we had three models here that we just went through so after you have the model here well what it does now is it takes input into the model and then it makes predictions right in the categories right it takes some features which it knows and does the magic and
Then it makes a prediction basically it and uh just to mention also so this can be an iterative process right so you when you get more data you will go back and train the model again and maybe you get more features that’s more data types more columns in your
Data and the model becomes smarter or better and so forth so it’s a continuous learning supervised learning right so giving data sets of input output pairs learn a function to map input to output that’s what i said right there are different tasks but we will start on classification classification is a supervised learning
The task of learning function mapping on input point to a discrete category discrete means just that it is a classification good so now we’re actually ready to use this iris data set here so a bit about this iris data set is it is a classical data set that you’ve been
Doing to do classification of it consists of a classification of different iris flowers so the best way to get to know it is actually to see it and you see here it has three classes of flowers and the point is can you predict what flower it is so
Let’s try to do that so i already made these few lines here and here we see it right so you have something called the length the width the length and the width of different two different features so we have four features and one species and again let’s just uh check
The species so data species uh how’s it called count values isn’t that it no it’s called value counts good value counts perfect so here we see it we have actually 50 of each of them and uh and so it’s a small data set but it’s
It’s fun to play with it’s one of those datas that you just need to have played with okay so machine learning here right so these are the steps that we are using here right so we have we’re going to look at two different models for classification one is called svc and the
One other one is called k neighbor classifier and basically what you will realize is that the journey what you’re doing is the same it’s just a different model so svc here is a c support vector classification and there’s some details but often you can just go to the
Down on the bottom here you’ll see how exactly what to do what right and that’s basically what we are doing here perfect so let’s try to assign the follow these seven steps that are here and see what happens but first we need to import it and again we import the
Train test split the cbs scv the accuracy score to get the accuracy and you’ll probably see some similarities what’s been already done previously because basically if you saw the linear regression part and the two last lessons well it’s basically the same we’re doing here right so data um
We could use uh drop for instance uh species and i guess we need access to b1 right so so what do i do here uh now we only have these four features and i removed this species here right and the y is data species oh species perfect so now we
Actually did step one and step two then we need to divide the data set and that is basically taking this piece here and execute it down here so now we have a test and a training data set and let’s see if we can do this here so
Now we need to create the model s vc svc and then we need to fit it again you see this is basically the same thing right x train uh y train oops i wasn’t supposed to do that and uh after you fit it you need to predict why pred equals c v s
C b c c i don’t know i cannot pronounce that and what we need here is x test and then we need the accuracy score and we take y test and y print okay so here we go so whoa we actually get full accuracy right uh that in itself maybe doesn’t tell you
Much uh it tells you that this model is pretty pretty amazing and what could be interesting is like we have four features here which one are the most telling the most so let’s try to do that so to do that we have like this helper function here permutation importance of feature evaluation which
Can help us so let’s just simply try to do that so we need to execute this one here and then actually we just need to do these things so again there’s not much programming understanding in these things but it’s just more about showing you how it works right so you see actually
This feature here is the most significant and the others do have some impact this one has a bit more and this one has the most good uh so i also thought it would be pretty funny to actually visualize the features uh the importance by them so let’s try
To do that and i shouldn’t i shouldn’t double click on things like this because then it just goes into the mode but basically i just wanted to show you and that you can visualize it i sorted them first so sort of which one is most important
And then visualize it right so df plot bar not par it should be bar age you can also have something in the figure size uh yeah i lost the first p here right perfect so here you can see see how important they are so it’s mainly this one with as impact on it
And one thing that could be interesting actually try to visualize it so you see these two features here have the most important so let’s try to visualize that and yeah actually i’m going a bit rogue here i didn’t prepare this but we have one problem we need to visualize
This so let me just show you the problem here so data so one of the features petal length centimeters and oh i’m also we need a scatter plot right scatter plot so this is the x feature and the y feature is uh with centimeters
Isn’t that it i think so so here we have the plot so the problem with this plot here is we don’t see which species is which so what we need to do is we need to make a color mapping so let’s try to do that so color
Map and how can you do that well you make a i like to do the make a dictionary and take the the names these here so hmm this is not easy so yeah let’s let’s just copy this down here now i went all rogue here uh i then i pressed
Ctrl c instead so that was a mistake so i just copied them down here so i can have them here and i’ll delete them in a moment but my point is here so you make a dictionary and say this one it should be yeah that’s not going to i don’t know
If the coloring is correct it’s probably not but just so you can see the difference on them so let’s see here and it’s red this one apparently and this one here let’s take yellow for this one right so i make a color map so this is how i
Want to map the colors and uh let’s just make a series here and we take data and we take this species oh species and then we do apply and we do a lambda function x and it takes a color map map and x here okay and then we take it
As a color here we take the colors there right so what i’m doing here you actually see i’m doing a coloring of these irises each species and you actually see this is this is amazing right so you see all of the blue ones are down here and it’s
Easy to detect right so which one is the main feature is the one here down on x-axis here right and you see here actually all the red ones are here and it’s a bit overlap when you look at the x-axis and then all the yellow ones this should apparently be a yellow color
It looks a bit more i don’t know it doesn’t look extremely yellow to me but they’re over here and then you also see that this is its second most uh important feature and you see that as well here right you obviously see that perfect
Good so i went a bit rogue here just to visualize it i like to visualize things because it makes my understanding better and i hope it helps your understanding as well so awesome for that okay k neighbor classifier and again not again this is basically the same
History we want to go through here and uh we actually already divided in training sets and all that kind of stuff so actually you should uh you should be able just to go straight forward here okay and fit x train y train uh y yeah we need to calculate this one again
Okay n fit and we took x test oh that was not test test and then we do accuracy score we take y test why pret and we did something wrong here uh the fit model here it says missing one argument what is it missing now this one here we said fit twice it
Should be predict i was looking at the first one here i i looked at this one fit here and say why it’s missing but i wrote fit twice and again multitasking is not a good thing and again here we have a high accuracy maybe we want to
To look if there are different features that are important for this one so actually let’s do this one here here for kn right and then we see here that this actually has the same picture it is the same right but if you look at this picture it kind of makes sense it
Kind of makes sense right this is this is what is most important it it looks easy we haven’t investigated the other features but i’m pretty sure that this this actually makes pretty accurate predictions so that’s pretty much it and now we are ready for the project and in the
Project we are going to create a looking at a data set set where we don’t know what the features are but we need to make predictions we need to make a model that make good predictions are you ready for that i hope so because it’s going to be
Amazing so see you in a moment are you ready for this i hope so because it’s going to be amazing so let’s look at this project here classification with hidden features what does it mean don’t worry we’ll get there so the goal of the project you get hired by a company
They classified that data set so classified means you are not able to read the data and is this unrealistic uh maybe maybe not because sometimes actually you have user data which is hidden by gdpr but they still want to use the data set so they need to hide
The features so it is not unlikely you would get a data set where you actually don’t know exactly a lot of things about it so the features are hidden that is you do not know what they are they ask you to create a model to predict classes right how accuracy can you
How high accuracy can you predict the classes right are some features more important than others right so they’re also interested in uh their findings for them is like what matters and what doesn’t matter so right so again is this a realistic scenario maybe maybe not but in the end you need
To focus on what gives the customer value and if they give you some data where you know a little about well the quality of the result will also depend upon that right good perfect so step a not one a is just to execute this one to import
All the libraries you need right so you have a model here uh two models the two models from the lessons we’re going to use and compare them together and feel free to use more models later if you want and then we have the classified data
Yes there should be a v here i don’t think so we’ll see that when we when we go through and then inspect the data length and value const on the column containing the classes they’re one of them containing classes let’s see how many values are there
Prepare then we do an info to get some information about it see what it is are there any missing values check it out then we need to do our machine learning thing here right so we have the dependent and uh target feature independent feature uh the labels whatever they are called
So many things so it doesn’t really matter do that and then we have a train test split and then we train fit score this model then we find the most important features and you can do that here and visualize the result right that’s also what we did here
And that’s probably also why we did it in the lesson right to see how it done and do the same for k neighbor classifiers so that means actually we need to do all the steps here we did up here for the k neighbors and then conclusions are any other model using
The same features right others is it concluding the same things then uh write dynamic er findings what would you write to the company how to present the findings good question how to follow up right so this is a potential long-term relationship with the company how can we follow up and improve on the
Model after more data is available right so can you make any recommendations can you i mean how can you make this into a long-term relationship and that’s often what you’re looking for when you’re working with customers so my advice now is play stop try to do
It yourself if you get stuck play along i’ll help you i’ll kill you don’t worry about it we’ll get in we will manage to do it together thank you so see you in a moment stop play play stop press stop stop play stop stop press stop i make no sense stop
Did you play stop or stop play or press stop what did you do let me know in comments i make no sense sometimes i know it i apologize for that so let’s get started so the first part is just to import this one so this is shift enter did you manage that
I almost didn’t it took time there right then we need to read the data so let’s say data here we need pd read csv and then take files we take it is classified data right so there’s a v in this one here which should not be
There and i told you that so when you watch this this is actually repaired and then we take a data hit as normally here right so here we see it right so we have a unnamed here so actually i don’t want that one call zero so we have an on
We have the features here one two one two three four five six seven eight nine ten and then we have the target class here so inspect the data and just use length so how much data do we have we have 1000 rows and then we have a data Target class and what do we do we need value counts and we have 500 samples or observations for each class that’s pretty good because sometimes you get something which is you have 20 in one and then you have 1000 and the other one it doesn’t really make sense right it’s
Difficult to make a model of that uh good check the data types okay data info we use info this time here so the good news is they’re all float we like floats when we do regression and classifications and all that because it makes it easier and then the classification class is an integer
Perfect check for null values data is no any and we are lucky the data set is really really awesome so there’s nothing to care about and could there be data which should be cleaned remember the cleaning well basically we know nothing about the data so it’s difficult to say
So we must assume that the customer our customer knows what he’s doing good so far so good so so now we need to put it dependent on independent features so let’s do that data and i think we can just use a drop here and which one is we drop it’s what’s it called
It was called target class and we do it along axis one run axis one and then we have y’s legal data here and here right so now we have the the x and the y dependent and the labels right perfect divide into training and test set so this is train x test test
Y train y test train test split x y test size point 20 random state 42 okay so so the test side here is i think it’s actually around 20-25 the normal size is the random state is if you want to create the same as i’m doing here get the same results then you
Need to set the random state otherwise it will use some randomness so that’s why and why do i use 42 do you know the story about foreign 42 let me know in the comments you don’t check it out google 42 you’ll realize what 42 is as a real geek you should know 42.
So train fit score so let’s do that svc that’s the one i cannot pronounce svc svc fit x train o train x no y train y pred svc predict x test accurace accuracy score y test y print so here we have it i love it train is not defined so what happened
There train i didn’t execute this one boom there you have it boom perfect 0.995 so that sounds pretty good we don’t know if it’s good that means basically that 4.5 percent are failing or not accurate okay to find the most important features use this one here
Right so this is actually what we have done already and let’s take this one as well here and we see hmm interesting it’s difficult to actually figure something out so that’s why we have visualize it uh so yeah yeah so let’s just take here sorted edx equals this one and then data
Frame remember all what we wrote there inside the lesson and then we take data frame bar no plot bar h you could make something here right so here you actually see it pretty pretty clear what are the important features but the big question is if you use another model are
The still the same features that matter the most right and that’s an interesting question so because if they are we have more confidence to say that these features let’s say these three features are the same feature scoring high in the k neighboring model well maybe we have more confidence to
Say that these are the most important ones right so let’s try to figure that one out so k n k neighbors k n fit x train y train i’m getting nuts in my head here uh why pred kn predict uh x test accuracy score we also wanted just to
See it here just to see if they are doing the same job here right so why test why why pred here so it has the same accuracy right and uh so what was the next thing we did we did the permutations here so we just use the same here
Just to make it a bit uh speed it a bit up here normally i would not use the same same what’s called the same Variable names because then you kind of reuse them above there but let’s just do it here so here we have it right so are they actually identical so h q a h q e e q w p pjf so actually the three features but it seems like this feature is a bit less
Than the others right so actually here it has less confidence or whatever you would call it so this one is around 0.06 and it’s the same here but these are higher scoring here right but basically it concludes the same three main features right so that’s actually the conclusion here
It is right we have h q e e q w p j f are most important features right so reporting so how would you do that how would you present your findings and these are good questions right so actually i would say a great way to present it is actually to
Have these charts here those two you generated here and have them in your findings why because this tells a story about what the work you did in order to find the accuracy and moreover you also present the accuracy of your finding and say both models create the same accuracy and
You see the features here these three features show the most important impact on it right so this tells a great story for your customer i’m not going to do it here because it should be your story your findings finally are there any actions you can do i mean it’s difficult right but
You can have a dialogue with your customer to say that well we actually don’t know anyone think about the quality of the data right are there do we need to do some data cleaning and you could do some research are there any null values no we didn’t know values but
Zero values other something i mean you could do some some mapping is out of some outliers you can do that kind of uh testing too i would do that if i was having a customer just to check right so this could create a dialogue about it
Right so how confident are they in the data quality and uh yeah and how how accurate does that model for them need to be right because if this is good enough if 90 is good enough then well we’re basically done and are the other features they can get
In the future right does it make any difference to have these with or not i would create the same models with only those three features and see how it’s scoring then does the score better does the score worse right you know doing some experimentation to figure it out later in the course we’ll
Do about feature selection so that’s also where you learn about how to select features so i hope you enjoyed this one here and if you did let me know in the comments uh please hit like subscribe share these things so in the next one it’s going to be amazing because it’s
Going to be about feature scaling and what is that actually we will need to use our statistics again and i’ll show you what feature scaling is and why it’s important we’ll investigate that so it’s going to be quite interesting so i hope you’re ready for that like subscribe
All that kind of stuff i appreciate it because it helps me grow and motivates me to create more awesome content to you guys so you can become better at what you’re doing see you in the future bye-bye When it comes to machine learning feature scaling is one of the things you probably heard about but what is feature scaling and why does it matter in this lesson here we’ll figure out what feature scaling is what types of feature skills are when to use it and actually investigate
What kind of impact it has on the machine learning models so at the end of this lesson you’ll know which machine learning models to be aware of where feature scaling actually matters and how you can measure the impact of feature scaling and what feature scaling is in
The project you get a data set where you can investigate feature scaling does it matter doesn’t it on your machine learning model okay i think you’re ready for this because i’m very excited about this because feature scaling for me as well in the beginning is just like wow does
It really matter that much and follow along and you’ll see why it matters and when it matters so let’s get started if you are new to this series here this is a 15 part course on data science with python there’s a description down below there click the
Link and you can get all the jupyter notebooks all the resources we are using so you can follow along and do the projects along with us if you’re totally new to this i advise you to go to the webpage read about it get started on the level where you feel comfortable and
Let’s begin feature scaling so what is feature scaling about it’s again we are around the model working with the model analyzing things and why does feature scaling matter it’s because in the end you get more accurate insights more accurate predictions with your models so hence it improves the overall value
Of your entire data science workflow because what matters in the end is the useful insights the predictions you can make good so what is feature scaling well feature scaling transforms values into similar ranges so machine learning algorithms can behave optimal so some machine learning algorithms require datas to be in same
Ranges and we’re gonna explore when and how it does it future scaling can be a problem for machine learning algorithms on multiple features spanning in different magnitudes right so imagine one feature is only scaling in this magnitude and the other one is scaling in this one so if the distance between points matters
In the machine learning model well the distance between the points over the big scale are more bigger than on the smaller scale so the feature has more impact and that might be a problem feature scaling also can make it easier to compare the result from your machine
Learning model so when you do it you actually get something which is easily easier to compare so there are two main categories of feature scaling there’s a normalization it’s a special case of mean maxer where you normalize the values between zero and one and so in some cases this is actually
What you want to do because all the values are lying in some random intervals then you know that all your values when you have done this normalization are between 0 and 1. but there are cases where this is not the optimum because sometimes you have some extreme outliers and these outliers are
Actually part of the data set and then you can use something called standardization standardization okay and in this case you actually take the mean value from the value and divide by the standard deviation that means that the standard deviation will always be one but that means actually that if you have
Outliers that are extreme then there are still outliers and you have the main the main the one standard deviation of your data is within one the range of one that means that 68.2 of the data points are within one standard deviation of the mean and also the mean value b will be zero
On all of them so that is actually the problem with the normalization is very sensitive to outliers and the standardization is less sensitive to outliers and can you say that one is better than the other no it depends on the data and you need to experiment so machine learning algorithms some
Algorithms are more sensitive than others right so the most sensitive ones are the distance based algorithms because what they use as a measure is the distance between things all right and again if you have a lot big distance between some points on one feature and small distance between
Another feature then it makes a difference right so some examples are the key n k mains and svm good so again i think the best way to learn about things is to dive right into it and try to work with it so the goal here is obviously to work on a data set
And then learn about normalization and standardization okay so let’s just dive into this data set here and look at it it’s a weather data set it has a row per day it seems like it and it has some features here some are numeric some
Are not let’s just dive a bit into it so let’s make an info here so we see we have a direction and wind directions and what i’m looking for are objects right rain today rain tomorrow so the one we want to predict is rain tomorrow
Risk of rain i think this one is a pretty maybe we should exclude that from our data set if you ask me good so far so good we can also just data describe just take a look at the description here the statistics here so remember here the most important one is actually the
Counter and the count is the same for all of them but it just tells you something about how accurate your model will become and it has something about the mean values and you see here the mean values are different depending on the feature right so here we have some in the 60 50s
And here we have some three five and seven right and uh so it varies all over the standard deviation is also different for all of them here we have 16 and 15 and we have something in two here and then we have something with a distribution here we’re not going to go
Into details of that so so far so good we want to simplify a tiny bit so we will only focus on the numeric columns and we’ll remove all missing values our goal is to explore normalization and standardization i already said that so let’s just import this one here
So we want to simplify our data set here so how do we do it let’s let’s call it we call the above there data set let’s just call this data clean instead and data we do drop i mean what do we drop we drop and we drop this how is it called risk
Mm because i think this is actually a prediction on the risk of rain tomorrow we don’t know but it sounds like it so i think we remove that one just to be fair and then we have data clean equals data clean drop not so what we have now is a smaller data
Set probably we had 3334 here so let’s take the length of data clean here and say well actually we dropped a lot of them right so that’s it’s a lot of drop but for simplicity let’s not care about that right now because we could do a lot of
Things there but we could do a lot of things to improve with the data quality remember we had this data cleaning uh if you’re interested in that you should check it out but this is not the focus of this one so right now we just ignore it perfect because basically right now we
Are just focused on single observations and if they’re missing some it doesn’t really matter in this case here so so far so good so let’s make our feature selections and data clean we can actually just take all of them which are select types include number because the one the features that we are
The target uh oh we don’t see it here we see it up here the target here is range more is yes or no so it’s it’s a non-numeric so we can just take all the numeric ones right and then y is data clean and we take rain tomorrow
And then we we need to transform this one actually so so how can we do that we can do that many ways i just do something here we do some house called this comprehension here so let’s do that so if value equals equals no let’s it’s capital no
Else one for value in y i think this should do the trick okay so it didn’t really do the trick in Okay so we need an equal sign here at least good so this did the trick right so we have y here it is a lot of zeros uh sum we have 416 yes so there are some values in there okay so let’s also just divide it into x strain x test y train
Y test test no train test split x y uh test size oh test size 0.2 random state 42 is there anything else we want to know not really you know maybe actually maybe maybe So what i’m doing here is i i just want to look at this extreme because we’re gonna compare it all the way down there so describe so you see here here we have some description again we have the the same description here almost the same now we
Only have it on the training data so it might be a small differences but here we can see the mean values and the standard deviation and that those are the ones i want you to look at at first and then i have the minimum and the maximum which are also quite interesting
Uh last thing we could do is actually x train plot scatter and uh we take x oh we’ll just take x equals to rain fall why we just take some random wind uh wind speed A.m that’s it so what it’s doing here is something we didn’t have the key here wind speed 9 a.m wind speed 9 a.m right good so here again we’re looking at the scales here so here we have this and now here we have this so it’s not because we
We’re interested in how it’s uh how it’s ordered you see here actually it has ordered in in lines and it’s because it’s uh probably discrete values so that’s how it is good so i want you to remind us a bit about the statistics things we did remember this box plot
Because what the box plot does exactly summarizing all the values here so it has almost it has a median it has q1 and q2 which is 25 and 75 it has the minimum values and maximum values and it has outliers here so you kind of get basically you get
Most of the information you get here in this row column here in one chart so that’s actually what i want us to do is actually to look at that x train uh plot box and um i won’t because it’s so big it’s a fixed size let’s take 20.5
I think that will be good rotation and 90 because we all want to rotate it so what i want to see here is actually this is how the data is distributed right now right so we see they’re all flat down here and some are up here right
And that might be bad for our model as we’ll see later but this is actually how the data is by default and even though these are extreme up here there might be others that have problems but it’s showing where the uh the where the median is and how it’s spread
The data and where it is on the scale right so you obviously see this data is all over the place good so our first one is the mean max scalar to make some normalization so let’s try to do that together so so norm so we make a normalization here with a
Mean max scaler and we’re actually just going to call fit immediately so you can read about it here and you have examples you have yeah you have the calculations here so that’s also part of that and you see how you do it here so you create a
Mean max scalar and then you call fit on it we just do it immediately and then you can actually apply it on data so you use transform right perfect so fit and what is it you want to fit you want to fit the extreme data so that’s what
You’re fitting it with you cannot fit it on the full data set you only have to fit it on the training data set you cannot fit it on the full because then you implicitly saying something for instance when you do normalization and you had some outliers in the test data
Those are not included in the in in the normalization because you shouldn’t know that so x train norm is like the norm and uh this is this one here and then we do the transform and we take the x train and we do the same for x test norm and we do norm
Transform x test right so uh x test is not defined did i do a misspelling up here yeah i did why didn’t you tell me okay perfect so here we go here we go right so now i should have it down here perfect okay so here we go thank you for
Correcting my spelling up there and one thing we can do is actually we do the same here so let’s uh say x train on norm uh actually we wanted to do pd data frame because it’s not a data frame anymore and then we add it in here and then we
Do describe then we have the numbers here so what you’re seeing now is actually that the minimum number is zero all the time and the maximum number is 1 all the time does that mean anything for the mean no the mean is still a bit around and you can see how the
Data is distributed it’s different right so for instance with this one here feature two it is seems like uh most of the data is down most of the data is down here right most of the data is actually in zero and so on so actually maybe there’s a
Problem with the quality of this one we don’t know good and uh i actually also want to make a scatter plot so that’s actually the same here because we don’t have it as a data frame anymore what this mean man’s scalar does it makes it to a
Numpy array and i transform that numpy array to a to a how house called to a into a what is called data frame good an x here now we are taking numbers here instead and uh actually we’re taking this one here it’s kind of funny because most of the data is down there
Zero no two so that two is uh this rainfall we had here and then wind spade is two three four five six seven eight seven five no zero one two three four five six so it should be number six i don’t think i actually have it as a string in there so
Six so here we have it so what i want you to see now is now we have scaled it between zero and one right and zero and one so this is basically what we’re saying and as we saw most of the data is down here at zero actually we can’t
Really see that but that’s apparently what it says perfect so far so good so so far we don’t know anything about these things but uh let’s also just find yeah let’s let’s do that i actually forgot about that let’s let’s also make uh this let’s make this data frame again and uh
And we can actually put column names on here i think so x string columns let’s make it better plot box uh fixed size we had 20.5 and then we had rotation and 90. so what we are seeing here right is that everything is between 0 and 1
All of them but we see that the median is actually different from one to another right and you can see outliers also different so this one is the rainfall we have the outlier yeah of course a rainfall that’s probably what i didn’t think about a lot of days are probably with
Zero rainfall so it might make sense that all of them are basically down at zero so there might not be a problem with the quality as i mentioned earlier but this is a picture i want you to see here right so everything is between zero and
One perfect but we have a might have a problem with the outliers because now they have higher impact on it right so are these outliers uh destroying the rainfall for instance i don’t know let’s figure it out so this is where we use the standard
Scaler here and we will try to do the same actually so scale we just call it standard scaler and we do fit and do x train and then we do x strain stand i think and scale transform x train x test stand scale transform
X test okay so here we go we we did the exact same here we saved the values in here and then now we’re actually going to do the the exact same thing we did up here so let’s first take a look at this one here and i’m gonna change obviously the variable here
And uh we see something interesting now right so first of all the standard deviation here is almost exactly one all the time here right you know the scientific notation with zero zero the mean value is you see the scientific notation is basically zero all the time
So so we could actually do a rounding of two here and you see it way more clean here right so the standard deviation is if you round it one the mean is zero all that’s all the way right but the minimum and maximum are different right for instance this one
Here with the rain you see it’s enormous uh enormous the maximum is enormous right while the mean of all of them are zero and standard deviation is one perfect and uh yeah let’s do the same let’s also just take this one here and and show the same here so what we see
Here now is that in theory all the data should be around zero and zero here and uh that’s the standard deviation and then it’s so so the scales are different again here right that’s exactly the point and let’s do this plot here this is going to be interesting
Perfect so so as you see here all of these mean actually the mean values are around zero right the middle values not the mean values and you see it actually varies now so now the outliers are outliers so this might be better or it might be worse i
Don’t know that’s what we’re going to try to figure out let’s try to do that to do that i just added some code down here and i hope i’ve been typing it correctly although all the way but what it does it takes this model from uh from last time remember that you know
Machine learning thing and it has an accuracy score and we’re going to keep the scores and then we’re going to try first the train the normal uh the normalized and the standardized right and then i’m going to take them pair by a pair don’t worry about the code if you don’t make
Sense but what it does it takes for these two then these two and then these two and then it makes a model it makes prediction and then it appends the accuracy score and what it does in the end it is it makes a data frame where
You present all the results so let’s try to do that so you see the original data where we did nothing and that was the data that looked like this this kind of data here right this data here this data here the score of that is 0.7 which is maybe we just tease
It right but if you take the normalized it’s 0.1 that is the data looking like this so this is actually doing way better than the other way better right and then finally the last one actually does a bit better it does a bit better and that’s when the data looks like this here
So you see it does have an impact on the accuracy score uh i actually don’t think it’s an r2 here it’s an accuracy score let’s apologize for that iq c score it was not a r square apologies so accuracy score here so it does actually
Make a impact on it right so it actually improved the accuracy score a lot some people claim that 0.8 is a boundary to be something oh this is actually percentages these accuracy scores so forget what i just said but the accuracy score says how many percentages are
Predicted correctly so in a normal case we predict 71 correct when we do a normalized we have 80 correct predicting if it’s raining tomorrow and in the standardized we have 81 right so that basically means if we have all the metrics uh all the features on that specific day without anything
Else we can actually predict with 81 if it’s gonna rain tomorrow or not now that is crazy actually that’s crazy that’s that’s pretty accurate for so little data well you know the weather forecasts have enormous amount of data and it’s way more complex than this model here this model is insanely simple
So this is just to emphasize in insane improvement you can do just by normalizing and standardizing perfect so we’re going to do something similar in the project and you will be enjoying that because it’s always nice to try it yourself so i think you should be ready for it and if
Any questions let me know in the comments there and now you know about feature scaling and why it matters and how to do it so see you in a moment Awesome are you ready for the project let’s dive into it and see what’s all about of course it’s about feature scaling and to see the impact of that so here we have it feature scaling again where are we we are around step three and we do it because of step five
So the projects are trying to focus on the full scale of the aspect of the data science workflow the project go so let’s try to figure out a sports magazine is writing an article on soccer players they have a special interest on a left footed players a question is whether they are playing
Their the playing style can predict if a player is left footed the question they want to answer is can you from if can you form a features set on players predict if the if a player is left footed if so what features matter the most right so that’s interesting
Uh okay so the first step here is to import libraries and we have a lot of libraries here right so we have the mean max scalar and the standard scalar those are the new ones and then we have this permutation importance right to figure out which one matters the most
So the data set we are using a data set of the european soccer database i we have it already here so you don’t need to download it or anything and remember to assign it’s a variable and take the head so that’s basically it then you want to check the data types
Just to check for numerics you have info to get an idea of the data check for missing values you have is null any is null sum to see how if there are any and how many there are drop missing values so we just we just
We just skip them all here right so we have already covered what to do with cleaning data and it’s important skill to have but for many projects we’re just not interested in doing all that work so we limit our work in this focus here limit data set size right so we just because
The data size is so big so we’re not gonna we’re just gonna take a smaller one if you want to make it on a big one feel free to do that analyze it the classifier we want to predict is preferred foot so that is our target for now we keep
The other numeric features as dependent features right use info to see numeric ones uh use drop assign the dependent features and the independent features y good we are only interested in the numeric good then we split in train we’ve done this man many times i actually have the
Full line of code here i actually take 25 here perfect then we then we normalize the data and we create then transform the data like we did uh it should probably say train here apologize for that i also had some typos up there i saw them too so you’re not alone there uh
Create a standard scaler and do oh you see i probably didn’t copy paste on this one here because i have the same typo here good so i do the same here and uh then we compare the sets so this is basically what we did inside uh inside inside the inside the
Lesson here and then you want to have the accuracy score so you create the model you fit the model you predict the model and you have the accuracy score and you probably want to make a loop or whatever to do that work so finding the fee features that are most important right
We know that the features can predict if a player is left-footed uh now we need to find the most important features uh permutation importance to evaluate right so we do these things here right so uh that’s what we do perfect and visualize the results so this is again we’ve done this before
It’s just making a chart and if you didn’t you can go back to previous lessons to see that and create a chart and then it’s about presenting the findings right be creative here right ideas explore how the features are related to the value right reflections there might not be any
Actions i don’t know that’s up to you you are the one do you want to have a good relationship with this newspaper or sports magazines probably online one so these are questions you need to answer because you need to can you measure the impact can you help them
In any other ways it’s always looking for opportunities to earn more get more customers good so this is basically it what i advise you to do now is press stop try it on your own if you get stuck don’t worry then play along and i’ll show you how i would
Solve this problem here see you in a moment so how are you doing did you manage to do it let’s jump to jupiter notebook and let’s get started so the first step here is just importing this one and i hope it works oh it took long
And it did work so that’s all the libraries we need so let’s get started and and in this one here we need to read the data so let’s try to read the data uh pd read uh it goes really well read park here and we have files soccer park here and
Data hit so that’s what we do usually right so we have already explored this data set but now we’re exploring again and what we are trying to predict now is actually what kind of foot he’s using and we don’t know that yet so in this one here we just need to take
Info on it to get a bit smarter about the data set so we have a lot of features here a lot of them are numeric we have something here and which one which the one we want to predict is the preferred foot here which is an object obviously another
Most of them are just like that so checking for null missing data so let’s try to do that is no any uh you see here actually we have missing data all over the place except the first two ones and actually we know that so data is no
Some will sum it up for us uh in snow is no so here we have it right so you actually have a lot of data missing here i mean i don’t know actually if it’s that lot out of how many data i don’t know here we have it right we have this much
Data here good so a great idea to investigate missing data and outliers but for this project we just ignore it right so data data and drop that boom gone gone gone and then we limit our data sets because yeah data data i lock which exit first two thousand
And what you see here length data it is it has 2000 entries now right analyze now it’s getting interesting right so what we want to do now is actually we want to focus on features uh in the data set right so let’s first start by data info we
Already did that but the point is here none all here is 2 000 on all of them so that is a good start to do it but what we want to do is actually we want to make the x set data and in this case we could have used the
Other method we used but in this case i advise actually just use a drop and drop the ones that are not numeric so this date and then there’s a attack yeah we need also preferred foot preferred foot and attacking work rate and what is the last one defensive work great you can also
There’s this trick where you can call this method on we used before where you just remain the columns that are numeric but we didn’t do that here so preferred food here right so now we have i thought we would have data not found on axis no it was not
Data date here we go perfect so now we have it here right so that’s basically we have the independent and dependent variables by themselves there good and now we need to do a split into train and test and actually i just was so nice to keep the
Line you need to take type in here so i’ll just do that because i’m lazy good good good good good good good so now we need to normalize data so we use this mean max scalar and mean max scalar and we do a fit on this
One x train remember all this and then we do x train norm norm transform oops not attack x train x test norm norm transform x test good here we go and uh that’s basically what i want to do about that one and then we want to standalize
Standard scalar and we do fit x train and again remember we are only doing on the training part of the set because we don’t have any knowledge about the testing set or we assume that when we create this model we’re assuming we don’t have any knowledge about the test set
It makes kind of sense x o x train stand stand transform x train x test stand stand transform x test okay so here we have it right comparison set okay okay okay now it becomes a bit more complex here right so what did i do in the lecture and i
Didn’t put much hints here so if you got stuck here i can imagine you did but it’s also a great idea to look what i did in the lecture what i did was to create a score here and then i had the x trains let’s just call it x trains and
Then x train extra train norm extreme standard and then we have x test and we do the x test and we do the x test norm and we do the x test standard right that’s basically what we did here and then we did a for loop for x train set x
Test i called it something else in the lecture i don’t really remember sip uh x tray x extremes x t of bloody hell i did a mistake here right because now i need to it was supposed to be an s here tests it’s difficult so i just have to re-run this one here
To make sure i have a up-to-date value and then i need to reward this one and then i can do this one here right x tests so it’s sometimes you need to be very focused on having the correct names and can i just copy this one here almost i hope it’s correct
And then we have score append on this one here and i think that actually creates and let’s create a more data frame pd data frame and we do Score score index original normalized standard diced perfect here you see it and something went wrong in this one okay okay yeah i know what’s wrong here abs that’s because copy pasting is not always the right thing to do so what i did i was just being lazy copy pasting and
The truth is it’s always a better not always a bad idea but sometimes you need to remember what you’re doing and i didn’t hear so let’s read from this one and it didn’t like no no no so this was okay so the y train is okay but we need the this one here
You see how good it goes perfect here we go okay so score boom boom boom here we go so now we have it i apologize for the mistake here so the original data had a accuracy of zero point or seven almost 77 percent while the normalized data had a 90 almost 93 percent
And the standardized has 95 accuracy that’s pretty pretty pretty good and yeah finding out the most important feature right so actually now we know that this is actually pretty good and let’s actually try to to Yeah let’s just make it again here svc as we see oh and we create this model again so as we see fit and we’re fitting on x train uh standard that was the best one and uh y train so that’s a model and uh and how was this we take this one here
Here we go and it is pretty slow i don’t know where sometimes the computer there we go so this is pretty difficult to see which one is which one so let’s try to focus our effort and this is where i’m so annoying i should not do double click on this one here right
So we sort them first and then we make a data frame yeah this is funny i’m so funny and uh let me take the data frame plot bar age and uh fixed size i think we have to do something here 15 by 15 because there’s so much data
Here we go so actually we see which one are the most important and always good idea to check the negative ones because they might be highly negatively correlated but actually we have these three here to be the most important one molly strength and long passing has a lefty amazing report so
This is actually our findings and i want actually to leave this as it is because how uh how how you want to present it i don’t know i mean this chart here pretty much tells the story and along with the accuracy that actually yes it is possible and it
Is extremely possible so just a word here about this normalization standardization it doesn’t really matter for the end customer they don’t give they don’t care about that what they care about is the accuracy that it is accurate and what features are there right so when you present something things to end
Customers right they don’t care about that they don’t only care about what matters to them they don’t care that you what you did to get the results right they don’t care good i hope you enjoyed this one and actually in the next one we’re gonna do feature selecting so
So far we ignored that we just use all the features basically all the time and that is actually often a bad idea you only want to have the features that matters so how can you select those features well find out in the next one and it’s gonna
Be exciting so if you like this one please hit like and subscribe and share this one share it with somebody that needs it well i would appreciate it it helps me grow and it helps me and motivates me to create more content so see you in the future bye bye Feature selection what is that well when you train your machine learning model you feed it with features your goal is to find the features that matter the most for your model to create a more accurate model so in this lesson here we’ll figure out what are the different
Types of feature selections out there how do they work how do you use them how do you apply them and actually investigate what is the effect of feature selection one of the great pitfalls of data science is just to use all the features in your machine learning model but that’s a mistake
What you want to as a data science is find the features that matter the most to make the most simple most accurate models so you can have better findings so feature selection is about to improving the accuracy of your model and make them more simple simpler model simpler models actually do
Better work than complex model so in this lesson here we’ll figure out what is feature selections what are the main types of feature selections how do you use them and we’ll make a project where we actually together try to test out our new findings about feature selections and how it works in
Reality so are you ready for this i hope so let’s get started if you’re new to this series here this is a 15 part course on data science with python there’s a link down there in the description click it and you can get all the resources all the jupiter notebooks
All the things we’ve been working with here free available for you to enjoy so are you ready let’s get started Feature selection and again we always focus on the bigger picture of the data science workflow in feature selections we are here in the middle and the reason why we focus on that is to create better models so we can create better insights so more accurate models makes better insights for
The customer and that’s where why we are data scientists is to create value for a customer so what is feature selection so feature selection is about selecting attributes that have the greatest impact towards the problem you’re solving notice it should be clear that all steps are interconnected so that’s basically
A comment on this process up here so it’s not like you go this step step step step step forward no sometimes you take step one step two back to step one then step two steps three back to step two forward and back and forward being a data scientist it should be
Clear that this is a continuous work with your client where you jump back and forth all the time perfect feature selection why do we even bother why can’t you just use all the features available in your data set well feature selection actually gives you a higher accuracy simpler model and
Reduce the risk of over fitting you can read more about it on wikipedia they actually have a great wikipedia page about it for instance they also mention shorter training time and so on so it’s a great idea to do feature selection there are different methods for feature subcollects there are different
Techniques for feature selection so in this one we’re going to dive into two of the three types there is a filter method there is a wrapper method there is the embedded methods and we’re only going to touch the two first of them because there’s so much work done on feature
Selection so it is impossible to cover in one lesson here so we will just give you the basic starting point when it comes to basics it’s often those who add the most value so the filter method are actually creating a great amount of value and yeah so
The great thing about filter method is it independent on the model that you’re using in your data science project and based on scores of stat stats statistics they’re easy to understand and they do a good job of remove features that do not matter and they have low computational
Requirements some of the other ones they actually require a lot of computations and i have added a list of some of them they’re key square method there’s information gain there’s correlation score there’s correlation matrix with heat map and we’re going to touch upon some of them there
Then there’s a wrapper method and they compare different subsets of features and run the model on them right so they take a subset of the features that run the model and redo that so it’s basically a search problem right you try to find the best ones so uh
A couple of examples here i’m not going to go through them there’s more on wikipedia if you want to be more inspired but these are the some of the main ones the four main ones then there’s embedded methods and find features that contribute the most
Of the accuracy of the model while it is created so here we have a few ones too here so far so good i’m not going to go into those and you have links here so you can read more about them and finally i just want to give you a few
Resources where you can read more about feature selections because it is a wide subject and it is way more important that you would give credits for before so here’s an introduction to feature selection and a comprehensive guide to on feature selection as well here so we can learn a bit more about it
And just to be clear before you do feature selections you should clean the data you should divide it into training and test set and you should do feature scaling so these should be done before feature selection we are not going to go into the details of these three here in this
Lesson here because we already covered these aspects so we’re not going to use it here but in when you do your own work you should do it only do future selection on training uh training set to avoid overfitting right so you’re not allowed to do it not
Allowed you shouldn’t do it on the test set data set so we’re going to use this data set here which is a big customer satisfaction data set and basically you can see this is no competition on kaggle i’m not going to go into that anymore but the
Point was to figure out if customers are happy or not so let’s try to dive into this data set here and see how it is so immediately we have a data set here and we see actually it has 370 features so that’s a big one right and as we know
It it does a summary of it here so we don’t see all 370 features here so so far so good so yeah so let’s just take the length of the data here we have 76 000 here we can also take the length of data columns we already know that
Those are 370 columns and we have 76 020 rows of data and uh an interesting thing is we have a target here this is a target we want to hit here zero on one so let’s just try to do that and uh value counts we just want to investigate
We want to investigate how many are there right so this is an interesting data set because this is often difficult we have two classes we only have two classes and almost or 96 percent are in one of the classes that means you get if you have a
Model listen to this if you have a model which always predicts zero if you have a model that always predicts zero it will by default be 96 accurate so a good model needs to predict higher than 96 percent that’s the case otherwise it’s just predicting zero right i make a model predict zero
All the time i have a 96 accuracy so this is kind of a pitfall that’s often there good so data let’s also just describe the data now we have all these statistical measures and we see here that it has not been standardized has not been normalized and so on you have a big
Variety of data and we’re not going to look into all of it here but just some of it just just to see how how the data is right so here you have you see the data is lying a bit awkward here right so many of the data sets are zero zero zero all
The way and then they have some maximum values down down at the end good so that’s just to tell you that the data set is kind of awkward so for instance here one first question you could ask is and this is again often when you think about all these
Selection methods and so on you think it’s going to be complex it’s going to be complex but basically they’re easy to understand some of them and first of all is just to say are any of them constant right so what you’re looking for are columns where the data is equal to the first
Row of the data all the columns where all the rows are equal to the first row they are constant right so you actually see here that there are 34 of the 370 columns are constant so that gives you an indication that sometimes you have data sets where the value
I mean these features do not add any value these 34 features are always the same always the same so they don’t add any value in our analysis so why not using sk learn to do that and they have actually a variance threshold it’s a feature selector that removes all
Those variants features that’s actually what we’re looking for and we have an example here as we say so here they make a feature or a data set with have four three rows with four features and you see the first feature and the last feature is zero three zero three zero
Three so they are constant right so when you do the variance threshold and you do a fit transform on x then you actually only get these two features here and notice that you have a fit and a fit transform the fit one fits a model and then you can do the transform later
Instead of doing both at the same time and then you have something called you can get the mask of uh integer index of the feature selected and you can get actually the get feature names out master names according to the selected features good so you have like
Tools to use this one so let’s try to use it right good so far so good so let’s try to do this variance threshold and by default i think it should be telling you by the by default the threshold is 0.0 that means that it is constant right we will change that later
But right now we want to have all the constant features selected this way actually doing the same thing as this one library but this one is more powerful because later we can change the threshold of it good and uh so we can use this fit transform
Actually on uh our data here and then we have actually fitted it to what it does right so uh don’t worry it actually outputs the data here we could also just done a fit and a transform that was actually what i said just a moment ago but we did the fit
Transform then returns the array of data here right so basically we can see here just to get a bit to know our our thing here so let’s say data columns and what what we do we have the data and the columns and then we can have git support
Get support on that one here so we see if we have 333 left those are the one which are not constant and uh 36 plus 34 is 70s right so it adds up to these so this this selection here has all the features that are non-constant right so so far so good and
We could also say selected selected get feature name sound it has the same here right so this one contains the names actually we can have it here we have this is a list of the names of the features that are there so there are different ways to get it that’s that’s
Just the point right and uh so far so good we’re not gonna do anything else on this one because now we’re actually gonna use the same one again and uh looking for quasi-constant features and what does that mean that means it has the same value in the great majority of the observations good
So basically we’re going to do the same thing here so we’re just going to do the barn threshold here we do threshold we have 0.01 so that’s one percent and then we have selected i mean let me just fit here instead uh instead of it right so so so far so good
So length of selected uh get feature names out we have 273 now instead right so we removed a lot of them so a lot of them have less than one percent data which varies right so this is this is basically good right so a lot of the features are actually almost constant right
Uh should i take more than one percent i don’t think so because we actually have a data set where it’s like three or barely four percent which have the right value right so that’s not good so what i want to do now is like like selecting all the columns
Where it’s about about to delete right so these are quasi quo quasi constant features right so those are the features we want to to collect right so a column in call no column for call in data columns uh if call not in selector get feature maps out then we want to add it
There right so so what did i do here so i did something wrong i forgot to add the parenthesis here right perfect right so this one here uh length of this one it should have something so 97 right so that should add up to the to the
70. 370 it do that oh yeah it does apologies yeah of course it does good so far so good so the next thing we want to investigate is actually the correlation so one thing that is also a simple concept to understand is uh if we have features
Which do correlate a lot that might be identical then we do not use more of them because they correlate so much right so one thing to do is actually to take our data set and do correlation but maybe actually we want to only have the columns
That we use so far so let’s let’s actually do that so it’s a great idea actually great uh selected uh get features out so this is actually the data set that we need right so this is a train shape right so here we have all the features that are that matters for us
And after that we can actually do train correlation and what just to make it more readable we add this one here you’ll see the moment why here and this one takes a few seconds because because yeah it does a lot of calculations there just be patient so and what we’re looking for is
Actually we’re looking for features that correlate a lot right so for instance this one here uh correlates 0.8 with uh so this feature here correlates 0.8 with this feature here right so this might be a big correlation right and we have more like that we also
Have this one here and this one here so you see there’s some features that don’t add value so for instance if we have this one here does this one here add value these two here are so highly correlated and you have all these questions and that’s the point that’s actually what we
Want to investigate does it add value or not we don’t know so how do you do that and you can investigate this is an enormous matrix that’s also what i said and it’s in its multiple dimensions so how can you do those calculations right so first step first we want to create a
Correlation matrix to figure this one out so that is uh train and we do core on that one and basically let’s try to do this one here because what it does here we if we take one feature it will check all the features it will take this feature correlation
Matching for this feature and look at all the features up until the feature itself and check if any of them are greater than 0.8 in this case it is zero so let’s just look at this one uh imp up var 30 thing which one is that impart 35
Utl1 you tell one it is this one right so this one says actually here here here no but what if we take the next one here right that’s the greatest question that’s a great question and uh it should be true right because now we have actually one which is
Because it goes all the way up in our text this one this one this one and this one is so while the other one here here here it’s not but here here here here it is right so that’s what it does it checks all the way up there so
How do you use uh this how do you use this in your findings how can you make a correlated features so let’s try to do that feature correlated features and uh what we do is actually we do four no feature for feature in correlation made tricks columns if and then we do
This piece there if this one here and that should be it perfect and uh let’s just take the length of core features so actually we get 149 features that are highly correlated so what it does it removes uh every time there’s a feature that is highly correlated with the features already there
So what we have done so far is actually well we have removed a lot of features uh quasi quasi quake quasi features and so on so this is the first step in our our journey it’s just to do the simple one so remove the how many of the other ones
Length of quasi constant we removed uh 149 here and 70 97 of these here okay perfect so in the next step here actually we’re going to use the next method a wrapper method and again so let’s try to look at that so this is a sequential feature selector
Classifier and what it does it actually it takes a estimator so that’s classifier or regressor so that’s basically what we’re doing we need a classifier in our case and if you scroll a bit down there will often be an example of how to use it and in this case actually no never mind
That’s why we wrote it here and if you want to use oh yeah it’s a different library i forgot about that this one is from this mlx10 and if you don’t have it you need to uncomment this one and execute this one here to install it because it’s not by default
But i have it here so i don’t want to install it again good and now you actually want to drop all the quasi features i didn’t call them that quasi-constant so correlated features along the axis right so now i created my data set x and my target set y right
And uh the thing is that the length of no not the length yeah the length of x columns we only have 123 we started with 370 right so we have reduced the number and why do we need to reduce the number because this step here is quite expensive it takes quite some time
And you see here we created a small uh training set right and actually i’m gonna make it even smaller because i actually know this one is gonna take a long time so i don’t want to do that so what i do here is actually i’m going to increase this one here
To 90. so my training set is only 10 of the size and i know this is not going to be very accurate but it’s just for the purpose you should do it differently because you have more time that i have and then we should use our sequential sequential what’s called the sequential
Uh feature selector right so we make it f s f equals s sfs and then we use our cs cv s model to use it uh we need to look at the manual to understand this one more right so uh so how many features do we want um
Let’s go for two features i mean again uh again the more features you take the longer it will take i’ll just make it a simple example here and i’ll take verbose two because i want some information on my screen for versus zero it gives nothing and uh isn’t more we need to
Cv what is cv uh i forgot about it uh an integer or iterable yielding train test split in the engine estimator okay so this is basically uh cross validation so let’s default is five so we put it a bit down to make it a bit faster and end jobs how
Many jobs can you do on my computer uh i think we can have eight so this is basically the model and uh now we need to fit it i called it fsf it should be sfs right sfs and then we fit it on our x train so that’s the training data we
Made and why train right so again this is going to take a while uh i highly optimized it so it’s pretty fast now i hope good so it’s already fast the problem is if i increase increase the test size here this one will be insanely slow so right now our
Training data set is insanely small comparing two so it’s really not a good estimator good so these have scores 96 right so remember what i said in the beginning uh that if i just predict zero all the time i will get 96 accuracy right this is basically what i
Remind myself of here it’s just 96.8 and that’s basically what it says here so our model might be saying just predict zero and it’s pretty good but again you need more features you need uh increase this one maybe and but it’s going to take a long time to run this
Here and i didn’t want to do that in this video here so please feel free to increase features and increase the data here so that means decreasing this value here right do that and it will run for way longer time than it did here i just wanted to make it
Fast and i hope you understand the point that this is not a good result but this is kind of the process of a future selection and in the project we are gonna do the same so it’s gonna be about a real estate dealer which wants to find out which features matter the most
When selling a house so you have a data set with a lot of features and you need to figure out which one matters the most because for a real estate dealer they need to know exactly well when they need to sell a house what is the least effort we can do to
Improve the sales improve the sales right so they are really interested in this so this is interesting so this is the data set they have and they need to figure that one out okay this is gonna be crazy oh not crazy amazing see you in a moment
Are you there i hope so so let’s get started so in this project here we are going to look at the parameters or the features with the highest impact on house prices so again always remember why we are doing our research and it is to give higher value to our clients or customers
And in this one we are looking at feature selection and again we are trying to find the features or parameters as they call them to get the highest impact on the result so our customer has one goal in mind and that is find out what parameters matter the most right so
Actually they focus on the 10 most parameters or features that matter the most to present the findings right so again why does it add value to them well because when they get to a real estate house or whatever they want to sell they need to know what matters what should
They improve on to get the highest sale good so in the first step here it is as always we need to import these things and if you haven’t installed this beast here then you should do it in the like i do in the house called the lesson
Read the data we have the data here in house sales in park here format and the target is sale price right so again we know that and then we need to know the number of rows and columns you can use shape for that then we need to check the data types
It’s always a good idea then we need to check for no we need to check for null missing values you can use info should we remove any we can remove features columns like this here analyze then you use this threshold here with one percent here remember that remember
The variance threshold with it and then you can get the features which are not quasi constant given by this one here perfect so now you have them then we look at correlated and inspect them and you can do this piece here like we did here that was one way to do it
Then we prepare the training set so we have the quasi features and the correlated features and we drop those and then we assign this to the sale price remember we also did that and best features of linear regression so here we use the 10 best features here
Look we take the 10 features and verbose 2 and then we run this beast here we did that in the lecture and you get the best feature index from this one here test the result so create a normal linear regression model and run a full data set on the calculate the r2 r
Square code do the same only for the 10 best features did the score surprise you notice that this test score is far from as good as from s f s test with the 10 highest correlated features find the 10 highest correlated features then calculate the r2 score for them
Does the score surprise you i don’t know does it present the findings uh use analysis from step three to figure out how to present your findings trying to think how a real estate dealer can use these findings and is there anything we can use to help the dealer use these insights perfect i
Think this is about as ready as you can get for this project try to follow along try to think about the questions i’m asking you and let’s see how it goes if you get stuck then hit play and i’ll show you so right now you stop the video and try it on your
Own if you get stuck hit play and see how i would do it so see you in a moment bye bye did you manage i’m always curious let me know in the comments how it went good so let’s just jump to step one and import our things here
And again if you got an error here because of this remember you need to install this one here good and let’s read the data so data equals pd read park here and we’d have the files house sales park here and uh data hit perfect so here we have the data and
It looks pretty fine and the target is the sale price over here right perfect inspect the data data and just take data shape just to check that we have some data so we have 56 features and one of them is a target of course and we have 1 1460 data observations perfect
Now you need to tell if some numeric columns is not represented in numeric so let’s do that data d types oops the types here we go and we see all of them are actually numeric right so influence so no none of them are object right so
That is perfect so all our features are easy to work with we actually see some over here is not available perfect then check for missing values uh we can do that with info data info and what we realize here is how many are non-null right so
This one has some missing we see it’s pretty difficult to notice them here’s one with a lot uh this one basically is non-existing right so what does it suggest here is to just here to remove some so for the fun of it let’s just remove data equals data and drop and we take
Pool qc axis one perfect so that one and it actually says here that you should use filner if you are not removing all of them right so instead of drop down we use fill now so let’s do that and it also says it’s not a good approach to do this but for our
Purpose here it will be fine so the point is just to not have any not a number as you saw up here in the data set we had a lot of not a number here okay we removed this we removed this column here but there are other as we could see in info
The info call here there are others that have missing values and y minus one it’s not a good approach but basically it’s just to have a number which is probably not represented there and we didn’t check it so it might be good but again this is not going to be
Perfect but it’s just to get you started unfocused on what really really matters here good perfect so let’s see if there are any quasi-quasi features right so remember what we did variance or variance threshold and we need the threshold to be 0.01 right and
Then we just do select and we fit and we fit it on our data perfect so now what here we actually get the feature names let’s see if there are any len those are the one that are and how many did we start with i don’t remember actually
I just did it and 56 and how many are having 53 right so there should be some of them here right so uh we need to inverse what we have right so let’s just call it quasi quasi features and do this call for call in data columns if call not
In select get features out right so in this case here should be three right two y two i i don’t get it it wasn’t it 50. 56. oh we dropped i forgot it we dropped one right that’s that’s what i forgot we’ve we dropped that one the memory
Of me not good okay so there are only two so find correlation make a correlation matrix which takes data core let’s do that first and then we had this one here right so what should we call that one correlated features and we do the same here what is it feature for feature in
Core matrix columns if this one here right isn’t that the one we want to do i think so let’s just see let’s just see how it looks like core features it actually has a few here it actually has five it’s that’s interesting one two three four five six it has six actually
Awesome awesome awesome awesome perfect good so let’s try to see let’s try to see how this actually fits into our let’s try to see how this works right so so x now we need to find all of them right and next says here what to do uh data again i’m really really nice
And y equals data sales price perfect right so now we have divided them in there so we have removed all the things that are quality features and correlated features and remove them and then we should split into old all this thing right train x test y train y test equals
Train test split x y i’ll just stick random state 0 here so we make the default test size perfect perfect perfect perfect and now we need to use a linear regression model here sfs and then we do use the sfs actually says here linear regression model it’s 10
And verbose 2. i don’t know how fast this one is oh yeah of course we need to sfs fitted with the x train and y train good so basically this is way faster than the other one from the lecture and we see some scores here right so it increases
The scores as it goes along here so it started around 60 something percent we’re trying to find 10 features here so it should end up with 10 features it finds the highest score with so it says here you can find the feature indexes in this one here right so it should be
In the sfs uh how’s it called i forgot again k underscore feature idx so these are the index indices of the 10 features that matter the most good so what we need to do now is test the results so create a linear regression model and run it on the full
Data and calculate the r2 score so let’s do that so linear is a linear regression and then we take linear we take fit x train we do y train and then we do y pred linear predict x test and then we do R2 score and we do i always forget which one is first the true y test y print perfect so it’s 71 or 71 score not percent and then test with the 10 highest features that we attend with highest correlated features uh the highest correlated features and then calculate
Oh yeah so here okay so sorry i forgot it there i i knew i was forgetting something find the best 10 features right so what do we do here we do a linear regression model again so linear regression and what we need here is actually the columns
Of our model here so right x train columns i mean this is one way to get it list and we have our sfs k features index that should actually be our column names right and no here uh columns i just want to show it here right so here we have the
Columns it should be ten of them right one two three four five six seven eight nine ten perfect and then we should be able to do linear fit uh x train columns and then we do y train and then we do linear oh y y print equals to the linear predict and
We take x test we do the columns oh columns here and then we write the r r2 score y test y print and you see here we actually get an increase in the score here it’s not extreme but it is better so so the point is also what you need to
Realize up here we use all all the columns right the full data set the full data set where we we actually removed removed some features and down here we didn’t remove we’ve we’ve removed a lot of features more features than this one so it has way less features and it’s more accurate
So isn’t that amazing i think so good and let’s try actually to look into this one here uh so what is it that we get here remember this correlation matrix uh we get all the we get all the what is called the columns here so
If you want to have all the columns of the 10 highest correlated you get them here right here you get the names and uh if you take index on this one here right you have actually the index of all the most correlated columns right so if you do here columns here
If you take the columns of the ten highest columns here you have the columns names here and now we have a bit of a problem here because basically what we need is we need all the data once again because up in our training data we actually removed a lot
Of the features also the correlated features so we need to do one more again so actually we need to make x train x test and we need to y train y test because we removed some of the features and it’ll be clear right now because here we take the
Data and we take all of them except the how it’s called sales pro sales sale price and along access and one here and y is still the same and then take random to be oh random state to be zero just have something and then we take the linear linear regression and
We have the columns then linear fits we take x train columns and i think we already did that right so y train and y pred equals to linear predict and we do x test and then we do r to score we take y test y print here right so um
I don’t know what happened there uh what did i do wrong here uh features but linear regression expecting 10 features as input oh yeah i forgot that so this one here obviously also needs to have columns in it here right perfect here so this is funny right so we take the 10
Highest correlated features and we get a lower score than we actually got up here right we took all the features how can that be it’s because having features that are highly correlated do not add much value and that’s kind of the point i want to emphasize there right so here
We actually get a lower score than the original score where we have removed features with high correlation and can’t see my constant constant features and this one here it was obviously the best as expected i don’t know if you’d expect because we have less features here than we have
Here right so this is kind of interesting right so what can you say now what are the features that matter most and now there’s some pitfalls i would help you along here because when you we did remove features which are correlated highly correlated so this is insane good insights
So if you look at correlated the features that matter the most do we have the names of those oh we have the names here here we have the here we have the names of the features features that matter here features that are correlated core related and removed and removed
Because what is interesting with that one we just need to find what was it called what was it called what was called it was up here correlated features here so why is this interesting here we actually don’t know which one are correlated to up here some of these
Might be correlated with these 10 features here right we have this one here first floor here and they are and that actually should be removed this one okay this is because i take a new training center okay so my mistake here so i should have called them different
Things here that’s that’s a problem because now i see what the problem is i re-assigned this one here so what is called up here is is different it’s different so here they’re called different way we have drops some of them so i cannot i cannot show you what i
Want to show you but the point is here that some of these might be highly correlated with some of these up here and that’s a problem because what is it that the customer needs a customer needs maybe to say okay even though those two of them might be highly correlated
What does it matter if if if what does it matter for me i mean i know these two are maybe highly correlated but i need to have that knowledge because maybe i have higher impact on this one than the other one okay i hope it kind of makes sense because
It needs you need to look out of that and i hope also it’s clear that this one is wrong because this x train has been redefined so i cannot use these names down here are false and i kind of got confused here because i saw this one and
This one and this one should be removed right and because it’s not okay so i hope you enjoyed this one here and i hope it helps you understand why feature selection is such an important aspect of your work in the next lesson we’re gonna look at model selection
Right and that’s a big one right so how do you figure out which model to use when you do uh machine learning right when you do your data science product project which model to use we’ll dive into that and it’s going to be interesting if you like this one so
Please hit subscribe and share this one and hit like and all that kind of stuff because the support from you helps me build this channel and the bigger this channel is the more i will create content to it i love the feedback i’m getting for you guys youtube guys are
Just amazing having you guys out there supporting me it’s it’s it’s just mind-blowing i’m just happy about it so i hope you enjoyed it so see you in the next one where we’re gonna dive into model selection an important aspect see you there so now you get your first data science
Project and the first question that comes to your mind is probably yeah but what model should i use what machine learning model should i use and there’s so many models out there so it can be kind of confusing so in this lesson we’re going to dive into some techniques for you to narrow
Down the amount the number of models you need to evaluate and then how can you evaluate which models are appropriate for your problem and actually which is funny actually and surprises many people it is sometimes you are actually the problem seems to be of one type for instance a regression problem
But maybe the value your end client or your customer needs is more of a classification problem so imagine this maybe you’re so what we’re going to look at is actually house pricing right so maybe your client is not so interested in predicting these specific house prices the sales prices but more interested in
Predicting well this is a high end middle end or low end price house right so we’re actually going to dive into how can you convert that kind of problem to a classification problem instead whoa that sounds interesting and we’re going to do that in the project as well so at
The end of this lesson you will know a lot of things you will know about how can you narrow down the type of models you want to use how can you evaluate the models among a subset of them how do you know which one to take
Right and also how can you how can you convert in this case a regression problem to a classification problem instead and still and then even add more value to your end client or your customer whoa that’s amazing so are you ready i hope so if you’re new
To this series this is a 15 part course on data science with python and check down there there’s a link and you can download all the jupyter notebooks and all the resources we’re using so you can follow along and do the exact same thing that we are doing and do the
Project together with us it’s gonna be amazing so are you ready i hope so let’s jump to our jupyter notebook and get started model selection again we always focus on the data science workflow here and model selection is here in the middle here but it’s again we do it in order to create
Higher value insights for our clients or customers so in the end why are you doing these research it is to add value to the end client or customer so again the better model we have the better higher value they get and the more likely they are to
Use your material right we always need to keep focus on the bigger picture here so just a word on what is model selection well i mentioned it already but let’s just be clear here so it’s about what model you should use in your data sign project
So it’s a you have a lot of candidates of machine learning models which one should you choose right so the first question is there’s so many so how do you narrow it down well let’s try to look at it there are in general three kinds of classification they’re in general three
Kinds of problems there’s classification clustering and regression so classification what is that well you want to predict a label on data with a predefined class and this is a supervised machine learning remember we talked about supervised non-supervised learning and regression supervised learning and unsupervised learning so this is classification and clustering
And that’s more about when you don’t have any labels and you cluster your observations in groups which have similar similarities similar similarities they have similarities and then there’s a regression which is like where you want to predict a specific value so in this one here actually we’re going to
Take a regression problem and turn it into a classification because it actually might add more value to your end client or customer so first of all how do you get started well actually sklearner has a great cheats cheats cheat sheet i couldn’t pronounce that today and it’s actually a great resource for
You to use so advise you often here so let’s think about it so you have a problem right you check do you have enough samples if no get more data right so 50 samples i would actually set the number way higher than 50 but let’s just follow along here
So what is it that you want to predict do you want to predict a category if yes you go this way if no you go this way so let’s assume it’s a category is it labeled data do you have it already well then it’s a classification problem so
Let’s just focus on the classification problem how many samples do you have do you have over 100 000 and no maybe you go this way and if you do have more than one 100 000 samples then you can go this way is it text data and so on so this is
Kind of guiding you guiding you what to do in your process of selecting models and you see here we have models out on the edges or the children notes everywhere here then you have the models you might use and just a word again there are often in each model
For instance this one here you they’re different different different modes of using them and you can twist them and that’s also part of the process a great thing with this this one here is actually there are links here so you can click on them and
Then you can see a bit more about this kind of classification how it works right you can see there are different kernels that was the word i was looking for kernels that are different kernels there’s a linear kernel there’s a another linear kernel there’s a b rbf kernel there’s a polynomial kernel
And so on so you can see how it does the work perfect awesome so that is kind of the first step you need to do it is figuring out what kind of problem am i in and you can use this cheat sheet to kind of narrow
Down on the kind of model you want so i want to talk a bit about a common kind of misconception uh or i don’t know if it’s a misconception but it’s kind of the thing that people get wrong in the first place so the question is what is the best model
Right what is the best model and you have to realize that all models have kind of some predictive error that means they are not 100 accurate they all fail if you have 100 accuracy in your either you have an extreme trivial data set or you have overfitted it right
If you want to learn more about overfitting and so forth i have the machine learning course i advise you to take that one it’s a 10 out of 10 hours video course with the same principles here it has notebooks and all that kind of stuff so try to check it out good so
We’re not trying to find the best model actually we should seek a model that is good enough because we’re not gonna find the perfect model so write that down right we are looking for a model that is good enough not the perfect model good enough okay so
Model selection now you narrowed it down right you went for the first here what kind of problem is maybe you use this cheat sheet but now you haven’t narrowed it down so in the model selection we will mention two techniques to do that probabilistic measures that is actually
Scoring by performance and complexity of the model so the simpler the models are the better the simpler the models are the better i mean this is this is again this is this is complex model are often overfitting to specific data sets so it’s always keep it simple if your model becomes complex
It’s wrong you know it’s yeah i cannot say it’s wrong but it is a negative in scoring of it so you have the performance and you weigh that against the complexity of the model right so simple model high score is good highs higher score more complex model is bad okay
And another great tool because often you look at one single data set of train and test it is to actually to do subsets you iterate over your resample so you have different training and test set and you rerun re-run your scoring of your algorithm and take the mean value of the repeated runs
Right so you might have 10 different ways of making a training set test sets and then you score them all and you take the mean of that and that’s your scoring because sometimes you can make a training set and a test set that scores insanely high or insanely low and then
You think oh this is the best but then you try retry it with a different training set and a different test set and it doesn’t score well and the point is maybe the model is not that good maybe you need to tweak it a bit maybe you need to change the model right
So again these are two techniques to do it and let’s try to look at a dataset and try to play a bit around with it so the goal here is actually not to dive into too much into these we’re trying to we will try to score a bit different uh
With a bit different models and compare them together so you can see how that’s done but i think in general this technique down here is a great one and we’re not going to go into it but because it should be obvious what to do you just rerun the test train test split
And uh do it with different sets and you take the average of that that’s a really great one right so the first one is more like okay we do look at the complexity we look at the performance and you can do the performance by resampling method okay
Perfect so let’s try to take this problem here and investigate it here so we have some sales house prices and has a lot of features here i actually think we saw this data set here before and it has a sales price here let’s try to look at this this data set
Here let’s get a bit smarter on it so what do we do we do describe right we learned about statistics so we look a bit on these numbers here right so you see actually we are most maybe in the first place most interested in the sales price
Here so we see we have a mean sales price at 180. we have the it’s an annoying standard deviation of 80 000 it says something about how how how how much spread there is but again you actually get an idea of that actually most of the data is aligned
Closer to this and than this end here right so you see a big step between here so a great idea is actually to visualize it so let’s let’s try to do that actually data how’s it called sales price plot hist and let’s make some bins here good
So what we are saying that’s actually what i said right so the highest price here is 700 something thousand but you see the most of the data is actually down here so it’s very skewed to this one side here right and uh that’s important to understand so
What i want to do is actually convert into categories right so they’re two great tools to do that because what we want to do is our customer he doesn’t really care about the specific prices it’s more about how can we make some categories for it and for simplicity we
Only make with we’re only going to play with three categories like low high low high middle it should be low middle high and we’ll use two tools to do to investigate it and the conclusion for this one here doesn’t necessarily mean that this is always a conclusion we have both tools
So what does cut do bin uh bin values into discrete intervals right so you can put the number of bins and uh you have some examples down here actually it’s easier just to show it so let’s do that and then we have x cut so so the difference between these two are
So cut takes the full interval from the minimum to the maximum makes three identical sized intervals while cue card looks differently added it puts three intervals but it puts the same it makes the intervals based on the number of objects or observations going into each
So it tries to fit it so there’s the same number of observations in each category okay so let’s try to do that so how do we do this so we added a new one here and take pd card and then we need to remember card here i don’t
I don’t use it too often so this is the data we want data we have sales price it wasn’t it was called sale price instead okay and again and then we have the number of bins we have bins to three ins means bins to three do we need more let’s see
Let’s let’s just put some labels here labels one two three right so now we have the data here actually i wanted to data target and what do we need value counts divided by length of theta right so what are we seeing here so what i’m calculating here on the last line here is
What are the how many how much percentage is in each category right so in the first category we have 989 of the values right and then 10 in this one and half percent in this one and why does that happen okay so basically it makes the categories in three equal
Sized groups right uh depending uh on the interval down here right so it makes a group for this from here to here and one okay a bit smaller here here and here right and you see here there’s almost no data down here there’s a less here and almost all
Of it here actually eighty nine percent let’s say right and ten percent this one and zero point five percent the last one right so if you’re doing your research and presenting this to your end customer or your client they’ll probably say okay yeah you made three categories and
How much value does it add right so actually it’s only category uh almost 90 in one category it’s just like wow that doesn’t add much value to me because yeah you know i don’t know what to say i mean if 90 is in one category well how much
Prediction are you doing right so you can make a model which predict with 90 accuracy if it’s just put it in first category all the time well good luck with that okay so What i want to do now is actually i want to do the same here and we actually just use the same with er with uh it goes really well with uh with cue cut pd cue cut cue cut and i think it goes the same way right it has xcs
Uh you see it has a queue instead of pins so data a sale price and uh let’s just see this q q into this side of float number of quantiles so this is actually what we want and does it have labels too labels yeah so so let’s just make the
Same labels here and uh the labels are not really good here but it was just you could have low middle and so on and actually we want to calculate the same thing here and here you see it right so i have 30 30 30 33 in each one of them
So this is a way better way to categorize them and in some cases this makes more sense uh this one here than the other one in other cases this might be the one you’re going for right perfect so so far so good so what do i want to do now actually i want
To create a model which does this categorization here of the data so we have the data here and again we’re not going into data quality we’re not going into data cleaning or anything like that because that’s covered already and i just want to focus on the
On the on the on the things that matter in this lecture right so data drop so we’re just we’re ignoring everything sale price target okay axis one and y is equal data target right so that’s roi and i forgot to import these one here i sometimes do that right
Okay so let’s try to create a model here as we see let’s take the linear as we see here and uh svc fit x comma y y print is svc predict and x and actually we wanted this to be don’t we want to make a test actually we want to do that right
Uh okay so so let’s let’s do data here data filner i think there’s some missing data in this monitor and then we actually do x train x test y train y test train test split x y test size 0.2 And random state 42 okay so so what does it say here it says uh okay so you cannot make had a categorical can we do it here filner minus one i think so sorry so there were these categorical uh indexes it couldn’t do so i have i have them
Removed in this one here so i can fill my hair there’s some missing values and this one we didn’t actually investigate that and uh apologize for not doing that but we just filled them with -1 we are not looking for the highest quality of anything because the purpose of this one
Here is just to investigate different models and do a categorical prediction instead okay so that was kind of my point here train test and then we do the accuracy score y test y print good so here we actually get a pretty good score here we get a
I don’t know if it’s good that’s what we’re going to investigate but that’s the default score out of the box here i wanted to do the same for different models just to compare here so so so let’s do that so neighbor here okay neighbor classifier oops and here and
Neighbor then we do the fit x train y train why pred uh nay predict oh predict predict uh y x test x test and accuracy score y test y print so you see here actually they get pretty similar scores these two here so not so much and then
I actually also imported the svc here so let’s try to do that as well so there are different kernels in this one so svc is a usb svc and a kernel so let’s look at the kernels we have here we have the linear polynomial polynomial so let’s let’s try this one here
You can try more if you want maybe we’ll try a bit more afterwards as we see fit uh x train y train svc oky pred svc predict x test accuracy score y test y print so here you see you have a lower score on this one here and let’s let’s let’s
Try some of the others here and uh so we’re trying different kernels in them so they’re doing doing it differently right so let’s do sigmoid and poly afterwards and you see this one is extremely low right so this is just playing a bit around with the different differences
And now we need to have it we have it down here let’s do the pulley as well and this is 52. so i think actually you can uh you can there are more variables you can play around with uh i think there’s a mag uh degree for instance that’s a polynomial
Degree i think you can actually add more if it to degree like one that’s like a linear so it’s actually better right and you can have two then you see it gets here and you have four instead of three and then you see you can tweak it around
To get different values okay so this is actually to get you started with how you can play around with the same same thing and you can actually make a function that takes the model because you can see you’re doing the exact steps same steps here so you can make a
Function that takes a model and you use that model to fit there so so you don’t have to write and copy and write the same code again and again and again okay so perfect yeah so now you kind of get an idea of how you can convert a problem because what is what
Are we actually doing down here i forgot to say that right what are we doing with the accuracy score here right so we’re checking how many of the categories are predicted correctly so in the in the default case up here we actually had a 73 correct so that means with 73
Accuracy we can predict if a house price is in ca which category of the three categories they are so we’re actually not trying to predict the exact price but more like where is it and is that good i would say i wouldn’t be satisfied with that but we didn’t do anything of data
Quality and so on we added the minus one of the missing values we we did a lot of bad stuff so it was just more about showing you how this piece worked how model selection how you can evaluate a model okay and how to convert a model from one type to another type
Because basically sales price well maybe you don’t care about that and then three categories is maybe also on the smaller side but it was just to emphasize the point so in the project we are going to dive further into this and investigate it further so don’t skip out on that one it’ll be
Quite interesting so see you there in a moment bye-bye in this project we are going to use the knowledge we have from before and combine it with the knowledge we just learned in a moment ago okay so let’s dive into figuring out the project so the project title is parameters with
Highest impact on house price class so a real estate dealer from last assignment calls back and clarifies his objective not so interested in finding what matters most to find house prices but more in which a range a house is in there are three classes the 30 or 33 percent cheapest mid-range and
Expensive houses he needs to find which 10 parameters matter most to determine that this is interesting right so how do you do that right so we already learned about finding the 10 most important parameters in the last lesson if you didn’t you should go back one lesson and check it
Out and in this one we learned how to make these classes so can we use the same knowledge again let’s try so we need to import this one here uh if you don’t have this one installed you should execute this one in a cell then we need to read the data we have
The data here in the parquet file then we need to find the shape it’s just to check that the data is there then we need to check the data types just to ensure they are numeric and we use d types for that then check for null values you can
Remove features columns and we did do that in our lesson so quasi constant feature we remember we did that so we can use this one again here and get the feature selected names out then we need the correlated features and we have some code how to do that and we
Can use a list comprehension again to make a list just like we did up in this one here list comprehension i mean if this is all unfamiliar with you it’s probably because you didn’t take the previous lesson so do that please prepare training and test set so we
Want to make the three categories right so we use this one here right from the lesson here and then we assign the features right where we take these here right so we remove the sales price and target like we did in the class then we also re remove quas quas quasi features and
Correlated features and then we assign y to be the target and then we use the train test split as we know it okay and then we need to find the 10 best feature for the k neighbors classifier model and uh we use the k neighbors classifier model you can play around with other
Models as well but this is the one we’re using here because it seemed to be doing decent work and we need the 10 features and we get the index from this one here then we want to explore the features a bit right so try to explode the features and the
Features can be accessed with this one here as it says also above there get the feature names by this one here try to list them according to the correlation score this is a bit more advanced python so you basically take this one here then you can ask yourself does that i
Mean you take this copy this down here it’s a bit more advanced and then you actually just execute it and look at the result and does it surprise you and does it change your recommendation based on what you concluded before this step here well i don’t know so
Then the report part present your findings figure out how to present your findings try to think of how a real estate deal can use these findings and measure impact right so remember the last one actually where we tried to predict the house sales prices right he actually came back and said i needed
Something a bit different right so this maybe you didn’t think of this in the last one and that doesn’t matter because this is about learning right but next time you have something like that this could be an advice right maybe this is what you’re actually looking for because
Sometimes the customer or the clients they actually don’t know what is possible and what adds value and that’s where your experience comes into the picture right you’re you are kind of a sales person right you say okay the client wants this and this and you might think
Maybe i’ll do it but maybe he’s actually or she the client is more interested in getting this here because it adds more value you have seen from previous experience that this actually makes clients more happy or adds gives them more valuable insights right i think you get my point so what to do
Now now you should stop the video and you should try to solve this project as far as you can if you get stuck don’t worry in the next one if you continue playing i will try to solve it together with you and sometimes i need your help so it will be a
Cooperation so see you there if you get stuck otherwise you are awesome see you are you awesome i know you are so did you manage yourself let me know in the comments and let’s try to do this together then because i need help too that’s how it is
Right so this one here it is shift enter did you manage that one did you yeah okay you are awesome perfect and let’s try to do this one here so data pd read park here with a p and then we take files and we take a house sales park here
And let me just do a hit just to see that the data is there perfect so we also notice here we have a lot of not a number we’re going to deal with that as we know and let’s just see the number of data here we have 1400 and we have 56 features
Perfect and then we want to check the data type here into a so let’s do that d types so the great thing about this data set here is uh they’re all numeric all numeric all the way here so that’s a beauty we don’t need to do anything with categories and so on so
It’s perfect right then we have here we have uh null values info and this is this is again a long list here and what i did last time at least was to remove this one here so so that was the only one i removed and i think i think that’s that’s that’s fine
Data data drop and one is called pull qc perfect i actually think that we should take care of the missing values and i think again it should be something we’ll do that here actually so data filner minus one no sorry we’re not doing that here we are actually doing something different here right
Let’s just follow along here so we need to have axis one here axis a one here and then it will be more happy because i was trying to remove it along the rows and we need to remove it along the columns instead right perfect so now we need to find the quasi quasi
Constant features right so this is actually a great thing uh so let’s try to do that so variance threshold and we have threshold for 0.1 and fit it and then we have the selectus selector fit we have the data here and then we have the get features there
Then we have the features that matters but we we’re actually working a bit different with it we are working with features that are to be removed so basically what we’re doing is we’re using this list comprehension so what do we call it qual quasi features maybe and then we’ll say call for
Call in data columns or if call not in cell get feature names out right and let’s just see how many other uh invalid character in identifier what i guess there was a different character i didn’t see here so we get street and utilities in this one area and
So we we we worked with this last time here so it should be familiar to you and then we’re looking for correlated features and let’s do this correlation matrix and how do we do that data and we actually do it on this one where we didn’t remove it but that’s that’s up
To you features and then we do what is it we do we call it feature here instead it’s very good for feature in core may tricks columns if and then this one here and that’s basically what we want and then core features let me see here we get we get
One two three four five of them there right perfect good and uh now we want to prepare a training and test data set so what we want to do is actually we want to create three categories so let’s try that so data like in lecture target
Let’s call it target like in the lecture and uh then i actually think we can use this one here straight out of the box here we go good and then we want to create the x data set here so data and drop and then we have we have something already here we have
The target at least and then we have a sale price and then we want to add quasi features and correlated features and then we want to do it along axis one here and y is equal to data what is going on here target here perfect so so
I think i don’t mention this but i mean uh there’s still some null values and i should probably have done a bit more accurate work on notice notifying on you and notifying you with that but here we have the data sets and i fill all the missing data so it’s still
Not a perfect data set i apologize for that uh good and then we do x train x test y train y test train to split x y y test size let’s do 20 and uh random state 42 good perfect and now let’s let’s find the 10 most important features let’s use this one
Here sfs what does it say sfs it actually says what we should do here and sfs fit then we want to fit it as well and x train y train good so i hope it goes a pretty fast this one here because you should do a more accurate uh
Investigation this is just you know to get you started but interesting to see here how the score goes up this starts by 60 or 58 and it ends down actually on it doesn’t change much here so 75 right so it’s better score than in our
Lecture or in the lesson right we got 72 i think so this is again we have done some preparation we’ve done some better work and there’s still some improvements with uh with null values here which you could do better than i’m doing here so remember the cleaning data part
You should check that one out and figure out how to do this better than i’m doing here perfect and then it says you get the features here so sfs okay underscore something featured in the next so you get some features there right and you can get the
Names down here it actually says there explore the features you can access them with here which we did and you get the feature names with this one here so let’s try to see the feature names here here we have them those are the 10 most important so
This is a bit so what we want to do now is actually so one of our conclusions here so so my idea is would try to look at these try to figure out does it make sense to you is this is this is this what you think like a year built does it
Has have high impact it seems like it second floor sf sf what does it say for uh i actually don’t know but you can look at the at the how’s it called the description you can often see how what it is when you see the values of it of course
We don’t have all of them there open porch sf what is sf why don’t i know that okay so what i want you to do here is actually to look at these numbers or these features here and see if doesn’t make sense actually does it make sense so year build
For me it kind of makes sense right the the year the the house was built has a high impact on it right newer houses higher prices often so that’s how it could be good so let’s try to run this one here and see what it actually shows us and it shows here it
Shows uh it shows actually it shows us it takes what does it do it takes each of these features and it looks for the location and correlation matrix sorted and you see kind of how high impact they are so the one which is most correlated it’s actually this one here
And garage cars is second right and then years build and then years removed add and so on so you can see them in a how they are ordered which index they actually have and this is kind of surprising right so so if you look at the correlation matrix
Let’s just look at that one right so you see actually overall quality is not one of them but it’s probably because it’s highly correlated with one of the others right and and then you can see these things so so we’ve removed some of them and you
Can try to check out which one we removed and then you see actually it’s kind of funny which one actually has the highest impact right so what i would do here is like let’s go look at which one we’ve removed let’s actually add them here so quasi features it’s street
And then you can remove street and utilities right so those have no value so where are they i don’t know street is here utilities is here right so they basically have no impact at all and then we could add we could add the uh how they called features
I forgot what i called it actually and this is embarrassing right so i’m sitting here teaching you what to do and i have core features that’s it core features here we have some of them and then you can remove those from the list and you kind of get the picture of which one
Are left here and then you can see which one actually matters the most to get the highest score in our or 10 features right so this gives one picture and the other one gives another picture and that’s kind of interesting right so does does that change how you do your recommendation well
It might do it might not i mean the point is that this list here is not ordered by ordered by what’s called the impact this is more what this one shows here this is a higher correlation right so and actually this overall quality is not being removed right so that’s
Interesting that this one is not part of the list that’s actually interesting overall quality that it’s not part of this this is not a good it doesn’t have the highest impact when it comes to predicting which category it is in that’s actually quite interesting i actually don’t get that but again this is
These findings here maybe you run a different model it will say something a bit different that’s also why you should do your work better than i’m doing it here so this leads us down to step four a here and the presenter finding right and again i think you should do a bit more
Deeper research and figure more out before you can elaborate on your findings because maybe this is just a coincidence so i would run a few more tests and run a bit deeper level of investigation in order to get give some recommendation and again with our knowledge now how can
User insights is three categories a good choice should you have more should you divide it in a different way i mean sometimes you get the the division between two categories maybe it’s not the perfect one because maybe that’s actually not what you want to do because maybe you want to find some specific
Price ranges instead and not specific 33 33 33 right and you can actually do that as well and maybe that’s actually what you want because when people are looking for buying a house is it dictated the prices that dictated which of the market prices 33 of the market prices or is more like
The your end customer your client knows more about what ranges of houses they sell right so again i hope you enjoyed this one in the next one it actually gonna be amazing because we’re gonna sum up all we’ve been doing and actually i’m gonna present you a template of the entire
Process right because we learned so many things and now you there’s like well how can i remember it all well good news i made a template for that so you just follow the template and if you want to dive deep into them there’s mention which lesson you need to dive into so
You can refresh your material whoa that’s nice i know it so if you’re excited about that please hit the like and subscribe and share this with somebody and comment down below and yeah let me hear from you how you’re doing so see you in the next one it’s
Going to be amazing to summarize it all and have a template so you can figure out how to work with your data science process projects in the future see you there bye bye the full data science workflow most aspiring data sciences actually get wrong what matters most as a data scientist
So they focus most on specifics while a real experienced data scientist is focused on the bigger picture so in order to ensure that you don’t make the same mistakes i created a template that will walk you through all the things you need to consider when creating a data science project because
In the end what matters is the value you add to your clients or your customers that’s right what value can you give them so that’s your main objective so in this template it will guide you through all the things you need to do all the things you need to consider
Where you can find more resources more informations about specific topics that is particularly interesting for your data data science research problem also we will make a project just trying to use this template together for fun and see how it works and as you will notice yeah it’s not always you use all
The steps but it guides you ensures that you have considered if it’s necessary in this one here also if you’re new to this series here this is actually a 15 part course on data science with python and this is kind of summarizing everything you need to know if you use
This template here you’re certain you do create value for your end customer okay i hope you’re ready and if you didn’t notice there’s a link in the description you can click that you can download all the notebooks all the resources we’re using and of course this template here
That we’re using okay are you ready i hope so so let’s get started so the data science workflow is represented here we introduced it in our first lesson lesson 0 0 in our course and we stepped through all of these and been working with them all the time throughout the course
So the main thing is that the first step that you need to master as a data scientist is to understand the problem of your end client because at the end what matters most is that you can create useful insights that create value for your end customer your client good
And i also mentioned many times that most data science courses and resources only focus on preparation and analysis these two steps a few a bit about reporting and so on but it is important that you don’t as a data sign become frustrated that you think oh i need to
Master an enormous list of of different technologies different languages and all that insane mathematics and insane statistics because it’s not necessary in this course we have covered what you need and that is what is summarized in this template to make sure that you make great data science projects in the
Future so let’s dive into it so the first step in a data science project is actually you need to understand the problem and if you want to learn more about that you can actually dive into lesson 0 as mentioned there but you need to get the right question
What is the problem we try to solve what this forms the data science problem i’m not going to dive into all the things here i’m just going to go on a high level but you can read more about examples and how to do that and how to assess a
Situation with risk benefits and so on these are important aspects then when you do understand the problem well you need to identify data and many times your customer your client will provide you with data but sometimes you need more data or sometimes they don’t have any data and you need to find
It and here’s a list of great places that i like to find data and i would love to hear from you create other great places we can build a bigger list here with awesome places to get data then the first step is so the main tool we’ve been using to represent data when
We work with it in python is pandas data frames so the great thing about pandas data frame is it connects with all types of sources so that’s the first step we need to import the data and you can here find the type of file that you have and
See how it’s done right so if it’s a csv you do it like this and then the default arguments can be like this and you can read there’s actually tutorial about that and then you see the lesson where you can learn more about that sometimes people have excels file excel files parking files
Web scraping this is a big thing and it’s actually way easier than you think right you do it like this here and then you have all the tables there and again go to lesson three if you want to go more in depth databases it’s also a great thing
There’s so much data around the world in databases still and it will continue for a long time i promise you that and here’s an example how to use it you create a connector and then you select all the data inside your data frame and most of the time this is the
Sql statement you need you don’t need more advanced stuff then you only have a lot of data and you need to combine it together and there are like three main ways you combine data it’s concatenation when you need to concatenate data together there’s join and that’s mostly used if the
The index name is the same and then there’s merge where you can show that what key you’re merging on and again you can go to lesson06 in order to learn more about it so that was step one step two well you need to explore the data and we
Learned about head shape d types info describe is any is not any and so describe is like the statistics that teaches you a bit about it and info is more about the missing values and so on and you can also use isna annie to get the same thing
Another great thing that you want to learn about is grouped by council and some basic statistics which we did in lesson zero eight so you can group by and that’s often when you have categories in your columns for instance gender then you can count how many are of each
Gender and you can find values the mean values of the genders then we had something about statistics and the great thing about about statistics is actually the most important statistics you actually know that by heart it’s a count because it says something about the quality of your
Research if you make a research with only 10 data points 10 observations you know that that research no matter what it is it’s poor you need more observations to make real good research then we’ll learn about standard deviation and what it means and again go to that lecture and then we learn
About the box plot and i would say my advice with box plot is something that you use unless your client is used to box plot they will not make sense to you but it’s an amazing way to summarize the data that you’re having it summarizes basically what describe
Gives you as figures it summarizes that visually and our brains are just hardwired to understand visual information way faster so when you have you need to understand big data sets it’s way easier to understand it visually than it is to see the actual numbers you spot discrepancies you spot outliers
You spot patterns and so on so that leaves us to the next one visualize data because one thing that most people don’t realize when they talk about visualization is actually part of multiple steps there’s like understanding the data visualizing ideas to get smarter with it so i would say in
General there are three things you use visualization to as a data scientist one is exploring data to find discrepancies and missing data outliers and so on and the other one is to explore ideas like is there a connection between the data points with the scatter plot and finally it’s like
Presenting making the report this is where you present your findings and why does it matter to do it visually because you need to convey that you did your work and you have evidence of what you’re saying to your end customer so here we have a lot of plots and it
Has some kind of examples here and these examples can be run out of the box if you run it if you download all the resources because we have all this data down here so it should be working so if you want to try everywhere you can just add a cell and
Copy paste the code in there good so we have a few it’s a simple plot we have a few few labels we have ranges we use and comparing data then we have scatter plot histograms bar plot pie chart and so on cleaning data and this one is often a big surprise to anybody
How much impact cleaning data actually has so drop now is kind of like the lazy way to do something because often it’s not the best way you need to figure out what you can do with data and what we learned is interpolation or filling in with
With the mean value like this one here and there other like the mode which gives the value which is the most represented the one that has the most so that’s about cleaning data and we also have dropped duplicates uh that’s often a quality issue in your data you
Have the same data multiple times and you don’t want that because it can represent it gives you discrepancies in in how much weight a data point gets if it’s represented multiple times right good and there’s a great resource on that one on pandas user guide here so use that one about cleaning data
Then we have step three analyze right so what you do is you have the dependent on independent features the x and the y right and then you divide it into a training and a test set with this beast here and what you want to do is you want to do
Some feature scaling maybe that might be necessary because some machine learning models need that especially those that have a using a metric to the distance between points and some features are all tied together and other features have a bigger interval and that makes a difference on the machine learning model
We learned about normalization and standardization how that works and how to do that and it explains here what to do with code examples down here with normalization standardization and again it’s not always there’s not like one guideline to do you need to use this one
Here you need to try both and see what works then feature selection then feature selection and that’s also often a surprise because well most of the time i think yeah yeah the more data the better right but actually no why feature selection well you get higher accuracy you get simpler
Models you reducing the risk of overfitting because maybe your data set has some features that are a bit skewed in this data set and it makes some conclusions so you want to find the features with the highest impact and the most accuracy to create the most accuracy on your models
And in general there are filter methods there are wrapper methods and there are embedded methods and we have some examples here how to remove constant and quasi constant features often you don’t think about it but some of the features are maybe almost the same value all the time so
This is how you can remove them and then remove correlated features you don’t want to use features that are highly correlated together because they don’t add extra value and one measure here would be like if they’re more than 80 correlated or not percent 80 0.8 correlated with that well get rid of them
Model selection this is also a big one what model to use right we covered that in the last lesson here lesson 13 and uh in general there are three types of models classification cluster and regression and there’s a cheat sheet here how to find the type of model you
Want to be using in this one air and then there’s model selection techniques they’re like two great ways to do it one of them is scoring by performance obviously to compare by scoring by performance and also the complexity of the model right so the more complex the lower you want simple models
With high scoring and a great way to score them actually is not just to run once but to make training sets different training sets different if you only have one training set you can divide that into multiple different training test sets and score them and take the mean value
Of that so that’s a resample methods so here we have a few models that we covered and again i would advise you to look at the full course on machine learning i have here on the channel it is a 10 hours full course in 15 part tutorial covering machine learning from
Scratch and onwards it’s really amazing and people love it so i advise you to do that as a next step then we’re analyzing the result right so this is this is basically uh basically a checklist checkpoints for our analysis i’m not going to go through them all here
But review the problem right go back to the data science problem you started with because what happens sometimes when you work with something is you kind of screw and get interested in something and you dive into something that makes you interested not the customer you right you find oh this is more interesting
So yeah so that’s a focus point did we learn anything right again data driven insights should add value sometimes it doesn’t be be aware of not making conclusions like wealthy people buy more expensive cars right is it interesting to figure out as a data scientist yeah because you have the
Data that confirms what everybody thinks but does it add value does it add value does it i don’t know is it obvious that rich people buy more expensive cars i don’t know maybe not in the future because we need to buy green cars and electric cars and cheap
Cars and smaller cars i don’t know let me know in the comments what you think can we make any valuable insights of our analysis right that’s a checkpoint and there’s some questions to ask there do we have the right features right because sometimes you need more data a different
Type of data and this is also again when you think about the loop with our data science research project workflow another data science workflow it is a loop right don’t think of it as a static one two three four five step thing you might go back to the customer
And say okay we did the analysis we need more features we need this and this type of data do we need to try different models right you need to try to do thin can the result be inconclusive right can we still do some recommendation uh these are some some check questions
To finalize it and again sometimes you need to go back when you ask these questions and that’s how the process is i like this quote because this is where i see everybody not everybody but so many people make faults it’s politicians it is newspapers or online medias and all that
And i will probably myself will pray for this one as well so what is this quote saying it’s sherlock holmes so it’s a fictional character but it is amazing he says it’s a capital mistake to theorize before one has data in sensibly one begins to twist facts to suit theories
Instead of theories to suit facts so you know this one right so you have like this idea i want this to be true i want to find the data that confirms this and i had discussions with so many people during throughout my life right away but you just
Know and i’m i’m guilty as well but you just what you want this to be true and then you find data supporting it but then you see some data that does not support it and then you kind of argue against that why this is not used
Politicians do this all the time as well right so they say oh we don’t need experts to tell us this is common sense everybody knows it you kind of twist the facts to suit what you think your hypothesis right your facts right you find things that support it and the things that don’t
Support it you argue against it as a data scientist as a good data scientist don’t do that don’t be that guy or girl where you go i don’t know but don’t be that one don’t be that person so i love this quote and it reminds me all the time am i doing it
Correctly am i trying to twist the facts so he says get the data Make the conclusions based on the data don’t twist facts so it’s an integrative research process and uh again you can you have some observation you have a hypo this is a kind of kind of like thinking of a process what can you do before concluding often it goes back and forth all the
Time right so you have some observations that start a question in your head then you make maybe a hypothesis on it then you get more data and test it then you analyze if it’s evidence and then you can conclude but there’s a catch here often what data science
What people want to conclude is the correlation right we want to conclude that this causes this so when we’re having a correlation between something that does not mean a causes b correlation is not the same as causation so a stupid example might be you notice that there’s a correlation that people
Use umbrellas when it rains that does not mean that umbrellas cause rain makes sense yeah but that’s what often what people conclude when you have a correlation right but remember try to make a simple example that’s stupid and say can i conclude that based on that so correlation is not causation right
Report present defining and this is this is a great one uh you need to sell and tell the story of the findings so what people what are the common mistake people do it is they forget who’s the audience right so you think that your audience has the
Same interest and you you are a data scientist if your audience are data scientists well you’re lucky because they probably have the same interest but let me just put some broad categories here right team manager data engineer data science team business stakeholders and business stakeholders are often those that pay your
Money right so they pay you they if you don’t make value for them they don’t pay you anymore so they talk at different they have a different interest and you need to speak their language and make a story that makes sense to them right it
Doesn’t make sense to them that you oh i twisted this model and i tweet these parameters and i found this data what they care about is how can they do more sales how can they optimize something how can they predict something right that’s what matter for them add value to them
When presenting communicate actionable insights to key stakeholders right you don’t want to tell a story and what’s the conclusion when they hear what you’ve done they need to have actions what should they do different that’s what they pay you for then this is tl dr too long don’t read be clear and concise
It’s often good to have that in the beginning a clear concise summary of the content often in one line because it kind of we are bombarded with information all the time and i must say me too i don’t a lot of the information i say ah if it
Doesn’t catch my interest immediately i don’t listen so this is the same point here right you need to catch your audience and i must say it’s difficult i’m not an expert on that and i wish i was better and there’s a longer list of the things
Here and i think you should we talked enough about that then it’s visualizing result telling a story with the data right uh this is where you convince that your finding insights are correct right so again when you have a finding you can say i have this i know that
A is correlated to b and this is like hmm does that impress your audience that you know that no you need to have some data represented and the best way to represent data is visual because we can process that so fast so if you have a chart showing something
Connection then you say oh yeah i see that and then they also make the same conclusions resources for visualization we covered the matplotlib here inside the course i added a few more and the list could be more longer i just want to add a few of
Them that actually used folium in one of the projects and the list could be way longer seabourn in plotla plotly are amazing tools i wish i had more time to make the course add them in here but i the point was not to go in depth of anything here
The point was to keep get you to the full scale but those are great resources to look into and finally a reminder here credibility counts often we want to leave out some results because it doesn’t tell the good story but you need to remember this is credibility this is about your
Your value your credibility is your value so then we have actions use insights how do we follow up on the presenting insights right no one size fits all here unfortunately inside some problem examples what are what customers are most likely to cancel the subscription right say we have insufficient knowledge of
The customer and need to get more hence we can give recommendation of gather more insights right but you should still try to add value right our problem here here is our data what valuable insights this is a challenge as there is no given focus it’s an iterative process involving the customer
And can leave you with no surprises okay measure impact right so as a customer you are interested in measuring did it add value right and this is your sales point right now you present the insights what should they do differently and you need to be able to measure the
Impact in a few months or how long it takes on some metrics and if they can’t do that why would they use you again right if you can this is your this is your sales point right i promise you your sales will go up like that so so understand understanding of the
Metrics the metrics are indicators that our hypothesis is correct right so you’re selling them something right you’re selling them insights and if they do these actionable things they will get you will have some metrics that are indicators that your hypothesis is correct sometimes many times this is actually not
Going as exactly good as you assume it will well then you also need to be able to tell a story that the things were a bit different and what to do better in next iteration right so remember this is a long-term relationship with the customer or the clients and uh yeah it
Cannot evolve along the way read about it here it’s fine main goal so again here the main goal just to your success as data science is to create valuable actionable insights that’s it a great way to think about this kind of thing is any business or organization can be thought of a
Complex system nobody understands it perfectly and involves organically all the time so data describes some aspect of it it can be thought of as a black box in this organization right and any insight you bring is like a window that needs that light on something that happens
Inside this black box right so again imagine that the organization right this is this black box nobody really understands it and when you have some data and some insight it opens a window and it gets light into it and there’s somebody get some insights some advantage right this is what you want to
Do you want to find the right windows to open so a general advice uh expectations when i started as a phd a researcher my my journey i expected to solve the biggest problem in the world i was like thinking whoa i’m gonna i’m gonna solve you know them
I like physics too so like i’m gonna be the next einstein creating the new relativity theory and you think you’re gonna when you read history and the books you read about the biggest discoveries made you are probably i hope you will be big like that but honestly don’t expect it immediately because
I expected to change the world to be a better place when i wrote my application to become a phd student we wrote the project that we’re supposed to do and how it will change the world to a better place and reality was a bit different it was
Small increment so how research works is often okay you provide with some small tiny insight that it built upon some other research and i remember when i read my first couple of research papers just like amazed whoa this is big big work but if you read into the resources
Building up to this research you actually realize it’s just incremental step it’s just one tiny step and that’s how research is it’s like one small step at the time sometimes bigger steps comes and we all hope to get that one uh so start with simple interesting problem
Do not expect to find insight that will change the world from one day one learning i like this one so this is a new field but like any research field it evolves and we will learn new techniques and new tools all the way this course gives you
On a solid basis but it’s a lot more to learn don’t expect to you’re learning to end i always focus on keep on learning new things all the time another thing is like long-term focus be clear on your goal become a data scientist right this will help you
When when things gets difficult right sometimes also myself it’s like why am i doing this have a long term focus know exactly why you’re doing things this will help you when things are not easy because sometimes you sit there with a problem and you don’t understand and you can sit
For days sometimes and you kind of want to give up don’t give up i’ve been there same problem for weeks i didn’t understand it and then when you realize it when you understand it and you solve it it’s so amazing it’s a great feeling curiosity this is this is this is one of
My favorite right if you’re curious then i mean let it guide you keep it playful as i say to myself you need to enjoy what you’re doing and most people are curious so let your curiosity guide you on your data science journey there’s so many amazing things that you can
You can learn out there with data science so please do that make it guide you don’t get frustrated there’s something you don’t really understand and you can’t get the help i mean that’s impossible because there’s so much help out on the internet but if you feel stuck remind yourself why you’re doing
It and remind yourself you’re curious about it explore new things do something learn learning is the best thing in the world too so now i walk through the entire five-step processor and where you can find resources in the next one we’re going to do a small project together where you use this one
We’re not going to use every single aspect of it so i hope it will clearly make you understand how to use this one so perfect so see you in a moment bye bye Are you ready to try out the template i hope so this will be your capstone project that means that you connect all the pieces together trying to use all the knowledge you have so are you ready for that i hope so so let’s dive into the jupyter notebook and let’s get started
So the capstone project so in this project the goal is to imagine your first client and this first client will create a problem and you want to add value here so for your client in the classical world you might think that the client comes with the problem and you add
Insight here in the end in the real world it’s often an iterative process where you talk to your client and so on but for the purpose of it we don’t have a real client so we have to imagine we have a client so try to explore and find a problem and i
Have some guidelines here how you can start and try to figure out what kind of insights you should generate and again this is to teach you uh how to use this template so let’s get started so the goal of the project here is to use
Is where we put it all together right so ideally we would look at a real business organization problem and turn it into a data science problem as this can be hard we just assume that we have a problem that we need to solve this will be done either by making up a
Problem or looking at some data that interests you and make up a question so i’ll just scroll through this entire template here it is long as you can see and i’m not going to go through it because that’s basically what we did in the previous one this is just mapped out in steps
And it has most of the things in it so you can see it’s pretty pretty long and it’s for you to guide you but you need a problem right so define the problem so this is a this is a this is a fictional problem but uh let’s just assume you have a
Customer because it makes a bit more fun and is more it’s more realistic than and it gives you it gives you like kind of the focus you need as a data science and not just doing things that you think is 100 fun but also what adds value the reason why
You’re there if you’re not good at adding value so what’s the purpose of you being a data center who’s going to pay for you right example a green energy windmill producer needs to uh yeah so let’s try to define some problems but don’t be too ambitious right so examples a green energy
Windmill producer needs to optimize distribution and need better prediction and production based on weather forecast so why is that too ambitious it’s probably because you don’t have the data necessary for instance for the windmill here you would need data on production maintenance periods windmills they often need to be
Maintained and their detailed weather data you don’t have so this is just to get you started what you’re needing in this one here so another one could be an online news media is interested in a story on how co2 per capita around the world has evolved over the years so
The data for co2 capital is available in world bank but creating a visual story is difficult for our current capabilities so that said it’s not impossible but if you really want to make a compelling story it will take a bit more practice on the visibility the visual story remember the video of
That guy walking around there yeah yeah that exactly that one if you don’t so go back to the lesson on visual visualization it is amazing it’s amazing so but that’s also world class you don’t need to get up on that level good perfect but here are some ideas on a project so
Uh you could start by considering a data set to get some inspiration examples could be the sagar database and co2 per per capita co2 gdp per capita data set or you can take the imdb movie the reason why i did like that is because it’s a small fly flying around
Here i don’t know where it comes from but you can also do the Imdb movie extensive data set and see about that so example of problem what is the highest rated movies in the genre so i’m gonna try to make a sample project and i’m going to use the imdb movie database set and i’m going to look into the genres of
The movies uh how they are evolving and uh perfect that is actually what i’m gonna do and if you want to do something different please do so if you want to get inspiration how to do it so watch along and see how how it is uh
How it’s done what i was looking for was this one here uh so this is actually the problem i’m gonna try to to look into in the next one and you should do something else right so sample problem project here it is an online media wants
To write an article of the trend of movie ratings over the time they want to explore what is the overall trend and are there different trends in different genres they ask you to make some charts showing trends right so that’s your job you’re hired to do this a single thing there right
And uh as you see here then there are sample projects you can find the files where you can find the files i don’t know the assembled project you stored in park here so there are some guidelines in in the project i’m gonna do if you want to do something else
Feel free if you want to see me do this one feel free i advise you if you want to see me do it try to do it yourself and try to do the same project that that i’m doing because it will cheat you the most so i hope you are excited about this
Because i am and i’m pretty curious about that myself with the movies and what is the overall trend right what what happens and are some genres getting more popular less popular i don’t know it’s going to be interesting so get ready to do that amazing see you in a moment if you
Dare see you bye-bye Are you ready for this excuse me i just uh it’s been exhausting making i mean i love to do these uh projects here but i’ve been making 15 projects 15 tutorials right now i just need a cup of coffee oh that’s good perfect so are you ready
Did you get your coffee don’t worry so let’s dive into how you can use this template and let’s try to make this project and again remember remember we have the overall picture and we create a problem that should add value to the customer and all steps do not always make sense in
The process project but you need to focus on what the problem is and how to add value to your customer you need to understand that so let’s dive into it i already mentioned that our problem will be that an online media wants to write an article on a trend on the trend of
Movies rating over the time they want to explore what is the overall trend and are there different trends in different genres they ask you to make some charge showing the trends right they want it to be visual right because that is what people understand they understand big data visually not as numbers
So the first one here well it’s important we imported that so the dataset is also an occasional here but i actually downloaded it already so i have it in files imdb so if you go here you can actually find it in files imdb here here we have the the four tables that
Cackle dataset has perfect perfect oh i’m scrolling down here so That’s the wrong one here okay here we have it so we have the data here now we need to import the data and we need to use it we notice it’s parquet file so we use this one here so let’s try to do that we are not using all four we are using
Three of them so let’s uh we have movies i don’t remember the names actually so pd and read the good thing is that uh read parque and we have files and imdb and then we have movies that’s the first one we should actually just have movies
Hit here let’s just take one line here because there’s so much data on them so this is the movies movies description and then we have the names of the actors i think pd no pd read park here files imdb names and then afterwards we have ratings
I don’t know if you’re actually going to use them all and the head one here makes it only make the first line by default it takes five lines so here we have the first uh actor fred his height and some data on it up here we had the original title miss jerry from
1890 something and then we have the ratings which are probably the things that we are most interested in but we probably need to combine the things we’ll see and we’ll see when we get started down in the project pd read parque parque and uh files and imdb and we have the ratings
Do we need title principles i don’t remember ratings hit one perfect so these are ratings okay so actually we are not looking into this one we are actually wanting the title principles instead so let’s do that title principles again we need to look up the data in
Order to remember what it is uh pd read parque files imdb title there it is title title principles title principles ahead one oops head one here we go so yeah here we have it right yeah yeah so this is what connects the the title id to the name id
And who’s been acting in it right so this is this is uh this is how you’re connecting names with uh movies right so you have the movies and you have the names and here you have the connection between them right perfect and actually the movies here
That was not the movies there was a i’m the bee name id the title ids here you actually have the ratings and how many votes are there so they actually has a lot of the information we want to so that was first step here so if using
Now we need to combine the data right so let’s just try to combine some data here and combined data maybe instead let’s just see and then see maybe we use it maybe we don’t use it movies join and title principles and then we join again we do it names on imdb
Name id so what am i doing here combine data hit one so actually what i’m doing here is i’m taking the movie and for each movie i combine it with a title principle and then i join it on names with the name of the actress so whoa
And let’s just try to get a few more headlines combined data ahead because what you’re getting here is then you get miss jerry multiple times right because you have multiple joins on it because there are multiple multiple uh title principles that combine to it so originally in names we only had
One of these here one movie but then it makes four of them and combines it with the person involved and it can be the act they can be actress where do we have it we have oh we don’t have all the data here but but it combines if you looked at the
Principal data here the principal data here it will have like actress or what kind of person it is so now we actually have all the data inside this one combined data and maybe it makes sense maybe it doesn’t make sense later but kind of like having an understanding of the data here so
Explore the data let’s just look at the movies i think that’s the one we’re gonna use most likely we have uh 85 000 movies we have movies d types what are we interested in here we’re interested in we have something about duration we have something with year we
Have something about votes we have something average vote we have a meter score reviews from users reviews from critics and so on so we have some insights into what it does right and the movies uh info and we can see here no no so there are 80 how many 85 000
Entries here you can see some are missing and some are missing a lot of budget for instance uh usa cross income and they’re not numeric anyway so if we needed to use them we need to convert them good so so what we want to do now is actually use our group by
And we want to group by year to find some uh to to to to look at to look at the data so movies group by uh is it capital no it’s small year and we take the mean value and then we just for instance take the duration uh why why why the duration
Let’s just look at the mean value first here because what we’re looking at here is uh we have the duration we have the average vote we have the voters we have the meter score we have review from users we have reviews from critics right those are the numeric
Values we have right so what i wanted to look at first was maybe like okay so the duration duration let’s just look at that plot we see here that on average the movie has been becoming longer and longer we’re not asked anything about this it’s
Just just to show you how you can you can make charts on that so in general movies are actually stabilizing around this 100 minutes here you cannot see that specifically here you need to do a bit more research is it going a bit upwards or not it’s not a
Lot at least so since the 80s the movie length has been on average the same perfect and uh now we need to so so we need to understand our data set a bit better and uh one way to do that is we’re asked about the sean ross in the
Movies so let’s try to explore that so movies uh let’s do the explode here i haven’t introduced explode here actually but let’s let’s do that group by uh genre mean sword value average boat this gives us some kind of in inside as scanning ahead we only take the first one
Do we take 30 and take 20. shangra why didn’t you tell me you knew that you saw that right that’s not which sword values so here we go so what it is i want you to notice now is actually we have some weird shangras here right so we have musical comedy
Family so that’s one genre right so it’s is that probably what the user want music musical so there might be there might be almost none of these movies in this specific category right so this gives a skewed picture right so the average here is 2011. maybe there’s only one of them right
And uh it has a high score the highest score right but maybe there’s only one movie and uh how do you so how does that affect right so there’s no really shangrath which is alone and that’s because the genre the genre the shangri in itself has multiple elements so there are many
Ways to deal with this but one way to deal with it is actually to to divide it if we have multiple in it and then divide it into simple genres so that’s actually what we’re gonna do here right we’re gonna copy our data so data oh movies copy
So it’s making a copy of it so we keep the original data right so movie data genre genre and then we do data genre and we do sdr dot split dot what do we want to split on we want to split on comma right so you see here all
Of the genres are split on this one here right and what do we do then data actually we do this one data data explode so i didn’t really introduce explode to you okay we haven’t done that here explode here explode here transform each element of a list like to
A row replicating index values right so it is if you have a list inside there you create it as a row explode genre so we create we converted it to a list now and now we have it like that and then data genre data genre string
Uh strip so what i’m doing here is if we have spaces in front or space and spaces behind we strip them away and data and group by group by genre and we do mean sort values average vote ascending false so here we have it uh dagger did you see that again
Or you’re just keeping it silent for me false did you see that one too here we go so actually we have the full list now here right so what i did now and don’t worry about the details because i’m we didn’t learn about these things there it’s just
To show you that sometimes you need to transform your data in a way so what i did here is making all these combined so each column the column of genre can have multiple genres in it comma separated so what i did is to explode it out to not being comma separated okay
So i think i don’t need a cup of more silver for my coffee thank you for that and then you see kind of what’s happening right so what are the most popular xiaomi’s now you see documentary film noir biography history more news and then you probably
Have a higher count of each of them instead so it makes it more reliable of it and you see actually funny thing is reality tvs all the way down there interesting so so far so good so what is it what we want to show now it is probably
What is the average vote that’s that’s the question that we want now so let’s do that data oh actually actually this one is on movies movies uh group because we want all the data and we want all the data just as it is group by and now we exploded it we have the
Same data represented multiple times so you need to take it from movies here mean and mean and then we have average vote and then we do a simple plot of that so now we see the trends of the movies right so one thing we see here is actually up until the 60s
Around the movies are getting more and more highly rated and then it’s been going downwards the ratings have been worse and worse on average right for all the movies and just a notice on the data set because these are the things you need to consider it is that it only includes
Movies that has at least 100 ratings and these are the kind of things you need to research also is that what you want is that the customer the client what are they expecting right how big movies are they another thing we want to do is data data shang
And it could be for instance biographies biography it’s called biography and group group by year mean average mode plot so here you get it right so here you see actually something interesting here in biographies it seems quite stable i mean you have an end here but the trend actually seems to be
It’s difficult to see and we will make some more research on that in a bit later but it seems like it’s quite stable or me maybe a bit growing actually right and honestly we could do more more of them but that’s up to you i’m not gonna do more because it’s just
The same copy paste so i’m just going to focus on the overall picture and this one picture so what we concluded so far is an overall picture it seems like going downwards and there are some genres that go upwards but you know that the overall picture is going downwards
That means that this this is kind of an exception and there might be other exceptions to maybe growing even faster i don’t know i didn’t try it you should try that out good and the cleaning data uh we’re actually not gonna do anything about that in this
One here we could go in and investigate if data is missing or not but actually in our case we’re only using data which is available so we can’t say much about it good so what i want now is actually i want to create a data set
And try to make a model to see the trend in overall so let’s try to do that let’s try to do that okay so all all the data right so that’s in movies and we do group by here and we do mean here right and then we
Say bio data frame and you could add more right data data by genre who’s called sorry biography group group by year mean okay so so what i’ve done so far is creating the data in the data frame and what i need is the x data x all
And it’s just like we made a data frame with just index because that’s actually what we need data data frame and we take all data frame index here so that’s the x data set and why all it is basically we want all data frame and we want the we want the average
Vote here and we want to do the same for bio so let’s do that pd data oh data frame bio data frame index because we only need the index of it and why bio pd data frame bio data frame oh no i’m not listening to what i’m saying
I’m just going all the pilot bio data frame average mode so this is how you do it right on model selection uh what we want to do is actually we want to make a trend line on our data sets so that’s why we’re going to use the linear regression so actually we could
Copy this down here and modify it to our needs so let’s do that we actually don’t need to score yeah we just keep it maybe we’ll just look at that but then we have linear all and linear all and we’ll take x all and why all
And we’re actually going to do the same for bio bio bio bio bio okay so now we have fitted our models there let’s just try to visualize this so ax all data frame what is it we are looking for we are looking for the average vote i mean make
A plot of that but we want to do more we don’t want to plot all data frame index with our linear model here so linear all predict predict right so this is very predicting and we want to predict x oh perfect there so here we actually see
The trend line here maybe we make it red so here we actually see the trend line here on it right so you see this is the best fit with the linear regression uh with a fitting line on this one so you can see a trend line going downwards here
And we’ll get a bit back to representing this when we we’re just analyzing the result right so we’ll get a bit smarter in a moment just don’t worry now it’s presented so so this is the bio and this is a bio and this is a bio so
So actually in the bio i’m kind of surprised here actually it’s actually going downwards as well there’s something wrong with this one oh yeah now i know what’s wrong so this is a bio here this is a danger of copy pasting right i forgot to change this one so this one
Goes upwards here you could see the the the line didn’t fit right so this one is actually balanced a bit upwards here okay so good so far so analyzing the result right so could we say anything here i would say that more movies are produced right are produced does that mean worse movies
Uh that’s a question now why why do we ask this question because we know already that over time we actually didn’t look at the data but you could look at the data and see that there are more movies made right so let’s just validate that actually so movies group by year count
And it doesn’t matter what we count on now title because we count them a lot so the point is okay the data is not updated here in 2020 or whatever year it is but you see there’s more and more movies made does that mean they are worse in general right uh
That’s a good question right could that have impacted the quality so if you took the top 10 movies for each year would it still be going upwards is the rating still going downwards or what i mean these are interesting questions that you should consider right
You can also look at uh what about votes per movies per movie move move movie i cannot spell right now votes per movie right so how do you do that movies uh votes describe so what you see here is actually so the mean of the votes is uh
Almost 1 000 right it’s 949 and you see actually the midpoint here is 500 almost and so on so what is what is the point here do all the movies have uh the minimum here yeah the minimum here is also a good point here so the movie with the least votes
Has 99 right it’s 999.9 in scientific so it’s 99 movies here and the one with the most votes has has a few thousands millions votes isn’t that million million votes yeah excuse me 2.2 million votes so one thing to to to look at look at is actually how many movies group by year
Mean votes plot so it’s interesting to see how the involvement had been right so you see it’s been growing rapidly and actually been pretty stable up here movies per votes votes per movie okay but it gives you some kind of insight right there are more movies there are more votes per movies
What does it mean does it mean that the quality of the evaluation is better i mean those people evaluating i mean what about them we don’t know good so now you need to present the findings and one thing that you could be looking at is is actually you want to You want to make a chart with that so you can actually take the chart we already made up here we have two charts that we are working with so let’s do that and then refine it a bit so one thing i often talk about is like okay so
Why limb should be zero right because what does it say but is this good representation of the data right because here you still see a decline in the vote are people interested in seeing this one or are they more interested in seeing this one here so
I would say in this case here this one makes more sense but could we add something could we restrict the data set are people interested in movies back from the 20s maybe not maybe if not from 40s maybe from 60s maybe from 40s what does what tells the best story
I’m not the judge of that i think you should dive into that and figure out what you think is best so far so good but let’s try to let’s try to add something else to this one here actually so let’s make this one alpha point 25 and line width to be 20 and line style and just adding something here i’m just making a bit different right so this is one way to represent it we we are not so so technical yet so x set x a label it’s already there so why label what could that be a rating
Uh actually we’re not making parenthesis here we’re now we should make parenthesis here rating and ax set title trend of movie ratings okay so here you have something to work with and uh let’s take the other one as well because this is interesting because biographies is going upwards while the
Other one is going downwards right so you can do something similar to this one here you can make alph alpha alpha point 25 and you can make a line width i take 8 in here because i tried it just before and so it looked better with this line style we have
The same one here right so here we have it and the reason why i made 18 is because then this one looked a bit better and again this is not the most insane amazing amazing uh visual presentation but our limit we have really limited the unlimited housing called uh
Possibilities at the current stage ax set title trend of biographies ratings perfect so you see that one’s trending upwards and this wasn’t turning downward so what would it tell the user story in the newspaper you can see well in general movies seem to be going downwards and biographies go upwards right
Actually those should have i put these wrong actually percent findings uh it should have been down there so but let’s go to credibility counts now so what could you say down here uh i mean a clickbait conclusion is like something like uh movies are getting worse but is that what you want uh
But as i mean we noticed that there are some made more movies but there are more movies made right does that mean that the production value goes down so movies can be made by anyone so in this list here i mean what we know so far is this list includes uh movies
Include at least 100 ratings and i don’t know how difficult it is to get 100 ratings and so on and we also notice that each movie movie gets more votes does that have an impact on it it’s uh difficult to know no if more votes gives a better evaluation if
You asian right so there’s some questions to ask yourself in this credibility thing here because i mean and if you investigate more you might you might find more aspects to consider when you’re doing this so again presenting something might be very easy to digest that if you really wanted the great one right
So you will cut the data from the 20s and you’ll just see it one chart going downwards and it will look amazing right and if you wanted some positive story right so biographies are getting better and movies are getting worse but there’s often a reason behind that there’s
Something behind the data and that’s what you need to explore right so more movies are made everybody me i could make a movie and i just need 100 ratings to get on this list here does that mean it’s a good movie i don’t know don’t tell me maybe it will
So use insight so what what are our insights i don’t know let’s try to brainstorm a bit here so movies get worse ratings right that is that is an insight and that’s objective inside bios are getting better on average have a rich and i got a triple t here
Apparently that’s not how you spell and uh more movies are more movies are being produced and what else can we say more more votes per movies so why do i do this i mean this is what are the insights that we have at this stage maybe we’re only
Considering for ourselves but we need to make some brainstorming on what adds value so uh who is the end user right what do they care about so you can think about the newspaper the media right is it an in in-depth media in-depth media or click paid media right so these are questions
To consider right uh this will have an effect effect on our recommendations but remember your brand right so you are also a brand in this one here so do you want to make click bait news media or not i don’t know that’s up to you measure impact right can we do anything here
Uh can we do anything here for our customer yeah so maybe we can actually uh i mean this might be one-off enough by the media but can we help them think long-term right so but you because you need maybe you need to add them value and if you do a great
Job they’ll be more likely to take you right so what could you do right so ensure they measure measure as much about their users and visitors as possible right so for instance this uh what chart are they most interested in interested in for instance that could be one of them
So this helps them to create better content in the future right so again this is about being creative again and uh thinking about your goal right your goal is to get more contracts in the future and create awesome content right so remember your brand value remember how to get more customers right so
That is actually how i would do it in a fast way this is done in 30 40 minutes so it’s not like a depth in depth analysis but i hope you get an idea of how to use it and how to work with it and there are many points we just
Skipped and you could go deeper into them and make more investigation but that’s not the purpose of that so i am so honored that you took this journey with me all the way to the end from the beginning i have no idea how i am so thankful for that and i hope to
Hear back from you who are you why are you taking i want to learn more about you so i can make better content to help you in the future so thank you so much if you find this valuable please also share this template with anybody there’s a link down below
So you can share it along so do that for me and help people try to use it try to come with improvement suggestions as you learn along so i hope you enjoyed this journey together here in this 15 part course on data science with python it’s been amazing so this template it
Summarizes everything you need to do and as you see it doesn’t need to be using every single step every single time time it depends on the problem so one great question you might have what is next what’s next well if you don’t feel 100 comfortable about
Python i would advise you to take my python course it’s an eight-hour course it has resources like this project like this one and it has a free ebook you can download so you have like a look up where you can see how you do stuff it’s really amazing it’s very popular
If you are more comfortable about that you’re comfortable about your python programming maybe it is your you want to expand your machine learning models well lucky you i have a machine learning course as well here on this channel it is 10 hours and it is also
Free it is structured in the same way it has 15 projects and you learn all the things you need to know about machine learning so it is just out there for grabs and finally i would love to hear something about you you’ve been here all the way
To the end and i’m so grateful for that i’m so happy you stuck to the bitter end and uh let me know who are you what type of problems do you solve all your data sciences what are the challenges you see i mean what do you want to learn more
About i wish to learn more about you so i can make better content and finally please hit like and subscribe it helps me grow this channel and it makes me motivated to help more people like that so see you in the future thank you
0 Comments