Implementing an entire Data Science (AI) solution.
Part 1 of a three-part series on how to implement an AI solution from data cleaning and visualization to building a Machine Learning model and deploying it on the web!
Data Science? AI? Machine Learning? These 3 buzz words being thrown around everywhere must be confusing a lot! This series is for those with no or little prior experience with the above three trending areas. By the end of this series, you should have a basic understanding of them and even have your own AI solution (A Machine Learning model that takes different house features and returns the estimated price of the house) up and running which you can even add to your portfolio! This series assumes you know at least the basics of python or at least programming. If you’re new to programming refer to this quick read before jumping into this article.
Let me start by asking you a question — what is 3+3? Don’t sweat it this is not a trick question. You probably answered 6 but why did you answer that? If we had similar childhoods it's because back in 1st or 2nd grade the Math instructor came in and drew three apples 🍎🍎🍎 drew another three apples 🍎🍎🍎 and told you to add them together giving you 6! pretty neat we’re about to learn something you learned ages and is now second nature hu? Let’s get back on track. You actually learned that and even remember it and when I asked you what 3+3 was you answered 6, that’s because you’re smart! you’re intelligent! Machines on the other hand not so much. That’s where Artificial Intelligence comes in! Artificial intelligence is when the machine tries to mimic the intelligence that humans come with by default, hence — “Artificial” intelligence.
But how do Machines achieve that intelligence? This is a fascinating question, isn’t it? Remember when you had to go to school to answer that 3+3 question? Machines will also have to go through well let’s just say a similar process. Machines have to learn, hence — Machine Learning!
We have covered two of the big three words and now there was one left, Data Science! how does the Machine actually learn in Machine Learning? Simply like you used existing data, like textbooks, addition tables and so Machines also need things to “study” or learn from which is simply Data! That is where Data Science comes in, “Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains.” well that’s how Wikipedia likes to state it but for now, let’s just see one aspect of it where it is the process of the study of our Data which will be the basis for our intelligent model, or from which our system learns!
Setting up the execution environment
Setting up a local python Data Science environment is tiresome and error-prone as it requires many external libraries so let’s go for a tool that’ll make our lives so much easier Google Colab, a tool that allows you to write and execute arbitrary python code through the browser. You can run any python code in a code cell by using the ► button.
the above image pretty much describes the things you’ll normally use as a beginner but you can take a look at this article for a deeper dive into using Colab.
Explore and Analyze Data in python
Data exploration and analysis is usually an iterative process, during which the Data scientist takes a sample of knowledge, performs different subsequent tasks to research and test hypotheses.
Unsurprisingly, our project starts with exploring and analyzing data. The results of this analysis might form the basis of a report or a machine learning model, but it all begins with data. For this project, we’ll stick with the famous Boston House pricing Dataset where we’ll build a prediction model to calculate the median value of owner-occupied homes in 1000 USD’s.
Importing required libraries
import numpy as np #for advanced list manipulations
import pandas as pd #for loading our data
import matplotlib.pyplot as plt #for visualizations
Loading our data
So we’ll start off by loading our dataset into our notebook. To do that we are going to import a library called pandas. Pandas is a software library written for the Python programming language for data manipulation and analysis. If you have ever worked with Excel before pandas is simply used to work with our data in a similar(Tabular) format using rows and columns.
df = pd.read_csv(' https://raw.githubusercontent.com/Azariagmt/Implementing-an-entire-Data-Science-AI-solution/master/Data/Training_set_boston.csv?token=ANOIBQLX4ZBHXSUPYIWMMELANHH3O'
Pandas read_csv method is used to load data from text files or almost any type of data file we’ll be working with. It can also load data from the URL where our data(CSV file) is located so let us load the data to our colab environment, more specifically to the df variable.
The next step would be to take a peek at our data and try to get a sense of how it's like. Calling the head() method on our DataFrame like below gives us the first five elements in our dataset.
If you’ve got the above output great job! Each column refers to :
CRIM: per capita crime rate by town
ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS: proportion of non-retail business acres per town
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX: nitric oxides concentration (parts per 10 million)
RM: average number of rooms per dwelling
AGE: proportion of owner-occupied units built prior to 1940
DIS: weighted distances to five Boston employment centres
RAD: index of accessibility to radial highways
TAX: full-value property-tax rate per 10,000 USD
PTRATIO: pupil-teacher ratio by town
B: 1000(Bk — 0.63)² where Bk is the proportion of blacks by town
LSTAT: lower status of the population (%)
MEDV: Median value of owner-occupied homes in 1000 USD’s (Target)
Handling missing values
Data is messy, how it's been collected might have made it a mess and it might have a lot of missing values due to different reasons. We as Data Scientists should know how to clean that missing Data. Luckily for as Pandas comes in with two built-in methods we can combine to see how much data we have missing from each column.
calling the above methods on our DataFrame should return the sum of the null values for each column.
As you can see this is a dataset with no missing values so you’ve been spared for today but 99% of the time our data will not be complete and we’ll need to do some preprocessing to fill in those values.
Gaining more insights
The next thing we want to do is to view some basic statistical details of our Continuous data like percentile, mean, std, etc. The describe() method of Pandas lets us do exactly that. But first what are our continuous values? Continuous data is data that can take any value. We also have another type of Data called Discrete data, data which is information that can only take certain values.
As you can see in the output we can see a lot of statistical descriptions of our numeric data. Including different quartiles found in the data.
“A picture is worth a thousand words” well you’ve heard it a ton of times and that saying should be on your head when you are implementing a Data Science solution. You probably looked at the dataframe above scratching your head trying to see relationships, trying to know what the heck is going on. When you go to show other people this data why should they go through the same hustle? Your job is to make the data easier to comprehend and make sense within seconds. That is where Matplotlib and other data visualization libraries come in.
We’ll dive deep into Data visualizations and finish up on getting our data ready for our Machine Learning model on the next series. The next part of the series will be released Friday, April 09, stay tuned!