Home » A Beginner’s Guide to Structuring Data Science Project’s Workflow

A Beginner’s Guide to Structuring Data Science Project’s Workflow

by EasyDailyCrypto News
0 comment 4 views

Finally, let’s display a Correlation heatmap to visualize the strength of relationships between the features in our dataset.

import seaborn as sns

corr = iris_df.corr()

# plot correlation heatmap

fig, ax = plt.subplots(figsize = (15, 6))

sns.heatmap(corr, annot = True, cmap = "coolwarm")
Correlation heatmap

As we can see, the most correlated features are Petal_Length and Petal_Width, followed by Sepal_Length and Sepal_Width.

Now that we have an idea of what the task is about, let’s go to the next step.

Preparation

The preparation step involves making the data ready for modeling. In many cases, the data collected will be a lot messier than this dataset.

To prepare our dataset for modeling, let’s first label encode the Species feature to convert it into numeric form i.e. machine-readable form.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

Species = le.fit_transform(iris_df['Species'])

We will be using a Support Vector Machine Classifier, and SVMs assume that the data it works with is in a standard range, usually either 0 t0 1, or -1. So next we will be scaling our features before feeding them to the model for training.

from sklearn.preprocessing import StandardScaler

# drop the Specie feature and scale the data

iris_df = iris_df.drop(columns = ["Species"])

scaler = StandardScaler()

scaled = scaler.fit_transform(iris_df)

iris_df_1 = pd.DataFrame(scaled, index = iris_df.index, columns = iris_df.columns)

iris_df_1.head(2)

Modeling

We have explored the data to get a much better sense of the type of challenge we’re facing and prepared it for modeling. So it is time to train our model.

For training purpose we’ll store Sepal_Length, Sepal_Width, Petal_Length and Petal_Width in X and our target column – Species in y.

# Prepare the data for training

# X = feature values - all the columns except the target column

X = iris_df_1.copy()

# y = target values

y = Species

Next, we will use train_test_split to create our train and test groups. The reason for creating a validation set is so we can test the trained model against a set of data that is different from the training data.

30% of the data will be assigned to the test group and will be held out from the training data. We will then configure an SVM Classifier model and train it on the X_train and y_train data.

from sklearn.model_selection import train_test_split

# Split the data into 70% training and 30% testing

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.30)


We have split the data, it’s time to import the (SVM) classification model, and train the algorithm with our X and y

from sklearn import svm

# Train the model

model = svm.SVC(C = 1, kernel = "linear", probability = True )

model.fit(X_train, y_train)
SVC(C=1, kernel="linear", probability=True)

Now, let’s pass the validation data to the stored algorithm to predict the outcome.

y_pred = model.predict(X_val)
y_pred[:5]

array([1, 1, 0, 1, 0])

Evaluation

The next step in our workflow is to evaluate our model’s performance. To do this, we will use the test set held out from the data during Modelling to check the accuracy, precision, recall, and f1-score.

from sklearn.metrics import accuracy_score

# Print the model's accuracy on the testing set

print("Accuracy: ", accuracy_score(y_pred, y_val) * 100)

Accuracy: 97.77777777777777

from sklearn.metrics import classification_report
#Check precision, recall, f1-score
print(classification_report(y_pred, y_val))

Our model’s accuracy is 97.7%, which is great.

This may not be the case in real-time as our dataset is very clean and precise, therefore it gives such results.

Deployment

We are in the final step of our Data Science workflow – deploying the model. We will be using Streamlit – an open-source python framework for creating and sharing web apps and interactive dashboards for data science and machine learning projects to deploy our model.

First, we need to save the model, the label encoder, and the standard scaler. And to do this all we need to do is pass these objects into the dump() function of Pickle. This will serialize the object and convert it into a “byte stream” that we can save as a file called a model.pkl.

import pickle
# store model, label encoder, and standard scaler in a pickle file
dict_filename="model.pkl"
data = {'le': le, 'model': model, 'scaler': scaler}
with open('model.pkl', 'wb') as file:
    pickle.dump(data, file)

Now that we have saved the model, the next thing to do is to deploy the model into a web app using  Streamlit.

To do this, we will open a folder containing the dataset, the notebook you have been using, and the pickle file we just created (model.pkl) in an editor (e.g. Spyder), and create 2 files named web_app.py and predict_page.py. 

Before we proceed, if you don’t have Streamlit installed, run the following command to set it up.

pip install streamlit

Next, in the predict_page.py script, we will load the saved model. To do this, we will pass the “pickled” model into the Pickle load() function and it will be deserialized.

import pickle
def load_model():
    with open('model.pkl', "rb") as file:
        data = pickle.load(file)
    return data

Now we have our model, label encoder, and standard scaler stored in data, and we can easily extract it with the code below.

# extract model, label encoder, and standard scaler
data = load_model()
model = data['model']
le = data['le'] 
scaler = data['scaler']

We now have our model, label encoder, and standard scaler saved. So, let’s start building the web app.

To do this, we will create a function containing Streamlit widgets.

def prediction_page():
    st.title("Iris Flower Specie Prediction")
    st.write("""This app predicts the specie of an Iris flower""")

To view the web app, we will go to our web_app.py file to import Streamlit and the function we created.

import streamlit as st
from predict import prediction_page
prediction_page()

Then we will open Command Prompt and change our home directory to the folder containing the files then input “streamlit run web_app.py” in the terminal.

If you followed these steps correctly, the output below will show in your browser.

data science

Next, we will add sliders for the user to input parameters for the Sepal Length, Sepal Width, Petal Length, and Petal Width by calling the slider method and giving it a minimum and maximum value, and a text to prompt the user the input values.

    st.write("""#### Input Parameters""")
    Sepal_Length = st.slider('Sepal Length', 8.0, 5.0)
    Sepal_Width = st.slider('Sepal Width', 5.0, 2.0)
    Petal_Length = st.slider('Petal Length', 7.0, 1.0)
    Petal_Width = st.slider('Petal Width', 3.0, 0.0)

After doing this, we will click save and go back to the browser to rerun.

Finally, let’s add a button to predict the Iris flower specie We will do this by calling the button method and assigning it to a variable with the code below.

ok = st.button('Check Flower Specie')

if ok:

Features = np.array([[Sepal_Length, Sepal_Width, Petal_Length, Petal_Width]])

Features = scaler.transform(Features)

Prediction = model.predict(Features)

st.subheader('The species of the iris flower is' + ' ' + le.classes_[Prediction[0]])

Let’s click save again, and go back to our browser to rerun. The app should look like the image below.

You can play around with the web app and try different input parameters. We are now done building the app, and it’s time to deploy it.

To deploy the app on Streamlit Sharing we first need to upload all the files we have used to a repository on GitHub. Once that is done, we will go to https://share.streamlit.io/ and click on Continue with Github.

google sign in

Before we click on the New App button, we need to add a requirements.txt to our GitHub repository for the app to build properly.

new app

Fortunately, there is a package called PIGAR that can generate a requirements.txt file automatically without any environments. To install it run the following:

pip install pigar

Then, change the home directory to the projects folder, and run the following to generate the requirements.txt file:

pigar

Now that the requirements.txt file is generated, we will upload the requirements.txt file to the Github repository we created, and go back to click on the New App button.

deploy app

Then we will select the repo that we want to deploy from the dropdown. Choose the respective branch. Then select the file that runs the code and Click Deploy.

In this article, we covered the following:

  • What a Data Science workflow is and why it is necessary to structure your Data Science project using a template.
  • The necessary steps to successfully structure a Data Science project’s workflow.
  • And a hands-on practice using the famous Iris flower dataset from the UV Irvine Machine Learning Repository to build a Multi-label Classifier model using SVM and deployed the model using Streamlit.

I hope you enjoyed reading and learned something new. Please feel free to comment if you need clarification or have any questions. Thank you.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Related Posts

Leave a Comment