Automation of EDA & Text processing

14 min readApr 5, 2021

Exploring data and visualizing with Sentiment Analysis

In this article, I’m going to analyze the Women’s Clothing E-Commerce dataset which contains numerical data, text reviews that are written by customers (available here).

The steps which we are going to follow are listed below

Data Description
Data Cleaning
Data Pre-Processing
Data Analysis
Data visualization
Data Modelling

Let’s start the fun with our first step Data Description

This dataset includes 23486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:

Clothing ID: Integer Categorical variable that refers to the specific piece being reviewed.
Age: Age of the reviewer’s age.
Title: The Title of the review.
Review Text: The description of the product by customers.
Rating: Ratings were given by the customer to a different product from worst 1 to best 5
Recommended IND: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.
Positive Feedback Count: Positive Integer documenting the number of other customers who found this review positive.
Division Name: Categorical name of the product high-level division.
Department Name: Categorical name of the product department name.
Class Name: Categorical name of the product class name.

So let's get some insights. How can we do this? No need to worry we have pandas library for it.

# Library for data manipulation and data exploration
import pandas as pd
import numpy as npdf = pd.read_csv('Womens Clothing E-Commerce Reviews.csv')
df.head()

#Droping Unnamed: 0 col as it doesn't have any role in data
df.drop('Unnamed: 0',axis=1)

Finding out the column data type, how many null values are present in each column, and last but not least we can also find the number of rows and cols.

df.info()

Describe function well help us to some basic statistical details like

Percentile, Mean, Std, Quantile Range of a data frame

2. Find out skewness and outliers too. We can plot as a boxplot for outliers for better visualization.

How to find skewness from describe()?

Let’s take a feature Age we can see that 50% of our data lies in the range of age 18 to 41. So we can say that our Age feature is right skew.

df.describe()

df.describe(include='o')

Finding the unique values from all the columns. E.g. Age has 77 different ages in the data.

df.nunique()

Finding the Null value (nan) in data. As we can see Title, Review Text, Division Name, Department Name, Class Name has nan values so we will remove them.

df.isnull().sum()

Counting Unique value and Missing value and embedding into one table for better understanding.

unique_count = []
for x in df.columns:
    unique_count.append([x,len(df[x].unique()),df[x].isnull().sum()])
    
pd.DataFrame(unique_count, columns=["Column","Unique","Missing"]).set_index("Column")

Dropping the null value and then checking the shape of the dataset. Before dropping null values dataset has 23486 rows & 10 Col and after dropping we have 19662 rows & 10 Col.

df = df.dropna()
df.shape

By creating a separate data frame for Numerical data and Object data you will get the idea of when we will use the visualization.

df_numerical = df.select_dtypes(include=['int64'])
df_categorical = df.select_dtypes(include=['object'])
df_cat = df[['Division Name', 'Department Name','Class Name']]print("Numerical col: ", df_numerical.columns)
print()
print("Object col: ", df_cat.columns)

Till now we have done some exploration for different features but we didn’t do exploration in Title and Review Text. We will take a look at these two features later on. Let’s focus on other columns for now.

Let’s check the correlation for each feature with the help of a heatmap.

#annot will help us to present int in heatmap
sns.heatmap(df.corr(), annot=True) 
plt.title('Heatmap for Whole Data', fontsize = 16, fontweight = 'bold')

Interpret Correlation:

The corr() method calculates the relationship between each column in your data set.

Let’s take the correlation between Rating & Recommended IND it is 0.79 means they are highly correlated to each other as their correlation is close to 1.

In simple terms, clothing having higher ratings are more recommended to people.

Using pairplot function to find the different graphs and for easy analysis.

Let’s interpret a graph of clothing ID and Positive Feedback count we can see most of the clothing IDs are recommended to customers.

sns.pairplot(df, hue='Recommended IND')

# Import library for Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns#setting the theme for seaborn
sns.set_theme(style="darkgrid")
%matplotlib inline

Let’s do something fun, ready set go with some visualization.

Visualization will be divided into the below format:

Univariate Visualization

2. Bivarinet Visualization

3. Trivarinet Visualization

Univariate Visualization

How to choose which chart is best for numerical or categorical data?

For categorical data: pie chart, Bar chart

For Numerical data: Histogram, scatterplot

We will start with the target variable

# Converting 0 & 1 in Not Recommended & Recommended
df.loc[df['Recommended IND'] == 0, 'Recommended IND'] = "Not Recommended" 
df.loc[df['Recommended IND'] == 1, 'Recommended IND'] = "Recommended"plt.figure(figsize = (6,4))
x = df['Recommended IND'].value_counts()
labels = 'Recommended','Not Recommended'
plt.pie(x = x,  labels = labels,
        autopct = '%.2f%%', 
        textprops = {'size' : 'x-large',
                   'fontweight' : 'bold'})
plt.title('Distribution of Recomended ID', fontsize = 14, fontweight = 'bold')
plt.legend(labels, loc="upper left", bbox_to_anchor = (1,1))
plt.tight_layout()
plt.show()

Interpretation

82% of the data we have been recommended to customers means the data is of good quality.

Univariate visualization for numerical data.

df.hist(bins=10, color='steelblue', edgecolor='black', linewidth=1.0,
           xlabelsize=8, ylabelsize=8, grid=False)    
plt.tight_layout(rect=(0, 0, 2, 2))

Interpretation:

As we take a look at the graph Clothing Id has way more skewness.

Age has skewness and how come 99 year old people give ratings. So I might think they gave false age.

Rating we do have a product that has fewer ratings but mostly data is in 5 ratings.

Recommended IND Mostly product is Recommended to customer

Univariate Visualization for categorical data

plt.figure(figsize=[14,10])
n=1
for x in df_cat:
    plt.subplot(2,2,n)
    sns.countplot(x=df[x],data=df)
    sns.despine()
    plt.title("Distribution of {} ".format(x), fontsize=16, fontweight='bold')
    plt.xticks(rotation=55)
    n=n+1
plt.tight_layout()
plt.show()

Interpretation:

Division Name: General products are more in demand than intimate products

Department Name: Most choose products are Tops & Dresses

Class Name: Dresses, Knits, and Blouses are the most popular

Univariate distribution of clothing ID

plt.figure(figsize=(8,6))
ax = sns.countplot(x='Clothing ID', data = df, 
                   order = df['Clothing ID'].value_counts().index[:50])
plt.title(' Distribution of Top 50 Clothing ID ', fontsize = 16, fontweight = 'bold')
plt.xlabel('Count', fontsize = 13)
plt.xticks(rotation=90)
plt.ylabel('Clothing ID Number', fontsize = 13)
plt.tight_layout()
plt.show()

Interpretation:

plotting Top 50 clothing Id so we know which type of clothes are in demand

Bivariate Visualization

Division, Department & Class Name Vs Recommended IND

for x in df_cat:
    y = pd.crosstab(df[x],df['Recommended IND'])
    y.div(y.sum(1).astype(float), axis=0)
    y.plot(kind='bar', stacked=True)
    plt.title("Distribution of {} vs Recommended IND".format(x), fontsize=16, fontweight='bold')
    plt.xticks(rotation=55)

Interpretation:

Division Name: Recommendation clothes are more for General products

Department Name: Jackets are the only products which have less recommended compared to other clothes.

Class Name: Interesting result is Trend clothes are only recommended we don't have any negative review.

Age, Positive Feedback Count Vs Recommended IND

plt.figure(figsize=[17,10])
n=1
label = df[['Age', 'Positive Feedback Count']]
for x in label:
    plt.subplot(2,1,n)
    sns.countplot(x=df[x], hue='Recommended IND',data=df)
    sns.despine()
    plt.title("Frequency count of {} ".format(x), fontsize=16, fontweight='bold')
    plt.xticks(rotation=90)
    n=n+1
plt.tight_layout()
plt.show()

Interpretation:

At age of 39 highest people recommend the product. The age range between 30 to 50 has the highest number of people recommending the product.

Positive Feedback Count doesn’t give much of the information.

Division, Department & Class Name Vs Rating

for x in df_cat:
    y = pd.crosstab(df[x],df['Rating'])
    y.div(y.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
    plt.title("Distribution of {} vs Rating ".format(x), fontsize=16, fontweight='bold')

Interpretation:

The lowest rating 1 & 2 are given to General products, while most 5 ratings are given to Intimates
Trend clothes has lowest 5 ratings compared to another different department name
Casual bottoms and chemises are giving the highest 4-star rating and don’t have any other rating that's strange.

Age Vs Rating

df1=df.copy()
bins = np.arange(0,100,10)
df1['Age group'] = pd.cut(df1['Age'], bins)
df1.columnsdf1 = df1.groupby(['Rating', df1['Age group']]).size().reset_index(name='n')
df1.columnsratings_count_df_pivot = pd.pivot_table(df1,index=["Age group"],
               values=["n"],
               columns=["Rating"],
               aggfunc=[np.sum])ratings_count_df_pivot.plot(kind = 'bar', stacked=True, fontsize = 14)
plt.title('Age vs Rating', fontweight='bold', fontsize = 16)

Division, Department & Class Name Vs Age

plt.figure(figsize=[8,10])
n=1
for x in df_cat:
    plt.subplot(3,1,n)
    sns.boxplot(x=df[x], y='Age',data=df)
    sns.despine()
    plt.title("Frequency count of {} Vs Age".format(x), fontsize=16, fontweight='bold')
    plt.xticks(rotation=55)
    n=n+1
plt.tight_layout()
plt.show()

Interpretation:

People in the range of age of 30–40 are buying product frequently.

Tops are brought by people having age of 60–70. Trendy clothes are not preferable by aged people.

Swimsuit is brought by at age of around 90 which is pretty rare.

Division, Department & Class Name Vs Division, Department & Class Name

f, ax = plt.subplots(1,2,figsize=(16, 2), sharey=True)
sns.heatmap(pd.crosstab(df['Division Name'], df["Department Name"]),
            annot=True, linewidths=.5, ax = ax[0],fmt='g', cmap="Blues",
                cbar_kws={'label': 'Count'})
ax[0].set_title('Division Name Count by Department Name (Count Distribution)', fontsize=16, fontweight='bold')sns.heatmap(pd.crosstab(df['Division Name'], df["Department Name"], normalize=True).mul(100).round(0),
            annot=True, linewidths=.5, ax=ax[1],fmt='g', cmap="Blues",
                cbar_kws={'label': 'Percentage %'})
ax[1].set_title('Division Name Count by Department Name (Percentage Distribution)', fontsize=16, fontweight='bold')
ax[1].set_ylabel('')plt.tight_layout(pad=0)
plt.show()

cgar_kws = is used for labeling y labels.

normalize = True means normalizing overall values.

For more information visit pandas official website.

pandas.crosstab - pandas 1.2.3 documentation

Edit description

pandas.pydata.org

Interpretation:

The dominance of the General size is more across the various categories within Department Name. There a notable overall between General Petite and Department Name.

f, ax = plt.subplots(1,2,figsize=(16, 7), sharey=True)
sns.heatmap(pd.crosstab(df['Class Name'], df["Division Name"], normalize=True).mul(100).round(0),
            annot=True, linewidths=.5, ax = ax[0],fmt='g', cmap="Blues",
                cbar_kws={'label': 'Percentage %'})
ax[0].set_title('Class Name Count by Division Name (Percentage %)')sns.heatmap(pd.crosstab(df['Class Name'], df["Department Name"], normalize=True).mul(100).round(0),
            annot=True, linewidths=.5, ax=ax[1],fmt='g', cmap="Blues",
                cbar_kws={'label': 'Percentage %'})
ax[1].set_title('Class Name Count by Department Name (Percentage %)')
ax[1].set_ylabel('')plt.tight_layout(pad=0)
plt.show()

Interpretation:

We have seen that dress's popularity is way more but not that of knits.

Up til now, we have seen basic manipulation and Univarient and Bivarient visualization for some insights. Keep it up!

This blog is going to be long so if you want then just take a break and rejoin the further procedure. I know sometimes it is hectic to do everything in one go so take a break and have KitKat.

So let’s jump to the very exciting part which we have kept last. My favorite part NLP.

Let’s take a look at the feature Title, Review Text

pd.set_option(‘max_colwidth’, 500)
 df[[‘Title’,’Review Text’]].head(3)

An expert in NLP can easily tell that text needs to be clean by removing punctuation, stop words, etc.

CLean Text

# improt nltk library for cleaning the text
import nltk
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('all')
nltk.download('punkt')# for stopwords removal
from nltk.corpus import stopwords#for word tokenizing
from nltk import word_tokenize#for stemming
from nltk.stem import PorterStemmer#for making wordcloud
from wordcloud import WordCloud, STOPWORDS

The text needs to be clean so we will start converting all the text into lower case.

Making a cleaning function for cleaning text and saving it into a different data frame.

def clean_text(text):
    
    # Make lowercase
    text = text.apply(lambda x: " ".join(x.lower() for x in x.split()))# Remove special characters
    text = text.apply(lambda x: "".join(["" if ord(i) < 32 or ord(i) > 126 else i for i in x]))# Remove whitespaces
    text = text.apply(lambda x: " ".join(x.strip() for x in x.split()))
            
    # Remove punctuation
    text = text.str.replace('[^\w\s]', '')
    
    # Remove numbers
    text = text.str.replace('\d+', '')
    # Convert to string
    text = text.astype(str)
    
    return text# Applying clean_text function to data
df['Filtered Review Text'] = clean_text(df['Review Text'])
df['Filtered Review Text'].head(2)

Q. Why do we need to convert text into the lower case?

Ans: Taking an example like A boy is having an apple. An Apple a day keeps the doctor away.

Here apple has the same meaning but one is in lower case and another one is in uppercase. To remove such an error we convert text into lower case.

Q. Removing special characters & Numbers. Is it important?

Ans: We don’t have to remove numbers all the time it depends on the use cases. Like here we don’t need it that's why we are removing special characters.

Removing unnecessary whitespace and punctuation for a better understanding of the data.

Removing Stop words. Why do we need to do it?

Ans: Stopwords are the most common words in any natural language. For getting more insights from the data & purpose of analyzing text data and building NLP models, these stopwords might not add much value to the meaning of the document.

Stopwords are like I, and, you, are, When, etc.

Not always we need to remove stopwords it depends on use cases like in machine translation or text summarization it is not advisable to remove. In our case, we are cleaning the text and finding out the sentiment of the text (Positive, Negative, or Neutral)

# Removing stop words
stop = stopwords.words('english')df['Filtered Review Text'] = df['Filtered Review Text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
df['Filtered Review Text'][:2]

Now we are going to find the sentiment of the text. It will help to have a brief knowledge about how ratings or recommended IND are related to review given by customers.

Sentiment Analysis

# library for Sentiment Analysis
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.sentiment.util import *# Sentiment Analysis
SIA = SentimentIntensityAnalyzer()
df["Review Text"]= df["Review Text"].astype(str)# Applying Model, Variable Creation
df['Polarity Score']=df["Review Text"].apply(lambda x:SIA.polarity_scores(x)['compound'])
df['Neutral Score']=df["Review Text"].apply(lambda x:SIA.polarity_scores(x)['neu'])
df['Negative Score']=df["Review Text"].apply(lambda x:SIA.polarity_scores(x)['neg'])
df['Positive Score']=df["Review Text"].apply(lambda x:SIA.polarity_scores(x)['pos'])# Converting 0 to 1 Decimal Score to a Positive, Negative, Neutral Variable
df['Sentiment']=''
df.loc[df['Polarity Score']>0,'Sentiment']='Positive'
df.loc[df['Polarity Score']==0,'Sentiment']='Neutral'
df.loc[df['Polarity Score']<0,'Sentiment']='Negative'
df[['Polarity Score', 'Neutral Score', 'Negative Score', 'Positive Score', 'Sentiment']][:3]

Explanation:

We used the NLTK Sentiment Intensity Analyzer module, for Sentiment Analysis. Here, we classify the text into three dimensions: Positive, Neutral, and Negative, and overall sentiment is stored in the Sentiment column.

Neutral/Negative/Positive Score: Indicates the potency of these classes between 0 and 1.
Polarity Score: Measures the difference between the Positive/Neutral/Negative values, where a positive number closer to 1 indicates positivity, and a negative number closer to -1 indicates negativity.

Let’s do some analysis!!

sns.countplot(x='Sentiment', hue='Recommended IND', data=df)
plt.title("Distribution of Sentiment vs Recommended IND".format(x), fontsize=16, fontweight='bold')

sns.countplot(x='Sentiment', hue='Rating',data=df)   
plt.title("Distribution of Sentiment vs Rating".format(x), fontsize=16, fontweight='bold')

sns.countplot(x='Rating', hue='Sentiment',data=df)   
plt.title("Distribution of Sentiment vs Rating".format(x), fontsize=16, fontweight='bold')

Interpretation:

Most of the Review text has a positive response and is recommended to customers.

Ratings 3 have more negative sentiment compared to other ratings in the negative segment.

pd.crosstab(df['Sentiment'], df['Division Name']).plot(kind='bar')
plt.title('Distribution of Sentiment vs Division Name', fontweight='bold', fontsize=16)

pd.crosstab(df['Sentiment'], df['Department Name']).plot(kind='bar')
plt.title('Distribution of Sentiment vs Department Name', fontweight='bold', fontsize=16)

x = pd.crosstab(df['Class Name'], df['Sentiment'])
x.plot(kind='bar',stacked=True)
plt.title('Distribution of Sentiment vs Class Name', fontweight='bold', fontsize=16)

Negative sentiment is more in Dresses.

Wordcloud

# Library for wordcloud
from wordcloud import WordCloud, STOPWORDS# Creating a function cloud
def cloud(text,stopwords=stopwords): # title,
    wordcloud = WordCloud(width=1600, height=800,
                          background_color='black',
                          stopwords=stopwords,
                         ).generate(str(text))
    
    # Output Visualization
    fig = plt.figure(dpi=80, facecolor='k',edgecolor='k')
    plt.imshow(wordcloud,interpolation='bilinear')
    plt.axis('off')
    #plt.title(fontsize=50,color='y')
    plt.tight_layout(pad=0)
    plt.show()

Most Frequent Words in Low Review Text for Class Name.

It helps us to visualize the low rated comment frequent words

print('Most Frequent Words in Low Review Text for Class Name')
temp = df['Filtered Review Text'][df.Rating.astype(int) < 3]# Modify Stopwords to Exclude Class types, suchs as "dress"
new_stop = set(STOPWORDS)
new_stop.update([x.lower() for x in list(df["Class Name"][df["Class Name"].notnull()].unique())])# Cloud
cloud(temp.values,  stopwords = STOPWORDS)

Most Frequent Words in High Rated Review Text for Class Name

print('Most Frequent Words in High Ratied Review Text for Class Name')
temp = df['Filtered Review Text'][df.Rating.astype(int) >= 3]# Modify Stopwords to Exclude Class types, suchs as "dress"
new_stop = set(STOPWORDS)
new_stop.update([x.lower() for x in list(df["Class Name"][df["Class Name"].notnull()].unique())]
                + ["dress", "petite", "skirt","shirt"])# Cloud
cloud(temp.values,  stopwords = STOPWORDS) #title= title,

We can do the same proceudre what we did with Class Name for Low & High Review text for Department Name & Division Name.

Visualizing Top n-gram

#library
from sklearn.feature_extraction.text import CountVectorizer#top_n_gram function
def top_n_ngram(corpus,n = None,ngram = 1):
    vec = CountVectorizer(stop_words = 'english',ngram_range=(ngram,ngram)).fit(corpus)
    # Have the count of  all the words for each review text
    bag_of_words = vec.transform(corpus) 
    # Calculates the count of all the word in the whole review text
    sum_words = bag_of_words.sum(axis =0) 
    words_freq = [(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq,key = lambda x:x[1],reverse = True)
    return words_freq[:n]

Top 20 Unigrams in Review Text

common_words = top_n_ngram(df['Filtered Review Text'], 20,1)
df1 = pd.DataFrame(common_words, columns = ['Review Text' , 'count'])
plt.figure(figsize =(10,4))
df1.groupby('Review Text').sum()['count'].sort_values(ascending=False).plot(
kind='bar', title='Top 20 unigrams in Filtered Review Text')

Top 20 Bigrams in Review Text

common_words = top_n_ngram(df['Filtered Review Text'], 20,2)
df2 = pd.DataFrame(common_words, columns = ['Filtered Review Text' , 'count'])
plt.figure(figsize =(10,5))
df2.groupby('Filtered Review Text').sum()['count'].sort_values(ascending=False).plot(
kind='bar', title='Top 20 bigrams in Filtered Review Text')

Top 20 Trigrams in Review Text

common_words = top_n_ngram(df['Filtered Review Text'], 20,3)
df2 = pd.DataFrame(common_words, columns = ['Filtered Review Text' , 'count'])
plt.figure(figsize =(10,5))
df2.groupby('Filtered Review Text').sum()['count'].sort_values(ascending=False).plot(
kind='bar', title='Top 20 trigrams in Filtered Review Text')

Top 20 Part-of-speech taggings

# library for POS
!pip install TextBlob
from textblob import *blob= TextBlob(str(df['Filtered Review Text']))
pos = pd.DataFrame(blob.tags,columns =['word','pos'])
pos1 = pos.pos.value_counts()[:20]
plt.figure(figsize = (10,5))
pos1.plot(kind='bar',title ='Top 20 Part-of-speech taggings')

What is POS?

Process of converting a sentence into a list of words for building lemmatizers/Stemming which is used to reduce a word to its root form

For more information, I would recommend reading this blog.

What is a POS tag, and why do we need this?

Answer: A POS tag is a tag that indicates the part of speech for a word (let us not worry about the nuances between a…

www.quora.com

NLP Guide: Identifying Part of Speech Tags using Conditional Random Fields

POS tagging is the process of marking up a word in a corpus to a corresponding part of speech tag, based on its context…

medium.com

So uptil now we have completed

Data cleaning
Data preprocessing
Data Analysis (Sentimental Analysis)
Data visualisation (Uni/ Bi / Tri- Varient, Worcloud)

Now in my next post, we will discuss about Data Modelling and its steps.

If you find my post informative do like it and comments are always welcome!

Analytics Vidhya

Automation of EDA & Text processing

Univariate Visualization

Univariate visualization for numerical data.

Univariate Visualization for categorical data

Univariate distribution of clothing ID

Bivariate Visualization

Division, Department & Class Name Vs Recommended IND

Age, Positive Feedback Count Vs Recommended IND

Division, Department & Class Name Vs Rating

Age Vs Rating

Division, Department & Class Name Vs Age

Division, Department & Class Name Vs Division, Department & Class Name

pandas.crosstab - pandas 1.2.3 documentation

Edit description

CLean Text

Q. Why do we need to convert text into the lower case?

Q. Removing special characters & Numbers. Is it important?

Removing Stop words. Why do we need to do it?

Sentiment Analysis

Explanation:

Wordcloud

Visualizing Top n-gram

Top 20 Unigrams in Review Text

Top 20 Bigrams in Review Text

Top 20 Trigrams in Review Text

Top 20 Part-of-speech taggings

What is a POS tag, and why do we need this?

Answer: A POS tag is a tag that indicates the part of speech for a word (let us not worry about the nuances between a…

NLP Guide: Identifying Part of Speech Tags using Conditional Random Fields

POS tagging is the process of marking up a word in a corpus to a corresponding part of speech tag, based on its context…

Published in Analytics Vidhya

Written by Kashish Rastogi