Build Your Own Product Classifier — Part 1

Ricky H. Putra
5 min readNov 15, 2023

Business Problem. The significance of data analysis in the ever-growing and always changing grocery retail industry cannot be understated. It is an essential instrument for deciphering consumer trends and helps firms make wise decisions that lead to success. But among the thousands of products stored in a stock-keeping unit (SKU) database, data analysts face a formidable obstacle: correctly classifying these goods. This work is frequently tedious and prone to human mistake, which could result in errors and inefficiencies in important business insights.But thanks to developments in machine learning, you may now create your own product classifier that will automate this classification procedure. Data analysts can train models to effectively identify product category based on their names by utilizing machine learning algorithms.

Stock-Keeping Unit or SKU Product database usually contains product names sold by the grocery stores along with other data. For us to do product based analysis, it often requires data analysts to categorize them in order to get insights at higher product level such as ABC brand dairy drink can be categorized as Milk beverage. Groceries shop owners usually tends to diversify their products and brands. In order to understand their business better, they need to know how diverse are they currently stocking and how has it been progressing? small vs high margin products? e.g. selling more cigarettes than snacks food? spread too thin on smaller margin products. Business owners need to understand their top selling products along with their categories, to make sure they know their product diversity, sourcing stability, overall sales and margin to sustain the business.

Grocery stores usually stores their products in the database with free-text format, and they do not have the same format or standard of storing them hence it is difficult for us to analyze data across different stores. Second, typical mom’s and pop’s grocery stores do not have complete product categories labeled for each products making it is even more difficult to analyze at higher product level.

I will share my approach on how we solved this problem by creating deep learning model to generate product categories by names.

Part 1: Data Understanding and Preparation

Part 2 Data Modelling (coming soon)

Data Understanding and Preparation

To build a product classifier, we are given a CSV text file containing list of the product names and their respective categories. SKU product name usually follows this format (although they are not standardized across products or businesses): product name + variant + unit + size.

CSV file product category examples:

Alkaline AAA PACK 24x2pcs, Battery

Bear Brand CTN 30x189ml, Milk Beverage

Biore BW Whitening Scrub 250ml, Soap

In my case, we use Google Sheet to store the product database and connect it to Google Colab Notebooks using Python google.colab and google.auth package. It will prompts login page to your google account.

from google.colab import auth
auth.authenticate_user()

from google.auth import default
creds, _ = default()

Before we can read the data from Google Sheet, we need to load the worksheet into our Google Colab notebook environment. In this case, I prepared each store data in different worksheets, so that I can pick up which stores I need to include or exclude.

# load data from GSheet
import gspread

gc = gspread.authorize(creds)
worksheet = gc.open('Product Database')
worksheet = worksheet.worksheet('Store A')

# get_all_values gives a list of rows.
rows = worksheet.get_all_values()

# Loading data into Pandas DataFrame
import pandas as pd
raw_pd = pd.DataFrame.from_records(rows)
new_header = raw_pd.iloc[0] #grab the first row for the header
raw_pd = raw_pd[1:] #take the data less the header row
raw_pd.columns = new_header #set the header row as the df header
raw_pd = raw_pd.iloc[:,0:2] # Retrieve all rows starting from first row and first two columns
raw_pd=raw_pd.replace(r'^\s*$', np.nan, regex=True) # Replace field that's entirely space (or empty) with NaN
raw_pd=raw_pd.dropna(subset=['Product Category']) # Drop the rows where Product Category is missing
raw_pd = raw_pd.drop_duplicates(subset=['name']) # drop rows with duplicate product names
raw_pd['name'] = raw_pd['name'].str.lower() # convert all product names to lower case
raw_pd.columns = ['name','product_category'] # Rename columns
  1. Gspread is a Python API for Google Sheets
  2. google.auth is the Google authentication library for Python. This library provides the ability to authenticate to Google APIs
  3. Worksheet object contains all the worksheets in one GSheet document, so if you need to access to specific worksheet in the GSheet use worksheet(sheet name) function
  4. worksheet.get_all_values() returns all values from a worksheet as a List object
  5. Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language
raw_pd['product_category'] = raw_pd['product_category'].astype("category") #convert as categorical value
raw_pd['product_category_code'] = raw_pd['product_category'].cat.codes #generate number codes based on product_category names

#create a temporary product_type_pd data frame to look for product category name from product category code later
product_type_pd = raw_pd[['product_category','product_category_code']]
product_type_pd=product_type_pd.drop_duplicates()
product_type_pd=product_type_pd.sort_values(by=['product_category'])

processed_pd = raw_pd[['name','product_category_code']]
  1. We use Pandas data frame categorical.codes to generate numbers from category names so that we can use them for training our model later in Part 2
  2. We create temporary data frame product_type_pd so that we can look for product category names from product category codes. We will want our model to return product category names not the codes as it is meaningless for us.

Below is the list of product_type or (category) and respective code for training our model.

Grocery stores stand to gain significant advantages from implementing such a system. By automating the categorization process, they can streamline their operations and allocate resources more efficiently. Additionally, they can leverage the insights gained from product-based analysis to optimize inventory management, pricing strategies, and even personalized marketing campaigns.

In conclusion, building your own product classifier with machine learning empowers data analysts in the grocery industry to overcome the challenges posed by SKU databases. By automating the categorization process using advanced algorithms, businesses can unlock valuable insights that drive growth and success in an increasingly competitive market landscape.

Stay tune for Part 2 which I will share more how to use this for training our custom product classifier model. If you find this article useful, please support by following my medium and give your claps. Thank you.

--

--

Ricky H. Putra

Leading digitization initiatives in AwanTunai focusing on strengthening Indonesia MSME businesses with technology. Software Dev | Automation | Data Science | AI