0. Data Introduce & Import Package¶

https://www.kaggle.com/datasets/rishikumarrajvansh/marketing-insights-for-e-commerce-company/data

2019.01.01~2019.12.31 까지의 온라인 거래 내역이 포함되어 있는 E-commerce Data이다.
CustomersData, Discount_Coupon, Marketing_Spend, Online_Sales, Tax_amount 의 5가지 테이블로 구성되있다.
사용자의 행동을 추적할 수 있는 웹 로그 데이터는 존재하지 않는다.

In [1]:

import pandas as pd
import seaborn as sns
import numpy as np
import os
import matplotlib.pyplot as plt
import pickle
import scipy.stats as st
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

1. Data Load¶

In [2]:

'''
E-Commerce 회사의 1년 간 회원목록, 쿠폰정보, 마케팅 비용, 거래내역, 세금에 대한 데이터이다. 
'''

data_path = data_path


# Customer Data: ID, Gender, lOCATION, Tenure_Months(유지기간)
data_cus=pd.read_excel(os.path.join(data_path, "CustomersData.xlsx"))

# Coupon Data: Month, Porduct_Category, Coupon_Code, Discount_pct
data_cou=pd.read_csv(os.path.join(data_path, "Discount_Coupon.csv"))

# Marketing_Spend: Date, Offline_Spend, Online_Spend
data_ma=pd.read_csv(os.path.join(data_path, "Marketing_Spend.csv"))

# Online_Sales: CustomerID, Transaction_ID, Transaction_Date, Product_SKU,
#       Product_Description, Product_Category, Quantity, Avg_Price,
#       Delivery_Charges, Coupon_Status
data_on=pd.read_csv(os.path.join(data_path, "Online_Sales.csv"))

# Tax_amout: Product_Category, GST(부과세 비율)
data_tax=pd.read_excel(os.path.join(data_path, "Tax_amount.xlsx"))

2. EDA¶

데이터 탐색을 통해 라벨링된 각 열에 대해 이해한다.
데이터에서 각 열의 결측치, 중복값, type, 이상치 등을 고려하여 분석에 용이하게 처리한다.
데이터 처리에 대한 작업은 한 번에 끝내는 것이 아닌, 분석이 진행됨에 따라 필요한 부분은 피드백 작업.

2-1. Customer data¶

In [3]:

# customer info: not null, not dupicated, 1468명의 고객, category type 할당
data_cus.info()
print(data_cus.isna().any())
data_cus[['Gender', 'Location']] = data_cus[['Gender', 'Location']].astype('category')

print("\nID 기준 중복값:", data_cus.duplicated(subset='CustomerID').sum(), "\n")
# Gender Distribution: 여성이 약간 더 많음
# pie chart function
def plot_pie_chart(title, sizes, explode:list=[0,0], labels:list=['1', '2'], colors:list=['blue', 'red']):
    plt.title(title)
    plt.pie(sizes, explode=explode, labels=labels, colors=colors)
    plt.show()
    
plot_pie_chart('Distribution of Gender', [len(data_cus[data_cus['Gender']=='M']), len(data_cus[data_cus['Gender']=='F'])], labels=['M','F'])

# 유지기간별 고객 분포: 유지기간별 큰차이가 없다..? 유지기간의 기준은?
plt.figure(figsize=(4, 4))
sns.histplot(data_cus['Tenure_Months'])
plt.title('Distribution of Tenure_Months')
plt.show()

# Location별 고객 분포: Chicago, Califonia >> New Jersey >> New Yor > Washington DC
plt.figure(figsize=(4, 4))
data_cus['Location'].value_counts().plot(kind='bar')
plt.title('Distribution of Location')
plt.xticks(rotation=90)
plt.show()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1468 entries, 0 to 1467
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   CustomerID     1468 non-null   int64 
 1   Gender         1468 non-null   object
 2   Location       1468 non-null   object
 3   Tenure_Months  1468 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 46.0+ KB
CustomerID       False
Gender           False
Location         False
Tenure_Months    False
dtype: bool

ID 기준 중복값: 0

No description has been provided for this image

2-2. Discount_Coupon¶

In [4]:

'''
Discount_Coupon
'''
# coupon info: 인덱스 열 할당, not null, 204개, category type 할당
data_cou.insert(0, 'CouponID', range(1, len(data_cou) + 1))
data_cou.info()
print(data_cus.isna().any())
data_cou['Product_Category'] = data_cou['Product_Category'].astype('category')
print("\nCoupon 테이블 중복값:", data_cou.duplicated().sum(), "\n")

# Month: Jan, Feb의 형태를 1,2 형식으로 변환, 이후 데이트타입으로 변경할지는 고민
data_cou['Month'] = data_cou['Month'].replace(to_replace=data_cou['Month'].unique(), value=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

# Product_Category: 카테고리별 쿠폰발행은 같은 분포, 여기에 나온 카테고리가 전체를 대변하는지 확인필요
plt.figure(figsize=(4, 4))
plt.bar(data_cou['Product_Category'].unique(), data_cou.groupby('Product_Category')['Month'].count())
plt.title('Distribution of \nProduct Category where coupon was used')
plt.xticks(rotation=90)
plt.show() 
print("Counpon이 사용된 category 개수:", len(data_cou['Product_Category'].unique()))

# Discount_pct: [10, 20, 30] 각 비율의 쿠폰은 동일하게 분포
plt.figure(figsize=(4, 4))
plt.bar(data_cou['Discount_pct'].unique(), data_cou.groupby('Discount_pct')['Month'].count(), width=4)
plt.title('Distribution of Discount_pct')
plt.show()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   CouponID          204 non-null    int64 
 1   Month             204 non-null    object
 2   Product_Category  204 non-null    object
 3   Coupon_Code       204 non-null    object
 4   Discount_pct      204 non-null    int64 
dtypes: int64(2), object(3)
memory usage: 8.1+ KB
CustomerID       False
Gender           False
Location         False
Tenure_Months    False
dtype: bool

Coupon 테이블 중복값: 0

Counpon이 사용된 category 개수: 17

2-3. Marketing Spend¶

In [5]:

# coupon info: not null, 365개, date type 할당
data_ma.info()
print(data_ma.isna().any())
data_ma['Date'] = pd.to_datetime(data_ma['Date'], format='%m/%d/%Y')
print("\nMarketing Spend 테이블 중복값:", data_ma.duplicated(subset='Date').sum(), "\n")

print("\n Offline, Online describe") # 대체적으로 Offline 마케팅에 더 비용을 쓴 것으로 추정
print(data_ma[['Offline_Spend', 'Online_Spend']].describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 365 entries, 0 to 364
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           365 non-null    object 
 1   Offline_Spend  365 non-null    int64  
 2   Online_Spend   365 non-null    float64
dtypes: float64(1), int64(1), object(1)
memory usage: 8.7+ KB
Date             False
Offline_Spend    False
Online_Spend     False
dtype: bool

Marketing Spend 테이블 중복값: 0 


 Offline, Online describe
       Offline_Spend  Online_Spend
count     365.000000    365.000000
mean     2843.561644   1905.880740
std       952.292448    808.856853
min       500.000000    320.250000
25%      2500.000000   1258.600000
50%      3000.000000   1881.940000
75%      3500.000000   2435.120000
max      5000.000000   4556.930000

2-4. Online_Sales¶

In [6]:

# Online_Sales info: 52924건, not null, type 할당
data_on.info() 
print(data_on.isna().any())
data_on[['Product_Category', 'Product_SKU', 'Coupon_Status']] = data_on[['Product_Category', 'Product_SKU', 'Coupon_Status']].astype('category')
data_on['Transaction_Date'] = pd.to_datetime(data_on['Transaction_Date'], format='%m/%d/%Y')
print("\nOnline_Sales 테이블 중복값:", data_on.duplicated().sum(), "\n")
# CustomerID, Transaction_Date, Product_SKU, Quantity를 기준으로 중복을 봤을 때 중복이 존재하지만, Transaction ID가 다르므로 별도의 다른 주문으로 가정하고 분석을 시행.      
print(data_on.describe())

# CustomerID: 회원으로 등록된 사용자들은 모두 1회 이상 구매에 참여한 것으로 확인
print("\nCustomer table 총 회원 수:", len(data_cus['CustomerID']))
print("구매가 발생한 고유 회원 수:", len(data_on['CustomerID'].drop_duplicates()), "\n")

# Product_Category: Category 별 판매 현황은 다양하게 분포되어있으며 한눈에 주력 카테고리 3개가 보인다.
# Category 개수가 Coupon이 적용된 카데고리의 개수보다 많은 것을보아서 쿠폰이 적용된 카테고리들의 원인이 있을것이다.
plt.figure(figsize=(4, 4))
plt.bar(data_on['Product_Category'].unique(), data_on.groupby('Product_Category')['CustomerID'].count())
plt.title('Distribution of Product Category')
plt.xticks(rotation=90)
plt.show() 
print("전체 category 개수:", len(data_on['Product_Category'].unique()))

# Avg_Price: 쿠폰 적용 전 가격, 제품의 가격은 50달러 이내에 대부분 분포하며 고가의 상품도 존재한다.
plt.figure(figsize=(4, 4))
sns.distplot(data_on['Avg_Price'])
plt.title('Distribution of Avg_Price')
plt.show() 

# Delivery_Charges: 대부분 50달러 내이나, 유독 큰 값들이 존재. 이상치인지 확인 필요
plt.figure(figsize=(4, 4))
sns.distplot(data_on['Delivery_Charges'])
plt.title('Distribution of Delivery_Charges')
plt.show() 

#  Quantity: 대부분 5개 이내이나, 이상치가 존재함.
plt.figure(figsize=(4, 4))
sns.distplot(data_on['Quantity'])
plt.title('Distribution of Quantity')
plt.show()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52924 entries, 0 to 52923
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   CustomerID           52924 non-null  int64  
 1   Transaction_ID       52924 non-null  int64  
 2   Transaction_Date     52924 non-null  object 
 3   Product_SKU          52924 non-null  object 
 4   Product_Description  52924 non-null  object 
 5   Product_Category     52924 non-null  object 
 6   Quantity             52924 non-null  int64  
 7   Avg_Price            52924 non-null  float64
 8   Delivery_Charges     52924 non-null  float64
 9   Coupon_Status        52924 non-null  object 
dtypes: float64(2), int64(3), object(5)
memory usage: 4.0+ MB
CustomerID             False
Transaction_ID         False
Transaction_Date       False
Product_SKU            False
Product_Description    False
Product_Category       False
Quantity               False
Avg_Price              False
Delivery_Charges       False
Coupon_Status          False
dtype: bool

Online_Sales 테이블 중복값: 0 

        CustomerID  Transaction_ID               Transaction_Date  \
count  52924.00000    52924.000000                          52924   
mean   15346.70981    32409.825675  2019-07-05 19:16:09.450532864   
min    12346.00000    16679.000000            2019-01-01 00:00:00   
25%    13869.00000    25384.000000            2019-04-12 00:00:00   
50%    15311.00000    32625.500000            2019-07-13 00:00:00   
75%    16996.25000    39126.250000            2019-09-27 00:00:00   
max    18283.00000    48497.000000            2019-12-31 00:00:00   
std     1766.55602     8648.668977                            NaN   

           Quantity     Avg_Price  Delivery_Charges  
count  52924.000000  52924.000000      52924.000000  
mean       4.497638     52.237646         10.517630  
min        1.000000      0.390000          0.000000  
25%        1.000000      5.700000          6.000000  
50%        1.000000     16.990000          6.000000  
75%        2.000000    102.130000          6.500000  
max      900.000000    355.740000        521.360000  
std       20.104711     64.006882         19.475613  

Customer table 총 회원 수: 1468
구매가 발생한 고유 회원 수: 1468

전체 category 개수: 20

2-5. Tax_amount¶

In [7]:

data_tax.info()
print(data_tax.isna().any())
data_tax['Product_Category'] = data_tax['Product_Category'].astype('category')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Product_Category  20 non-null     object 
 1   GST               20 non-null     float64
dtypes: float64(1), object(1)
memory usage: 452.0+ bytes
Product_Category    False
GST                 False
dtype: bool

3. Data Save¶

In [8]:

# dataframe의 type을 보존하기 위해 pickle을 통해 저장.
data_cus.to_pickle("data_cus.pkl")
data_cou.to_pickle("data_cou.pkl")
data_ma.to_pickle("data_ma.pkl")
data_on.to_pickle("data_on.pkl")
data_tax.to_pickle("data_tax.pkl")

[LTV] Marketing insights for E-commerce company (0)	2024.06.21
[Cohort Analysis] Marketing insights for E-commerce company (0)	2024.06.15
[Customer Segment] Marketing insights for E-commerce company (0)	2024.06.06
[Simple Analysis] Marketing insights for E-commerce company (1)	2024.06.03

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Data 공부

Data 공부

[EDA] Marketing insights for E-commerce company 본문

[EDA] Marketing insights for E-commerce company

EDA

해당 카테고리에선 Kaggle의 Marketing insights for E-commerce company dataset을 활용하여 해당 회사의 데이터 분석가로서 분석하고, E-commerce 산업에서 우위를 가지기 위하여 어떤 사업전략을 취해야 할지 고민해본다.

0. Data Introduce & Import Package¶

1. Data Load¶

2. EDA¶

2-1. Customer data¶

2-2. Discount_Coupon¶

2-3. Marketing Spend¶

2-4. Online_Sales¶

2-5. Tax_amount¶

3. Data Save¶

'Data 분석 > E-Commerce data' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

« 2025/08 »
일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31