데이터를 학습하기 위해서는 방정식에 대입되어야 하는데
방정식은 수학식이므로 데이터는 모두 숫자로 되어 있어야 한다.
따라서 문자열 데이터를 숫자로 바꿔줘야 한다.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
LabelEncoder() : 문자열 데이터를 정렬해서 순서대로 0부터 시작하는 숫자로 변경
- 카테고리컬 데이터이 3개 이상일 때는 Label Encoding으로 학습하면 학습이 잘 안된다.
3개 이상의 카테고리컬 데이터는 One Hot Encoding을 이용해서 처리
encoder = LabelEncoder()
X['컬럼'] = encoder.fit_transform(X['컬럼'])
OneHotEncoder() : 문자열 데이터를 정렬해서 순서대로 0부터 시작하는 숫자로 바꿔 컬럼을 만들어준다.
>>> ct = ColumnTransformer([('encoder', OneHotEncoder(), [3])], remainder='passthrough')
# 원 핫 인코딩된 컬럼이 항상 맨 왼쪽에 위치하게 된다.
>>> X = ct.fit_transform(X)
>>> X
array([[0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.6534920e+05,
1.3689780e+05, 4.7178410e+05],
[1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 1.6259770e+05,
1.5137759e+05, 4.4389853e+05],
[0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.5344151e+05,
1.0114555e+05, 4.0793454e+05],
[0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.4437241e+05,
1.1867185e+05, 3.8319962e+05],
[0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.4210734e+05,
9.1391770e+04, 3.6616842e+05],
[0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.3187690e+05,
9.9814710e+04, 3.6286136e+05],
[1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 1.3461546e+05,
1.4719887e+05, 1.2771682e+05],
[0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.3029813e+05,
1.4553006e+05, 3.2387668e+05],
[0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.2054252e+05,
1.4871895e+05, 3.1161329e+05],
[1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 1.2333488e+05,
1.0867917e+05, 3.0498162e+05],
[0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.0191308e+05,
1.1059411e+05, 2.2916095e+05],
[1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 1.0067196e+05,
9.1790610e+04, 2.4974455e+05],
[0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 9.3863750e+04,
1.2732038e+05, 2.4983944e+05],
[1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 9.1992390e+04,
1.3549507e+05, 2.5266493e+05],
[0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.1994324e+05,
1.5654742e+05, 2.5651292e+05],
[0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.1452361e+05,
1.2261684e+05, 2.6177623e+05],
[1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 7.8013110e+04,
1.2159755e+05, 2.6434606e+05],
[0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 9.4657160e+04,
1.4507758e+05, 2.8257431e+05],
[0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 9.1749160e+04,
1.1417579e+05, 2.9491957e+05],
[0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 8.6419700e+04,
1.5351411e+05, 0.0000000e+00],
[1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 7.6253860e+04,
1.1386730e+05, 2.9866447e+05],
[0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 7.8389470e+04,
1.5377343e+05, 2.9973729e+05],
[0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 7.3994560e+04,
1.2278275e+05, 3.0331926e+05],
[0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 6.7532530e+04,
1.0575103e+05, 3.0476873e+05],
[0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 7.7044010e+04,
9.9281340e+04, 1.4057481e+05],
[1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 6.4664710e+04,
1.3955316e+05, 1.3796262e+05],
[0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 7.5328870e+04,
1.4413598e+05, 1.3405007e+05],
[0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 7.2107600e+04,
1.2786455e+05, 3.5318381e+05],
[0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 6.6051520e+04,
1.8264556e+05, 1.1814820e+05],
[0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 6.5605480e+04,
1.5303206e+05, 1.0713838e+05],
[0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 6.1994480e+04,
1.1564128e+05, 9.1131240e+04],
[0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 6.1136380e+04,
1.5270192e+05, 8.8218230e+04],
[1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 6.3408860e+04,
1.2921961e+05, 4.6085250e+04],
[0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 5.5493950e+04,
1.0305749e+05, 2.1463481e+05],
[1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 4.6426070e+04,
1.5769392e+05, 2.1079767e+05],
[0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 4.6014020e+04,
8.5047440e+04, 2.0551764e+05],
[0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 2.8663760e+04,
1.2705621e+05, 2.0112682e+05],
[1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 4.4069950e+04,
5.1283140e+04, 1.9702942e+05],
[0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 2.0229590e+04,
6.5947930e+04, 1.8526510e+05],
[1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 3.8558510e+04,
8.2982090e+04, 1.7499930e+05],
[1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 2.8754330e+04,
1.1854605e+05, 1.7279567e+05],
[0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 2.7892920e+04,
8.4710770e+04, 1.6447071e+05],
[1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 2.3640930e+04,
9.6189630e+04, 1.4800111e+05],
[0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.5505730e+04,
1.2738230e+05, 3.5534170e+04],
[1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 2.2177740e+04,
1.5480614e+05, 2.8334720e+04],
[0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.0002300e+03,
1.2415304e+05, 1.9039300e+03],
[0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.3154600e+03,
1.1581621e+05, 2.9711446e+05],
[1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
1.3542692e+05, 0.0000000e+00],
[0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 5.4205000e+02,
5.1743150e+04, 0.0000000e+00],
[1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
1.1698380e+05, 4.5173060e+04]])
'인공지능 > 데이터 전처리' 카테고리의 다른 글
[데이터 전처리] 데이터 균형 맞추기 SMOTE() (0) | 2024.04.15 |
---|---|
[데이터 전처리] Training/Test용 데이터 분리 train_test_split() (0) | 2024.04.12 |
[데이터 전처리] 데이터 정규화, 표준화 Feature Scaling, StandardScaler(), MinMaxScaler() (0) | 2024.04.12 |
[데이터 전처리] 인공지능 만들기 전 데이터 전처리 (0) | 2024.04.12 |