인공지능/데이터 전처리

[데이터 전처리] 레이블 인코딩, 원 핫 인코딩 LabelEncoder(), OneHotEncoder()

건휘맨 2024. 4. 12. 17:28

데이터를 학습하기 위해서는 방정식에 대입되어야 하는데

방정식은 수학식이므로 데이터는 모두 숫자로 되어 있어야 한다.
따라서 문자열 데이터를 숫자로 바꿔줘야 한다.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

LabelEncoder() : 문자열 데이터를 정렬해서 순서대로 0부터 시작하는 숫자로 변경

-  카테고리컬 데이터이 3개 이상일 때는 Label Encoding으로 학습하면 학습이 잘 안된다.

 3개 이상의 카테고리컬 데이터는 One Hot Encoding을 이용해서 처리

encoder = LabelEncoder()
X['컬럼'] = encoder.fit_transform(X['컬럼'])

 

OneHotEncoder() : 문자열 데이터를 정렬해서 순서대로 0부터 시작하는 숫자로 바꿔 컬럼을 만들어준다.

>>> ct = ColumnTransformer([('encoder', OneHotEncoder(), [3])], remainder='passthrough')

# 원 핫 인코딩된 컬럼이 항상 맨 왼쪽에 위치하게 된다.

>>> X = ct.fit_transform(X)
>>> X
array([[0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.6534920e+05,
        1.3689780e+05, 4.7178410e+05],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 1.6259770e+05,
        1.5137759e+05, 4.4389853e+05],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.5344151e+05,
        1.0114555e+05, 4.0793454e+05],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.4437241e+05,
        1.1867185e+05, 3.8319962e+05],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.4210734e+05,
        9.1391770e+04, 3.6616842e+05],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.3187690e+05,
        9.9814710e+04, 3.6286136e+05],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 1.3461546e+05,
        1.4719887e+05, 1.2771682e+05],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.3029813e+05,
        1.4553006e+05, 3.2387668e+05],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.2054252e+05,
        1.4871895e+05, 3.1161329e+05],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 1.2333488e+05,
        1.0867917e+05, 3.0498162e+05],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.0191308e+05,
        1.1059411e+05, 2.2916095e+05],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 1.0067196e+05,
        9.1790610e+04, 2.4974455e+05],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 9.3863750e+04,
        1.2732038e+05, 2.4983944e+05],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 9.1992390e+04,
        1.3549507e+05, 2.5266493e+05],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.1994324e+05,
        1.5654742e+05, 2.5651292e+05],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.1452361e+05,
        1.2261684e+05, 2.6177623e+05],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 7.8013110e+04,
        1.2159755e+05, 2.6434606e+05],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 9.4657160e+04,
        1.4507758e+05, 2.8257431e+05],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 9.1749160e+04,
        1.1417579e+05, 2.9491957e+05],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 8.6419700e+04,
        1.5351411e+05, 0.0000000e+00],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 7.6253860e+04,
        1.1386730e+05, 2.9866447e+05],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 7.8389470e+04,
        1.5377343e+05, 2.9973729e+05],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 7.3994560e+04,
        1.2278275e+05, 3.0331926e+05],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 6.7532530e+04,
        1.0575103e+05, 3.0476873e+05],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 7.7044010e+04,
        9.9281340e+04, 1.4057481e+05],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 6.4664710e+04,
        1.3955316e+05, 1.3796262e+05],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 7.5328870e+04,
        1.4413598e+05, 1.3405007e+05],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 7.2107600e+04,
        1.2786455e+05, 3.5318381e+05],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 6.6051520e+04,
        1.8264556e+05, 1.1814820e+05],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 6.5605480e+04,
        1.5303206e+05, 1.0713838e+05],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 6.1994480e+04,
        1.1564128e+05, 9.1131240e+04],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 6.1136380e+04,
        1.5270192e+05, 8.8218230e+04],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 6.3408860e+04,
        1.2921961e+05, 4.6085250e+04],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 5.5493950e+04,
        1.0305749e+05, 2.1463481e+05],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 4.6426070e+04,
        1.5769392e+05, 2.1079767e+05],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 4.6014020e+04,
        8.5047440e+04, 2.0551764e+05],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 2.8663760e+04,
        1.2705621e+05, 2.0112682e+05],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 4.4069950e+04,
        5.1283140e+04, 1.9702942e+05],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 2.0229590e+04,
        6.5947930e+04, 1.8526510e+05],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 3.8558510e+04,
        8.2982090e+04, 1.7499930e+05],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 2.8754330e+04,
        1.1854605e+05, 1.7279567e+05],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 2.7892920e+04,
        8.4710770e+04, 1.6447071e+05],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 2.3640930e+04,
        9.6189630e+04, 1.4800111e+05],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.5505730e+04,
        1.2738230e+05, 3.5534170e+04],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 2.2177740e+04,
        1.5480614e+05, 2.8334720e+04],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.0002300e+03,
        1.2415304e+05, 1.9039300e+03],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.3154600e+03,
        1.1581621e+05, 2.9711446e+05],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        1.3542692e+05, 0.0000000e+00],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 5.4205000e+02,
        5.1743150e+04, 0.0000000e+00],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
        1.1698380e+05, 4.5173060e+04]])