본문 바로가기
Learning-driven Methodology/ML (Machine Learning)

[XGBoost] 위스콘신 유방암 데이터 (1)

by goatlab 2022. 10. 4.
728x90
반응형
SMALL

위스콘신 유방암 데이터

 

사이킷런에서는 UCI ML 유방암 위스콘신 (진단) 데이터 세트를 제공한다. 또한, 사본도 다운로드 가능하다.

 

라이브러리 설치

 

conda install -c conda-forge xgboost
conda install -c conda-forge imbalanced-learn

 

breast-cancer-wisconsin 데이터셋 로드

 

breast-cancer-wisconsin.data.csv
0.02MB

import pandas as pd

df = pd.read_csv('breast-cancer-wisconsin.data.csv', names=['id','Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Outcome'])
df.head()

# Outcome 값 변경
df.loc[df['Outcome']==2,'Outcome'] = 0
df.loc[df['Outcome']==4,'Outcome'] = 1

df.describe()

# 데이터프레임 데이터값 확인
df.dtypes
# 결측치 제거 및 Outcome 비율 확인
for key in df.keys():
    df.loc[df[key]=='?', key] = None
    
df.dropna(inplace=True)
df.reset_index(inplace=True, drop=True)

df['Outcome'].value_counts()
0.0    444
1.0    239
Name: Outcome, dtype: int64

 

데이터 전처리

 

# 훈련 데이터와 테스트 데이터 나누기
features = df[df.keys().drop(['id','Outcome'])].values
outcome = df['Outcome'].values.reshape(-1,1)
from sklearn.model_selection import train_test_split

train_features, test_features, train_target, test_target = train_test_split(features, outcome, stratify=outcome, test_size=0.3)

 

데이터 스케일링 수행

 

from sklearn.preprocessing import MinMaxScaler

feature_scaler = MinMaxScaler()
train_features_scaled = feature_scaler.fit_transform(train_features)
test_features_scaled = feature_scaler.transform(test_features)

 

XGBClassifier로 학습

 

from xgboost import XGBClassifier

xgb = XGBClassifier()
xgb.fit(train_features_scaled, train_target)
result = xgb.predict(test_features_scaled)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print('Accuracy :',accuracy_score(test_target, result))
print('Precision :',precision_score(test_target, result))
print('Recall :',recall_score(test_target, result))
print('F1 score :',f1_score(test_target, result))
Accuracy : 0.9707317073170731
Precision : 0.9342105263157895
Recall : 0.9861111111111112
F1 score : 0.9594594594594595
728x90
반응형
LIST