[Data Science] Random UnderSampling

728x90

SMALL

Random UnderSampling

무작위 언더샘플링에는 다수 클래스에서 무작위로 예제를 선택하고 훈련 데이터 세트에서 삭제하는 작업이 포함된다. 무작위 언더샘플링에서는 보다 균형 잡힌 분포에 도달할 때까지 다수의 클래스 인스턴스가 무작위로 삭제된다.

RandomUnderSampler

sampling_strategy

데이터 세트를 샘플링하기 위한 샘플링 정보이다. float인 경우, 리샘플링 후 다수 클래스의 샘플 수에 대한 소수 클래스의 샘플 수의 원하는 비율에 해당한다. 따라서, 비율은 어디가 소수 클래스의 샘플 수이고 가 리샘플링 후 다수 클래스의 샘플 수인지로 표현된다.

sampling_strategy는 다음과 같은 방법으로 사용할 수 있다.

'majority' : 다수 클래스만 다시 샘플링
'not minority : 소수 클래스를 제외한 모든 클래스를 다시 샘플링
'not majority' : 다수 클래스를 제외한 모든 클래스를 다시 샘플링
'all' : 모든 클래스를 리샘플링
'auto' : 'not minority'와 동일

import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from imblearn.under_sampling import RandomUnderSampler

# 가상의 데이터 생성
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rus = RandomUnderSampler(sampling_strategy=0.5, random_state=42)  # 원하는 클래스 비율로 조정 가능
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_resampled, y_resampled)

y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'정확도: {accuracy:.2f}')
print(report)

https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html

RandomUnderSampler — Version 0.11.0

If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then the following input feature names are generated: ["x0", "x1", ..., "x(n_features_in_ - 1)"].

imbalanced-learn.org

728x90

LIST

저작자표시 비영리 변경금지 (새창열림)

'Data-driven Methodology > DS (Data Science)' 카테고리의 다른 글

[Data Science] 불균형한 데이터 균형 조정 (0)	2023.12.08
[Data Science] 주성분 분석 (Principal Component Analysis, PCA) (0)	2023.10.30
[Data Science] 데이터 불균형 (0)	2023.09.06
[Data Science] 탐색적 데이터 분석 (Exploratory Data Analysis) (0)	2023.07.17
[Data Science] 모델 평가 (0)	2022.11.29

GOATLAB

[Data Science] Random UnderSampling

Random UnderSampling

RandomUnderSampler

'Data-driven Methodology > DS (Data Science)' 카테고리의 다른 글

티스토리툴바

[Data Science] Random UnderSampling

Random UnderSampling

RandomUnderSampler

'Data-driven Methodology > DS (Data Science)' 카테고리의 다른 글

관련글

티스토리툴바