결손값 처리

티스토리 뷰

데이터 분석/전처리

결손값 처리

yessen 2023. 12. 22. 16:52

728x90

1. NAN 값 처리(제거, 변경, 평균값 대체)

SQL>

SELECT * table

WHERE weight is not NULL

COALESCE(weight, 1) AS weight,

FROM table

COALSCE(weight, SELECT AVG(weight) FRM table)) AS weight

table %>% drop_na(weight)

또는

na.omit(table)

table %>% replace_na(list(weight=1))

2. PMM(predictive mean matching) 방법

- 값이 있는 데이터에서 회귀 모델 구성 -> 계수, 오차의 분포 계산

-> 계수와 오차의 분포에서 새로운 계수와 오차 분산 생성 -> 생성한 계수와 오차 분산에 따른 회귀모델로 예측값 계산

-> 관측 데이터 중 예측값에 가장 가까운 데이터를 보완값으로 선택

-> 데이터를 보완하여 새롭게 구성한 회귀 모델의 계수와 오차 분포를 계산 -> 반복

<R>

library(mice)

table$type <- as.factor(table$type)
table$x <- table$x=='TURE'

mice_tb <- mice(table, m=10, maxit=50, method='pmm', seed=1)

# 50번 시행하여 10개의 데이터를 얻고 mice_tb에 저장

<Python>

from fancyimpute import MICE

table.replace('None', np.nan, inplace=True)
table['weight'] = table['weight'].astype('floate64')
table['type'] = table['type'].astype('category')
table['x'] = table['x'].astype('category')

dummy_x = pd.get_dummies(table[['type', 'x']], drop_first=True)
mice = MICE(n_imputations=10, n_burn_in=50, impute_type='pmm')
production_mice = mice.multiple_imputations(
	pd.concat([table[['weight']], dummy_x], axis=1)) #보완값 저장

저작자표시 (새창열림)

'데이터 분석 > 전처리' 카테고리의 다른 글

정규화 (0)	2023.12.22
오버샘플링 기법 (0)	2023.12.22
SQL, R 전처리 함수 모음 (0)	2023.12.22
Grayscale images to 3 channels for CNN (0)	2023.02.08
numpy 행렬 나누기 (0)	2023.01.05

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

글 보관함

Connecting dots via Data

티스토리 뷰

결손값 처리

'데이터 분석 > 전처리' 카테고리의 다른 글

티스토리툴바