데이콘 #오늘의 파이썬 #1일1오파

Lv2. 결측치 보간법과 랜덤포레스트로 따릉이 데이터 예측하기

sososoy 2021. 11. 5. 17:45

데이터 다운로드

아래 셀을 실행시켜 데이터를 colab에 불러옵니다.
셀 실행은 Ctrl + Enter 를 이용해 실행 시킬 수 있습니다.

[1] # 데이터 다운로드 링크로 데이터를 코랩에 불러옵니다.

!wget 'https://bit.ly/3gLj0Q6'

import zipfile

with zipfile.ZipFile('3gLj0Q6', 'r') as existing_zip:

existing_zip.extractall('data')

라이브러리 불러오기

[2] # 라이브러리 불러오기

# import [라이브러리] as [사용할 이름]

# 판다스 , 넘파이

import pandas as pd

import numpy as np

파일 불러오기

파이썬에서 데이터 파일(csv 파일)을 불러오기 위해서는 pandas 라이브러리를 이용해야 합니다.
pandas 라이브러리를 먼저 import 해주고, pandas의 read_csv 메서드를 이용해 파일을 불러올 수 있습니다.

[3]

#import pandas as pd

#data = pd.read_csv('파일경로/파일이름.csv')

train = pd.read_csv('data/train.csv')

test = pd.read_csv('data/test.csv')

EDA

[4]

# 데이터 상위 5개 행 확인하기

#train

train.head()

#test

test.head()

# 행열 갯수 관찰하기 - shape

print('train의 행열 갯수 :', train.shape)

print('test의 행열 갯수 :', test.shape)

# 결측치 확인하기

train.isnull().sum()

test.isnull().sum()

# 데이터 정보 확인

train.info()

test.info()

#수치데이터 특성 보기

train.describe()

# 시각화에 필요한 라이브러리를 import

import matplotlib

import matplotlib.pyplot as plt

import seaborn as sns

# 마이너스 기호 출력

plt.rc('axes', unicode_minus=False)

# 분석에 문제가 없는 경고 메세지는 숨긴다.

import warnings

warnings.filterwarnings('ignore')

sns.histplot(train['count'])

train.corr()

import seaborn as sns

plt.figure(figsize = (12,12))

sns.heatmap(train.corr(),annot = True)

sns.barplot(x = 'hour', y = 'count', data = train)

데이터 전처리

결측치 처리하기

[17]

# 결측치 데이터 제거

train.dropna(inplace = True)

[19]

# 결측치 특정 상수 값으로 대체

train.fillna(0,inplace = True)

[20]

# 결측치 해당 변수 평균 값으로 대체

train.fillna(train.mean(),inplace = True)

test.fillna(train.mean(),inplace = True)

[23]

# 결측치 보간법으로 채우기

train.interpolate(inplace=True)

연속형 변수 변환

[24]

# 연속형 변수 시각화.

for col in train.columns:

plt.figure(figsize = (4,4))

plt.title(col)

sns.histplot(train[col])

plt.show()

# 데이터 분포가 불균형한 경우 Min-Max Normalization

#train['hour_bef_pm2.5'] = np.log1p(train['hour_bef_pm2.5'])

#train['hour_bef_pm10'] = np.log1p(train['hour_bef_pm10'])

test['hour_bef_pm2.5'] = np.log1p(test['hour_bef_pm2.5'])

test['hour_bef_pm10'] = np.log1p(test['hour_bef_pm10'])

sns.histplot(train['hour_bef_pm2.5'])

모델링

[25]

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()

[26]

X_train = train.drop(['id', 'count'], axis = 1)

y_train = train['count']

X_test = test.drop('id', axis = 1)

[27]

from sklearn.model_selection import GridSearchCV

[28]

RandomForestRegressor()

[29]

param = {'min_samples_split': [30, 50, 70],

'max_depth': [5, 6, 7],

'n_estimators': [50, 150, 250]}

[30]

gs = GridSearchCV(estimator=model, param_grid=param, scoring = 'neg_mean_squared_error', cv = 3)

[31]

gs.fit(X_train, y_train)

제출 파일 생성

[32]

submission = pd.read_csv('data/submission.csv')

[33]

pred = gs.predict(X_test)

[34]

submission['count'] = pred

[35]

submission.to_csv('gridsearch.csv', index = False)

'데이콘 #오늘의 파이썬 #1일1오파' 카테고리의 다른 글

Lv4. 교차검증과 모델 앙상블을 활용한 와인 품질 분류하기 (0)	2021.11.23
Lv1. 의사결정회귀나무로 따릉이 데이터 예측하기 (0)	2021.10.28

현재글Lv2. 결측치 보간법과 랜덤포레스트로 따릉이 데이터 예측하기

핫소스

파이썬, finetuning, BOAZ컨퍼런스, 빅데이터연합동아리, multivariable linear regression, aws, softmax, node.js, 추천시스템, 인공지능 논문 리뷰, 배치 정규화, prompttuning, sasrec, boaz, Logistic Regression, EC2 인스턴스 스토리지, 머신러닝, 패스트캠퍼스, 빅데이터동아리, ML,

Today :
Yesterday :

핫소스

Lv2. 결측치 보간법과 랜덤포레스트로 따릉이 데이터 예측하기

데이터 다운로드

라이브러리 불러오기

파일 불러오기

EDA

데이터 전처리

결측치 처리하기

연속형 변수 변환

모델링

제출 파일 생성

'데이콘 #오늘의 파이썬 #1일1오파' 카테고리의 다른 글

'데이콘 #오늘의 파이썬 #1일1오파'의 다른글

티스토리툴바

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Lv2. 결측치 보간법과 랜덤포레스트로 따릉이 데이터 예측하기

데이터 다운로드

라이브러리 불러오기

파일 불러오기

EDA

데이터 전처리

결측치 처리하기

연속형 변수 변환

모델링

제출 파일 생성

'데이콘 #오늘의 파이썬 #1일1오파' 카테고리의 다른 글

'데이콘 #오늘의 파이썬 #1일1오파'의 다른글

관련글

티스토리툴바