前言
這是機(jī)器學(xué)習(xí)系列的第三篇文章,對于住房租金預(yù)測比賽的總結(jié)這將是最后一篇文章了,比賽持續(xù)一個(gè)月自己的總結(jié)竟然也用了一個(gè)月,牽強(qiáng)一點(diǎn)來說機(jī)器學(xué)習(xí)也將會(huì)是一個(gè)漫長的道路,后續(xù)機(jī)器學(xué)習(xí)的文章大多數(shù)以知識科普為主,畢竟自己在機(jī)器學(xué)習(xí)這個(gè)領(lǐng)域是個(gè)渣渣,自己學(xué)到的新知識點(diǎn)會(huì)分享給大家的。
前面的文章談了談這次比賽非技術(shù)方面的收獲,對數(shù)據(jù)集的初步了解和特征工程的處理,今天主要介紹這次使用的模型--XGBOOST。
XGBOOST模型介紹
關(guān)于xgboost的原理網(wǎng)絡(luò)上的資源很少,大多數(shù)還停留在應(yīng)用層面,自己也是僅僅學(xué)習(xí)了一點(diǎn)應(yīng)用,關(guān)于原理可以參考陳天奇博士的這篇文章
https://xgboost.readthedocs.io/en/latest/tutorials/model.html。
簡單介紹:
XGBOOST是一個(gè)監(jiān)督模型,xgboost對應(yīng)的模型本質(zhì)是一堆CART樹。用一堆樹做預(yù)測,就是將每棵樹的預(yù)測值加到一起作為最終的預(yù)測值。下圖就是CART樹和一堆CART樹的示例,用來判斷一個(gè)人是否會(huì)喜歡計(jì)算機(jī)游戲:
第二張圖明了如何用一堆CART樹做預(yù)測,就是簡單將各個(gè)樹的預(yù)測分?jǐn)?shù)相加。
參數(shù)介紹:
官方參數(shù)介紹看這里:https://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters
比較重要的參數(shù)介紹:
“reg:linear” –線性回歸?!皉eg:logistic” –邏輯回歸。“binary:logistic” –二分類的邏輯回歸問題,輸出為概率。“binary:logitraw” –二分類的邏輯回歸問題,輸出的結(jié)果為wTx。
“count:poisson”–計(jì)數(shù)問題的poisson回歸,輸出結(jié)果為poisson分布。在poisson回歸中,max_delta_step的缺省值為0.7。(used to safeguard optimization)
“multi:softmax”–讓XGBoost采用softmax目標(biāo)函數(shù)處理多分類問題,同時(shí)需要設(shè)置參數(shù)num_class(類別個(gè)數(shù))
“multi:softprob” –和softmax一樣,但是輸出的是ndata * nclass的向量,可以將該向量reshape成ndata行nclass列的矩陣。沒行數(shù)據(jù)表示樣本所屬于每個(gè)類別的概率。
lambda [default=0]L2 正則的懲罰系數(shù)alpha [default=0]L1 正則的懲罰系數(shù)
lambda_bias在偏置上的L2正則。缺省值為0(在L1上沒有偏置項(xiàng)的正則,因?yàn)長1時(shí)偏置不重要)
eta [default=0.3]為了防止過擬合,更新過程中用到的收縮步長。在每次提升計(jì)算之后,算法會(huì)直接獲得新特征的權(quán)重。eta通過縮減特征的權(quán)重使提升計(jì)算過程更加保守。取值范圍為:[0,1]
max_depth[default=6]數(shù)的最大深度。缺省值為6,取值范圍為:[1,∞]
min_child_weight [default=1]孩子節(jié)點(diǎn)中最小的樣本權(quán)重和。如果一個(gè)葉子節(jié)點(diǎn)的樣本權(quán)重和小于min_child_weight則拆分過程結(jié)束。在現(xiàn)行回歸模型中,這個(gè)參數(shù)是指建立每個(gè)模型所需要的最小樣本數(shù)。該成熟越大算法越conservative取值范圍為: [0,∞]
xgboost參數(shù)設(shè)置的代碼示例:
1xgboost參數(shù)設(shè)置代碼示例: 2 3#劃分?jǐn)?shù)據(jù)集 4X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.01,random_state=1729) 5print(X_train.shape,X_test.shape) 6 7#模型參數(shù)設(shè)置 8xlf=xgb.XGBRegressor(max_depth=10, 9learning_rate=0.1,10n_estimators=10,11silent=True,12objective='reg:linear',13nthread=-1,14gamma=0,15min_child_weight=1,16max_delta_step=0,17subsample=0.85,18colsample_bytree=0.7,19colsample_bylevel=1,20reg_alpha=0,21reg_lambda=1,22scale_pos_weight=1,23seed=1440,24missing=None)2526xlf.fit(X_train,y_train,eval_metric='rmse',verbose=True,eval_set=[(X_test,y_test)],early_stopping_rounds=100)2728#計(jì)算分?jǐn)?shù)、預(yù)測29preds=xlf.predict(X_test)
比賽代碼
關(guān)于xgboost只是簡單的做了一個(gè)介紹,自己也僅僅懂一點(diǎn)應(yīng)用層,原理懂得不是很多,這次XGB代碼分析使用的是第一名開源代碼。
導(dǎo)入數(shù)據(jù)集
1importpandasaspd 2importnumpyasnp 3importmatplotlib.pyplotasplt 4 5train_data=pd.read_csv('train.csv') 6test_df=pd.read_csv('test.csv') 7train_df=train_data[train_data.loc[:,'Time']<3] 8val_df=train_data[train_data.loc[:,'Time']==3] 910del?train_data
以默認(rèn)參數(shù)的XGB分?jǐn)?shù)為準(zhǔn),低于此基準(zhǔn)線2.554的模型一律不考慮。
1defxgb_eval(train_df,val_df): 2train_df=train_df.copy() 3val_df=val_df.copy() 4 5try: 6fromsklearn.preprocessingimportLabelEncoder 7lb_encoder=LabelEncoder() 8lb_encoder.fit(train_df.loc[:,'RoomDir'].append(val_df.loc[:,'RoomDir'])) 9train_df.loc[:,'RoomDir']=lb_encoder.transform(train_df.loc[:,'RoomDir'])10val_df.loc[:,'RoomDir']=lb_encoder.transform(val_df.loc[:,'RoomDir'])11exceptExceptionase:12print(e)1314importxgboostasxgb15X_train=train_df.drop(['Rental'],axis=1)16Y_train=train_df.loc[:,'Rental'].values17X_val=val_df.drop(['Rental'],axis=1)18Y_val=val_df.loc[:,'Rental'].values1920fromsklearn.metricsimportmean_squared_error2122try:23eval_df=val_df.copy().drop('Time',axis=1)24exceptExceptionase:25eval_df=val_df.copy()2627reg_model=xgb.XGBRegressor(max_depth=5,n_estimators=500,n_jobs=-1)28reg_model.fit(X_train,Y_train)2930y_pred=reg_model.predict(X_val)31print(np.sqrt(mean_squared_error(Y_val,y_pred)),end='')3233eval_df.loc[:,'Y_pred']=y_pred34eval_df.loc[:,'RE']=eval_df.loc[:,'Y_pred']-eval_df.loc[:,'Rental']3536print('')37feature=X_train.columns38fe_im=reg_model.feature_importances_39print(pd.DataFrame({'fe':feature,'im':fe_im}).sort_values(by='im',ascending=False))4041importmatplotlib.pyplotasplt42plt.clf()43plt.figure(figsize=(15,4))44plt.plot([Y_train.min(),Y_train.max()],[0,0],color='red')45plt.scatter(x=eval_df.loc[:,'Rental'],y=eval_df.loc[:,'RE'])46plt.show()4748returneval_df
原生特征的丟棄嘗試
以XGB做原生特征篩選,在原生特征中丟棄后不影響分?jǐn)?shù)甚至漲分的特征有:Time,RentRoom(漲幅明顯),RoomDir,Livingroom,RentType(漲幅明顯),SubwayLine(漲幅明顯),SubwayDis(漲幅明顯)。
1#丟棄各特征后的分?jǐn)?shù) 2#‘Time':2.558,'Neighborhood':2.592,'RentRoom':2.531,'Height':2.57,'TolHeight':2.591,'RoomArea':3 3#'RoomDir':2.548,'RentStatus':2.561,'Bedroom':2.584,'Livingroom':2.548,'Bathroom':2.590,'RentType':2.538 4#'Region':2.583,'BusLoc':2.594,'SubwayLine':2.521,'SubwaySta':2.569,'SubwayDis':2.537,'RemodCond':2.571 5forcolintrain_df.columns: 6ifcol!='Rental': 7print('dropcol:{}'.format(col)) 8tmp_train_df=train_df.drop([col],axis=1) 9tmp_val_df=val_df.drop([col],axis=1)10eval_df=xgb_eval(train_df=tmp_train_df,val_df=tmp_val_df)1112#一起丟棄:2.55313tmp_train_df=train_df.copy()14tmp_val_df=val_df.copy()15tmp_train_df.drop(['Time','RentRoom','RoomDir','Livingroom','RentType','SubwayLine','SubwayDis'],axis=1,inplace=True)16tmp_val_df.drop(['Time','RentRoom','RoomDir','Livingroom','RentType','SubwayLine','SubwayDis'],axis=1,inplace=True)17eval_df=xgb_eval(train_df=tmp_train_df,val_df=tmp_val_df)
特征選擇
一股腦加上所有特征表現(xiàn)不佳,使用貪心策略(前向選擇、后向選擇)逐個(gè)添加特征。
1train_data=pd.read_csv('train.csv') 2train_df=train_data[train_data.loc[:,'Time']<3] 3val_df=train_data[train_data.loc[:,'Time']==3] 4 5drop_cols=['SubwayLine','RentRoom','Time']????????#?需要丟棄的原生特征 6 7comb_train_df=train_df.copy() 8comb_val_df=val_df.copy() 910#?前向特征選擇這塊我是用for循環(huán)暴力搜出來的最優(yōu)特征組合,最終篩選出來的特征組合為:11#?['ab_Height','TolRooms','Area/Room','BusLoc_rank','SubwayLine_rank']1213comb_train_df.loc[:,'ab_Height']=comb_train_df.loc[:,'Height']/(comb_train_df.loc[:,'TolHeight']+1)14comb_val_df.loc[:,'ab_Height']=comb_val_df.loc[:,'Height']/(comb_val_df.loc[:,'TolHeight']+1)1516comb_train_df.loc[:,'TolRooms']=comb_train_df.loc[:,'Livingroom']+comb_train_df.loc[:,'Bedroom']+comb_train_df.loc[:,'Bathroom']17comb_val_df.loc[:,'TolRooms']=comb_val_df.loc[:,'Livingroom']+comb_val_df.loc[:,'Bedroom']+comb_val_df.loc[:,'Bathroom']18comb_train_df.loc[:,'Area/Room']=comb_train_df.loc[:,'RoomArea']/(comb_train_df.loc[:,'TolRooms']+1)19comb_val_df.loc[:,'Area/Room']=comb_val_df.loc[:,'RoomArea']/(comb_val_df.loc[:,'TolRooms']+1)2021rank_cols=['BusLoc','SubwayLine']22for?col?in?rank_cols:23????rank_df=train_df.loc[:,[col,'Rental']].groupby(col,as_index=False).mean().sort_values(by='Rental').reset_index(drop=True)24????rank_df.loc[:,col+'_rank']=rank_df.index+1????????#?+1,為缺失值預(yù)留一個(gè)0值的rank25????rank_fe_df=rank_df.drop(['Rental'],axis=1)26????comb_train_df=comb_train_df.merge(rank_fe_df,how='left',on=col)27????comb_val_df=comb_val_df.merge(rank_fe_df,how='left',on=col)28????try:29????????comb_train_df.drop([col],axis=1,inplace=True)30????????comb_val_df.drop([col],axis=1,inplace=True)31????except?Exception?as?e:32????????print(e)33for?drop_col?in?drop_cols:34????????try:35????????????comb_train_df.drop(drop_col,axis=1,inplace=True)36????????????comb_val_df.drop(drop_col,axis=1,inplace=True)37????????except?Exception?as?e:38????????????pass3940#?貪心策略添加特征,目前為:2.40341eval_df=xgb_eval(train_df=comb_train_df,val_df=comb_val_df
調(diào)參對于不是很大的數(shù)據(jù)集可以用sklearn的Gridcvsearch來暴力調(diào)參。示例代碼:
1params={'depth':[3],2'iterations':[5000],3'learning_rate':[0.1,0.2,0.3],4'l2_leaf_reg':[3,1,5,10,100],5'border_count':[32,5,10,20,50,100,200]}6clf=GridSearchCV(cat,params,cv=3)7clf.fit(x_train_2,y_train_2)
對于較大的數(shù)據(jù)集,用第一種方法耗時(shí)特別長。2. 逐個(gè)參數(shù)調(diào),先取定其它參數(shù),遍歷第一個(gè)參數(shù),選擇最優(yōu)值,再調(diào)下一個(gè)參數(shù)。省時(shí)但有的時(shí)候容易陷入局部最優(yōu)。3.觀察數(shù)據(jù)的分布來調(diào)整對應(yīng)的參數(shù),如樹模型的葉子節(jié)點(diǎn)數(shù),變量較多,葉子數(shù)少欠擬合。
預(yù)測提交
1defxgb_pred(): 2train_df=pd.read_csv('train.csv') 3test_df=pd.read_csv('test.csv') 4 5try: 6fromsklearn.preprocessingimportLabelEncoder 7lb_encoder=LabelEncoder() 8lb_encoder.fit(train_df.loc[:,'RoomDir'].append(test_df.loc[:,'RoomDir'])) 9train_df.loc[:,'RoomDir']=lb_encoder.transform(train_df.loc[:,'RoomDir'])10test_df.loc[:,'RoomDir']=lb_encoder.transform(test_df.loc[:,'RoomDir'])11exceptExceptionase:12print(e)1314train_df.loc[:,'ab_Height']=train_df.loc[:,'Height']/(train_df.loc[:,'TolHeight']+1)15test_df.loc[:,'ab_Height']=test_df.loc[:,'Height']/(test_df.loc[:,'TolHeight']+1)16train_df.loc[:,'TolRooms']=train_df.loc[:,'Livingroom']+train_df.loc[:,'Bedroom']+train_df.loc[:,'Bathroom']17test_df.loc[:,'TolRooms']=test_df.loc[:,'Livingroom']+test_df.loc[:,'Bedroom']+test_df.loc[:,'Bathroom']18train_df.loc[:,'Area/Room']=train_df.loc[:,'RoomArea']/(train_df.loc[:,'TolRooms']+1)19test_df.loc[:,'Area/Room']=test_df.loc[:,'RoomArea']/(test_df.loc[:,'TolRooms']+1)2021rank_cols=['BusLoc','SubwayLine']22forcolinrank_cols:23rank_df=train_df.loc[:,[col,'Rental']].groupby(col,as_index=False).mean().sort_values(by='Rental').reset_index(drop=True)24rank_df.loc[:,col+'_rank']=rank_df.index+1#+1,為缺失值預(yù)留一個(gè)0值的rank25rank_fe_df=rank_df.drop(['Rental'],axis=1)26train_df=train_df.merge(rank_fe_df,how='left',on=col)27test_df=test_df.merge(rank_fe_df,how='left',on=col)28try:29train_df.drop([col],axis=1,inplace=True)30test_df.drop([col],axis=1,inplace=True)31exceptExceptionase:32print(e)33fordrop_colindrop_cols:34try:35train_df.drop(drop_col,axis=1,inplace=True)36test_df.drop(drop_col,axis=1,inplace=True)37exceptExceptionase:38pass3940print(train_df.columns,test_df.columns)4142importxgboostasxgb43X_train=train_df.drop(['Rental'],axis=1)44Y_train=train_df.loc[:,'Rental'].values45test_id=test_df.loc[:,'id']46X_test=test_df.drop(['id'],axis=1)474849fromsklearn.metricsimportmean_squared_error5051reg_model=xgb.XGBRegressor(max_depth=8,n_estimators=1880,n_jobs=-1)52reg_model.fit(X_train,Y_train,eval_set=[(X_train,Y_train)],verbose=100,early_stopping_rounds=10)5354y_pred=reg_model.predict(X_test)5556sub_df=pd.DataFrame({57'id':test_id,58'price':y_pred59})60sub_df.to_csv('sub.csv',index=False)6162returnNone6364xgb_pred()
第一名XGB單模分?jǐn)?shù)為1.94,線下線上是一致的,總特征數(shù)二十多個(gè),跟自己的XGB相比,自己在特征組合方向有所欠缺,自己單模特征10個(gè)左右分?jǐn)?shù)在2.01。
-
模型
+關(guān)注
關(guān)注
1文章
3521瀏覽量
50428 -
代碼
+關(guān)注
關(guān)注
30文章
4900瀏覽量
70751 -
機(jī)器學(xué)習(xí)
+關(guān)注
關(guān)注
66文章
8503瀏覽量
134628
原文標(biāo)題:機(jī)器學(xué)習(xí)實(shí)戰(zhàn)--住房月租金預(yù)測(3)
文章出處:【微信號:AI_shequ,微信公眾號:人工智能愛好者社區(qū)】歡迎添加關(guān)注!文章轉(zhuǎn)載請注明出處。
發(fā)布評論請先 登錄
如何通過XGBoost解釋機(jī)器學(xué)習(xí)

基于xgboost的風(fēng)力發(fā)電機(jī)葉片結(jié)冰分類預(yù)測 精選資料分享
基于xgboost的風(fēng)力發(fā)電機(jī)葉片結(jié)冰分類預(yù)測 精選資料下載
通過學(xué)習(xí)PPT地址和xgboost導(dǎo)讀和實(shí)戰(zhàn)地址來對xgboost原理和應(yīng)用分析

面試中出現(xiàn)有關(guān)Xgboost總結(jié)
基于遺傳算法和隨機(jī)森林的XGBoost改進(jìn)方法

XGBoost超參數(shù)調(diào)優(yōu)指南

XGBoost 2.0介紹

評論