# Tianchi Diabetes Top12

### 天池精准医疗大赛-糖尿病遗传风险预测

#### 特征工程

##### 新特征构造

1.构造加减乘除四则运算特征，做特征间的交互(考虑可解释的 基因拮抗、基因协同)
2.构造特征本身的乘方，幂方，开方等数值特征
3.利用多项式特征包来构造特征(线上表现不行)

##### 缺失值的处理

1.观察数据分布，对于缺失数据在非长尾的特征，均值填充/中值填充
2.把缺失值的特征当Label，考虑Label Propagation传播算法，半监督填充Label
3.不用GBDT等模型填充的原因是对于缺失值较多的(40%-75%)，无法保证数据的分布一致
4.将缺失值数量超过75%的进行删除

##### 模型的选择

``````if Choose_Best_Feature(now_feature)<the_last_best:
now_feature.pop()
else:
print('Now CV:',cv_mean)
``````

``````def get_pic(model,feature_name):
ans = DF()
ans['name'] = feature_name
ans['score'] = model.feature_importances_
print(ans[ans['score']>0].shape)
return ans.sort_values(by=['score'],ascending=False).reset_index(drop=True)

nums = 45
feature_name1 = train_data[feature_name].columns
# 先训练好三个模型 第一种方法是将三个模型的Feature_importances的Top K选择出来后，将这些特征取并集；而第二种方法则是取交集
``````

``````def get_model(nums,cv_fold):
feature_name1 = train_data[feature_name].columns
print('New Feature: ',len(get_ans_face))
new_lgb_model = lgb.LGBMClassifier(objective='binary',n_estimators=300,max_depth=3,min_child_samples=6,learning_rate=0.102,random_state=1)
cv_model = cv(new_lgb_model, train_data[get_ans_face], train_label,  cv=cv_fold, scoring='f1')
new_lgb_model.fit(train_data[get_ans_face], train_label)
m1 = cv_model.mean()

new_xgb_model1 = xgb.XGBClassifier(objective='binary:logistic',n_estimators=300,max_depth=4,learning_rate=0.101,random_state=1)
cv_model = cv(new_xgb_model1, train_data[get_ans_face].values, train_label,  cv=cv_fold, scoring='f1')
new_xgb_model1.fit(train_data[get_ans_face].values, train_label)
m2 = cv_model.mean()

new_gbc_model = GBC(n_estimators=310,subsample=1,min_samples_split=2,max_depth=3,learning_rate=0.1900,min_weight_fraction_leaf=0.1)
kkk = train_data[get_ans_face].fillna(7)
cv_model = cv(new_gbc_model, kkk[get_ans_face], train_label,  cv=cv_fold, scoring='f1')
new_gbc_model.fit(kkk.fillna(7),train_label)

m3 = cv_model.mean()
print((m1+m2+m3)/3)
pro1 = new_lgb_model.predict_proba(test_data[get_ans_face])
pro2 = new_xgb_model1.predict_proba(test_data[get_ans_face].values)
pro3 = new_gbc_model.predict_proba(test_data[get_ans_face].fillna(7).values)
ans = (pro1+pro2+pro3)/3
return ans
``````

Open Source Agenda is not affiliated with "Tianchi Diabetes Top12" Project. README Source: luoda888/tianchi-diabetes-top12
Stars
193
Open Issues
1
Last Commit
5 years ago
Repository