A High-level Scorecard Modeling API | 评分卡建模尽在于此
A High-level Scorecard Modeling API | 评分卡建模尽在于此
Documentation page | 文档页面:https://scorecard-bundle.bubu.blue/
Scorecard-Bundle is a high-level Scorecard modeling API that is easy-to-use and Scikit-Learn consistent. It covers the major steps of training a Scorecard model including feature discretization with ChiMerge, WOE encoding, feature evaluation with information value and collinearity, Logistic-Regression-based Scorecard model, and model evaluation for binary classification tasks. All the transformers and model classes in Scorecard-Bundle comply with Scikit-Learn‘s fit-transform-predict convention.
A complete example showing how to build a scorecard with Scorecard-Bundle: Example Notebooks
See detailed and more reader-friendly documentation in https://scorecard-bundle.bubu.blue/
In Scorecard-Bundle, core algorithms in WOE/IV calculation and scorecard transformation were based on the methods introduced in Mamdouh Refaat's book '"Credit Risk Scorecards: Development and Implementation Using SAS";ChiMerge was written based on Randy Kerber's paper "ChiMerge: Discretization of Numeric Attributes".
I developed Scorecard-Bundle in my private time, but its codes wouldn't be so good if my superior Andyshi hasn't been allowing me to use it in projects at work, if my colleages (e.g. zeyunH) hasn't been active in using it, or if users didn't report issues when they found bugs. Thanks to everyone who helps to make Scorecard-Bundle better.
Installing the latest version is strongly recommended as every version either corrected known bugs or added useful functionality. In principle, critical bugs are fixed as soon as they are revealed. Therefore please file an issue if you suspect the presence of a bug when using Scorecard-Bundle.
Note that Scorecard-Bundle depends on NumPy, Pandas, matplotlib, Scikit-Learn, and SciPy, which can be installed individually or together through Anaconda
Pip: Scorecard-Bundle can be installed with pip: pip install --upgrade scorecardbundle
!Note that the latest version may be not available at some pip mirror site (e.g. *https://mirrors.aliyun.com/pypi/simple/*). Therefore in order to update to the latest version, use the following command to specify the source as https://pypi.org/project
pip install -i https://pypi.org/project --upgrade scorecardbundle
Manually: Download codes from github <https://github.com/Lantianzz/Scorecard-Bundle>
and import them directly:
import sys
sys.path.append('E:\Github\Scorecard-Bundle') # add path that contains the codes
from scorecardbundle.feature_discretization import ChiMerge as cm
from scorecardbundle.feature_discretization import FeatureIntervalAdjustment as fia
from scorecardbundle.feature_encoding import WOE as woe
from scorecardbundle.feature_selection import FeatureSelection as fs
from scorecardbundle.model_training import LogisticRegressionScoreCard as lrsc
from scorecardbundle.model_evaluation import ModelEvaluation as me
from scorecardbundle.model_interpretation import ScorecardExplainer as mise
TypeError: 'bool' object is not iterable
or DeprecationWarning: elementwise comparison failed
).V1.2.2 fixed some non-critical bugs in previous versions.
Corrected the use of deprecated parameters
plt.annotate()
in previous versions, parameter s
is used to pass in the text. However, this parameter has been renamed as text
and from Python3.9 continuing using s
may cause in TypeError annotate() missing 1 required positional argument: 'text'
. In V1.2.2 parameter text
is used when using plt.annotate()
Change default parameter values: Change the default value of parameter min_intervals
in ChiMerge from 1 to 2.
Adjust the naming of private variables in classes:
BaseEstimator
and TransformerMixin
classess in Scikit-learn, and for each class parameter Scikit-learn checks whether it is existed inside the class as an property with the exact same name. The previous codes set such parameters as private variables with two underscores as prefix. This resulted in errors like cannot found __xx in class xxxx
when users try to print the instance or access these private variables. Note that this problem won't stop you from getting the correct results.ChiMerge
, WOE
andLogisticRegressionScoreCard
to avoid such problem.This is an emergency update to fix 2 related bugs that may be triggered in rare cases but are hard to debug for someone who is not familiar with the codes. Thanks to @ zeyunH for bring one of the bugs to me.
force_inf
to scorecardbundle/utils/func_numpy.py/_assign_interval_base()
and related codes. This parameter controls whether to force the largest interval's right boundary to be positive infinity. Default is True.
b_max
passed is larger than or equal to the maximum feature value, the largest interval output is originally (xxx, b_max]. In tasks like fitting ChiMerge where the output intervals are supposed to cover the entire value space (-inf ~ inf), this parameter force_inf
should be set to True so that the largest interval will be overwritten from (xxx, b_max] to (xxx, inf]. In other words, the previous largest boundary value is abandoned.force_inf=True
in tasks like fitting ChiMerge where we want the output intervals to cover the entire value space so that the largest interval will be fixed to cover infinity. Set force_inf=False
in tasks like ChiMerge transform and Scorecard predict where we only need to transform feature values into intervals based on the given boundaries._assign_interval_base
in ChiMerge fit()
, the largest interval will be overwritten from (xxx, b_max] to (xxx, inf] to cover the entire value range. However, previously the codes only perform this adjustment when the largest boundary value is equal to the maximum value of the data, while in practive the largest boundary may be larger due to rounding (e.g. the max value is 3.14159 and the threshold happend to choose this value and rounded up to 3.1316 due to the decimal
parameter of ChiMerge). From V1.2.1, the condition has been changed to >=
force_inf=False
in function assign_interval_str
when calling Scorecard predict(). This is to avoid getting KeyError because the maximum interval adjustment mentioned above generates an interval that does not exist in the Scorecard rules.X_beforeWOE
parameter of LogisticRegressionScoreCard.predict()
. In the case when the Scorecard rules have features which are not in the passed features data, or the passed features data has features which are not in the Scorecard rules, an exception will be raised.feature_discretization:
decimal
to class ChiMerge.ChiMerge()
, which allows users to control the number of decimals of the feature interval boundaries.FeatureIntervalAdjustment.plot_event_dist()
.FeatureIntervalAdjustment.feature_stat()
that computes the input feature's sample distribution, including the sample sizes, event sizes and event proportions of each feature value.feature_selection.FeatureSelection:
identify_colinear_features()
that identifies the highly-correlated features pair that may cause colinearity problem.unstacked_corr_table()
that returns the unstacked correlation table to help analyze the colinearity problem.model_training.LogisticRegressionScoreCard:
LogisticRegressionScoreCard
class so that it now accepts all parameters of sklearn.linear_model.LogisticRegression
and its fit()
fucntion accepts all parameters of the fit()
of sklearn.linear_model.LogisticRegression
(including sample_weight
)baseOdds
for LogisticRegressionScoreCard
. This allows users to pass user-defined base odds (# of y=1 / # of y=0) to the Scorecard model.model_evaluation.ModelEvaluation:
pref_table
, which evaluates the classification performance on differet levels of model scores . This function is useful for setting classification threshold based on precision and recall.model_interpretation:
ScorecardExplainer.important_features()
to help interpret the result of a individual instance. This function indentifies features who contribute the most in pusing the total score of a particular instance above a threshold.scorecardbundle.feature_discretization.ChiMerge.ChiMerge
to ensure the output discretized feature values are continous intervals from negative infinity to infinity, covering all possible values. This was done by modifying _assign_interval_base
function and chi_merge_vector
function;min_intervals
parameter in scorecardbundle.feature_discretization.ChiMerge.ChiMerge
from None to 1 so that in case of encountering features with only one unique value would not cause an error. Setting the default value to 1 is actually more consistent to the actual meaning, as there is at least one interval in a feature;scorecardbundle.feature_discretization.FeatureIntervalAdjustment
class to cover the functionality related to manually adjusting features in feature engineering stage. Now this class only contains plot_event_dist
function, which can visualize a feature's sample distribution and event rate distribution. This is to facilate feature adjustment decisions in order to obtain better explainability and predictabiltiy;Scorecard-Bundle是一个基于Python的高级评分卡建模API,实施方便且符合Scikit-Learn的调用习惯,包含的类均遵守Scikit-Learn的fit-transform-predict习惯。Scorecard-Bundle包括基于ChiMerge的特征离散化、WOE编码、基于信息值(IV)和共线性的特征评估、基于逻辑回归的评分卡模型、以及针对二元分类任务的模型评估。
展示如何训练评分卡模型的完整示例见Example Notebooks
详细的、更友好的文档见https://scorecard-bundle.bubu.blue/
Scorecard-Bundle中WOE和IV的计算、评分卡转化等的核心计算逻辑源自《信用风险评分卡研究 —基于SAS的开发与实施》一书,该书籍由王松奇和林治乾翻译自Mamdouh Refaat的"Credit Risk Scorecards: Development and Implementation Using SAS";而ChiMerge算法则是复现了原作者Randy Kerber的论文"ChiMerge: Discretization of Numeric Attributes"。
虽然我是用私人时间开发的Scorecard-Bundle,但如果不是我的上级 Andyshi 允许我在工作中使用它、如果不是我的同事 (e.g. zeyunH) 积极的使用和反馈、如果不是用户们在发现bug时提出issue,Scorecard-Bundle的代码不会有现在这么好。感谢帮助Scorecard-Bundle变得更好的每一个人。
由于每次版本更新都在修复已知的bug或添加重要的新功能,强烈建议安装最新版本 。严重的bug原则上都会在被发现的第一时间修复,因此若在使用Scorecard-Bundle的过程中怀疑存在bug,欢迎在issue中记录。
注意,Scorecard-Bundle依赖NumPy, Pandas, matplotlib, Scikit-Learn, SciPy,可单独安装或直接使用Anaconda安装。
Pip: Scorecard-Bundle可使用pip安装: pip install --upgrade scorecardbundle
注意!最新版本可能尚未被纳入一些镜像源网站 (e.g. *https://mirrors.aliyun.com/pypi/simple/*)。因此为了更新到最新版本,可以使用下面的命令,指定 *https://pypi.org/project*作为源
pip install -i https://pypi.org/project --upgrade scorecardbundle
手动: 从Github下载代码<https://github.com/Lantianzz/Scorecard-Bundle>
, 直接导入:
import sys
sys.path.append('E:\Github\Scorecard-Bundle') # add path that contains the codes
from scorecardbundle.feature_discretization import ChiMerge as cm
from scorecardbundle.feature_discretization import FeatureIntervalAdjustment as fia
from scorecardbundle.feature_encoding import WOE as woe
from scorecardbundle.feature_selection import FeatureSelection as fs
from scorecardbundle.model_training import LogisticRegressionScoreCard as lrsc
from scorecardbundle.model_evaluation import ModelEvaluation as me
from scorecardbundle.model_interpretation import ScorecardExplainer as mise
TypeError: 'bool' object is not iterable
或DeprecationWarning: elementwise comparison failed
);V1.2.2修复了几处非重要的bug
plt.annotate()
时使用参数s
传入文本,但此参数已经被更名为text
, 在Python3.9中继续使用原参数可能导致TypeError annotate() missing 1 required positional argument: 'text'
。新版代码改为使用text
参数min_intervals
参数的默认值由1改为2BaseEstimator
和TransformerMixin
, Scikit-learn会检查每个参数是否以同样的名称存在于类的实例的属性中,旧代码将参数均设为了由两个断线__
作为前缀的私有变量,导致当用户试图打印实例、或者获取私有变量的时候出现cannot found __xx in class xxxx
这类错误,此错误不会影响代码的正常使用ChiMerge
, WOE
和LogisticRegressionScoreCard
三个类,类的参数均已同名的属性的形式存在于类的实例中为了修复两处罕见的bug而紧急发布V1.2.1版本。下面的bug对于不熟悉代码的用户较难排查。感谢@ zeyunH 指出其中的一个bug
force_inf
到函数 scorecardbundle/utils/func_numpy.py/_assign_interval_base()
及相关代码,此参数控制是否会强制最大的区间的右侧边界为正无穷,默认为True
b_max
大于等于特征数据的最大值时,输出的最大的区间原本是(xxx, b_max],而fit ChiMerge计算离散化的阈值时,需要输出的区间覆盖整个值域(-inf ~ inf),此时这个参数应该被设为True,使得最大区间被从 (xxx, b_max] 改为(xxx, inf],相当于原有的最大阈值被弃用了。force_inf=True
,这样可以按需修正最大区间使其覆盖到正无穷;在用ChiMerge做transform操作、或使用评分卡的predict()这样希望严格按照阈值输出区间的任务中,使用force_inf=False
;fit()
中,旧版代码只会在最大阈值等于数据最大值时作上面提到的调整,然而实践中可能出现四舍五入导致最大阈值大于最大值的情况 (e.g. 最大值为3.14159 ,而最大阈值正好选中了这个值且由于ChiMerge的decimal
参数四舍五入到了3.1316)。因此从V1.2.1开始,生效的条件被改为了>=
assign_interval_str
设置force_inf=False
,避免原代码在最大阈值等于数据最大值时会擅自修改输出的最大区间,导致出现评分规则中不存在的区间,造成评分规则时的KeyErrorX_beforeWOE
的检查,当评分规则中存在特征数据没有的特征、或特征数据中存在评分规则没有的特征时,会抛出异常特征离散化 feature_discretization:
ChiMerge.ChiMerge()
添加参数 decimal
, 允许用户控制输出的特征区间的边界的小数位数;FeatureIntervalAdjustment.plot_event_dist()
;FeatureIntervalAdjustment.feature_stat()
用于计算特征的分布,包括不同取值的样本分布、响应率分布等;特征选择 feature_selection.FeatureSelection:
identify_colinear_features()
用于识别高度相关的特征,输出高度相关的特征中IV较低的特征清单;unstacked_corr_table()
,输出特征相关性表用于分析共线性问题;模型训练 model_training.LogisticRegressionScoreCard:
LogisticRegressionScoreCard
class ,使其可接受sklearn.linear_model.LogisticRegression
的任意参数、且其fit()
函数可接受sklearn.linear_model.LogisticRegression
的fit()函数的任意参数 (包括 sample_weight
)LogisticRegressionScoreCard
添加参数baseOdds
. 这允许用户传入自定义的base odds (# of y=1 / # of y=0)模型评估 model_evaluation.ModelEvaluation:
pref_table
, 用于评估不同水平的模型分数的分类表现(精确度、召回率、F1、样本比例等)。此函数可帮助用户基于分类表现选择分类阈值;评分卡解释 model_interpretation:
ScorecardExplainer.important_features()
用于解释单个样本的模型结果。此函数可识别对模型结果较重要的特征plot_event_dist
函数,可实现样本分布和响应率分布的可视化,方便对特征进行调整,已获得更好的可解释性和预测力;