|
今天看了一部片子《长处風暴》,影片中危害评估員皮特·苏利文获得被辞退的上司正在举行的一個危害模子阐發資料,然後當真举行了阐發,终极發明了公司财政评估的一個庞大缝隙,公司所持有的資產的危害价值(VAR)存在重大危害,若是這些資產呈現了问题,那末吃亏将会跨越公司的价值,從而高层開展了一系列的拯救事情。
看完後,给我對付数据阐發、数据建模有了很大的感到,若是不是采集了汗青数据将它们整合在一块儿,创建模子,也许這個缝隙没有那末快發明。
基于此,在kaggle上查找了是不是有雷同的数据可以作為操练利用,然後想起来曾存眷過的社群小火伴的实践項目Prosper Loan Data(公家号後台复兴“網貸”下载数据),因而参如实践功课,和本身的理解,举行阐發,终极的目标是创建模子,展望哪些人貸款後会還款、哪些人会赖账。
第一步:数据导入
###导入数据library(readr)loandata <- read_csv("F:/prosperLoanData.csv")View(loandata)str(loandata)
统共有113937行数据。
這個loandata共有81個變量,113937行数据。
第二步:理解数据
因為有81個變量,一些對阐發的成果,即貸款状况影响變革不大的變量不予斟酌,在此就不做名词诠释了。
ListingCreationDate:表建立時候
LoanStatus:貸款状况(Completed、Current、Defaulted、Chargedoff等)
EmploymentStatus:受雇佣状况(Self-employed、Employed等)
EmploymentStatusDuration:受雇佣状况延续時候(以月為计较单元)
IsBorrowerHomeowner:告貸人是不是具有衡宇
CreditScoreRangeLower/CreditScoreRangeUpper:消费信誉最低/最高分
InquiriesLast6Months:近来6個月查過几多次征信记实
BorrowerRate:告貸标利率,作為P2P平台資金假貸代价的代辦署理變量,BorrowerRate不包括其他用度,是筹資者付给投資人的報答,也是融資最直接和最首要的本錢,其表現了資金供求两邊在综合斟酌各类身分环境下所承認的資金利用本錢.
Term:刻日,筹資者經由過程網貸平台举行告貸時所许诺的终极了偿刻日,告貸刻日表現该資產的活動性,刻日较长的資產應存在着活動性溢价(利率上涨).
CreditGrade/ProsperRating(Alpha):信誉品级,前者反應的是2009年7月1日前客户的信誉品级,後者反應的是2009年7月1往後的信誉品级.信誉品级越高,其偿债能力越强.
CreditScore:由消费信誉公司供给的消费信誉评分,雷同于海内的芝麻信誉分。
StatedMonthlyIncome:客户月收入,月收入越高,投資者對该告貸本息定時回流越有信念.
DelinquenciesLast7Years:信誉資料提交時告貸人曩昔7年违约次数,该指标在必定水平上可以表現告貸标的公布者的信誉状态
BankcardUtilization:信誉資料提交時告貸人信誉卡利用额度和信誉卡总透支额度的百分比
LoanOriginalAmount:告貸人在告貸時已向prosper借入的資金,若是没有汗青记实则為0,明显,借入本金越多,其還款压力越大,可是這項指标大的话也可能阐明该客户對prosper依靠性较强.
DebtToIncomeRatio:告貸人的债務收入比,债務收入比越高阐明筹資者财政状态越差,還款能力较低.其向P2P平台告貸時,投資者應请求有更高的回報.
Occupation:貸款人职業
IncomeRange:貸款人年收入范畴
BorrowerState:貸款人地点州
ListingCreationDate:表建立時候
LoanStatus:貸款状况(Completed、Current、Defaulted、Chargedoff等)
EmploymentStatus:受雇佣状况(Self-employed、Employed等)
EmploymentStatusDuration:受雇佣状况延续時候(以月為计较单元)
IsBorrowerHomeowner:告貸人是不是具有衡宇
CreditScoreRangeLower/CreditScoreRangeUpper:消费信誉最低/最高分
InquiriesLast6Months:近来6個月查過几多次征信记实
BorrowerRate:告貸标利率,作為P2P平台資金假貸代价的代辦署理變量,BorrowerRate不包括其他用度,是筹資者付给投資人的報答,也是融資最直接和最首要的本錢,其表現了資金供求两邊在综合斟酌各类身分环境下所承認的資金利用本錢.
Term:刻日,筹資者經由過程網貸平台举行告貸時所许诺的终极了偿刻日,告貸刻日表現该資產的活動性,刻日较长的資產應存在着活動性溢价(利率上涨).
CreditGrade/ProsperRating(Alpha):信誉品级,前者反應的是2009年7月1日前客户的信誉品级,後者反應的是2009年7月1往後的信誉品级.信誉品级越高,其偿债能力越强.
CreditScore:由消费信誉公司供给的消费信誉评分,雷同于海内的芝麻信誉分。
StatedMonthlyIncome:客户月收入,月收入越高,投資者對该告貸本息定時回流越有信念.
DelinquenciesLast7Years:信誉資料提交時告貸人曩昔7年违约次数,该指标在必定水平上可以表現告貸标的公布者的信誉状态
BankcardUtilization:信誉資料提交時告貸人信誉卡利用额度和信誉卡总透支额度的百分比
LoanOriginalAmount:告貸人在告貸時已向prosper借入的資金,若是没有汗青记实则為0,明显,借入本金越多,其還款压力越大,可是這項指标大的话也可能阐明该客户對prosper依靠性较强.
DebtToIncomeRatio:告貸人的债務收入比,债務收入比越高阐明筹資者财政状态越差,還款能力较低.其向P2P平台告貸時,投資者應请求有更高的回報.
Occupation:貸款人职業
IncomeRange:貸款人年收入范畴
BorrowerState:貸款人地点州
這次阐發将基于上述的数据對貸款状态LoanStatus举行展望模子创建。
第三步:数据预处置
3.1選择子集
因為變量较大,挑選部門有必要的變量,從新创建一個新数据集newloandata。
###3.1挑選子集library(dplyr)##對變量從新定名names(loandata)[c(14,15,17)] <- c("ProsperRating.numeric","ProsperRating.Alpha","ListingCategory.numeric")##挑選子集newloandata <- select(loandata,ListingCreationDate,LoanStatus,EmploymentStatus,EmploymentStatusDuration, IsBorrowerHomeowner,CreditScoreRangeLower,CreditScoreRangeUpper, InquiriesLast6Months,BorrowerRate,Term,CreditGrade,ProsperRating.Alpha, StatedMonthlyIncome,DelinquenciesLast7Years,BankcardUtilization, LoanOriginalAmount,DebtToIncomeRatio,Occupation,IncomeRange,BorrowerState,LoanOriginalAmount)View(newloandata)
3.2 数据重编码
主如果對LoanStatus举行重编码,界说“已還款”為“1”,“未還款”為“0”。
##3.2检察LoanStatus的详细内容PastDue <- c("Past Due (>120 days)","Past Due (1-15 days)","Past Due (16-30 days)", "Past Due (31-60 days)","Past Due (61-90 days)","Past Due (91-120 days)")##标签為Past Due的同一归类為PastDuenewloandata$LoanStatus[newloandata$LoanStatus %in% PastDue] <- "PastDue"##cancelled归类到current中newloandata$LoanStatus[newl兒童生日禮物,oandata$LoanStatus=="Cancelled"]<-"Current"##defaulted归类為chargedoffnewloandata$LoanStatus[newloandata$LoanStatus=="Defaulted"]<-"Chargedoff"##FinalPaymentInProgress归类為completednewloandata$LoanStatus[newloandata$LoanStatus=="FinalPaymentInProgress"]<-"Completed"##检察数据table(newloandata$LoanStatus)
再進一步分类:
##将PastDue归类到completed中,属于還款状况newloandata$LoanStatus[newloandata$LoanStatus=="PastDue"]<-"Completed"##将正在举行中的数据删除,也就是current数据删除newloandata <- newloandata[!(newloandata$LoanStatus=="Current"),]##再次检察数据table(newloandata$LoanStatus)
将LoanStatus用0和1暗示未還款、已還款:
##将completed赋值為1,属于已還款状况newloandata$LoanStatus[newloandata$LoanStatus=="Completed"]<-"1"##将Chargedoff赋值為0,属于未還款状况newloandata$LoanStatus[newloandata$LoanStatus=="Chargedoff"]<-"0"newloandata$LoanStatus <- as.factor(newloandata$LoanStatus)##再次检察数据table(newloandata$LoanStatus)
3.3检察是不是出缺失值
利用如下代码挑選出含出缺失值的變量:
data <- sapply(newloandata,function(x) sum(is.na(x))) ##查找数据的NA值data1 <- data[data!=0]data1
因為缺失数值的變量出格多,上圖其实不是很直旁觀到数据缺失的环境,是以用missmap()函数画圖阐發:
missmap(newloandata,main="Missing Value Of Loandata")
缺失值排在前三的是CreditGrade、ProsperRating.Alpha和EmploymentStatusDuration,此中前两個是信誉品级,是因為2009年7月往後prosper平台對评级名词發生了變革,第三個是受雇佣状况连结時候。這三個指标都對貸款状况有影响,以是必要對缺失值举行补全。
3.3补全缺失值
3.3.1 EmploymentStatusDuration补全部值
起首是找到缺失值的位置:
###弥补EmploymentStatusDuration####找到EmploymentStatusDuration缺失的位置which(newloandata$EmploymentStatusDuration %in% NA)
然後检察對應的EmploymentStatus的环境:
###检察相對于應的EmploymentStatus的环境newloandata$EmploymentStatus[which(newloandata$EmploymentStatusDuration %in% NA)]
此处的EmploymentStatus不是“NA”,就是“Not available”,是以可以将缺失的EmploymentStatusDuration以“0”补全:
###EmploymentStatusDuration以“0”补全newloandata$EmploymentStatusDuration[which(newloandata$EmploymentStatusDuration %in% NA)] <- "0"###查找EmploymentStatusDuration是不是出缺失值sapply(newloandata,function(x) sum(is.na(x)))
EmploymentStatusDuration缺失值已彻底弥补。
3.3.2 EmploymentStatus补全部值
用“Not available”补全EmploymentStatus数值:
###弥补EmploymentStatus###EmploymentStatusDuration以“Not available”补全newloandata$EmploymentStatus[which(newloandata$EmploymentStatus %in% NA)] <- "Not available"###查找EmploymentStatus是不是出缺失值sapply(newloandata,function(x) sum(is.na(x)))
EmploymentStatus缺失值已彻底弥补。
3.3.3 CreditScoreRangeLower/CreditScoreRangeUpper补全部值
###将CreditScoreRangeLower/CreditScoreRangeUpper取二者均匀值作為一個新的變量newloandata$CreditScore <- (newloandata$CreditScoreRangeLower+newloandata$CreditScoreRangeUpper)/2###检察CreditScore的缺失值:sapply(newloandata,function(x) sum(is.na(x)))
缺失值仍是存在,因為属于消费评分,是以可以斟酌用中位数弥补缺失值。
起首画圖检察是不是可以用中位数弥补数值:
###画圖看是不是可用中位数弥补缺失值library(ggplot2)library(ggthemes)ggplot(newloandata,aes(x=CreditScore,))+ geom_density(fill="pink",alpha=0.4)+ geom_vline(aes(xintercept=median(CreditScore,na.rm = T)),colour="red",linetype="dashed",lwd=1)+ theme_few()+ggtitle("The density of CreditScore")
從圖中可以看出数值大部門集中在500到750之間,是以可以用中位数弥补缺失值:
###用中位数弥补缺失值newloandata$CreditScore[which(newloandata$CreditScore %in% NA)] <- median(newloandata$CreditScore,na.rm = T)###再次检察CreditScore的缺失值:sapply(newloandata,function(x) sum(is.na(x)))
CreditScore缺失值已彻底弥补。
3.3.4 InquiriesLast6Months补全部車漆修補神器,值
画圖检察是不是可以用中位数弥补数值:
ggplot(newloandata,aes(x=InquiriesLast6Months,))+ geom_density(fill="skyblue",alpha=0.4)+ geom_vline(aes(xintercept=median(InquiriesLast6Months,na.rm = T)),colour="red",linetype="dashed",lwd=1)+ theme_few()+ggtitle("The density of InquiriesLast6Months")
從圖中可以看出数值大部門集中在0到20之間,是以可以用中位数弥补缺失值:
###用中位数弥补缺失值newloandata$InquiriesLast6Months[which(newloandata$InquiriesLast6Months %in% NA)] <- median(newloandata$InquiriesLast6Months,na.rm = T)###再次检察InquiriesLast6Months的缺失值:sapply(newloandata,function(x) sum(is.na(x)))
InquiriesLast6Months缺失值已彻底弥补。
3.3.5 DelinquenciesLast7Years补全部值
画圖检察是不是可以用中位数弥补数值:
ggplot(newloandata,aes(x=DelinquenciesLast7Years,))+ geom_density(fill="blue",alpha=0.4)+ geom_vline(aes(xintercept=median(DelinquenciesLast7Years,na.rm = T)),colour="red",linetype="dashed",lwd=1)+ theme_few()+ggtitle("The density of DelinquenciesLast7Years")
從圖中可以看出数值大部門集中在0到10之間,是以可以用中位数弥补缺失值:
###用中位数弥补缺失值newloandata$DelinquenciesLast7Years[which(newloandata$DelinquenciesLast7Years %in% NA)] <- median(newloandata$DelinquenciesLast7Years,na.rm = T)###再次检察DelinquenciesLast7Years的缺失值:sapply(newloandata,function(x) sum(is.na(x)))
DelinquenciesLast7Years缺失值已彻底弥补。
3.3.6 BankcardUtilization补全部值
画圖检察是不是可以用中位数弥补数值:
ggplot(newloandata,aes(x=BankcardUtilization,))+ geom_density(fill="grey",alpha=0.4)+ geom_vline(aes(xintercept=median(BankcardUtilization,na.rm = T)),colour="red",linetype="dashed",lwd=1)+ theme_few()+ggtitle("The density of BankcardUtilization")
用中位数填充缺失值:
###用中位数弥补缺失值newloandata$BankcardUtilization[which(newloandata$BankcardUtilization %in% NA)] <- median(newloandata$BankcardU眼科,tilization,na.rm = T)###再次检察BankcardUtilization的缺失值:sapply(newloandata,function(x) sum(is.na(x)))
BankcardUtilization缺失值已彻底弥补。
接着對BankcardUtilization的数值举行分类:
###BankcardUtilization的数据举行分类newloandata$BankCardUse[newloandata$BankcardUtilization <quantile(newloandata$BankcardUtilization, 0.25,"na.rm" = TRUE)] <- "Mild Use"newloandata$BankCardUse[(newloandata$BankcardUtilization>= quantile(newloandata$BankcardUtilization,0.25,na.rm = TRUE))&( newloandata$BankcardUtilization< quantile(newloandata$BankcardUtilization,0.5,na.rm = TRUE))]<-"Medium Use"newloandata$BankCardUse[(newloandata$BankcardUtilization>= quantile(newloandata$BankcardUtilization,0.5,na.rm = TRUE))&( newloandata$BankcardUtilization<1)]<-"Heavy Use"newloandata$BankCardUse[newloandata$BankcardUtilization>=1]<-"Super Use"newloandata$BankCardUse <- as.factor(newloandata$BankCardUse)
3.3.7 DebtToIncomeRatio补全部值
loandata_1 <- newloandata[which(newloandata$DebtToIncomeRatio %in% NA),]names(loandata_1)loandata_1 <- loandata_1[,c(2,17)] table(loandata_1$LoanStatus)
未還款的比例较大,可以斟酌用四分位数對缺失值举行弥补:
su妹妹ary(newloandata$DebtToIncomeRatio,na.rm=T)
Q1是0.13,Q3是0.3 ,是以:
###四分位数弥补缺失值newloandata$DebtToIncomeRatio[which(newloandata$DebtToIncomeRatio %in% NA)] <- runif(nrow(loandata_1),0.13,0.3)###再次检察DebtToIncomeRatio的缺失值:sapply(newloandata,function(x) sum(is.na(x)))
DebtToIncomeRatio缺失值补全。
3.3.8 Occupation补全部值
####找到Occupation缺失的位置which(newloandata$Occupation %in% NA)###检察相對于應的EmploymentStatus的环境newloandata$EmploymentStatus[which(newloandata$Occupation %in% NA)]
缺失值對應的EmploymentStatus不是“other”,就是“Not available”,是以可以用“other”弥补缺失值:
###Occupation以“Other”补全newloandata$Occupation[which(newloandata$Occupation %in% NA)] <- "Other"###查找Occupation是不是出缺失值sapply(newloandata,function(x) sum(is.na(x)))
Occupation缺失值弥补完备。
3.3.9 BorrowerState补全部值
###BorrowerState补全部值loandata_2 <- newloandata[which(newloandata$BorrowerState %in% NA),]names(loandata_2)loandata_2 <- loandata_2[,c(2,20)] table(loandata_2$LoanStatus)
未還款占的比例较大,且這是貸款人地点州的标签,是以可以用一個因子取代缺失值:
###BorrowerState以“None”补全newloandata$BorrowerState[which(newloandata$BorrowerState %in% NA)] <- "None"###查找BorrowerState是不是出缺失值sapply(newloandata,function(x) sum(is.na(x)))
BorrowerState缺失值弥补完备。
3.3.10 CreditGrade/ProsperRating.Alpha补全部值
接下来是對CreditGrade和ProsperRating.Alpha数据举行缺失值的弥补,因為這两個值是2009年7月1日先後客户信誉品级,是以必要對数据举行依照2009年7月1日来朋分。
3.3.10.1 CreditGrade缺失值弥补
###依照2009年7月1日将数据举行朋分newloandata$ListingCreationDate <- as.Date(newloandata$ListingCreationDate)loandata_before <- newloandata[newloandata$ListingCreationDate < "2009-7-1",]###CreditGrade缺失值环境sapply(loandata_before,function(x) sum(is.na(x)))
共有131個缺失值,因為数目较小,可以疏忽不计,是以删除缺失值:
###挑選非缺失值数据loandata_before <- filter(loandata_before,!is.na(CreditGrade))###查找BorrowerState是不是出缺失值sapply(loandata_before,function(x) sum(is.na(x)))
CreditGrade的缺失值已处置终了。
3.3.10.2 ProsperRating.Alpha缺失值弥补
###依照2009年7月1日将数据举行朋分loandata_after <- newloandata[newloandata$ListingCreationDate >= "2009-7-1",]###CreditGrade缺失值环境sapply(loandata_after,function(x) sum(is.na(x)))
依照2009年7月1日朋分数据後,ProsperRating.Alpha并無缺失值。
到了此時,全数缺失值都处置好了。
第四步:数据计较&显示
這部門主如果阐發如下几点:
1.受雇佣状况延续時候與貸款状况的瓜葛?
2.告貸人是不是有衡宇和貸款状况的瓜葛?
3.消费信誉分與貸款状况的瓜葛?
4.征信记实盘问次数與貸款状况的瓜葛?
5.信誉品级與貸款状况的瓜葛?
6.客户的职業、月收入、年收入與貸款状况的瓜葛?
7.客户7年内违约次数與貸款状况的瓜葛?
8.信誉卡利用环境與貸款状况的瓜葛?
9.在Prosper平台是不是告貸與貸款状况的瓜葛?
10.债務收入比例與貸款状况的瓜葛?
11.告貸标利率與貸款状况的瓜葛?
4.1 受雇佣状况延续時候與貸款状况的瓜葛?
阐發受雇佣状体延续時候和貸款状况是不是有瓜葛,即招聘時候越长,是否是具有還款能力越好。
library(ggplot2)###1.受雇佣状况延续時候與貸款状况的瓜葛?newloandata$EmploymentStatusDuration <- as.integer(newloandata$EmploymentStatusDuration)ggplot(data = newloandata, aes(x = EmploymentStatusDuration, color = LoanStatus)) + geom_line(aes(label = ..count..), stat = 'bin') + labs(title = "The LoanStatus By EmploymentStatusDuration", x = "EmploymentStatusDuration", y = "Count", fill = "LoanStatus")
從圖中可以看出来跟着受雇佣時候越长,貸款未還款率低落,到了後期,根基上不存在毁约征象。也就是说,一個有不乱事情收入的人,不易呈現貸款毁约,不還款。
4.2 告貸人是不是有衡宇和貸款状况的瓜葛?
###2.告貸人是不是有衡宇和貸款状况的瓜葛?mosaicplot(table(newloandata$IsBorrowerHomeowner,newloandata$LoanStatus),main="The Loanstatus By IsBorrowerHomeowner", color = c('pink','skyblue'))
從圖中可以看出,當貸款人具有房的時辰,還款率较無房的貸款人稍高一点,可是這個@身%7q448%分對是%49Q68%不%49Q68%是@還款影响不大。
4.3 消费信誉分與貸款状况的瓜葛?
###3.消费信誉分與貸款状况的瓜葛?options(digits=1)newloandata$CreditScore <- newloandata$CreditScoreclass(newloandata$CreditScore)ggplot(data = newloandata, aes(x = CreditScore, color = LoanStatus)) + geom_line(aes(label = ..count..), stat = 'bin') + labs(title = "The LoanStatus By CreditScore", x = "CreditScore", y = "Count", fill = "LoanStatus")
從圖中可以看出,跟着消费信誉分越高,還款率越高,是以小我的消费信誉分会對貸款终极還款状况有必定的影响。
4.4 征信记实盘问次数與貸款状况的瓜葛?
ggplot(data = newloandata[newloandata$InquiriesLast6Months < 20,], aes(x = InquiriesLast6Months, color = LoanStatus)) + geom_line(aes(label = ..count..), stat = 'bin') + labs(title = "The LoanStatus By InquiriesLast6Months", x = "InquiriesLast6Months", y = "Count", fill = "LoanStatus")
當征信记实盘问记实小于10的時辰,還可以看出来對貸款状况有些影响,可是大于10以後,還款與未還款的曲线根基趋于一致,以是,可以斗胆猜想這個對貸款人是不是有能力還款影响不大。
4.5 信誉品级與貸款状况的瓜葛?
###5.信誉品级與貸款状况的瓜葛?par(mfrow=c(2,1))###斟酌2009年7月1日以前的信誉品级對貸款状况的影响:CreditGrademosaicplot(table(loandata_before$CreditGrade,loandata_before$LoanStatus),main="The Loanstatus By CreditGrade", color = c('pink','skyblue'))###斟酌2009年7月1日以後的信誉品级對貸款状况的影响:ProsperRating.Alphamosaicplot(table(loandata_after$ProsperRating.Alpha,loandata_after$LoanStatus),main="The Loanstatus By ProsperRating.Alpha", color = c('pink','skyblue'))
马赛克圖中可以看出,信誉品级越高還款率越高,是以AA品级還款率最高,NC最低。并且大部門人的品级集中在C、D品级,AA品级還款率和NC品级還款率相差较大,是以,信誉品级對貸款状况有必定的影响。
4.6 客户的职業散布,和月收入、年收入與貸款状况的瓜葛?
职業散布:
ggplot(data=newloandata,aes(x=Occupation))+geom_bar()+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
职業中,選择“other”的人数更多,跟以前数据处置得出的成果同样,阐明不少人在申请貸款的時辰会不選择本身的职業,或是有坑骗的可能性。
月收入和年收入與貸款状况的瓜葛:
###月收入與貸款状况的瓜葛newloandata$Monthly[newloandata$StatedMonthlyIncome < 3000] <- c("0-3000")newloandata$Monthly[newloandata$StatedMonthlyIncome >= 3000 & newloandata$StatedMonthlyIncome < 6000 ] <- c("3000-6000")newloandata$Monthly[newloandata$StatedMonthlyIncome >= 6000 & newloandata$StatedMonthlyIncome < 9000 ] <- c("6000-9000")newloandata$Monthly[newloandata$StatedMonthlyIncome >= 9000 & newloandata$StatedMonthlyIncome < 12000 ] <- c("9000-12000")newloandata$Monthly[newloandata$StatedMonthlyIncome >= 12000 & newloandata$StatedMonthlyIncome < 15000 ] <- c("12000-15000")newloandata$Monthly[newloandata$StatedMonthlyIncome >= 15000 & newloandata$StatedMonthlyIncome < 20000 ] <- c("15000-20000")newloandata$Monthly[newloandata$StatedMonthlyIncome >= 20000 ] <- c(">20000")newloandata$Monthly <- factor(newloandata$Monthly,levels=c("0-3000","3000-6000","6000-9000", "9000-12000","12000-15000", "15000-20000"))p1 <- ggplot(data = newloandata, aes(x = Monthly, fill = LoanStatus)) + geom_bar(position="fill")+ggtitle("The Loanstatus By MonthlyIncome")+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))###年收入對貸款状况的瓜葛newloandata$MonthlyIncome <- factor(newloandata$MonthlyIncome,levels=c("Not employed","Not displayed","$0", "$1-24999","$25000-49999", "$50000-74999","$75000-99999","$100000+"))p2 <- ggplot(data = newloandata, aes(x = IncomeRange, fill = LoanStatus)) + geom_bar(position="fill")+ggtitle("The Loanstatus By IncomeRange")+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))library(gridExtra)grid.arrange(p1, p2, ncol=2)
從圖中可以看出来,月收入越高,還款率相對于来讲也高一点,可是區分不大,年收入也是高收入的相對于来讲還款率大,可是同样是區分不大。也就是没法单凭收入果断一小我的還款环境。
4.7 客户7年内违约次数與貸款状况的瓜葛?
ggplot(data = newloandata,aes(x = DelinquenciesLast7Years, color = LoanStatus)) + geom_line(aes(label = ..count..), stat = 'bin') + labs(title = "The LoanStatus By DelinquenciesLast7Years", x = "DelinquenciesLast7Years", y = "Count", fill = "LoanStatus")
曩昔7年一次也没有违约的客户還款率更高,而违约次数越高,還款率越低。
4.8 信誉卡利用环境與貸款状况的瓜葛?
###8.信誉卡利用环境與貸款状况的瓜葛?ggplot(data = newloandata, aes(x = BankCardUse, fill = LoanStatus)) + geom_bar(position="fill")+ggtitle("The Loanstatus By BankCardUse")+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
貸款人的信誉卡利用环境為“Mild Use”和“Medium Use”的還款率相對于较大,而“Super Use”還款率最低,是以可以按照利用信誉卡的状态開端肯定貸款人的還款能力。
4.9 在Prosper平台是不是告貸與貸款状况的瓜葛?
###9.在Prosper平台是不是告貸與貸款状况的瓜葛?newloandata$LoanOriginal[newloandata$LoanOriginalAmount >= 1000 & newloandata$LoanOriginalAmount <4000]<-"1000-4000"newloandata$LoanOriginal[newloandata$LoanOriginalAmount >= 4000 & newloandata$LoanOriginalAmount <7000]<-"4000-7000"newloandata$LoanOriginal[newloandata$LoanOriginalAmount >= 7000 & newloandata$LoanOriginalAmount <10000]<-"7000-10000"newloandata$LoanOriginal[newloandata$LoanOriginalAmount >= 10000 & newloandata$LoanOriginalAmount <=13000]<-"10000-13000"newloandata$LoanOriginal[newloandata$LoanOriginalAmount > 13000]<-">13000"newloandata$MonthlyIncome <- factor(newloandata$MonthlyIncome,levels=c("1000-4000","4000-7000", "7000-10000","10000-13000", ">13000"))ggplot(data=newloandata,aes(x=LoanOriginal,fill=LoanStatus))+ geom_bar(position = "fill")+ ggtitle("The Loanstatus By LoanOriginalAmount")
在Prosper平台有告貸對貸款状况影响不大,還款率大致上趋于一致。
4.10 债務收入比例與貸款状况的瓜葛?
su妹妹ary(newloandata$DebtToIncomeRatio)
DebtToIncomeRatio的四分位数都是0,而最大值是10,也就是说大部門的数值是在小于1的范畴内。
###10.债務收入比例與貸款状况的瓜葛?ggplot(data = newloandata[newloandata$DebtToIncomeRatio < 1,], aes(x = DebtToIncomeRatio, color = LoanStatus)) + geom_line(aes(label = ..count..), stat = 'bin') + labs(title = "The LoanStatus By DebtToIncomeRatio", x = "DebtToIncomeRatio", y = "Count", fill = "LoanStatus")
债務比越低,還款率越高,也就是说貸款人自己的债務不高的环境下,具有還款能力越高。
4.11 告貸标利率與貸款状况的瓜葛?
###11.告貸标利率與貸款状况的瓜葛?ggplot(data = newloandata, aes(x = BorrowerRate, color = LoanStatus)) + geom_line(aes(label = ..count..), stat = 'bin') + labs(title = "The LoanStatus By BorrowerRate", x = "BorrowerRate", y = "Count", fill = "LoanStatus")
告貸标的利率越高,還款率越低,也就是说這個会影响貸款状况。
第五步:建模,做展望阐發
經由過程上述的阐發,可以晓得EmploymentStatusDuration、CreditScore、CreditGrade、ProsperRating.Alpha、DelinquenciesLast7Years、BankCardUse、DebtToIncomeRatio、BorrowerRate對貸款状况有必定的影响,以是建模時将這几個選择為影响因子。
###建模###练習集和测试集,以2009年7月1日為分界点###從loandata_before数据集中随機抽70%界说為练習数据集,30%為测试数据集set.seed(156)tain_before1 <- sample(nrow(loandata_before),0.7*nrow(loandata_before))set.seed(156)tain_before <- loandata_before[tain_before1,]test_before <- loandata_before[-tain_before1,]###操纵随機丛林创建模子library(randomForest)before_mode <- randomForest(LoanStatus~StatusDuration+CreditScore+ CreditGrade+Delinquencies+ BankCardUse+DebtRatio+LoanBorrowerRate,data=tain_before,importance=TRUE)
因為建模的變量必要因子化,且因子程度不宜不少,以是對各個因子举行分组,削减因子程度数目。對EmploymentStatusDuration、CreditScore、DelinquenciesLast7Years、DebtToIncomeRatio、BorrowerRate举行分组。
###显示模子偏差plot(before_mode,ylim = c(0,1))legend("topright",colnames(before_mode$err.rate),col=1:3,fill=1:3)
從圖可以看出相對付展望不還款的环境,這個模子對付還款展望偏差较低,比力轻易展望谁更可能還款。
###對因子的首要性举行阐發importance <- importance(before_mode)varImportance <- data.frame(variables=row.names(importance),Importance=round(importance[,'MeanDecreaseGini'],2))###對付變量按照首要系数举行分硫磺皂,列library(dplyr)rankImportance <- varImportance %>% mutate(Ranke= paste0('#',dense_rank(desc(Importance))))###利用ggplot绘制首要變量相瓜葛圖ggplot(rankImportance,aes(x=reorder(variables,Importance),y=Importance,fill=Importance))+ geom_bar(stat='identity')+ geom_text(aes(x=variables,y=0.5,label=Ranke),hjust=0,vjust=0.55,size=4,colour='red')+ labs(x='Variables')+ coord_flip()+theme_few()+ggtitle('The Importance of Variables')
因子首要性排名前三的是BorrowerRate、CreditGrade、DebtToIncomeRatio。
###對测试集展望predit_before <- predict(before_mode,test_before)pert_before <- table(test_before$LoanStatus,predit_before,dnn = c("Actual","Predicted"))> pert_before PredictedActual 0 1 0 1179 2022 1 822 4662
模子展望還款的人展望的比力准,可是展望正确率不高,只有67.25%,看来還必要继续优化因子挑選。
接下来看2009年7月1日以後的模子:
###练習集和测试集,以2009年7月1日為分界点###從loandata_before数据集中随機抽70%界说為练習数据集,30%為测试数据集set.seed(187)tain_after1 <- sample(nrow(loandata_after),0.7*nrow(loandata_after))set.seed(187)tain_after <- loandata_after[tain_after1,]test_after <- loandata_after[-tain_after1,]###操纵随機丛林创建模子library(randomForest)after_mode <- randomForest(LoanStatus~StatusDuration+CreditScore+ ProsperRating.Alpha+Delinquencies+ BankCardUse+DebtRatio+LoanBorrowerRate,data=tain_after,importance=TRUE)
检察模子评估偏差:
###显示模子偏差plot(after_mode,ylim = c(0,1))legend("topright",colnames(after_mode$err.rate),col=1:3,fill=1:3)
一样是更易展望谁可以還款,為不還款的偏差较大。
168娛樂城,
###對因子的首要性举行阐發importance <- importance(after_mode)varImportance <- data.frame(variables=row.names(importance),Importance=round(importance[,'MeanDecreaseGini'],2))###對付變量按照首要系数举行分列library(dplyr)rankImportance <- varImportance %>% mutate(Ranke= paste0('#',dense_rank(desc(Importance))))###利用ggplot绘制首要變量相瓜葛圖ggplot(rankImportance,aes(x=reorder(variables,Importance),y=Importance,fill=Importance))+ geom_bar(stat='identity')+ geom_text(aes(x=variables,y=0.5,label=Ranke),hjust=0,vjust=0.55,size=4,colour='red')+ labs(x='Variables')+ coord_flip()+theme_few()+ggtitle('The Importance of Variables')
因子首要性排名前三的是BorrowerRate、ProsperRating.Alpha、DebtToIncomeRatio。這個跟以前的同样。
> predit_after <- predict(after_mode,test_after)> pert_after <- table(test_after$LoanStatus,predit_after,dnn = c("Actual","Predicted"))> pert_after PredictedActual 0 1 0 39 1855 1 48 6542
此時的模子展望正确率是77.57%,比起2009年7月1日前制作的模子正确率提高了不少,也就是说平台扭转了信誉品级後,将评估模子也举行了点窜,保障了平台的长处。
并且,简直是展望還款的正确率比展望不還款的正确率要高一点。
第六步:总结
經由過程這次操练,對付随機丛林展望模子有了更進一步的熟悉,在制作模子的時辰,碰到了不少问题,經由過程在收集搜刮解决问题,固然费了一些時候,可是最少在制作2009年7月1日以後的模子再赶上一样问题時可以快速解决。 |
|