CN110019166A

CN110019166A - Method for screening attribute data and customer loss early warning method

Info

Publication number: CN110019166A
Application number: CN201711417983.9A
Authority: CN
Inventors: 田雨农; 苍柏; 唐丽娜
Original assignee: Dalian Roiland Technology Co Ltd
Current assignee: Dalian Roiland Technology Co Ltd
Priority date: 2017-12-25
Filing date: 2017-12-25
Publication date: 2019-07-16

Abstract

A method for screening attribute data and a customer loss early warning method belong to the field of data processing, and the method for screening the attribute data comprises the following steps: screening attribute data by using an information gain method; screening attribute data by using a point biserial correlation coefficient method; according to the information gain method and the point biserial coefficient method, a plurality of attribute data are respectively obtained, and the two attribute data are intersected to obtain the screened attribute data, so that the effect is as follows: the method combines the information gain method and the point biserial correlation coefficient method, so that the screening accuracy is better.

Description

The method of screening attribute data and the method of early warning of customer churn

技术领域technical field

本发明属于数据处理领域，涉及一种筛选属性数据的方法及客户流失预警方法The invention belongs to the field of data processing, and relates to a method for screening attribute data and a method for early warning of customer loss

背景技术Background technique

目前，随着汽车经销商、维修站数量的不断增加，汽车售后服务市场的竞争愈发激烈，加之客户对服务价格的敏感性和进店服务的满意度的不断提升，使得4S店的客户流失率逐年攀升，客户大量的流失的直接后果便是经济上的损失，从更深程度来思考，这也会间接地影响到4S店的声誉，从而陷入收益与信誉不断降低的恶性循环。那么从4S店的角度思考，如何有效识别流失概率较高的客户，以及如何成功地实现客户的挽留是做好客户关系管理的重中之重。此外，随着数据挖掘技术的不断成熟以及4S店经营数据的不断积累，使得从数据的角度探索客户流失可能性提供了良好的基础。本文发明一种方法，在深入利用4S店经营数据的基础上，提出了一种基于决策树算法的客户流失预警模型，通过该模型4S店可以掌握未来一段时间内流失概率较高的客户名单，从而为客户关系维护提供了良好的条件。At present, with the continuous increase in the number of car dealers and repair stations, the competition in the automotive after-sales service market has become more and more fierce. In addition, the sensitivity of customers to service prices and the continuous improvement of customer satisfaction with in-store services have led to the loss of customers in 4S stores. The rate is rising year by year, and the direct consequence of the loss of a large number of customers is the economic loss. Thinking from a deeper level, this will also indirectly affect the reputation of the 4S store, thus falling into a vicious circle of declining revenue and reputation. So thinking from the perspective of 4S stores, how to effectively identify customers with a high probability of loss, and how to successfully achieve customer retention are the top priorities for good customer relationship management. In addition, with the continuous maturity of data mining technology and the continuous accumulation of 4S store operation data, it provides a good foundation for exploring the possibility of customer churn from the perspective of data. In this paper, a method is invented. On the basis of in-depth use of 4S store operation data, a customer churn warning model based on decision tree algorithm is proposed. Through this model, 4S stores can grasp the list of customers with high loss probability in the future. So as to provide good conditions for customer relationship maintenance.

发明内容SUMMARY OF THE INVENTION

为了解决上述问题，本发明提出如下方案：.一种筛选属性数据的方法，包括：使用信息增益法筛选属性数据；使用点双列相关系数法筛选属性数据；根据信息增益法和点双列系数法分别得到若干个属性数据，并二者取交集得到筛选后的属性数据。In order to solve the above problems, the present invention proposes the following scheme: a method for screening attribute data, comprising: screening attribute data using an information gain method; screening attribute data using a point double column correlation coefficient method; according to the information gain method and the point double column coefficient The method obtains several attribute data respectively, and takes the intersection of the two to obtain the filtered attribute data.

本发明还提出一种一种客户流失预警方法，其特征在于，包括：S1.采集客户的基本属性数据、购买车辆数据及售后进4S店行为数据；S2.确定目标变量、自变量；S3.筛选自变量；S4.构建决策树模型；S5.使用决策树模型实际预测，并在必要时发布流失报警；其中，筛选自变量，以本发明任一项所述的筛选属性数据的方法进行筛选。The present invention also provides a method for early warning of customer loss, which is characterized by comprising: S1. collecting basic attribute data of customers, purchasing vehicle data and after-sale behavior data of entering 4S stores; S2. determining target variables and independent variables; S3. Screening independent variables; S4. Constructing a decision tree model; S5. Using the decision tree model to actually predict, and issuing a loss alarm when necessary; wherein, screening the independent variables, screening with the method for screening attribute data described in any one of the present invention .

有益效果：将信息增益法和点双列相关系数法两种自变量筛选方法结合起来，从而为分类模型的变量筛选方法提供了一种新的思路，以提高筛选准确性，此外，所选择的属性具有汽车行业客户关系管理的特色，从而使构建的决策树模型更具有行业适用性，为汽车行业客户关系管理提供了一种可行的客户流失预警方案。Beneficial effect: Combining the two independent variable screening methods of the information gain method and the point double column correlation coefficient method provides a new idea for the variable screening method of the classification model to improve the screening accuracy. The attribute has the characteristics of customer relationship management in the automotive industry, so that the constructed decision tree model has more industry applicability, and provides a feasible customer churn early warning scheme for customer relationship management in the automotive industry.

附图说明Description of drawings

图1为客户流失预警流程图。Figure 1 shows the flow chart of customer churn early warning.

具体实施方式Detailed ways

本发明主要通过以下技术方案来实现：The present invention mainly realizes through the following technical solutions:

1.收集每位客户的基本属性数据、购买车辆数据及售后进4S店行为数据，构建数据库1. Collect the basic attribute data of each customer, purchase vehicle data and after-sale 4S store behavior data, and build a database

1)客户基本属性数据主要包括身份证号码、姓名、性别、年龄、省份、城市、联系方式、教育程度、兴趣好爱、行业等信息；1) The basic attribute data of customers mainly include ID number, name, gender, age, province, city, contact information, education level, hobbies, industry and other information;

2)购买车辆数据主要包括底盘号、所属经销商、车型、售价等数据；2) Purchased vehicle data mainly includes chassis number, dealer, model, price, etc.;

3)售后进4S店行为数据主要包括进店类型(例如维修、保养、出险、索赔等)、进店时间、进店消费金额、进店里程、人工费、备件费、结算日期、维保项目等。3) The after-sales behavior data of entering 4S stores mainly include the type of entry (such as repair, maintenance, insurance, claim, etc.), entry time, entry consumption amount, entry mileage, labor cost, spare parts fee, settlement date, and maintenance items. Wait.

2.数据清洗2. Data cleaning

1)缺失值处理：例如缺失的性别、年龄、省份、城市可以由身份证号码相应位数进行翻译后补充；缺失的车型可以根据底盘号进行翻译后补充；缺失的售价可以根据该车车型售价的均值进行补充；缺失的进店消费金额可以根据维保项目及项目单价进行计算后补充等；1) Missing value processing: For example, missing gender, age, province, and city can be translated and supplemented by the corresponding digits of the ID number; missing models can be translated and supplemented based on the chassis number; missing prices can be based on the car model. The average value of the selling price is supplemented; the missing consumption amount can be supplemented after calculation according to the maintenance items and the unit price of the item;

2)噪声数据的识别与处理：由于数据采集、录入方面的原因，客户售后进店数据会存在部分噪声数据。首先，需要对该部分数据进行识别，本发明中主要使用DBSCAN算法(一种基于密度的聚类算法)识别噪声数据；其次，对于识别出来的离群点，使用分箱方法“光滑”数据。2) Identification and processing of noise data: Due to the reasons of data collection and input, there will be some noise data in the customer's after-sales data of entering the store. First, the part of the data needs to be identified. In the present invention, the DBSCAN algorithm (a density-based clustering algorithm) is mainly used to identify the noisy data; secondly, for the identified outliers, the binning method is used to "smooth" the data.

3.确定目标变量3. Determine the target variable

目标变量(流失或非流失)根据规定时间内客户是否进4S店进行确定，若在规定时间内进4S店则为非流失客户，否则为流失客户，常用的时间窗口为3个月、6个月或者一年，本发明中使用一年作为时间窗口。The target variable (churn or non-churn) is determined according to whether the customer enters the 4S store within the specified time. If the customer enters the 4S store within the specified time, it is a non-churn customer, otherwise it is a lost customer. The commonly used time window is 3 months, 6 month or year, and one year is used as a time window in the present invention.

4.计算自变量4. Calculate the independent variable

根据步骤2所得到的经过数据清洗后的数据，计算与客户流失行为相关的若干基本属性，主要包括年龄、售价、车龄、最后一次进4S店里程数、进4S店次数、出险次数、累计出险费用、保养次数、年均保养次数、次均保养费用、超期保养次数、累计保养费用、维修次数、年均维修次数、次均维修费用、累计维修费用。According to the data after data cleaning obtained in step 2, calculate several basic attributes related to customer churn, mainly including age, price, car age, mileage of the last 4S store entry, number of times of entering 4S stores, number of trips, Cumulative accident costs, maintenance times, average annual maintenance times, average maintenance costs per time, overdue maintenance times, cumulative maintenance costs, maintenance times, average annual maintenance times, average maintenance costs per time, and cumulative maintenance costs.

5.筛选自变量5. Filter independent variables

结合信息增益法和点双列相关系数法对步骤4中的自变量的重要性进行评估，从而筛选出重要性较高的若干属性。The importance of the independent variables in step 4 is evaluated by combining the information gain method and the point double-column correlation coefficient method, so as to screen out several attributes with high importance.

其中，信息增益法的主要过程如下：Among them, the main process of the information gain method is as follows:

1)计算将D中观测正确分类的期望信息Info(D)1) Calculate the expected information Info(D) that correctly classifies the observations in D

其中，D表示所有观测数据集，p_i是D中任意观测属于类C_i的非零概率，并用|C_i，D|/|D|估计；where D represents all observation data sets, p _i is the non-zero probability that any observation in D belongs to class C _i , and is estimated by |C _{i, D} |/|D|;

2)计算根据属性A对D中观测进行分类所需要的信息量2) Calculate the amount of information required to classify observations in D according to attribute A

其中，属性A根据数据集D具有v个不同值{a₁,a₂,…,α_v}，可以用属性A将D划分为v个分区域子集{D₁,D₂,…,D_v}，其中D_j包含D中的观测，它们的A值为α_j。3)计算属性A的信息增益Among them, attribute A has v different values {a ₁ ,a ₂ ,...,α _v } according to data set D, and D can be divided into v sub-region subsets {D ₁ ,D ₂ ,...,D by attribute A _v }, where D _j contains observations in D whose A value is α _j . 3) Calculate the information gain of attribute A

Gain(A)＝Info(D)-Info_A(D)Gain(A)=Info(D)-Info _A (D)

4)设定阈值，去除信息增益很小的基本属性4) Set the threshold to remove the basic attribute with small information gain

点双列相关系数法主要过程如下：The main process of the point double column correlation coefficient method is as follows:

1)计算目标变量Y中取某值的变量比例Y_p及取另外一值的变量比例Y_q；1) Calculate the variable ratio Y _p that takes a certain value in the target variable Y and the variable ratio Y _q that takes another value;

2)计算自变量X中与Y_p对应部分的平均值 2) Calculate the mean value of the part corresponding to Y _p in the independent variable X

3)计算自变量X中与Y_q对应部分的平均值 3) Calculate the mean value of the part corresponding to Y _q in the independent variable X

4)计算自变量X的标准差S_x；4) Calculate the standard deviation S _x of the independent variable X;

5)根据公式计算自变量X和目标变量Y的相关系数。5) According to the formula Calculate the correlation coefficient between the independent variable X and the target variable Y.

根据信息增益法和点双列系数法可以分别得到若干个重要性较高的自变量，二者取交集便可以得到两种方法的综合结果。According to the information gain method and the point double column coefficient method, several independent variables with high importance can be obtained respectively, and the comprehensive results of the two methods can be obtained by taking the intersection of the two methods.

经过上述方法的筛选，得到重要性较高的自变量包括最后一次进4S店里程数、出险次数、售价、年均保养次数、次均保养费用、超期保养次数。After the screening of the above methods, the independent variables with high importance are obtained including the mileage of the last 4S store, the number of accidents, the price, the average number of annual maintenance, the average maintenance cost, and the number of overdue maintenance.

6.构建决策树模型6. Build a decision tree model

利用步骤5中筛选所得的自变量构建决策树模型，计算每个属性的信息增益，选择信息增益最大的属性作为根节点，并为该属性的每个取值建立一个分枝，对于每个分枝，对剩下的其他所有属性计算信息增益，同样选择信息增益最大的属性作为新的分裂节点并建立相应分枝，递归以上过程，直到没有任何属性，定义该节点为叶子节点，并将该节点标记为所有样本中个数最多的类别。此外，还需将无任何样本的叶子节点进行剪枝。Use the independent variables screened in step 5 to build a decision tree model, calculate the information gain of each attribute, select the attribute with the largest information gain as the root node, and establish a branch for each value of the attribute. branch, calculate the information gain for all the remaining attributes, also select the attribute with the largest information gain as the new split node and establish the corresponding branch, recurse the above process until there is no attribute, define the node as a leaf node, and assign the Nodes are marked with the most numerous categories among all samples. In addition, leaf nodes without any samples need to be pruned.

7.验证模型的准确性7. Verify the accuracy of the model

将标注流失、非流失标签的客户数据输入构建好的决策树模型进行分析，比较预测结果和实际结果的差异，从而确定模型的准确性并对模型进行修正。Input the customer data marked with churn and non-churn labels into the constructed decision tree model for analysis, compare the difference between the predicted results and the actual results, so as to determine the accuracy of the model and revise the model.

8.实际预测，发布流失预警8. Actual forecast, release early warning of churn

根据修正后的流失预测模型对当前的非流失客户进行预测，重点关注流失概率较高的客户，发布流失预警。According to the revised churn prediction model, the current non-churn customers are predicted, focusing on customers with a high churn probability, and issuing churn warnings.

Claims

1. a kind of method for screening attribute data characterized by comprising

Use information gain method screens attribute data；

Attribute data is screened using point biserial correlation coefficient method；

Several attribute datas are respectively obtained according to information gain method and point biserial coefficient method, and after the two takes intersection to obtain screening Attribute data.

2. the method for screening attribute data as described in claim 1, which is characterized in that information gain method screens attribute data The step of method, comprising:

1) is calculated in D and is observed the expectation information Info (D) correctly to classify

Wherein, D indicates all observation data sets, p_iIt is that arbitrarily observation belongs to class C in D_iNonzero probability, be used in combination | C_{I, D}|/| D | estimate Meter, i indicate the value serial number of observation, and m indicates observation total amount.

2) it calculates and required information content of classifying is carried out to observation in D according to attribute A

Wherein, attribute A has v different value { α according to data set D₁,α₂,…,α_v, D is divided into v subregion with attribute A Subset { D₁,D₂,…,D_v, D_jComprising the observation in D, their A value is α_j。

3) information gain of computation attribute A

Gain (A)=Info (D)-Info_A(D)；

4) given threshold removes the basic attribute data of part according to gain, removes the basic attribute data removed in D, remaining Basic attribute data is to be screened out attribute data.

3. the method for screening attribute data as described in claim 1, which is characterized in that point biserial correlation coefficient method screens attribute The method of data, comprising:

1) the variable scale Y that certain value is taken in target variable Y is calculated_pAnd take the variable scale Y of an other value_q；

2) calculate independent variable X in Y_pThe average value of corresponding part

3) calculate independent variable X in Y_qThe average value of corresponding part

4) the standard deviation S of independent variable X is calculated_x；

5) according to formulaTo calculate the related coefficient of independent variable X and target variable Y.

4. the method for screening attribute data as described in claim 1, which is characterized in that use the attribute data after screening, structure The method for building decision-tree model is: selecting the maximum attribute of information gain as root node, and builds for each value of the attribute A branch is found, for each branch, the maximum attribute of information gain in other remaining all properties is selected to divide as new Node is split, and establishes corresponding branch, recurrence above procedure, until no any attribute, defining the node is leaf node, and will The vertex ticks is the largest number of classifications in all samples.

5. a kind of customer defection early warning method characterized by comprising

S1. it acquires the basic attribute data of client, buy vehicle data and after sale into the shop 4S behavioral data；

S2. target variable, independent variable are determined；

S3. independent variable is screened；

S4. decision-tree model is constructed；

S5. decision-tree model actual prediction is used, and publication is lost alarm if necessary；

Wherein, independent variable is screened, is screened in the method for screening attribute data of any of claims 1-3.

6. customer defection early warning method as claimed in claim 5, which is characterized in that the method for the building decision-tree model, Use the method for the building decision-tree model in the method for screening attribute data as claimed in claim 4.

7. customer defection early warning method as claimed in claim 5, which is characterized in that step S1 includes that step S1_2. data are clear It washes, the data cleansing includes one or more of following methods:

1) missing values are handled；

2) identification and processing of noise data.

8. customer defection early warning method as claimed in claim 5, which is characterized in that further comprise the steps of: the standard of S6. verifying model True property: mark is lost, the decision-tree model that the input of the customer data of non-streaming lose-submission label is built is analyzed, comparison prediction knot The difference of fruit and actual result, to determine the accuracy of model and be modified to model.

9. the customer defection early warning method as described in claim 5 or 8, which is characterized in that it further include step S6. actual prediction, Publication is lost early warning: being predicted according to decision-tree model or revised decision-tree model current non-attrition customer, needle To the higher client of probability is lost, publication is lost early warning.