CN110019166A - Method for screening attribute data and customer loss early warning method - Google Patents
Method for screening attribute data and customer loss early warning method Download PDFInfo
- Publication number
- CN110019166A CN110019166A CN201711417983.9A CN201711417983A CN110019166A CN 110019166 A CN110019166 A CN 110019166A CN 201711417983 A CN201711417983 A CN 201711417983A CN 110019166 A CN110019166 A CN 110019166A
- Authority
- CN
- China
- Prior art keywords
- data
- attribute
- attribute data
- screening
- early warning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
- G06Q30/0202—Market predictions or forecasting for commercial activities
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Strategic Management (AREA)
- Finance (AREA)
- Development Economics (AREA)
- Data Mining & Analysis (AREA)
- Accounting & Taxation (AREA)
- General Physics & Mathematics (AREA)
- Entrepreneurship & Innovation (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Probability & Statistics with Applications (AREA)
- General Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Economics (AREA)
- Marketing (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Quality & Reliability (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
技术领域technical field
本发明属于数据处理领域,涉及一种筛选属性数据的方法及客户流失预警方法The invention belongs to the field of data processing, and relates to a method for screening attribute data and a method for early warning of customer loss
背景技术Background technique
目前,随着汽车经销商、维修站数量的不断增加,汽车售后服务市场的竞争愈发激烈,加之客户对服务价格的敏感性和进店服务的满意度的不断提升,使得4S店的客户流失率逐年攀升,客户大量的流失的直接后果便是经济上的损失,从更深程度来思考,这也会间接地影响到4S店的声誉,从而陷入收益与信誉不断降低的恶性循环。那么从4S店的角度思考,如何有效识别流失概率较高的客户,以及如何成功地实现客户的挽留是做好客户关系管理的重中之重。此外,随着数据挖掘技术的不断成熟以及4S店经营数据的不断积累,使得从数据的角度探索客户流失可能性提供了良好的基础。本文发明一种方法,在深入利用4S店经营数据的基础上,提出了一种基于决策树算法的客户流失预警模型,通过该模型4S店可以掌握未来一段时间内流失概率较高的客户名单,从而为客户关系维护提供了良好的条件。At present, with the continuous increase in the number of car dealers and repair stations, the competition in the automotive after-sales service market has become more and more fierce. In addition, the sensitivity of customers to service prices and the continuous improvement of customer satisfaction with in-store services have led to the loss of customers in 4S stores. The rate is rising year by year, and the direct consequence of the loss of a large number of customers is the economic loss. Thinking from a deeper level, this will also indirectly affect the reputation of the 4S store, thus falling into a vicious circle of declining revenue and reputation. So thinking from the perspective of 4S stores, how to effectively identify customers with a high probability of loss, and how to successfully achieve customer retention are the top priorities for good customer relationship management. In addition, with the continuous maturity of data mining technology and the continuous accumulation of 4S store operation data, it provides a good foundation for exploring the possibility of customer churn from the perspective of data. In this paper, a method is invented. On the basis of in-depth use of 4S store operation data, a customer churn warning model based on decision tree algorithm is proposed. Through this model, 4S stores can grasp the list of customers with high loss probability in the future. So as to provide good conditions for customer relationship maintenance.
发明内容SUMMARY OF THE INVENTION
为了解决上述问题,本发明提出如下方案:.一种筛选属性数据的方法,包括:使用信息增益法筛选属性数据;使用点双列相关系数法筛选属性数据;根据信息增益法和点双列系数法分别得到若干个属性数据,并二者取交集得到筛选后的属性数据。In order to solve the above problems, the present invention proposes the following scheme: a method for screening attribute data, comprising: screening attribute data using an information gain method; screening attribute data using a point double column correlation coefficient method; according to the information gain method and the point double column coefficient The method obtains several attribute data respectively, and takes the intersection of the two to obtain the filtered attribute data.
本发明还提出一种一种客户流失预警方法,其特征在于,包括:S1.采集客户的基本属性数据、购买车辆数据及售后进4S店行为数据;S2.确定目标变量、自变量;S3.筛选自变量;S4.构建决策树模型;S5.使用决策树模型实际预测,并在必要时发布流失报警;其中,筛选自变量,以本发明任一项所述的筛选属性数据的方法进行筛选。The present invention also provides a method for early warning of customer loss, which is characterized by comprising: S1. collecting basic attribute data of customers, purchasing vehicle data and after-sale behavior data of entering 4S stores; S2. determining target variables and independent variables; S3. Screening independent variables; S4. Constructing a decision tree model; S5. Using the decision tree model to actually predict, and issuing a loss alarm when necessary; wherein, screening the independent variables, screening with the method for screening attribute data described in any one of the present invention .
有益效果:将信息增益法和点双列相关系数法两种自变量筛选方法结合起来,从而为分类模型的变量筛选方法提供了一种新的思路,以提高筛选准确性,此外,所选择的属性具有汽车行业客户关系管理的特色,从而使构建的决策树模型更具有行业适用性,为汽车行业客户关系管理提供了一种可行的客户流失预警方案。Beneficial effect: Combining the two independent variable screening methods of the information gain method and the point double column correlation coefficient method provides a new idea for the variable screening method of the classification model to improve the screening accuracy. The attribute has the characteristics of customer relationship management in the automotive industry, so that the constructed decision tree model has more industry applicability, and provides a feasible customer churn early warning scheme for customer relationship management in the automotive industry.
附图说明Description of drawings
图1为客户流失预警流程图。Figure 1 shows the flow chart of customer churn early warning.
具体实施方式Detailed ways
本发明主要通过以下技术方案来实现:The present invention mainly realizes through the following technical solutions:
1.收集每位客户的基本属性数据、购买车辆数据及售后进4S店行为数据,构建数据库1. Collect the basic attribute data of each customer, purchase vehicle data and after-sale 4S store behavior data, and build a database
1)客户基本属性数据主要包括身份证号码、姓名、性别、年龄、省份、城市、联系方式、教育程度、兴趣好爱、行业等信息;1) The basic attribute data of customers mainly include ID number, name, gender, age, province, city, contact information, education level, hobbies, industry and other information;
2)购买车辆数据主要包括底盘号、所属经销商、车型、售价等数据;2) Purchased vehicle data mainly includes chassis number, dealer, model, price, etc.;
3)售后进4S店行为数据主要包括进店类型(例如维修、保养、出险、索赔等)、进店时间、进店消费金额、进店里程、人工费、备件费、结算日期、维保项目等。3) The after-sales behavior data of entering 4S stores mainly include the type of entry (such as repair, maintenance, insurance, claim, etc.), entry time, entry consumption amount, entry mileage, labor cost, spare parts fee, settlement date, and maintenance items. Wait.
2.数据清洗2. Data cleaning
1)缺失值处理:例如缺失的性别、年龄、省份、城市可以由身份证号码相应位数进行翻译后补充;缺失的车型可以根据底盘号进行翻译后补充;缺失的售价可以根据该车车型售价的均值进行补充;缺失的进店消费金额可以根据维保项目及项目单价进行计算后补充等;1) Missing value processing: For example, missing gender, age, province, and city can be translated and supplemented by the corresponding digits of the ID number; missing models can be translated and supplemented based on the chassis number; missing prices can be based on the car model. The average value of the selling price is supplemented; the missing consumption amount can be supplemented after calculation according to the maintenance items and the unit price of the item;
2)噪声数据的识别与处理:由于数据采集、录入方面的原因,客户售后进店数据会存在部分噪声数据。首先,需要对该部分数据进行识别,本发明中主要使用DBSCAN算法(一种基于密度的聚类算法)识别噪声数据;其次,对于识别出来的离群点,使用分箱方法“光滑”数据。2) Identification and processing of noise data: Due to the reasons of data collection and input, there will be some noise data in the customer's after-sales data of entering the store. First, the part of the data needs to be identified. In the present invention, the DBSCAN algorithm (a density-based clustering algorithm) is mainly used to identify the noisy data; secondly, for the identified outliers, the binning method is used to "smooth" the data.
3.确定目标变量3. Determine the target variable
目标变量(流失或非流失)根据规定时间内客户是否进4S店进行确定,若在规定时间内进4S店则为非流失客户,否则为流失客户,常用的时间窗口为3个月、6个月或者一年,本发明中使用一年作为时间窗口。The target variable (churn or non-churn) is determined according to whether the customer enters the 4S store within the specified time. If the customer enters the 4S store within the specified time, it is a non-churn customer, otherwise it is a lost customer. The commonly used time window is 3 months, 6 month or year, and one year is used as a time window in the present invention.
4.计算自变量4. Calculate the independent variable
根据步骤2所得到的经过数据清洗后的数据,计算与客户流失行为相关的若干基本属性,主要包括年龄、售价、车龄、最后一次进4S店里程数、进4S店次数、出险次数、累计出险费用、保养次数、年均保养次数、次均保养费用、超期保养次数、累计保养费用、维修次数、年均维修次数、次均维修费用、累计维修费用。According to the data after data cleaning obtained in step 2, calculate several basic attributes related to customer churn, mainly including age, price, car age, mileage of the last 4S store entry, number of times of entering 4S stores, number of trips, Cumulative accident costs, maintenance times, average annual maintenance times, average maintenance costs per time, overdue maintenance times, cumulative maintenance costs, maintenance times, average annual maintenance times, average maintenance costs per time, and cumulative maintenance costs.
5.筛选自变量5. Filter independent variables
结合信息增益法和点双列相关系数法对步骤4中的自变量的重要性进行评估,从而筛选出重要性较高的若干属性。The importance of the independent variables in step 4 is evaluated by combining the information gain method and the point double-column correlation coefficient method, so as to screen out several attributes with high importance.
其中,信息增益法的主要过程如下:Among them, the main process of the information gain method is as follows:
1)计算将D中观测正确分类的期望信息Info(D)1) Calculate the expected information Info(D) that correctly classifies the observations in D
其中,D表示所有观测数据集,pi是D中任意观测属于类Ci的非零概率,并用|Ci,D|/|D|估计;where D represents all observation data sets, p i is the non-zero probability that any observation in D belongs to class C i , and is estimated by |C i, D |/|D|;
2)计算根据属性A对D中观测进行分类所需要的信息量2) Calculate the amount of information required to classify observations in D according to attribute A
其中,属性A根据数据集D具有v个不同值{a1,a2,…,αv},可以用属性A将D划分为v个分区域子集{D1,D2,…,Dv},其中Dj包含D中的观测,它们的A值为αj。3)计算属性A的信息增益Among them, attribute A has v different values {a 1 ,a 2 ,...,α v } according to data set D, and D can be divided into v sub-region subsets {D 1 ,D 2 ,...,D by attribute A v }, where D j contains observations in D whose A value is α j . 3) Calculate the information gain of attribute A
Gain(A)=Info(D)-InfoA(D)Gain(A)=Info(D)-Info A (D)
4)设定阈值,去除信息增益很小的基本属性4) Set the threshold to remove the basic attribute with small information gain
点双列相关系数法主要过程如下:The main process of the point double column correlation coefficient method is as follows:
1)计算目标变量Y中取某值的变量比例Yp及取另外一值的变量比例Yq;1) Calculate the variable ratio Y p that takes a certain value in the target variable Y and the variable ratio Y q that takes another value;
2)计算自变量X中与Yp对应部分的平均值 2) Calculate the mean value of the part corresponding to Y p in the independent variable X
3)计算自变量X中与Yq对应部分的平均值 3) Calculate the mean value of the part corresponding to Y q in the independent variable X
4)计算自变量X的标准差Sx;4) Calculate the standard deviation S x of the independent variable X;
5)根据公式计算自变量X和目标变量Y的相关系数。5) According to the formula Calculate the correlation coefficient between the independent variable X and the target variable Y.
根据信息增益法和点双列系数法可以分别得到若干个重要性较高的自变量,二者取交集便可以得到两种方法的综合结果。According to the information gain method and the point double column coefficient method, several independent variables with high importance can be obtained respectively, and the comprehensive results of the two methods can be obtained by taking the intersection of the two methods.
经过上述方法的筛选,得到重要性较高的自变量包括最后一次进4S店里程数、出险次数、售价、年均保养次数、次均保养费用、超期保养次数。After the screening of the above methods, the independent variables with high importance are obtained including the mileage of the last 4S store, the number of accidents, the price, the average number of annual maintenance, the average maintenance cost, and the number of overdue maintenance.
6.构建决策树模型6. Build a decision tree model
利用步骤5中筛选所得的自变量构建决策树模型,计算每个属性的信息增益,选择信息增益最大的属性作为根节点,并为该属性的每个取值建立一个分枝,对于每个分枝,对剩下的其他所有属性计算信息增益,同样选择信息增益最大的属性作为新的分裂节点并建立相应分枝,递归以上过程,直到没有任何属性,定义该节点为叶子节点,并将该节点标记为所有样本中个数最多的类别。此外,还需将无任何样本的叶子节点进行剪枝。Use the independent variables screened in step 5 to build a decision tree model, calculate the information gain of each attribute, select the attribute with the largest information gain as the root node, and establish a branch for each value of the attribute. branch, calculate the information gain for all the remaining attributes, also select the attribute with the largest information gain as the new split node and establish the corresponding branch, recurse the above process until there is no attribute, define the node as a leaf node, and assign the Nodes are marked with the most numerous categories among all samples. In addition, leaf nodes without any samples need to be pruned.
7.验证模型的准确性7. Verify the accuracy of the model
将标注流失、非流失标签的客户数据输入构建好的决策树模型进行分析,比较预测结果和实际结果的差异,从而确定模型的准确性并对模型进行修正。Input the customer data marked with churn and non-churn labels into the constructed decision tree model for analysis, compare the difference between the predicted results and the actual results, so as to determine the accuracy of the model and revise the model.
8.实际预测,发布流失预警8. Actual forecast, release early warning of churn
根据修正后的流失预测模型对当前的非流失客户进行预测,重点关注流失概率较高的客户,发布流失预警。According to the revised churn prediction model, the current non-churn customers are predicted, focusing on customers with a high churn probability, and issuing churn warnings.
Claims (9)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201711417983.9A CN110019166A (en) | 2017-12-25 | 2017-12-25 | Method for screening attribute data and customer loss early warning method |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201711417983.9A CN110019166A (en) | 2017-12-25 | 2017-12-25 | Method for screening attribute data and customer loss early warning method |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN110019166A true CN110019166A (en) | 2019-07-16 |
Family
ID=67186969
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201711417983.9A Pending CN110019166A (en) | 2017-12-25 | 2017-12-25 | Method for screening attribute data and customer loss early warning method |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN110019166A (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113962740A (en) * | 2021-10-27 | 2022-01-21 | 彩虹无线(北京)新技术有限公司 | Early warning method and device for passenger loss of automobile 4S store |
| CN119007952A (en) * | 2024-07-25 | 2024-11-22 | 康键信息技术(深圳)有限公司 | Inquiry user loss prediction method and device, electronic equipment and storage medium |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102567807A (en) * | 2010-12-23 | 2012-07-11 | 上海亚太计算机信息系统有限公司 | Method for predicating gas card customer churn |
| US20170004513A1 (en) * | 2015-07-01 | 2017-01-05 | Rama Krishna Vadakattu | Subscription churn prediction |
| CN106529714A (en) * | 2016-11-03 | 2017-03-22 | 大唐融合通信股份有限公司 | Method and system predicting user loss |
| CN107169284A (en) * | 2017-05-12 | 2017-09-15 | 北京理工大学 | A kind of biomedical determinant attribute system of selection |
| CN107203822A (en) * | 2016-03-16 | 2017-09-26 | 上海吉贝克信息技术有限公司 | Method and system based on the Logistic security customers attrition predictions returned |
-
2017
- 2017-12-25 CN CN201711417983.9A patent/CN110019166A/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102567807A (en) * | 2010-12-23 | 2012-07-11 | 上海亚太计算机信息系统有限公司 | Method for predicating gas card customer churn |
| US20170004513A1 (en) * | 2015-07-01 | 2017-01-05 | Rama Krishna Vadakattu | Subscription churn prediction |
| CN107203822A (en) * | 2016-03-16 | 2017-09-26 | 上海吉贝克信息技术有限公司 | Method and system based on the Logistic security customers attrition predictions returned |
| CN106529714A (en) * | 2016-11-03 | 2017-03-22 | 大唐融合通信股份有限公司 | Method and system predicting user loss |
| CN107169284A (en) * | 2017-05-12 | 2017-09-15 | 北京理工大学 | A kind of biomedical determinant attribute system of selection |
Non-Patent Citations (1)
| Title |
|---|
| 徐建华;孙健;陈天池;: "基于家庭用户的流失预警模型构建", 信息通信, no. 10, pages 236 * |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113962740A (en) * | 2021-10-27 | 2022-01-21 | 彩虹无线(北京)新技术有限公司 | Early warning method and device for passenger loss of automobile 4S store |
| CN119007952A (en) * | 2024-07-25 | 2024-11-22 | 康键信息技术(深圳)有限公司 | Inquiry user loss prediction method and device, electronic equipment and storage medium |
| CN119007952B (en) * | 2024-07-25 | 2025-10-21 | 康键信息技术(深圳)有限公司 | Method, device, electronic device and storage medium for predicting user churn |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN112734559B (en) | Enterprise credit risk evaluation method and device and electronic equipment | |
| JP5566422B2 (en) | High risk procurement analysis and scoring system | |
| JP6059122B2 (en) | Customer data analysis system | |
| CN107657267B (en) | Product potential user mining method and device | |
| CN110119903B (en) | Data comprehensive analysis system | |
| CN110706039A (en) | Electric vehicle residual value rate evaluation system, method, equipment and medium | |
| CN108389073A (en) | Automatic calculating method and system, the electronic equipment and storage medium of commodity price | |
| CN110288484B (en) | A user recommendation method and system for insurance classification based on a big data platform | |
| CN109711955B (en) | Poor evaluation early warning method and system based on current order and blacklist base establishment method | |
| CN105931068A (en) | Cardholder consumption figure generation method and device | |
| CN111882420A (en) | Generation method of response rate, marketing method, model training method and device | |
| CN118897854B (en) | An engine system based on data mining and analysis and its optimization method | |
| JP6818935B2 (en) | A medium that stores data processing equipment, methods, and programs | |
| CN111861679A (en) | Commodity recommendation method based on artificial intelligence | |
| CN119250679A (en) | A logistics deposit pricing method based on intelligent risk assessment | |
| JP5304429B2 (en) | Customer state estimation system, customer state estimation method, and customer state estimation program | |
| CN110020666B (en) | Public transport advertisement putting method and system based on passenger behavior mode | |
| CN110019166A (en) | Method for screening attribute data and customer loss early warning method | |
| JP6717610B2 (en) | Replacement replacement time judgment support method, replacement replacement time judgment support system, and program | |
| CN119693035A (en) | Cross-selling warning method and device based on business full-chain big data analysis modeling | |
| CN115689713A (en) | Abnormal risk data processing method and device, computer equipment and storage medium | |
| CN113673595A (en) | A data processing method, device and equipment | |
| CN112258220A (en) | Information acquisition and analysis method, system, electronic device and computer readable medium | |
| CN117372065A (en) | Intelligent product pricing method and system based on user information | |
| Maskan | Proposing a model for customer segmentation using WRFM analysis (case study: an ISP company) |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication | ||
| RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190716 |