Fraud Analysis in Healthcare

Frauds in the healthcare sector are increasing at a very fast rate. According to the fraud statistics provided by bcbsm, “The National Healthcare Anti-Fraud Association estimates conservatively that health care fraud costs the nation about $68 billion annually — about 3 per cent of the nation’s $2.26 trillion in health care spending. Other estimates range as high as 10 per cent of annual health care expenditure, or $230 billion.” The major types of healthcare frauds can be Fraud by Service Provider, Fraud by Insurance Subscriber or Fraud by the Insurance Carriers.

We need a technology that can be used to prevent such frauds. In this blog, I will share how we can use machine learning algorithms to analyse the various frauds in the healthcare sector technology.

Data Gathering

The training and testing data required for the model can be downloaded from here. The step prior to the model creation would be preprocessing our data.


Data Processing

The preprocessing of data will require various steps in order to make the model more efficient. Before that let’s load our training data into our python code.

import pandas as pd

Train = pd.read_csv(“Data/datasets_188596_421248_Train-1542865627584.csv”)

In the above code, I have used the pandas module to load the training CSV files. If we look into the data we have the details for some of the major diseases like Alzheimer, Heart failure, kidney disease, Cancer, ObstrPulmonary, Depression, Diabetes, IschemicHeart, Osteoporosis, rheumatoid arthritis and stroke. These columns are available in Train_Beneficiary CSV file labelled as 2 and 1. Let’s convert these into 1 and 0.

Train_Beneficiarydata = Train_Beneficiarydata.replace({‘ChronicCond_Alzheimer’: 2, ‘ChronicCond_Heartfailure’: 2, ‘ChronicCond_KidneyDisease’: 2,
‘ChronicCond_Cancer’: 2, ‘ChronicCond_ObstrPulmonary’: 2, ‘ChronicCond_Depression’: 2,
‘ChronicCond_Diabetes’: 2, ‘ChronicCond_IschemicHeart’: 2, ‘ChronicCond_Osteoporasis’: 2,
‘ChronicCond_rheumatoidarthritis’: 2, ‘ChronicCond_stroke’: 2 }, 0)

Train_Beneficiarydata = Train_Beneficiarydata.replace({‘RenalDiseaseIndicator’: ‘Y’}, 1)

Now, the next step will be to calculate the present or the total age of the patient determining whether he is dead or not. I have also created a new column in our dataset to donate 1 for death and 0 for no death of patients.

Train_Beneficiarydata[‘DOB’] = pd.to_datetime(Train_Beneficiarydata[‘DOB’] , format = ‘%Y-%m-%d’)
Train_Beneficiarydata[‘DOD’] = pd.to_datetime(Train_Beneficiarydata[‘DOD’],format = ‘%Y-%m-%d’,errors=’ignore’)
Train_Beneficiarydata[‘Age’] = round(((Train_Beneficiarydata[‘DOD’] – Train_Beneficiarydata[‘DOB’]).dt.days)/365)

Train_Beneficiarydata.Age.fillna(round(((pd.to_datetime(‘2013-12-01’ , format = ‘%Y-%m-%d’) – Train_Beneficiarydata[‘DOB’]).dt.days)/365),


I have calculated the age with a dummy date of 2013 because the available data is of 2012. Once done with this, the next most important part is to calculate the number of days a patient is admitted for. This is very important information for our fraud detection model. The number of Admit days can be easily calculated from Admission date and the discharge date present in the Train_Inpatient CSV file.

Train_Inpatientdata[‘AdmissionDt’] = pd.to_datetime(Train_Inpatientdata[‘AdmissionDt’] , format = ‘%Y-%m-%d’)
Train_Inpatientdata[‘DischargeDt’] = pd.to_datetime(Train_Inpatientdata[‘DischargeDt’],format = ‘%Y-%m-%d’)
Train_Inpatientdata[‘AdmitForDays’] = ((Train_Inpatientdata[‘DischargeDt’] – Train_Inpatientdata[‘AdmissionDt’]).dt.days)+1

Now the task is to merge the four different training files that can be used to train the model.

Train_Allpatientdata=pd.merge(Train_Outpatientdata,Train_Inpatientdata,left_on=[‘BeneID’, ‘ClaimID’, ‘ClaimStartDt’, ‘ClaimEndDt’, ‘Provider’,’InscClaimAmtReimbursed’, ‘AttendingPhysician’, ‘OperatingPhysician’,’OtherPhysician’, ‘ClmDiagnosisCode_1’, ‘ClmDiagnosisCode_2′,’ClmDiagnosisCode_3’, ‘ClmDiagnosisCode_4’, ‘ClmDiagnosisCode_5′,’ClmDiagnosisCode_6’, ‘ClmDiagnosisCode_7’, ‘ClmDiagnosisCode_8′,’ClmDiagnosisCode_9’, ‘ClmDiagnosisCode_10’, ‘ClmProcedureCode_1′,’ClmProcedureCode_2’, ‘ClmProcedureCode_3’, ‘ClmProcedureCode_4′,’ClmProcedureCode_5’, ‘ClmProcedureCode_6’, ‘DeductibleAmtPaid’,’ClmAdmitDiagnosisCode’],right_on=[‘BeneID’, ‘ClaimID’, ‘ClaimStartDt’, ‘ClaimEndDt’, ‘Provider’,’InscClaimAmtReimbursed’, ‘AttendingPhysician’, ‘OperatingPhysician’,’OtherPhysician’, ‘ClmDiagnosisCode_1’, ‘ClmDiagnosisCode_2′,’ClmDiagnosisCode_3’, ‘ClmDiagnosisCode_4’, ‘ClmDiagnosisCode_5′,’ClmDiagnosisCode_6’, ‘ClmDiagnosisCode_7’, ‘ClmDiagnosisCode_8′,’ClmDiagnosisCode_9’, ‘ClmDiagnosisCode_10’, ‘ClmProcedureCode_1′,’ClmProcedureCode_2’, ‘ClmProcedureCode_3’, ‘ClmProcedureCode_4′,’ClmProcedureCode_5’, ‘ClmProcedureCode_6’, ‘DeductibleAmtPaid’,’ClmAdmitDiagnosisCode’],how=’outer’)


Now, there are some columns which are no longer required because they were used to link the four files like ‘BeneID’ and ‘ClaimID’. Also, there are some columns like ‘AdmissionDt’ and ‘DischargeDt’ which we have used for our calculations and don’t need any more. Let’s drop all these columns.

remove_these_columns=[‘BeneID’, ‘ClaimID’, ‘ClaimStartDt’,’ClaimEndDt’,’AttendingPhysician’,
‘OperatingPhysician’, ‘OtherPhysician’, ‘ClmDiagnosisCode_1’,
‘ClmDiagnosisCode_2’, ‘ClmDiagnosisCode_3’, ‘ClmDiagnosisCode_4’,
‘ClmDiagnosisCode_5’, ‘ClmDiagnosisCode_6’, ‘ClmDiagnosisCode_7’,
‘ClmDiagnosisCode_8’, ‘ClmDiagnosisCode_9’, ‘ClmDiagnosisCode_10’,
‘ClmProcedureCode_1’, ‘ClmProcedureCode_2’, ‘ClmProcedureCode_3’,
‘ClmProcedureCode_4’, ‘ClmProcedureCode_5’, ‘ClmProcedureCode_6’,
‘ClmAdmitDiagnosisCode’, ‘AdmissionDt’,
‘DischargeDt’, ‘DiagnosisGroupCode’,’DOB’, ‘DOD’,
‘State’, ‘County’, ‘Provider’, ‘DeductibleAmtPaid’, ‘Race’, ‘NoOfMonths_PartACov’, ‘NoOfMonths_PartBCov’,
‘IPAnnualReimbursementAmt’, ‘IPAnnualDeductibleAmt’, ‘OPAnnualReimbursementAmt’, ‘OPAnnualDeductibleAmt’]

Train_category_removed = Train_ProviderWithPatientDetailsdata.drop(axis=1,columns=remove_these_columns)

TrainData = Train_category_removed

The major problems that need to be solved now are the problem of missing values and changing the values of ‘PotentialFraud’ column with 1 and 0.


cols1 = TrainData.select_dtypes([np.number]).columns
TrainData[cols1] = TrainData[cols1].fillna(value=0)

Let’s check some statistics on the data that has been prepared.

TrainData.describe(include = ‘all’)


The first line of the above code will show the summary of all the columns like count, mean, std, 25%, 50%, 75% and max. While the next line will print the correlation of every column with other columns.

from sklearn.naive_bayes import GaussianNB
model_GBN = GaussianNB(), Y)

I have used Gaussian Naive Bayes, which is a probability-based classification algorithm. By using gaussianNB I have taken an assumption that the data from each label is drawn from a simple Gaussian distribution. Now, we are ready to train our model and test our model with dummy data.

input = [9000,8,1,0,0,0,0,1,0,0,0,0,0,0,0,81,0]

if(Age <= 0 or Age>=100):
print(“Illegal data”)
probability = model_GBN.predict_proba([[])
prob = probability[0][1]
if prob >= 0.5:
print (“No Potential Fraud Detected”)
print (“Potential Fraud Detected”)



You now know how we can create a model using Gaussian Naive Bayes to predict a Potential Fraud in the healthcare industry.

I’d love to hear any feedback or questions. You can ask questions by leaving a comment, and I will try my best to answer it.

Share on:

Leave a Reply