Data science for health insurance fraud detection
Application Track:
Code:
Domain:
Proposed by:
Entity Logo:
Summary of the entity:
Almerys is an expert in collecting, storing and processing sensitive data and provides products and services in digital trust, sovereignty and privacy-by-design, ranging from tier 4+ level data centres, to identities management, electronic signature, dynamic consent management, transactional payments, legal and probative archival of sensitive data, as well as personalized services brokering. The majority of data processed by almerys concerns health, in particular the reimbursement of medical expenses.
Summary of the challenge:
The objective of the challenge is to develop solutions that allow anti-fraud agencies and insurance companies to detect fraud in reimbursement operations committed by healthcare practitioners and patients. The solution should exploit datasets of historical data of transactions of healthcare reimbursement requests, both for fraud detection of past operations but also to propose mechanisms to prevent fraud.
Description of the global challenge:
In many European countries, the health system makes it possible to reimburse part of the medical care with public and / or private health insurance systems. The healthcare reimbursement system is based on trust. A patient’s care needs must be real and approved by a healthcare professional. Then, the health costs declared by the doctor must correspond to the medical acts that were actually carried out. Unfortunately, there are abuses and many frauds are noted. Since the beginning of the 2000s and the generalization of the dematerialization of reimbursements for medical care, the anti-fraud services of health insurance can use big data and artificial intelligence to detect fraud in real time. Fraud comes from patients (fake prescription), healthcare professionals (fake care) or the combination of both. In a country like France (70 million inhabitants), for the year 2018, there were 261 million euros of fraud detected. Each year, the progression of artificial intelligence algorithms makes it possible to detect even more new cases of fraud.
The goal of this complex challenge is to work on a real dataset of medical reimbursement data to detect health insurance fraud. It is therefore composed of two sub-challenges. Both sub-challenges start with the same dataset. The first focused on supervised machine learning techniques and the second focused on unsupervised machine learning by using clustering models.
Sub-challenges composing this experiment:
This challenge is composed of 2 sub-challenges:
- Supervised machine learning for fraud detection (REACH-2022-READYMADE-ALMERYS_2.1)
- Unsupervised clustering for fraud detection (REACH-2022-READYMADE-ALMERYS_2.2)
Expected global results:
- To develop solutions based on predictive models to detect a potential fraud perpetrator, that provides a fraud likelihood score for an optician prone to committing fraud.
- To create new anti-fraud rules making it possible to block upstream reimbursement requests. Based on the behavior of opticians who are part of a fraud cluster.
Supervised machine learning for fraud detections
Code:
REACH-2022-READYMADE-ALMERYS_2.1
Summary of the sub-challenge:
Using supervised machine learning techniques to detect health insurance fraud.
Description of the challenge:
The goal of this sub-challenge is to label (tag) all the lines of dataset 1 and 2 with binary information (fraud vs non-fraud) using all the most appropriate supervised machine learning techniques.
This challenge is based on 3 datasets:
- In the first dataset, each line concerns a sale of glasses with information on the price, the type of correction, the beneficiary (patient) or the optician (seller).
- In the second dataset, each line groups together the information of a single optician (total number of glasses sold, average selling price of glasses, etc.). The second dataset describes the behaviour of each optician and it is important to understand that it is calculated via an aggregation of data from the first dataset.
- The third dataset is a list of opticians who are already known fraudsters.
The great difficulty of this sub-challenge comes from the fact that only a small part of the opticians who practise fraud are already known (written in dataset 3) but many other opticians also practise fraud without having yet been detected.
Data to be used:
- Historical data of reimbursement requests from opticians to insurance companies
- Behavioral description of opticians
- ID of proven fraudulent opticians
It is important to note that all sensitive data (name, address, etc.) has already been anonymized via the tool provided by Gnubila (Anonymizer).
Expected outcomes:
- To provide a complementary list to dataset number 3 which will contain new IDs of opticians who you think are fraudulent.
- To propose new fraud detection rules (such as the price of glasses above a price threshold, …), which would make it possible to detect both already known fraudulent opticians (dataset number 3) and also new opticians who you have detected as a fraudster using your algorithm.
Unsupervised clustering for fraud detection
Code:
REACH-2022-READYMADE-ALMERYS_2.2
Summary of the sub-challenge:
Using unsupervised clustering techniques to detect health insurance fraud
Description of the challenge:
The goal of this sub-challenge is to find in clusters where already known fraudulent opticians have been classified, other opticians who could also be considered as fraudsters (with great behavioural similarity between them).
This challenge is based on 3 datasets:
- In the first dataset, each line concerns a sale of glasses with information on the price, the type of correction, the beneficiary (patient) or the optician (seller).
- In the second dataset, each line groups together the information of a single optician (total number of glasses sold, average selling price of glasses, etc.). The second dataset describes the behaviour of each optician and it is important to understand that it is calculated via an aggregation of data from the first dataset.
- The third dataset is a list of opticians who are already known fraudsters.
From dataset 1 and 2, use unsupervised clustering methods to “classify” each optician and group them into similar clusters.
It is important to emphasize that, in this sub-challenge, an unsupervised learning method should be used. It is therefore forbidden to use data from dataset 3 as an input parameter. It is only after having made the groupings of opticians that you can see, with the information from dataset number 3, in which cluster the opticians already known fraudulent have been classified.
Data to be used:
- Historical data of reimbursement requests from opticians to insurance companies
- Behavioral description of opticians
- ID of proven fraudulent opticians
It is important to note that all sensitive data (name, address, etc.) has already been anonymized via the tool provided by Gnubila (Anonymizer).
Expected outcomes:
- To provide a complementary list to dataset number 3 which will contain new IDs of opticians who you think are fraudulent.
- To propose new fraud detection rules (such as the price of glasses above a price threshold, …), which would make it possible to detect both already known fraudulent opticians (dataset number 3) and also new opticians who you have detected as a fraudster using the algorithm.
How do we apply?
Read the Guidelines for Applicants
Doubts or questions? Read more about REACH on the About Us page,
have a look at our FAQ section or drop us an email at opencall@reach-incubator.eu.