2018 - Kaggle - TalkingData AdTracking Fraud Detection Challenge: Silver medal (银牌)
TalkingData AdTracking Fraud Detection Challenge is a data mining competition conducted by TalkingData at Kaggle. Finally, I got a silver medal in a solo job.
The journey of this competition is quite interesting for me. It should be perhaps the one which competition I spend the least time on. It was on April 25th that I discovered this competition. Then I casually downloaded the submission in other's kernel and submitted it. At that time, I was ranked over 300th+, then I put it down. When I continued doing it, it is May 2nd and I was ranked over 1100th+. From May 2nd to May 7th, I only spent about 6 days in this competition (there are other jobs I have to finish during this period, so it is not a full-time job on this competition). Luckily, I got a silver medal finally. The figure below is the ranking change during my competition.
Because of the short time, my results are all generated from the single model LightGBM, and I have not tried other models. So I will just share two parts that I think are more important: processing of billion-level data and feature construction.
The data provided by the organizers is about 10G, with more than 100 million samples. How to use limited memory to process this 10G data is very critical for this competition. I generally used the following operations:
a
(especially a large memory variable) is no longer used, we should remove it from memory: del a
.gc.collect()
to trigger garbage collection.The feature construction is particularly critical for improving the effects of results. Feature construction can be decomposed into two questions: what dataset to be used in construct features on and what features to be constructed.
At beginning, I used the train+test
dataset to construct features, get 0.9800 on public LB, in the bronze medal position. Later I tried to use the train+test_supplement
dataset to construct features and scores went up directly, get 0.9813 on public LB, in the silver medal position! Therefore, from this phenomenon we can notice that the bias of the model trained from the train+test
is much larger than the train+test_supplement
.
[ip, app, channel, device, os]
, calculate next time delta[ip, os, device]
, calculate next time delta[ip, os, device, app]
, calculate next time delta[ip, channel]
, calculate previous time delta[ip, os]
, calculate previous time delta[ip]
, unique count of channel
[ip, device, os]
, unique count of app
[ip, day]
, unique count of hour
[ip]
, unique count of app
[ip, app]
, unique count of os
[ip]
, unique count of device
[app]
, unique count of channel
[ip]
, cumcount of os
[ip, device, os]
, cumcount of app
[ip, day, hour]
, count[ip, app]
, count[ip, app, os]
, count[ip, app, os]
, variance of day
If you have any ideas, such as finding bugs in a certain place, thinking that I am incorrect or impenetrable about a method or having more creative ideas, please feel free to send an issue, pull request or discuss it directly with me! In addition, if you can star or fork this project to motivate me who has just entered the field of data mining, I will be grateful~