There are four important aspects we need to work on, to perform well in a Kaggle competition (or in any Data Science competition for that matter)
1. Feature Engineering:
This is the first and very important part in solving a problem. I generally spend more time on this part than others since this is the one which can give us bigger gains if done correct. This again includes:
- Feature selection
- Feature transformation
- Feature creation
One need to master all these three in order to perform well. Understanding the problem objective, understanding the data, data visualisation, domain knowledge helps a lot in this stage. Some good links with respect to feature engineering are as follows:
2. Cross Validation:
Having proper validation methodology is very important since without this, it will be more like shooting in the dark hoping that it will hit the bulls eye. There are several validation methodologies available and one need to choose the best one depending on the objective at hand. A nice article on validation methodology is present in this.
3. ML Algorithms Selection and Tuning:
Next step is choosing the algorithm that performs better for the given data. This generally involves trying out an array of algorithms that is present in your armour and then select the ones that perform better. It is absolutely essential to improve / increase this array of algorithms over time to perform consistently.
In general these days (Sept 2016),is performing better for most of the structured data competitions and
are performing better for most of the unstructured competitions. If the dataset is huge, online algorithms like, are performing good.
Proper tuning of parameters in the algorithms give improved results and so it is needed. Some good articles on the same are:
Now there are also few libraries like, which does parameter tuning in a more automated way.
4. Ensembling / Stacking:
In almost all the Kaggle competitions I know of, the final winners use ensembling or stacking of multiple models as their winning strategy. To know more about what is this ensemble leanring, check this.
Also there is one excellent blog post by a fellow Kaggler Triskelion onwhich also has codes for each of the ensembling / stacking methods that will come in handy.
Doing all these 4 steps correctly will help us get a good spot in the Kaggle competition we are working on. Other important points are:
- Look at the winners solution for similar problems from and try using some of the ideas in the competition.
- Teaming up with people who are similarly interested to work on. This will help learning faster and also add a different perspective on solving the problem.
- Follow regularly as many ideas of solving the problems will be discussed there.
- Kaggle has also released where people share their works in form of codes. Many people are using it as a “Click and submit” feature to gain good ranks but learning from these scripts will be immensely helpful in the long run.
Secret of Success 😉
On top of all the points mentioned above, we need to put in a lot of hard work (both ours as well that of our systems) and have a strong determination to try repeatedly (since most of the times the new method we try will fail to improve our scores – I would count this as valuable learnings)