Novice to Grandmaster in Kaggle

There are four important aspects we need to work on, to perform well in a Kaggle competition (or in any Data Science competition for that matter)

1. Feature Engineering:

This is the first and very important part in solving a problem. I generally spend more time on this part than others since this is the one which can give us bigger gains if done correct. This again includes:

  • Feature selection
  • Feature transformation
  • Feature creation

One need to master all these three in order to perform well. Understanding the problem objective, understanding the data, data visualisation, domain knowledge helps a lot in this stage. Some good links with respect to feature engineering are as follows:

2. Cross Validation:

Having proper validation methodology is very important since without this, it will be more like shooting in the dark hoping that it will hit the bulls eye. There are several validation methodologies available and one need to choose the best one depending on the objective at hand. A nice article on validation methodology is present in this blog link.

3. ML Algorithms Selection and Tuning:

Next step is choosing the algorithm that performs better for the given data. This generally involves trying out an array of algorithms that is present in your armour and then select the ones that perform better. It is absolutely essential to improve / increase this array of algorithms over time to perform consistently.

In general these days (Sept 2016), XGBoost is performing better for most of the structured data competitions and Deep Neural Nets

are performing better for most of the unstructured competitions. If the dataset is huge, online algorithms like FTRL,Vowpal Wabbit are performing good.

Proper tuning of parameters in the algorithms give improved results and so it is needed. Some good articles on the same are:

Now there are also few libraries like hyperopt, bayesopt which does parameter tuning in a more automated way.

4. Ensembling / Stacking:

In almost all the Kaggle competitions I know of, the final winners use ensembling or stacking of multiple models as their winning strategy. To know more about what is this ensemble leanring, check this post by Analytics Vidhya.

Also there is one excellent blog post by a fellow Kaggler Triskelion on Kaggle Ensembling Guide which also has codes for each of the ensembling / stacking methods that will come in handy.

Other Aspects:

Doing all these 4 steps correctly will help us get a good spot in the Kaggle competition we are working on. Other important points are:

  • Look at the winners solution for similar problems from Kaggle Blog and try using some of the ideas in the competition.
  • Teaming up with people who are similarly interested to work on. This will help learning faster and also add a different perspective on solving the problem.
  • Follow Kaggle Forums regularly as many ideas of solving the problems will be discussed there.
  • Kaggle has also released Kernels where people share their works in form of codes. Many people are using it as a “Click and submit” feature to gain good ranks but learning from these scripts will be immensely helpful in the long run.

Secret of Success 😉

On top of all the points mentioned above, we need to put in a lot of hard work (both ours as well that of our systems) and have a strong determination to try repeatedly (since most of the times the new method we try will fail to improve our scores – I would count this as valuable learnings)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s