For someone who is new at Kaggle
I summarized the definitions of Kaggle
and basic usage after reading Kaggle's Official Document
and Kaggle Guide
I hope it will help those who are just introduced to Kaggle
like me.
If there is anything that needs to be corrected, please leave it in Issue
.
FYI, the Hello Kaggle
' document rarely deals with Python programming
or machine learning theory
and focuses on Kaggle usage
.
For those of you who are looking for programming
, data science
, and machine learning materials
, I'll leave you with some links that I've been helped with.
Kaggle
is the platform that hosts the Data Analysis Competition.
It is common for competitions to be hosted by providing data that needs to be analyzed for the company's research challenges, key services
.
Artificial Intelligence, Machine Learning Boom
has continued to increase the number of participants and was acquired by Google's parent company 'Alphabet' in 2017.
Since the Alphabet's acquisition, Kaggle
has become a critical site for data scientists and engineers, not just a platform.
Kaggler
? Kaggling
?Googling
, >
Kaggle's users are Kaggler
or Kaggling
to participate in the Competition.
Jobs
Jobs Service
was originally provided, but the service ended on December 22, 2020.Python
, machine learning
and visualization
, and so on.Kaggle's course
can be quite useful if you haven't learned it step by step or if you've studied an old course.English
, free
and a certificate
of completion.
English
Data scientists from all over the world gather together and use English
by default.
Complementation Notice
, Dataset
, Discussion
are also in English.
Below is the photo of Discussion
and Site Forum
.
If you look at the profiles of the winners of the Competition, there are a variety of USA
,Korea
,Russia
,China
,India
, and so on.
Programming Language
Python
and R
a lot.
Purpose | Knowledge Required |
---|---|
Competition participation | Python, R, data analysis |
Competition organizer | Data analysis, English |
Discussion with Kaggler | English |
Learning through Courses | English |
Internet
, Python
and R
, PC
Server with GPU
or Workstation
and high capacity HDD
or SSD
Infrastructure
for data analyticsweb-based
and provides tools for data analysis. (Notebook)Notebook
programming environment for data analysis
provided by Kaggle.Jupyter Notebook
.4 Core CPU + 16GB RAM
by default. GPU Server
provides 2Core CPU + GPU + 13GB RAM
.Provided free of charge
, and GPU can be used for 30 hours a week
.
Dataset
Dataset
.Dataset
, you can use the Private
setting to make it private to the outside world.Public
, Apache 2.0 License
is applied, so you must make a careful decision.
Company Training
Example: staff training for creating neural network-based machine learning programs
What if we didn't use the Kaggle?
Kaggle is much easier and less expensive in building a development environment
, checking the score
, and deployment
.
Discussion
If you don't know something, you can ask in Site Forums
, and Competition
of the Communities
.
Communities
Site Forums
Refer to Competitions Documentation.
Featured
, the most common Competition$100
and $1,500,000
.
Research
Getting Started
for New KagglerTitanic: Machine Learning from Disaster
, House Prices: Advanced Regression Techniques
, Digit Recognizer
These three competitions are the most recommended and helpful competitions for new machine learners.
Playground
for data scientists and engineersRecruitment
for job opportunitiesJob Interview
opportunity. Participants can upload a Resume at the end of the Competition.
Annual Competition
held regularlyAnalytics
to effectively explain the resultsSign Up
Register
button on the upper right to sign up
first.
Courses
Courses
, as described above.Refer to Kaggle Progression System.
Before I explain how to become a Contributor
, I will explain about Kaggle Tiers
and Medal
.
Kaggle Tiers
There is a Progression System
in Kaggle, which is simply Kaggler Tier
.
This rating is a good indicator of your ability as a data scientist.
It also intuitively shows how much you've grown.
The Kaggle Tiers
are divided into five levels, and conditions are also given to achieve each.
Novice
Contributor
Expert
Master
Grandmaster
Also, as you can see in the pictures above, Kaggle Tier
is rated differently for Competitions
, Datasets
, Notebooks
, and Discussion
.
Click on the upper right account icon and select My Profile
to go to the profile page.
Then you can check your profile information and Kaggle activity content and tiers.
Medal
Medal
shows Kaggler's performance in each field.
Competition
Notebook
Dataset
Comment
Contributor
just needs to satisfy conditions. However, from Expert
, the medals required for the applicable conditions in each discipline must be collected.
Competitions
have different medal criteria depending on the number of teams participating.
Datasets
, Notebooks
, Discussion
are evaluated by Vote
. It means, the higher number of Vote
, the more Kaggler recommended it.
Note that there is only one type of medal awarded for each post in each part.
For example, if a post on Dataset
received 20 Votes, the bronze medal will be gone and the silver medal will be given.
Being Contributor
Edit Profile
, and enter the following:
Bio (self-introduction)
Occupation
Organization
City
profile image
and Social Media
freely.
Phone Verification
on the profile screen.Country Code
, Phone Number
and Not a Robot
boxes and click Send Code
.Verify
to complete authentication.
Course
or by creating your own Notebook
and executing any code.4. Participate in the Competition
will run a notebook, so you can skip it.
Select one Competition in the 'Getting Started' category.
If you go in, you can see the menu below in the middle of the screen.
Click on 'Notes' here and take a look at other people's notebooks.
Pick one notebook and open it in the upper right corner
You'll see a button like that. Click this button to copy the notebook.
Once the copy is complete, click Save Version
at the upper right corner.
Version Name
: You can enter the name.Version Type
: There are two options, Quick Save
or Save & Run All (Commit)
. Quick Save
is saved, not executed, and Save & Run All (Commit)
is executed.
Click Save & Run All
here and press the Save
button.
Go back to your profile and click Notebook
to see the notebook you just copied.
When you click on this notebook, there is Output
at the right menu.
Select Submission.csv, which can be viewed by pressing Output, and click Submit to Competition
on the right.
The screen will now be moved to the Leaderboard
menu and the submitted files will be automatically scored.
After scoring, you can check your score and click Jump to your position on the leaderboard
to see your ranking.
Discussion
, enter the topic you want and click any article you are interested in (recommended to enter Getting Started
in Site Forums
).comments
. If the text is useful or you like it, press Vote
as well.
Contributor
!Competitions
, Datasets
, Notebooks
, and Discussion
.Competitions
. You can also check how many people are in each tier.
Notebook
?Competition
or share Notebook
with Kaggler
. Some of the Notebooks
are shared only for training or skills.Code Cell
and Markdown Cell
to write codes, and descriptions of the code, text, image, etc. Notebook
Go to the Notebook
menu and look in the upper right corner There's a button like this. Click it.
Kaggle Notebook
has two types: Script
and Notebook
.
Script
is a method of writing and executing code in a commonly used code editor.Notebook
is an interactive development environment similar to Jupyter Notebook
. The characteristic is that you can divide the cells and execute only the code you want.
Press File
in the upper left corner and hover your cursor over Edit Type
to select the type. In addition, you can choose between Python
and R
in Language
.
You can change the name by clicking on the top left column that looks like the picture below.
The first time you create a Notebook
, you will see the following code:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
The above code specifies the directory /kaggle/input
to import files after loading Numpy
and Pandas
libraries from Python
.
I will print Hello Kaggle!
on Notebook
. Place the cursor in any code cell and press the + Code
button.
Then complete the following:
At the top left press this play button or
Enter Ctrl + Enter
or Shift + Enter
to execute the code. The output will be like this
These are the functions of the buttons that can be seen in the cell.
Notebook
Set Public
& Private
Notebook
can be released for sharing with other Kaggler
. But if you don't want to share, or when you work as a team, you can make settings such as Private
or Shared to a specific user
.Share
button in the upper right corner to open a window for public
or private
setting.Privacy
is set to Public
, it will be released with Apache 2.0 License
.Collaborators
to add users as collaborators.
Settings
Language
: You can set the programming language to use Python
and R
.Environment
: The Docker
image can be set. Original
sets up the development environment when creating Notebook
and Latest Available
uses the latest development environment provided by Kaggle
.Accelerator
: Whether to use GPU
or TPU
can be set.GPU/TPU Quota
: Show time and usage of GPU
and TPU
Internet
: You can set whether or not to connect to the Internet.Internet to On
. Google accounts also allow you to use BigQuery
, Cloud Storage
, and AutoML
services from GCP
(Google Cloud Platform).
Data
from Notebook
Kaggle Notebook
is available not only in Competition Data
but also in a variety of Dataset
shared.
In this case, a separate file must be set up for use in Notebook
.
new Notebook
Dataset
you want to use, and press New Notebook
to set the file automatically.existing Notebook
existing Notebook
, first access your Notebook
.+ Add Data
button in the upper right corner.Dataset
and press Add
after you choose Dataset
.
Data
menu and look in the upper right corner, click on the + New Data
button.Enter Dataset Title
and click Select Files to Upload
to upload the file. (Compressed file types such as zip or tar.gz are also possible.)Create
to upload Dataset
. You can import the uploaded Dataset
using the i
or ii
method.
Notebook
ii
method, a window will appear, where you can click on the Kernel Output Files
tab to use the output data from another Notebook
Notebook
External packages that pip
is avaliable can be installed with pip install package_name
by clicking Console
at the bottom of Notebook
.
You can also use pip
directly in the code cell, as shown in two examples
!pip install package_name
import os
os.system('pip install package_name')
Source Code
from Dataset in Notebook
If you add example dataset
that has package hello_kaggle
to Notebook
, you can add the ../input/example-dataset/hello_kaggle
directory.
The codes you add are as follows:
import sys
sys.path.append("../input/example-dataset/hello_kaggle")
Notebook
be used for besides data analysis Competition
?Notebook
will be shared(Public) after Competition
is finished.Competition
is in progress.
Data File
to use in Competition Notebook
?When performing Competition
, the Data
tab is located in the upper right corner of the Notebook
. There are three types of files you can click on, each of which is described as follows.
train.csv
: Learning data with correct answer label.test.csv
: Data for testing without the correct answer label.Sample_submission.csv
: Examples of data for submission
View the Data
menu in Competition
to see what data each file contains.
For example, lets look at the Titanic - Machine Learning from Disaster
.
In the picture above, click on the Data menu to read Overview
as follows
If you go down further, you can select each file to view the data and download it as follows
Let's use these files to create and submit a csv file for model creation and submission.
(The same is explained in 4. Participate in the Competition.)
Save Version
in the upper right corner of the Notebook
screen. (If the code is not executed, click Save & Run All (Commit)
.Save & Run All (Commit)
, Commit
is the same meaning as Git Commit
in Github
, which I am currently working on.Kaggle Notebook
can refer to the version of the source code previously written.Now return to your profile and click Notebook
to see the notebook you just saved.
When you click on this notebook, there is Output
in the right menu.
Select Submission.csv
that you can view by pressing Output
menu and click Submit to Competition
on the right.
The screen will now be moved to the Leaderboard
menu and the submitted files will be automatically scored.
After scoring, you can check your score and click Jump to your position on the leaderboard
to see your ranking.
Kaggle Guide
.Baseline
implementing the general-purpose algorithmData Analysis
NotebookNotebook
that analyzes Competition data
and shows visualization
.correlations
, rules
, and structure
between the analyzed data without creating data to submit. We also look for independent variables
that fit well with dependent variable
.Competition experience
, it would be a good start to build knowledge and insight by looking at data analyzed by other Kagglers
.
Fork Notebook
machine learning
and Kaggle
, one way is to fork out a notebook
that is open without data analysis or model development yourself.Fork
means to copy a version of the source code.Notebook
you'd like to fork press button to copy.
Merge, Blending, Stacking, Ensemble Notebook
Notebook
with words such as Merge
, Blending
, Stacking
, and Ensemble
.Notebook
combining several Notebooks
.Example
:
Competition
is carried out in this order, I think it would be better to study a variety of Notebooks
to understand the process rather than just looking at the winner's notebook
.Competition
is literally a competition, so the shared(public) Notebook
means that they are not serious impact on their score.Notebook of winners
, you can often see that they used the latest technology or used a different solution than the shared notebook
.
Competitions in Kaggle
sometimes have specific rules. This is because Competitions
are usually hosted by a company or organization, and special rules are often created to achieve the results that the company or organization wants.
rules
should I check?Rules
: To win the Competition
, you must first know the rules of Competition
. Check the Rules
menu for each Competition.Evaluation
: On the Evaluation
page of Overview
, you should look at the Evaluation function
and see what evaluation method is applied. Usually, statistical-based functions are used.One-person score check limit
: If you can check the score frequently by submitting a result file as you change the data one by one, the competition won't get any meaningful results, so there is usually a limit to the number of results checked.Notebook Only Competition
: Submit results using Kaggle Notebook
only.Kaggle Notebook
is used, Kaggler
is more likely to share Notebook
, and all participants can easily find good ideas by viewing shared Notebook
. Closed Competition
Kaggle
is that it leaves discussion
and notebook
of Competition that ended a long time ago
.Competition | Used Technology | Description |
---|---|---|
Mercari Price Suction Cahllenge (2018.2) | TF-IDF Vector + Pre-bonded Neural Network | Learn the frequency of each word with neural networks |
Toxic Comment Classification Challenge (2018.3) | FastText, Glove + GRU + LightGBM | A combination of word vector dictionaries learned from time series data |
Avito Demand Prediction Challenge (2018.6) | FastText + LSTM + 2D-CNN | Learn data and images of sentences simultaneously with neural networks |
Quora Insincere Questions Classification(2019.1) | Glove, para + OOV Token + LSTM + 1D-CNN | Learn vocabularies through OOV token |
Jigsaw Unintended Bias in Toxicity Classification(2019.6) | BERT + XLNet + GPT2 | BERT model appeared to the Kaggle |
won the Competition
topic by topic (I just checked it out that 11 months ago was the last commit).Competitions
will continue to release their latest technology-enabled solutions on the Private Leaderboard
page after the end.
public Dataset
Dataset
, UCI Machine Learning Repository
is famous.Data Repository
Github
, you can use Kaggle
as a convenient place to store Dataset
and Notebook
(Free!)Dataset
directly to Notebook
.public Dataset
and up to 20GB total for all private Dataset
.
Kaggle API
Kaggle API
is an API that can use various functions of Kaggle
in various development environments.Python 3
and the usage is input command into the terminal environment.
Kaggle API
You must install Python
and pip
before starting.
Kaggle API
using pip install kaggle
.2.Then enter your profile, click on the button that looks like this, and press Accounts
.
3.
Click Create New API Token
here to download the json
file.
json
file to the user's home directory as .kaggle/kaggle.json
. now you are ready to use Kaggle API
.
Kaggle API
kaggle competitions list
command to see which Competitions
are currently in progress.Competition files
, check the file with kaggle competitions files COMPETITION_NAME
and kaggle competitions download COMPETITION_NAME
to download the files.Kaggle API
, please visit Kaggle Public API Documentation.
First of all, thank you for reading Hello Kaggle!
I studied Python
for the first time in April 2020 and was unable to concentrate fully on my studies as I've started military service in July of the same year.
That's why I couldn't study data science in depth, and I still need more knowledge to understand it.
Now finally I'm stepping into machine learning
and Kaggle
.
At this moment to write Hello Kaggle!
, I've improved my understanding of Kaggle
and I'm going to start with Getting Started Competition
.
Also eager to keep up with the latest technology by looking at other outstanding Kaggler's Notebook
.
Hopefully, everyone who reads Hello Kaggle!
will get the best time in 2021. Let's Keep Going!