## A Worksheet for the Lichess Data Set

The first code cell is provided by Kaggle and provides a list of what datasets are available. What's available would be a function of what dataset was under inspection when New Notebook got pressed.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
 for filename in filenames:
 print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

A standard opening would be to read the csv file provided, into a pandas DataFrame. `read_csv` comes with a huger number of options, set by named argument. We consider ourselves lucky if the defaults do the job.

In [2]:
games = pd.read_csv('/kaggle/input/chess/games.csv')

We've likely already studied the columns, how full each one is (i.e. how riddled with nans) but it can't hurt to study them again, now that the data is loaded. These are not all unique games according to the Kaggle gods or guides.

What is of interest? Good questions will not occur to you (or me) until we have some understanding of the columns available. 

Since our two chess players are usually rated, with a boolean field flagging if not, and since we have the moves for most games, we can start envisioning an investigation: what games between players of widely different rank nevertheless took a lot of moves to resolve?

Another question: how might we curate a DataFrame called upsets, in which the lower ranked player, by say 500 or more points, nevertheless checkmated the stronger player? See below.

In [3]:
games.info()

We won't need all the columns going forward. Lets put a fork in the road with start down our path with filter_cols. Later, we will reorder these columns, and compute new columns based on what these columns contain.

In [18]:
filter_cols = games.loc[:, ['id', 'white_rating', 'black_rating', 'moves', "victory_status", "winner"]]
filter_cols.head()

The diff column will measure the difference in rank between the two players for each game. The difference will always be positive i.e. will be the "absolute value" of the span.

In [19]:
filter_cols["diff"] = abs(filter_cols.white_rating - filter_cols.black_rating)

In [20]:
def number_of_moves(the_moves):
 """
 return the number of moves in a moves string
 """
 return len(the_moves.split())

In [7]:
filter_cols.moves[0] # example moves string

In [21]:
number_of_moves(filter_cols.moves[0])

In [22]:
filter_cols["num_moves"] = filter_cols.moves.apply(number_of_moves) # apply is for whole Series

In [23]:
filter_cols.head()

Reorder the columns for `final`, the DataFrame on which future data analysis will be based.

In [31]:
final = filter_cols[["id", "white_rating", "black_rating", "diff", "winner", 
 "victory_status", "num_moves", "moves"]]

In [32]:
final.sort_values(["diff", "num_moves"], ascending=False)

In [33]:
final.sort_values(["diff", "num_moves"], ascending = False)

In [34]:
final.sort_values(["num_moves", "diff"], ascending = False)

In [35]:
upset_for_black = final.query("white_rating - black_rating >= 500 and winner == 'black' and victory_status == 'mate'")

In [36]:
upset_for_white = final.query("black_rating - white_rating >= 500 and winner == 'white' and victory_status == 'mate'")

In [43]:
upsets = pd.concat([upset_for_black, upset_for_white], axis=0)
upsets.reset_index(drop=True, inplace=True)
upsets

Lets export this upsets DataFrame to a csv file.

In [46]:
upsets.to_csv("upsets.csv", header=True, index=False)