Fifa 21 Python Data Cleaning
Overview
This repository contains the code to clean up the FIFA 21 dataset game. The purpose of this project is to clean and standardize the value format of several columns that originally varied. FIFA21 dataset contains 18K+ records of player data and 77 Columns. Each record represents a unique player and includes various attributes
Issues found in the data
- Inconsistent value formats : Several columns have inconsistent value formats, which could potentially disrupt operations such as grouping by, filtering, or aggregating.
- Missing Values : Some columns have missing values that need to be handled differently.
Tools Used
- Python : used Numpy and Pandas Library for Cleaning
- Visual Code Studio : IDE for Data Exploration and Cleaning
Columns Explanation
For descriptions of all columns, you can refer to a file named “FIFA21 Column Explanation”
Data Analysis
Data Analysis Table of Contents
- Importing Libraries
- Cleanning Club Column
- Cleanning Contract Column
- Cleanning Height Column
- Cleanning Weight Column
- Cleanning Loan Date End Column
- Cleanning W/F Column
- Cleanning SM Column
- Cleanning Hits Column
- Cleaning Value, Wage, Release Clause Column
Importing Libraries and Read the csv file
import numpy as np import pandas as pd
df = pd.read_csv('fifa21 data.csv') pd.set_option('display.max_columns', None) df
df.info()
Data Cleanning
create a copy of dataframe
data = df.copy()
data.head(5)
1. Cleanning Club Column
data['Club'].dtype
data['Club'].unique()
There are many whitespace characters in some club names in the Club column. Since the whitespace characters are both at the beginning and end of the string, using .strip() would be appropriate to remove them.
data['Club'] = data['Club'].str.strip()
data['Club'].unique()
2. Cleanning Contract Column
data['Contract'].dtype
data['Contract'].unique()
using
.loc
to search for values in the ‘Contract’ column for each row where the value contains the string ‘On Loan’ or ‘Free’.
for index, kontrak in data.loc[(data['Contract'].str.contains('On Loan')) | (data['Contract'].str.contains('Free')), 'Contract'].items():
print(kontrak)
Declare a function
extract_contract
that accepts one parameter contract
, which is used to create 3 new columns: ‘Contract Start’, ‘Contract End’, and ‘Contract Length (years)’ with values taken from the ‘Contract’ column.
def extract_contract(contract):
if contract == 'Free' or 'On Loan' in contract:
start_date = np.nan
end_date = np.nan
contract_length = 0
else:
start_date, end_date = contract.split(' ~ ')
start_year = int(start_date[:4])
end_year = int(end_date[:4])
contract_length = end_year - start_year
return start_date, end_date, contract_length
new_columns = ['Contract Start', 'Contract End', 'Contract Length(years)']
new_data = data['Contract'].apply(lambda x: pd.Series(extract_contract(x)))
for i in range(len(new_columns)):
data.insert(loc=data.columns.get_loc('Contract')+1+i, column=new_columns[i], value=new_data[i])
data[['Contract', 'Contract Start', 'Contract End', 'Contract Length(years)']].sample(10)
Declare a function Categorize_Contract
that categorizes values in the ‘Contract’ column into three categories: ‘Free’, ‘On Loan’, and ‘Contract’. Then create a new column ‘Contract Status’ to the DataFrame data to the right of the ‘Contract Length (years)’ column. The values for this new column are generated by applying the Categorize_Contract
function to each value in the ‘Contract’ column.
def Categorize_Contract(contract):
if contract == 'Free':
return 'Free'
elif 'On Loan'in contract:
return 'On Loan'
else:
return 'Contract'
data.insert(data.columns.get_loc('Contract Length(years)')+1, 'Contract Status', data['Contract'].apply(Categorize_Contract))
data.sample(4)
data[['Contract', 'Contract Start', 'Contract End', 'Contract Length(years)', 'Contract Status']].sample(5)
3. Cleanning Height Column
data['Height'].dtype
data['Height'].unique()
Declare a function
transform_height
that accepts one parameter used to remove the string ‘cm’ and convert the height value from inches to centimeters, resulting integer data type only
def transform_height(x):
if 'cm' in x:
return int(x.strip('cm'))
else:
feet, inches = x.split("'")
total_inches = int(feet)*12 + int(inches.strip('"'))
return round(total_inches * 2.54)
data['Height'] = data['Height'].apply(transform_height)
data['Height'].unique()
rename Height Column to Height(Cm)
data = data.rename(columns={'Height':'Height(Cm)'})
data.sample(1)
4. Cleanning Weight Column
data['Weight'].dtype
data['Weight'].unique()
Declare a function transform_weight
that removes the string ‘kg’ and ‘lbs’ characters, then converts the weight from pounds to kilograms, resulting integer data type
def transform_weight(x):
if 'kg' in x:
return int(x.strip('kg'))
else:
pounds = int(x.strip('lbs'))
return round(pounds/2.205)
data['Weight'] = data['Weight'].apply(transform_weight)
data['Weight'].unique()
rename ‘weight’ column to ‘weight(kg)’
data = data.rename(columns={'Weight':'Weight(kg)'})
data.sample(7)
5. Cleanning Loan Date End Column
data['Loan Date End'].dtype
data['Loan Date End'].unique()
player_onloan = data[data['Contract Status'] == 'On Loan']
player_onloan[['Contract','Contract Status', 'Contract End', 'Loan Date End']]
There are many missing values in the Loan Date End
column because not all players in this dataset are players in the “On Loan” status. There are 1013 rows of data from players in the “On Loan” status, and I think this column might be useful in the future, so I won’t delete it.
6. Cleanning W/F Column
data['W/F'].dtype
data['W/F'].unique()
data['W/F'] = data['W/F'].str.replace('★', '')
data['W/F'].unique()
7. Cleanning SM Column
data['SM'].dtype
data['SM'].unique()
data['SM'] = data['SM'].str.replace('★', '')
data['SM'].unique()
8. Cleanning Hits Column
data['Hits'].dtype
data['Hits'].unique()
Declare a function convert_hits
that replaces null values with 0, removes the string ‘k’ character, multiplies it by 1000, and converts it to an integer data type. Finally, if there’s a float or integer value, it will be converted to an integer value only.
def convert_hits(x):
if pd.isnull(x):
return 0
elif 'K' in str(x):
number_only = str(x).replace('K', '')
number_float = float(number_only)
number_int = int(number_float * 1000)
return number_int
elif isinstance(x, float) or isinstance(x, int):
return int(x)
else:
return int(float(x))
data['Hits'] = data['Hits'].apply(convert_hits)
data['Hits'].dtype
data['Hits'].unique()
9. Cleaning Value, Wage, Release Clause Column
Since I want to use the ‘Value’, ‘Wage’, ‘Release Clause’ columns as aggregate columns in Power BI, it would be easier if I convert the values of these three columns into integers only value and add the euro denomination in the column names.
def convert_euro_to_number(value):
if '€' in value:
if 'M' in value:
return int(float(value.replace('€', '').replace('M', '')) * 1e6)
elif 'K' in value:
return int(float(value.replace('€', '').replace('K', '')) * 1e3)
else:
return 0
return int(value)
The function convert_euro_to_number is designed to convert Euro-formatted values into numeric format. It checks for the presence of the Euro symbol (‘€’) in the value and then handles cases where the value is in millions (‘M’) or thousands (‘K’). Ultimately, it simplifies the conversion process, making it easier to analyze Euro values numerically.
Applying convert_euro_to_number
function to ‘Value’, ‘Wage’, and ‘Release Clause’ Column
data['Value'] = data['Value'].apply(convert_euro_to_number)
data['Wage'] = data['Wage'].apply(convert_euro_to_number)
data['Release Clause'] = data['Release Clause'].apply(convert_euro_to_number)
Renaming ‘Value’, ‘Wage’, and ‘Release Clause’ Column
data.rename(columns={'Value':'Value(Euro)'}, inplace=True)
data.rename(columns={'Wage':'Wage(Euro)'}, inplace=True)
data.rename(columns={'Release Clause':'Release Clause(Euro)'}, inplace=True)
data.sample(5)
Finally Save new data into CSV file.
data.to_csv('Fifa21_fix.csv', index=False)