Fifa 21 Python Data Cleaning

Overview

This repository contains the code to clean up the FIFA 21 dataset game. The purpose of this project is to clean and standardize the value format of several columns that originally varied. FIFA21 dataset contains 18K+ records of player data and 77 Columns. Each record represents a unique player and includes various attributes

Issues found in the data

Inconsistent value formats : Several columns have inconsistent value formats, which could potentially disrupt operations such as grouping by, filtering, or aggregating.
Missing Values : Some columns have missing values that need to be handled differently.

Fairuz Mujahid Annabil

Tools Used

Python : used Numpy and Pandas Library for Cleaning
Visual Code Studio : IDE for Data Exploration and Cleaning
Columns Explanation

For descriptions of all columns, you can refer to a file named “FIFA21 Column Explanation”

Data Analysis

Data Analysis Table of Contents
- Importing Libraries
- Cleanning Club Column
- Cleanning Contract Column
- Cleanning Height Column
- Cleanning Weight Column
- Cleanning Loan Date End Column
- Cleanning W/F Column
- Cleanning SM Column
- Cleanning Hits Column
- Cleaning Value, Wage, Release Clause Column
  Importing Libraries and Read the csv file
```
import numpy as np
import pandas as pd
```
```
df = pd.read_csv('fifa21 data.csv')
pd.set_option('display.max_columns', None)
df
```
```
df.info()
```

Data Cleanning

create a copy of dataframe

data = df.copy()
data.head(5)

1. Cleanning Club Column

data['Club'].dtype

data['Club'].unique()

There are many whitespace characters in some club names in the Club column. Since the whitespace characters are both at the beginning and end of the string, using .strip() would be appropriate to remove them.

data['Club'] = data['Club'].str.strip()
data['Club'].unique()

2. Cleanning Contract Column

data['Contract'].dtype

data['Contract'].unique()

using .loc to search for values in the ‘Contract’ column for each row where the value contains the string ‘On Loan’ or ‘Free’.

for index, kontrak in data.loc[(data['Contract'].str.contains('On Loan')) | (data['Contract'].str.contains('Free')), 'Contract'].items():
    print(kontrak)

Declare a function extract_contract that accepts one parameter contract, which is used to create 3 new columns: ‘Contract Start’, ‘Contract End’, and ‘Contract Length (years)’ with values taken from the ‘Contract’ column.

def extract_contract(contract):
    if contract == 'Free' or 'On Loan' in contract:
        start_date = np.nan
        end_date = np.nan
        contract_length = 0
    else:
        start_date, end_date = contract.split(' ~ ')
        start_year = int(start_date[:4])
        end_year = int(end_date[:4])
        contract_length = end_year - start_year
    return start_date, end_date, contract_length

new_columns = ['Contract Start', 'Contract End', 'Contract Length(years)']
new_data = data['Contract'].apply(lambda x: pd.Series(extract_contract(x)))

for i in range(len(new_columns)):
    data.insert(loc=data.columns.get_loc('Contract')+1+i, column=new_columns[i], value=new_data[i])

data[['Contract', 'Contract Start', 'Contract End', 'Contract Length(years)']].sample(10)

Declare a function Categorize_Contract that categorizes values in the ‘Contract’ column into three categories: ‘Free’, ‘On Loan’, and ‘Contract’. Then create a new column ‘Contract Status’ to the DataFrame data to the right of the ‘Contract Length (years)’ column. The values for this new column are generated by applying the Categorize_Contract function to each value in the ‘Contract’ column.

def Categorize_Contract(contract):
    if contract == 'Free':
        return 'Free'
    elif 'On Loan'in contract:
        return 'On Loan'
    else:
        return 'Contract'

data.insert(data.columns.get_loc('Contract Length(years)')+1, 'Contract Status', data['Contract'].apply(Categorize_Contract))
data.sample(4)

data[['Contract', 'Contract Start', 'Contract End', 'Contract Length(years)', 'Contract Status']].sample(5)

3. Cleanning Height Column

data['Height'].dtype

data['Height'].unique()

Declare a function transform_height that accepts one parameter used to remove the string ‘cm’ and convert the height value from inches to centimeters, resulting integer data type only

def transform_height(x):
    if 'cm' in x:
        return int(x.strip('cm'))
    else:
        feet, inches = x.split("'")
        total_inches = int(feet)*12 + int(inches.strip('"'))
        return round(total_inches * 2.54)
    
data['Height'] = data['Height'].apply(transform_height)
data['Height'].unique()

rename Height Column to Height(Cm)

data = data.rename(columns={'Height':'Height(Cm)'})
data.sample(1)

4. Cleanning Weight Column

data['Weight'].dtype

data['Weight'].unique()

Declare a function transform_weight that removes the string ‘kg’ and ‘lbs’ characters, then converts the weight from pounds to kilograms, resulting integer data type

def transform_weight(x):
    if 'kg' in x:
        return int(x.strip('kg'))
    else:
        pounds = int(x.strip('lbs'))
        return round(pounds/2.205)
    
data['Weight'] = data['Weight'].apply(transform_weight)
data['Weight'].unique()

rename ‘weight’ column to ‘weight(kg)’

data = data.rename(columns={'Weight':'Weight(kg)'})
data.sample(7)

5. Cleanning Loan Date End Column

data['Loan Date End'].dtype

data['Loan Date End'].unique()

player_onloan = data[data['Contract Status'] == 'On Loan']
player_onloan[['Contract','Contract Status', 'Contract End', 'Loan Date End']]

There are many missing values in the Loan Date End column because not all players in this dataset are players in the “On Loan” status. There are 1013 rows of data from players in the “On Loan” status, and I think this column might be useful in the future, so I won’t delete it.

6. Cleanning W/F Column

data['W/F'].dtype

data['W/F'].unique()

data['W/F'] = data['W/F'].str.replace('★', '')
data['W/F'].unique()

7. Cleanning SM Column

data['SM'].dtype

data['SM'].unique()

data['SM'] = data['SM'].str.replace('★', '')
data['SM'].unique()

8. Cleanning Hits Column

data['Hits'].dtype

data['Hits'].unique()

Declare a function convert_hits that replaces null values with 0, removes the string ‘k’ character, multiplies it by 1000, and converts it to an integer data type. Finally, if there’s a float or integer value, it will be converted to an integer value only.

def convert_hits(x):
    if pd.isnull(x):
        return 0
    elif 'K' in str(x):
        number_only = str(x).replace('K', '')
        number_float = float(number_only)
        number_int = int(number_float * 1000)
        return number_int
    elif isinstance(x, float) or isinstance(x, int):
        return int(x)
    else:
        return int(float(x))

    
data['Hits'] = data['Hits'].apply(convert_hits)

data['Hits'].dtype

data['Hits'].unique()

9. Cleaning Value, Wage, Release Clause Column

Since I want to use the ‘Value’, ‘Wage’, ‘Release Clause’ columns as aggregate columns in Power BI, it would be easier if I convert the values of these three columns into integers only value and add the euro denomination in the column names.

def convert_euro_to_number(value):
    if '€' in value:
        if 'M' in value:
            return int(float(value.replace('€', '').replace('M', '')) * 1e6)
        elif 'K' in value:
            return int(float(value.replace('€', '').replace('K', '')) * 1e3)
        else:
            return 0
    return int(value)

The function convert_euro_to_number is designed to convert Euro-formatted values into numeric format. It checks for the presence of the Euro symbol (‘€’) in the value and then handles cases where the value is in millions (‘M’) or thousands (‘K’). Ultimately, it simplifies the conversion process, making it easier to analyze Euro values numerically.

Applying convert_euro_to_number function to ‘Value’, ‘Wage’, and ‘Release Clause’ Column

data['Value'] = data['Value'].apply(convert_euro_to_number)
data['Wage'] = data['Wage'].apply(convert_euro_to_number)
data['Release Clause'] = data['Release Clause'].apply(convert_euro_to_number)

Renaming ‘Value’, ‘Wage’, and ‘Release Clause’ Column

data.rename(columns={'Value':'Value(Euro)'}, inplace=True)
data.rename(columns={'Wage':'Wage(Euro)'}, inplace=True)
data.rename(columns={'Release Clause':'Release Clause(Euro)'}, inplace=True)

data.sample(5)

Finally Save new data into CSV file.

data.to_csv('Fifa21_fix.csv', index=False)