Skip to the content.

Fifa 21 Python Data Cleaning

Overview

This repository contains the code to clean up the FIFA 21 dataset game. The purpose of this project is to clean and standardize the value format of several columns that originally varied. FIFA21 dataset contains 18K+ records of player data and 77 Columns. Each record represents a unique player and includes various attributes

Issues found in the data

LinkedIn

Fairuz Mujahid Annabil

Tools Used

  1. Python : used Numpy and Pandas Library for Cleaning
  2. Visual Code Studio : IDE for Data Exploration and Cleaning

    Columns Explanation

    For descriptions of all columns, you can refer to a file named “FIFA21 Column Explanation”

    Data Analysis

    Data Analysis Table of Contents

Data Cleanning

create a copy of dataframe

data = df.copy()
data.head(5)

image

1. Cleanning Club Column

data['Club'].dtype
data['Club'].unique()

image There are many whitespace characters in some club names in the Club column. Since the whitespace characters are both at the beginning and end of the string, using .strip() would be appropriate to remove them.

data['Club'] = data['Club'].str.strip()
data['Club'].unique()

image

2. Cleanning Contract Column

data['Contract'].dtype
data['Contract'].unique()

image using .loc to search for values in the ‘Contract’ column for each row where the value contains the string ‘On Loan’ or ‘Free’.

for index, kontrak in data.loc[(data['Contract'].str.contains('On Loan')) | (data['Contract'].str.contains('Free')), 'Contract'].items():
    print(kontrak)

image Declare a function extract_contract that accepts one parameter contract, which is used to create 3 new columns: ‘Contract Start’, ‘Contract End’, and ‘Contract Length (years)’ with values taken from the ‘Contract’ column.

def extract_contract(contract):
    if contract == 'Free' or 'On Loan' in contract:
        start_date = np.nan
        end_date = np.nan
        contract_length = 0
    else:
        start_date, end_date = contract.split(' ~ ')
        start_year = int(start_date[:4])
        end_year = int(end_date[:4])
        contract_length = end_year - start_year
    return start_date, end_date, contract_length

new_columns = ['Contract Start', 'Contract End', 'Contract Length(years)']
new_data = data['Contract'].apply(lambda x: pd.Series(extract_contract(x)))

for i in range(len(new_columns)):
    data.insert(loc=data.columns.get_loc('Contract')+1+i, column=new_columns[i], value=new_data[i])
data[['Contract', 'Contract Start', 'Contract End', 'Contract Length(years)']].sample(10)

image

Declare a function Categorize_Contract that categorizes values in the ‘Contract’ column into three categories: ‘Free’, ‘On Loan’, and ‘Contract’. Then create a new column ‘Contract Status’ to the DataFrame data to the right of the ‘Contract Length (years)’ column. The values for this new column are generated by applying the Categorize_Contract function to each value in the ‘Contract’ column.

def Categorize_Contract(contract):
    if contract == 'Free':
        return 'Free'
    elif 'On Loan'in contract:
        return 'On Loan'
    else:
        return 'Contract'

data.insert(data.columns.get_loc('Contract Length(years)')+1, 'Contract Status', data['Contract'].apply(Categorize_Contract))
data.sample(4)

image

data[['Contract', 'Contract Start', 'Contract End', 'Contract Length(years)', 'Contract Status']].sample(5)

image

3. Cleanning Height Column

data['Height'].dtype
data['Height'].unique()

image Declare a function transform_height that accepts one parameter used to remove the string ‘cm’ and convert the height value from inches to centimeters, resulting integer data type only

def transform_height(x):
    if 'cm' in x:
        return int(x.strip('cm'))
    else:
        feet, inches = x.split("'")
        total_inches = int(feet)*12 + int(inches.strip('"'))
        return round(total_inches * 2.54)
    
data['Height'] = data['Height'].apply(transform_height)
data['Height'].unique()

image

rename Height Column to Height(Cm)

data = data.rename(columns={'Height':'Height(Cm)'})
data.sample(1)

image

4. Cleanning Weight Column

data['Weight'].dtype
data['Weight'].unique()

image

Declare a function transform_weight that removes the string ‘kg’ and ‘lbs’ characters, then converts the weight from pounds to kilograms, resulting integer data type

def transform_weight(x):
    if 'kg' in x:
        return int(x.strip('kg'))
    else:
        pounds = int(x.strip('lbs'))
        return round(pounds/2.205)
    
data['Weight'] = data['Weight'].apply(transform_weight)
data['Weight'].unique()

image

rename ‘weight’ column to ‘weight(kg)’

data = data.rename(columns={'Weight':'Weight(kg)'})
data.sample(7)

image

5. Cleanning Loan Date End Column

data['Loan Date End'].dtype
data['Loan Date End'].unique()

image

player_onloan = data[data['Contract Status'] == 'On Loan']
player_onloan[['Contract','Contract Status', 'Contract End', 'Loan Date End']]

image

There are many missing values in the Loan Date End column because not all players in this dataset are players in the “On Loan” status. There are 1013 rows of data from players in the “On Loan” status, and I think this column might be useful in the future, so I won’t delete it.

6. Cleanning W/F Column

data['W/F'].dtype
data['W/F'].unique()
data['W/F'] = data['W/F'].str.replace('★', '')
data['W/F'].unique()

image

7. Cleanning SM Column

data['SM'].dtype
data['SM'].unique()
data['SM'] = data['SM'].str.replace('★', '')
data['SM'].unique()

image

8. Cleanning Hits Column

data['Hits'].dtype
data['Hits'].unique()

image

Declare a function convert_hits that replaces null values with 0, removes the string ‘k’ character, multiplies it by 1000, and converts it to an integer data type. Finally, if there’s a float or integer value, it will be converted to an integer value only.

def convert_hits(x):
    if pd.isnull(x):
        return 0
    elif 'K' in str(x):
        number_only = str(x).replace('K', '')
        number_float = float(number_only)
        number_int = int(number_float * 1000)
        return number_int
    elif isinstance(x, float) or isinstance(x, int):
        return int(x)
    else:
        return int(float(x))

    
data['Hits'] = data['Hits'].apply(convert_hits)

image

data['Hits'].dtype

image

data['Hits'].unique()

image image

9. Cleaning Value, Wage, Release Clause Column

Since I want to use the ‘Value’, ‘Wage’, ‘Release Clause’ columns as aggregate columns in Power BI, it would be easier if I convert the values of these three columns into integers only value and add the euro denomination in the column names.

def convert_euro_to_number(value):
    if '€' in value:
        if 'M' in value:
            return int(float(value.replace('€', '').replace('M', '')) * 1e6)
        elif 'K' in value:
            return int(float(value.replace('€', '').replace('K', '')) * 1e3)
        else:
            return 0
    return int(value)

The function convert_euro_to_number is designed to convert Euro-formatted values into numeric format. It checks for the presence of the Euro symbol (‘€’) in the value and then handles cases where the value is in millions (‘M’) or thousands (‘K’). Ultimately, it simplifies the conversion process, making it easier to analyze Euro values numerically.

Applying convert_euro_to_number function to ‘Value’, ‘Wage’, and ‘Release Clause’ Column

data['Value'] = data['Value'].apply(convert_euro_to_number)
data['Wage'] = data['Wage'].apply(convert_euro_to_number)
data['Release Clause'] = data['Release Clause'].apply(convert_euro_to_number)

Renaming ‘Value’, ‘Wage’, and ‘Release Clause’ Column

data.rename(columns={'Value':'Value(Euro)'}, inplace=True)
data.rename(columns={'Wage':'Wage(Euro)'}, inplace=True)
data.rename(columns={'Release Clause':'Release Clause(Euro)'}, inplace=True)
data.sample(5)

image

Finally Save new data into CSV file.

data.to_csv('Fifa21_fix.csv', index=False)