Python Data Analytics Tutorial � Reading & Cleaning Data with Pandas: Part2

Exploring the Data: Show Information about the Data Frame

Example 1: Show information about the data frame

Display all the rows and columns

Slicing the data: Show specific rows/columns of a data frame

Slicing data is a technique which is used to create small sets of your large data.

dataFrame.head(): Display first 5 rows of the data frame.
dataFrame.head(n): Display first n rows of the data frame. [ n is an int value]
dataFrame.tail(): Display last 5 rows of the data frame.
dataFrame.tail(n): Display last n rows of the data frame. [ n is an int value]
dataFrame[start:end+1]: Display all rows start from index start to index end
dataFrame[start:end+1:step]: Display all rows start from index start to index end in the intervals of step
dataFrame[" Column "]: Display specific column
dataFrame [[" Column1 ", " Column2 ",�K]]: : Display specific columns

Example 2: Show specific rows of the data frame

Example 3: Show specific columns of the data frame

Slice Data Using loc 1

The pandas loc function allows us to search and slice data based on both index and columns.
It is a powerful tool to allow us to focus on the important rows and columns for our data analytics.

	    dataframe.loc[starting_row:end_row,starting_column:end_column]

Example 4: Slice data using loc

The following code will display rows 2 to 5 and columns "Higher Education Institution" to "Enrolled_Post Graduate"

	    data.loc[2:5,"Higher Education Institution": "Enrolled_Post Graduate"]

Note that you need to use the index of the rows and the name of the column.
In this example the index is 2:5
The column "Higher Education Institution" :"Enrolled_Post Graduate"

How to Slice Data Using loc 2

You can display columns that are not in sequence, you need to add then inside a square bracket [ ].

Example 5: Display rows 3, 5, and 5 and Columns "Higher Education Institution" and "Enrolled _ UnderGraduate"

	    data.loc[[3,5,7],["Higher Education Institution", "Enrolled _Under Graduate"]

Slice data using iloc

The pandas iloc function similar to loc to slice rows and columns, it use index for columns instead of column names.

Changing the Index

The default index in a DataFrame is integer values starting from zero. To change the default index to any other column, you need to use .set_index as follows:

	    data.set_index("igher Education Institution",inplace=True)

Example 6: Change the the index of our test example to StudentID

	    data.set_index("ColumnName",inplace=True)

Note: Higher Education Institution is now the index and presented different on the DataFrame. The column is appearing in bold.

Example 7: Resetting the Index

When you need to reset the index back to its original values. There are different ways to do this. On common method is to run the line that reads the data from your source. However, you can use the function: .reset_index()

Reset the index of our test example to its original values

	    data.reset_index(inplace=True)
	    data.head()

Statistics/Aggregation Commands

When you need to summaries the data in data frame Pandas makes the calculation of different statistics very simple.

Syntax of using statistic command on a specific column

	    dataframe["Column"].statistics_method()

Syntax of using statistic command on all columns

	    dataframe.statistics_method()

Displaying unique values in a column

Finding unique (nonrepeating) values in a column is needed to perform analysis on your data.

	    dataFrame["Column"].unique

For example, to know the unique values in the column "Specialisation" use the function .unique() that helps you with perform this task.

	    data["Specialisation"].unique

Calculated Columns

Pandas allows you to easily add new columns to the DataFrame. This is usually used to create a new calculated column.

Syntax to create a new column:

	    DataFrame["New Column"] = expression

Example: The below example create a new column Total Enrolled which is sum of Enrolled Graduates and Enrolled Post Graduate

	    data[" Total Enrolled"] = data[" Enrolled _ Undergraduate"] + data[" Enrolled_Post Greauate"]
	    data.head()

Appending Data : Join two Dataframe

	    newDataFrame = dataFrame1.append(dataFrame2)

Example: You have two data frames as bellow. Both sheet have the same structure. They contains students from Education and Foundation Specialisation. We need to combine both in one DataFrame.

Writing data to external file

Example: Write the data you cleaned in the previous example to an external file.

	    writer = pd.ExcelWriter(' NewData.xlsx ')
	    data.to_excel(Writer,'sheet1 ')
	    Writer.save()

The above lines stores the DataFrame data in the an excel file 'NewData.xlsx' in a sheet with the name 'Sheet1'.

Summary of Pandas Commands

Data Analytics Part-2