Python Data Analytics Tutorial — Reading & Cleaning Data with Pandas: Part3

Slice data using simple conditions

You can also slice data using conditions based on the value of one or more columns.

The structure of the command line is: DataFrame[Condition]

Example 1: Display all the rows where "Specialisation" is "Health and Medicine"

Condition: data["Specialisation"] == "Health and Medicine" the code is:

	    data[data["Specialisation"] == "Health and Medicine"]

Another Slice data using simple conditions

Example 2: Display all the rows where "Enrolled _ UnderGraduate" is 1000 and above

	    data[data["Enrolled _ UnderGraduate"] >= 1000]

Example 3: Display total number of rows where "Enrolled _ UnderGraduate" is greater or equal to the average "Enrolled _ UnderGraduate"

	    data[data["Enrolled _ UnderGraduate"] >= data["Enrolled _ UnderGraduate"].mean()]

Slice data using multiple conditions

To slice data using more than one condition, you need to put each condition inside a bracket ( ). You also need to use a logical operator. The symbol & is used for and. The symbol | is used for or.

The structure of the command line is:

DataFrame[(Condition) & (Condition)]
DataFrame[(Condition) | (Condition)]

Example 4: Display all the rows where "Specialisation" is "Law" or "Humanities"

        (data["Specialisation"] == "Law") | (data["Specialisation"] == "Humanities")

Example 5: Display all the rows where " Enrolled _ UnderGraduate" value is from 500 to 800

        (data["Enrolled _ UnderGraduate"] ≥ 500) & (data["Enrolled _ UnderGraduate" ≤ 800)

Sorting data with one column

Sorting data is a technique that display data in an ascending or descending order.

        data.sort_values(['CoulmnName'], ascending=False)

By default, the data will be sorted in ascending order.

Example: Sort the Enrolled _ UnderGraduate column

        data.sort_values('Enrolled _ UnderGraduate')

        data.sort_values('Enrolled _ UnderGraduate', ascending = False)

Sorting data with more than one column

To sort data based on more than one column, you need to include all columns inside the square bracket.

Example: Sort Specialisation in ascending order, Enrolled undergraduate in descending order

        data.sort_values(['Specialisation','Enrolled _ UnderGraduate'],ascending=[True,False])

Grouping Data

Grouping data by columns is used when you have duplicated data in a particular column. For example you, the same student is doing more than one course and has grades in each course. The function needed for this operation is .groupby

        dataframe.groupby(['Column1', 'Column2']).mean()

Instead of mean(), you can use sum(), min() or any of the aggregation commands.

Example: Display the total Number of students enrolled in each specialisation

        data.groupby('Specialization') [ ['Total Enrolled']].sum()

Example 2: Total Number of students enrolled in each specialisation, by each Institution

        data.groupby(['Higher Education Institution', 'Specialization')] [ ['Total Enrolled']].sum()

Summary of Pandas Commands

Data Analytics Part-3