The provided code demonstrates various operations performed on a pandas DataFrame loaded from a CSV file. Let's break down each step and add the additional modifications as requested.
This demonstration uses the "california_housing_test.csv" dataset, which is found in the Google Colab samples. A refernce to obtain this data can be found on the sqlandpy.com website.
- Importing the necessary libraries and loading the data:
import pandas as pd
df = pd.read_csv('C:\\Data\\sample_data\\california_housing_test.csv')
df
In this step, the 'pandas' library is imported, and the 'read_csv()' function is used to load the data from the specified CSV file, "C:\\Data\\sample_data\\california_housing_test.csv". The loaded data is assigned to the variable 'df', which represents a DataFrame. Finally, the DataFrame is displayed by calling 'df'.
- Selecting specific columns from the DataFrame:
df[["longitude","latitude","housing_median_age"]]
Here, the 'df' DataFrame is accessed using square brackets '[]' to select specific columns. The columns '"longitude"', '"latitude"', and '"housing_median_age"' are chosen by passing them as a list of column names. This operation returns a subset of the original DataFrame that contains only the selected columns.
- Adding a restriction for the top 5 rows:
df[["longitude","latitude","housing_median_age"]].head(5)
To restrict the output to only the top 5 rows, the '.head(5)' method is added to the previously selected columns. This method returns the first 5 rows of the DataFrame, combining both column selection and row restriction.
- Adding an 'order by' clause:
df[["longitude","latitude","housing_median_age"]] \
.head(5) \
.sort_values(["housing_median_age","latitude"], ascending=(True,False))
To sort the DataFrame based on columns, the '.sort_values()' method is used. The columns to be sorted are specified as a list '["housing_median_age","latitude"]'. The 'ascending' parameter is set to '(True,False)' to sort '"housing_median_age"' in ascending order and '"latitude"' in descending order. This method is chained after the previous operations to sort the restricted DataFrame.
By combining all the modifications to the code, we get the following consolidated code:
import pandas as pd
Import data from csv to dataframe
df = pd.read_csv('C:\\Data\\sample_data\\california_housing_test.csv')
Select some columns from a dataframe
We will select these columns: "longitude","latitude","housing_median_age"
selected_columns = df[["longitude","latitude","housing_median_age"]]
Add restriction for only the top 5 rows
top_5_rows = selected_columns.head(5)
Add 'order by' on these columns "housing_median_age" asc, "latitude" desc
sorted_dataframe = top_5_rows.sort_values(["housing_median_age","latitude"], ascending=(True,False))
Display the sorted DataFrame
sorted_dataframe
This code imports the data, selects the desired columns, restricts it to the top 5 rows, and then sorts the resulting DataFrame by the specified columns. Finally, it displays the sorted DataFrame.