Import Data into Pandas Dataframe and Query It

The provided code demonstrates various operations performed on a pandas DataFrame loaded from a CSV file. Let's break down each step and add the additional modifications as requested.

This demonstration uses the "california_housing_test.csv" dataset, which is found in the Google Colab samples. A refernce to obtain this data can be found on the sqlandpy.com website.

  1. Importing the necessary libraries and loading the data:
    import pandas as pd
    df = pd.read_csv('C:\\Data\\sample_data\\california_housing_test.csv')
    df

In this step, the 'pandas' library is imported, and the 'read_csv()' function is used to load the data from the specified CSV file, "C:\\Data\\sample_data\\california_housing_test.csv". The loaded data is assigned to the variable 'df', which represents a DataFrame. Finally, the DataFrame is displayed by calling 'df'.

  1. Selecting specific columns from the DataFrame:
    df[["longitude","latitude","housing_median_age"]]

Here, the 'df' DataFrame is accessed using square brackets '[]' to select specific columns. The columns '"longitude"', '"latitude"', and '"housing_median_age"' are chosen by passing them as a list of column names. This operation returns a subset of the original DataFrame that contains only the selected columns.

  1. Adding a restriction for the top 5 rows:
    df[["longitude","latitude","housing_median_age"]].head(5)

To restrict the output to only the top 5 rows, the '.head(5)' method is added to the previously selected columns. This method returns the first 5 rows of the DataFrame, combining both column selection and row restriction.

  1. Adding an 'order by' clause:
    df[["longitude","latitude","housing_median_age"]] \
    .head(5) \
    .sort_values(["housing_median_age","latitude"], ascending=(True,False))

To sort the DataFrame based on columns, the '.sort_values()' method is used. The columns to be sorted are specified as a list '["housing_median_age","latitude"]'. The 'ascending' parameter is set to '(True,False)' to sort '"housing_median_age"' in ascending order and '"latitude"' in descending order. This method is chained after the previous operations to sort the restricted DataFrame.

By combining all the modifications to the code, we get the following consolidated code:

import pandas as pd

Import data from csv to dataframe

df = pd.read_csv('C:\\Data\\sample_data\\california_housing_test.csv')

Select some columns from a dataframe

We will select these columns: "longitude","latitude","housing_median_age"

selected_columns = df[["longitude","latitude","housing_median_age"]]

Add restriction for only the top 5 rows

top_5_rows = selected_columns.head(5)

Add 'order by' on these columns "housing_median_age" asc, "latitude" desc

sorted_dataframe = top_5_rows.sort_values(["housing_median_age","latitude"], ascending=(True,False))

Display the sorted DataFrame

sorted_dataframe

This code imports the data, selects the desired columns, restricts it to the top 5 rows, and then sorts the resulting DataFrame by the specified columns. Finally, it displays the sorted DataFrame.