Python, pandas, and databases

Posted on Thu 15 June 2023 in Databases

Pandas is a Python package that makes working with relational or labeled data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python ¹. Data scientists commonly load data from databases into Pandas dataframes. However, when you are learning to use Pandas, it is hard to find a public database with which you can practice meaningful data operations.

This post shows you how to Pandas to read data from a database that contains enough data to be interesting and how to perform basic data preparation. The examples in this tutorial will use the Microsoft AdventureWorks LT sample database. You may create your own version of this database or you may use the public AdventureWorks SQL Server. In this post, you will use the public server so that you can immediately get started with the examples.

The short answer

Once you are connected to a database, Pandas makes it easy to load data from it.

If you read my previous posts about reading databases schema information or using Python programs to read data from a database, you have already learned how to connect to a database.

Programmers can use the Pandas read_sql_table() method to read entire SQL database tables into Pandas dataframes by passing in the table name as a parameter. Then, they can use Pandas to join, transform, and filter those dataframes until they create the dataset that they need.

Python programmers who are proficient in writing SQL queries may pass a string containing an SQL query in the Pandas read_sql_query() method to extract and transform a dataset before loading it into a Pandas dataframe.

Set up your environment

Before you start working through this tutorial, create a Python virtual environment and install the packages you need in it. Then, start a Jupyter Notebook so you can follow along with this tutorial. You may use the Python REPL instead, if you do not want to use Jupyter.

Basic configuration

I have already covered the virtual environment setup process in my previous posts so I will just list the required commands here, without explanation.

$ sudo su
$ curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
$ curl https://packages.microsoft.com/config/ubuntu/$(lsb_release -rs)/prod.list > /etc/apt/sources.list.d/mssql-release.list
$ exit
$ sudo apt-get update
$ sudo ACCEPT_EULA=Y apt-get install -y msodbcsql18
$
$ mkdir data-science
$ cd data-science
$ python -m venv .venv
$ source ./.venv/bin/activate
(.venv) $ pip install jupyterlab
(.venv) $ pip install python-dotenv
(.venv) $ pip install pyodbc
(.venv) $ sudo apt install unixodbc

Install Pandas

Install Pandas. When Pandas is installed, NumPy will also be installed. NumPy (Numerical Python) is an open source Python library that’s used for working with arrays of numerical data in Python. Also install some Excel packages that help Pandas write dataframes to local storage as Excel spreadsheets.

(.venv) $ pip install pandas
(.venv) $ pip install openpyxl xlsxwriter xlrd

Install SQLAlchemy

Install SQLAlchemy. Pandas methods use SQLAlchemy functions when they read SQL database tables. Pandas does not support pyodbc connection objects. It only supports SQLAlchemy connections, sqlite connections, or a database string URI. Even when using just the database string URI, Pandas still uses SQLAlcehmy functions to connect to and read from an SQL database.

(.venv) $ pip install sqlalchemy

We don't use it explicitly in this tutorial, but SQLAlchemy is a very interesting library. It can be used with Pandas to define programmers' access to databases, to manage connections, and to build database queries using only Python code. I will cover more about using SQLAlchemy in future posts.

Start a notebook

Start a new Jupyter Notebook.

(.venv) $ jupyter-lab

When following along with the code examples in this document, open a new notebook cell for each example, enter the code, and run it. The results of code run in previous cells is held in memory and is available to subsequent cells. For example, a dataframe created in one cell can be used in a later cell.

The database

Use the public SQL server that is supported by the team that runs the sqlservercentral.com web site. It contains most of the AdventureWorks LT sample database. As far as I know, this is the only public SQL server that contains interesting data and is available to use for free ².

If this public server is not available when you read this post, you can create your own AdventureWorks LT database on a free server in Microsoft Azure. Or, you can run a version of the same database locally on your PC with SQLite or Docker.

Prepare to connect to the database

To connect to the database, first define an environment variable that contains the connection string. If you are working with a database on Microsoft Azure, you will get the string from the Azure Portal or from the database administrator. In this case, the connection string is based on the database and user information provided on the sqlservercentral.com web site.

In your terminal window, run the following command to create a dotenv file the contains the correct connection string:

(.venv) $ echo 'CONN_PUBLIC="Driver={ODBC Driver 18 for SQL Server};'\
'Server=tcp:sqlservercentralpublic.database.windows.net,1433;'\
'Database=AdventureWorks;Uid=sqlfamily;Pwd=sqlf@m1ly;Encrypt=yes;'\
'TrustServerCertificate=no;"' > .env

Then switch to your Jupyter Notebook (or REPL). In the first Jupyter Notebook cell, get the connection string environment variable and assign it to the variable named connection_string:

import os
from dotenv import load_dotenv

load_dotenv('.env', override=True)
connection_string = os.getenv('CONN_PUBLIC')

Pandas does not directly support pyodbc connection objects. Pandas uses SQLAlchemy, which uses the pyodbc library and the Microsoft ODBC driver, to manage database connections, based on the information you provide in the database string URI. Convert the connection string you stored in the dotenv file into the format required so it can be used as a database URI.

The code below will perform the conversion. The quote_plus() function replaces spaces and special characters with escape codes so, if you have a password in your connection string and if that password contains special characters, Pandas can still use it. Then, append a prefix that identifies the drivers SQLAlchemy and Pandas will use. Run the code below to set up the database URI:

from urllib.parse import quote_plus

url_string = quote_plus(connection_string)
uri = f'mssql+pyodbc:///?odbc_connect={url_string}'

You are not yet connected to the database. You have created a database string URI in the right format so Pandas and SQLAlchemy can use it to create and manage connections to the database. Now, you are ready to use Pandas to read data from the database.

Database schema

To read data from the database, you need to know the schema information. Ideally, the database administrator would provide you with documentation that describes the schemas, tables, and data relationships. You can read more information about the AdventureWorks LT database in my previous posts. For this example, you just need to know the schema and table names.

The only schema available to you, other than system schemas, is the SalesLT schema. It contains the following tables:

Address
Customer
CustomerAddress
Product
ProductCategory
ProductDescription
ProductModel
ProductModelProductDescription
SalesOrderDetail
SalesOrderHeader

Read database tables into Pandas dataframes

Developers who want to use Pandas to read entire tables from the SQL database into Pandas dataframes may use the Pandas read_sql_table() method. Simply pass it the name of the table in the database, the database URI, and the schema name.

Pandas does the work of creating a connection to the database, reading all the columns in the table, exporting them into a Pandas dataframe, and returning the dataframe. For example, we will read the contents of the ProductCategories table and print the first few lines.

import pandas as pd

categories = pd.read_sql_table('ProductCategory', uri, 'SalesLT')
print(categories.head(7))

The printed dataframe looks like the following output:

   ProductCategoryID  ParentProductCategoryID            Name  \
0                  1                      NaN           Bikes   
1                  2                      NaN      Components   
2                  3                      NaN        Clothing   
3                  4                      NaN     Accessories   
4                  5                      1.0  Mountain Bikes   
5                  6                      1.0      Road Bikes   
6                  7                      1.0   Touring Bikes   

                                rowguid ModifiedDate  
0  CFBDA25C-DF71-47A7-B81B-64EE161AA37C   2002-06-01  
1  C657828D-D808-4ABA-91A3-AF2CE02300E9   2002-06-01  
2  10A7C342-CA82-48D4-8A38-46A2EB089B74   2002-06-01  
3  2BE3BE36-D9A2-4EEE-B593-ED895D97C2A6   2002-06-01  
4  2D364ADE-264A-433C-B092-4FCBF3804E01   2002-06-01  
5  000310C0-BCC8-42C4-B0C3-45AE611AF06B   2002-06-01  
6  02C5061D-ECDC-4274-B5F1-E91D76BC3F37   2002-06-01

You can read multiple tables into different dataframes to build up a set of data to analyze. Get the contents of the Product and ProductModel tables:

products = pd.read_sql_table('Product', uri, 'SalesLT')
models = pd.read_sql_table('ProductModel', uri, 'SalesLT')

Merging dataframes in Pandas

To get more interesting data sets, we need to join database tables. For now, because we are using the Pandas read_sql_table() method, we will accomplish this by reading different tables into Pandas dataframes and merging the dataframes.

It is easier to merge dataframes if you already understand the relationships between them. If you imported entire tables into dataframes, look at the database documentation and find the primary keys and foreign keys that define relationships.

Print the output of each dataframe's columns attribute:

print(products.columns)
print(categories.columns)
print(models.columns)

the output shows the columns names in each dataframe

Index(['ProductID', 'Name', 'ProductNumber', 'Color', 'StandardCost',
       'ListPrice', 'Size', 'Weight', 'ProductCategoryID', 'ProductModelID',
       'SellStartDate', 'SellEndDate', 'DiscontinuedDate', 'ThumbNailPhoto',
       'ThumbnailPhotoFileName', 'rowguid', 'ModifiedDate'],
      dtype='object')
Index(['ProductCategoryID', 'ParentProductCategoryID', 'Name', 'rowguid',
       'ModifiedDate'],
      dtype='object')
Index(['ProductModelID', 'Name', 'CatalogDescription', 'rowguid',
       'ModifiedDate'],
      dtype='object')

You see that the ProductCategoryID column in the products dataframe matches with the ParentProductCategoryID column in the categories dataframe. It's not obvious, but the ParentProductCategoryID in the categories dataframe has a foreign-key relationship with the ProductCategoryID in the same table (which is why you need to document your database schema). The ProductModelID column in the products dataframe matches with the ProductModelID column in the models dataframe.

Merging two dataframes

Create a dataframe that lists selected product information, including the product name, model, category. Merge the dataframe products and the dataframe models into a new dataframe named df1. By default, the pandas merge method operates like an inner join operation so it returns merged rows that match between the left and right side of the join.

You can join two dataframes together using the Pandas merge() function, or you can join any number of dataframes together using each dataframe's merge() method. Pandas does not know about the relationships between the tables in the database. In simple cases Pandas will perform the inner join by matching columns with the same name from each dataframe. But, you have to be careful if you have multiple columns that have the same name, as you get in this case. Pandas will return only the rows that match in all columns with same names. To avoid this problem, specify the column that each merge should match on.

For example, to merge the albums and artists dataframes, use the following code:

df1 = pd.merge(left=products, right=categories, on='ProductCategoryID')
print(df1.shape)
print(df1.head())

Use the Pandas dataframe's shape attribute to check the number of rows and columns in the dataframe, since you are only printing the first two rows.

(295, 21)
   ProductID                     Name_x ProductNumber  Color  StandardCost  \
0        680  HL Road Frame - Black, 58    FR-R92B-58  Black       1059.31   
1        706    HL Road Frame - Red, 58    FR-R92R-58    Red       1059.31   

   ListPrice Size   Weight  ProductCategoryID  ProductModelID  ...  \
0     1431.5   58  1016.04                 18               6  ...   
1     1431.5   58  1016.04                 18               6  ...   

  SellEndDate DiscontinuedDate  \
0         NaT              NaT   
1         NaT              NaT   

                                      ThumbNailPhoto  \
0  b'GIF89aP\x001\x00\xf7\x00\x00\x00\x00\x00\x80...   
1  b'GIF89aP\x001\x00\xf7\x00\x00\x00\x00\x00\x80...   

         ThumbnailPhotoFileName                             rowguid_x  \
0  no_image_available_small.gif  43DD68D6-14A4-461F-9069-55309D90EA7E   
1  no_image_available_small.gif  9540FF17-2712-4C90-A3D1-8CE5568B2462   

           ModifiedDate_x ParentProductCategoryID       Name_y  \
0 2008-03-11 10:01:36.827                     2.0  Road Frames   
1 2008-03-11 10:01:36.827                     2.0  Road Frames   

                              rowguid_y ModifiedDate_y  
0  5515F857-075B-4F9A-87B7-43B4997077B3     2002-06-01  
1  5515F857-075B-4F9A-87B7-43B4997077B3     2002-06-01  

[2 rows x 21 columns]

You can see that Pandas uses the "_x" and "_y" suffixes to rename merged columns with the same name.

If you want to use other suffixes, you can specify them as parameters when you call the merge() method. For example, the output would be clearer if you wrote the following code, instead:

df1 = pd.merge(
    left=products, 
    right=categories, 
    on='ProductCategoryID',
    suffixes=['_product','_category']
)

print(df1.shape)
print(df1.head(2))

Now, after the merge, the columns from each table that had the same name are called "Name_product" and "Name_category"

Merging multiple dataframes

Pandas dataframes have a merge() method that works the same as the Pandas merge() function with the calling data frame being considered the left side in the join.

Perform the same merge as you performed above with the Pandas merge() function, but use the dataframe's merge() method:

df2 = products.merge(categories, on='ProductCategoryID', suffixes=['_product','_category'])

print(df2.shape)
display(df2.head(2))

You get the same output as was displayed when you used the Pandas Merge() function.

You can chain multiple merge() methods together to join multiple dataframes in one statement.

df3 = (
    products
        .merge(categories, 
               on='ProductCategoryID', 
               suffixes=['_product','_category'])
        .merge(models)
        .merge(categories,
              left_on='ParentProductCategoryID',
              right_on='ProductCategoryID',
              suffixes=['_child','_parent'])
)
print(df3.shape)
print(df3.head(2))

The output has 30 columns and the columns names are confusing, with duplicate data in columns.

(295, 30)
   ProductID                     Name_x ProductNumber  Color  StandardCost  \
0        680  HL Road Frame - Black, 58    FR-R92B-58  Black       1059.31   
1        706    HL Road Frame - Red, 58    FR-R92R-58    Red       1059.31   

   ListPrice Size   Weight  ProductCategoryID_child  ProductModelID  ...  \
0     1431.5   58  1016.04                       18               6  ...   
1     1431.5   58  1016.04                       18               6  ...   

  ModifiedDate_y     Name_child CatalogDescription  \
0     2002-06-01  HL Road Frame               None   
1     2002-06-01  HL Road Frame               None   

                          rowguid_child ModifiedDate_child  \
0  4d332ecc-48b3-4e04-b7e7-227f3ac2a7ec         2002-05-02   
1  4d332ecc-48b3-4e04-b7e7-227f3ac2a7ec         2002-05-02   

  ProductCategoryID_parent ParentProductCategoryID_parent  Name_parent  \
0                        2                            NaN   Components   
1                        2                            NaN   Components   

                         rowguid_parent ModifiedDate_parent  
0  c657828d-d808-4aba-91a3-af2ce02300e9          2002-06-01  
1  c657828d-d808-4aba-91a3-af2ce02300e9          2002-06-01  

[2 rows x 30 columns]

You can make the resulting dataframe more useful by choosing only the columns you want from each dataframe and renaming columns as needed.

You can keep chaining Pandas dataframe methods to create complex operations that merge more dataframes, rename and delete columns, filter results, and more. The Pandas documentation is a good resource for learning more about other Pandas method chaining, and Pandas dataframe methods.

Select data from the database using SQL queries

When working with large amounts of data, you may prefer to perform most of your data joins, grouping, and filter operations on the database server instead of locally on your PC. The Pandas read_sql_query enables you to send an SQL query to the database and then load the selected data into a dataframe.

The read_sql_query function

To select data from the SQL database, you need to create an SQL query statement using the SQL language. For example, see the SQL statement below that selects all the columns in a table:

SELECT * FROM SalesLT.Product

To use that statement with Pandas, run the following code:

statement = "SELECT * FROM SalesLT.Product"

products = pd.read_sql_query(statement, uri)
print(products.shape)
print(products.sample(2))

The output shows a random sample of two rows from the products dataframe, which contains the entire contents, 295 rows with 17 columns, of the Product table.

(295, 17)
     ProductID                    Name ProductNumber Color  StandardCost  \
104        809  ML Mountain Handlebars       HB-M763  None       27.4925   
47         752        Road-150 Red, 52    BK-R93R-52   Red     2171.2942   

     ListPrice  Size   Weight  ProductCategoryID  ProductModelID  \
104      61.92  None      NaN                  8              54   
47     3578.27    52  6540.77                  6              25   

    SellStartDate SellEndDate DiscontinuedDate  \
104    2006-07-01         NaT             None   
47     2005-07-01  2006-06-30             None   

                                        ThumbNailPhoto  \
104  b'GIF89aP\x001\x00\xf7\x00\x00\x00\x00\x00\x80...   
47   b'GIF89aP\x001\x00\xf7\x00\x00\x92\x04\x07\xc6...   

           ThumbnailPhotoFileName                               rowguid  \
104  no_image_available_small.gif  AE6020DF-D9C9-4D34-9795-1F80E6BBF5A5   
47       superlight_red_small.gif  5E085BA0-3CD5-487F-85BB-79ED1C701F23   

               ModifiedDate  
104 2008-03-11 10:01:36.827  
47  2008-03-11 10:01:36.827

Joining tables into one dataframe

Another benefit of SQL is that, when working with large databases, you can join tables and filter data more efficiently on the SQL server because it is optimized for these kinds of operations.

SQL query statements can select specific columns from tables, filter returned rows based on your criteria, join tables, rename columns, and more. For example, the query below is similar to the previous Pandas merge example, above. It joins the Product, ProductCategory, and ProductModel tables.

statement = """
SELECT *
FROM SalesLT.Product AS P
JOIN SalesLT.ProductCategory AS PC1 ON ( P.ProductCategoryID = PC1.ProductCategoryID )
JOIN SalesLT.ProductModel AS PM ON ( P.ProductModelID = PM.ProductModelID )
JOIN SalesLT.ProductCategory AS PC2 ON ( PC2.ProductCategoryID = PC1.ParentProductCategoryID )
"""

df = pd.read_sql_query(statement, uri)
print(df.shape)
print(df.sample(2))

The df dataframe contains thirty-two rows and has duplicat columns names. Similar to the example above, where you merged multiple dataframes, the information presented is confusing.

(295, 32)
     ProductID                     Name ProductNumber  Color  StandardCost  \
204        909  ML Mountain Seat/Saddle       SE-M798   None       17.3782   
82         787   Mountain-300 Black, 44    BK-M47B-44  Black      598.4354   

     ListPrice  Size    Weight ProductCategoryID ProductModelID  ...  \
204      39.14  None       NaN                19             80  ...   
82     1079.99    44  11852.31                 5             21  ...   

    ProductModelID                       Name CatalogDescription  \
204             80  ML Mountain Seat/Saddle 2               None   
82              21               Mountain-300               None   

                                  rowguid ModifiedDate ProductCategoryID  \
204  5CEFBB6E-3B7E-414F-AC1B-8F6DF741FB21   2007-06-01                 2   
82   ECDDD0D7-2DB2-464D-B2DA-89BFFC6276AA   2006-06-01                 1   

    ParentProductCategoryID        Name                               rowguid  \
204                    None  Components  C657828D-D808-4ABA-91A3-AF2CE02300E9   
82                     None       Bikes  CFBDA25C-DF71-47A7-B81B-64EE161AA37C   

    ModifiedDate  
204   2002-06-01  
82    2002-06-01  

[2 rows x 32 columns]

Filtering data

When you use SQL queries in Pandas, you can select a smaller subset of data to read into your dataframe. This is more efficient than reading in all the data from a table and then using Pandas to remove data you don't need.

For example, if you only need a list four random products from the database with the product name, model, category, and parent category run the following code:

statement = """
SELECT TOP 5
  P.Name AS 'Product', 
  M.Name AS 'Model', 
  C.Name AS 'Category', 
  PC.Name AS 'Parent'
FROM SalesLT.Product AS P
JOIN SalesLT.ProductCategory AS C ON ( P.ProductCategoryID = C.ProductCategoryID )
JOIN SalesLT.ProductModel AS M ON ( P.ProductModelID = M.ProductModelID )
JOIN SalesLT.ProductCategory AS PC ON ( PC.ProductCategoryID = C.ParentProductCategoryID )
ORDER BY NEWID()
"""

df = pd.read_sql_query(statement, uri)
print(df.shape)
print(df)

The output shows that the df dataframe now contains only four rows of product information with four columns, not including the index column, in each row. You do not need to perform any additional Pandas operations to reduce the data to only what we need.

(5, 4)
                 Product         Model       Category      Parent
0            Rear Brakes   Rear Brakes         Brakes  Components
1                HL Fork       HL Fork          Forks  Components
2       Road-250 Red, 52      Road-250     Road Bikes       Bikes
3  Touring-3000 Blue, 58  Touring-3000  Touring Bikes       Bikes
4     Road-750 Black, 58      Road-750     Road Bikes       Bikes

You can see that this procedure would be better if you have large database tables and are only interested in specific columns or in rows that meet a certain criteria. Then you do not need to put pressure on the network and on computer memory to move entire tables to your computer when, instead, you can select the data you need on the SQL server and send only what you selected to your computer.

Saving pandas dataframes

Pandas works in your computer's memory. The pandas workflow does not require that you save work to disk and then read it back later. If you are re-starting a data analysis project, you would normally go back to the original data source, read in the data again and apply the Pandas functions you wish to apply. If you are handing the dataframe off to another program, you would normally do that in memory.

One case where you might want to save a dataframe to disk is when you wish to output the results to a spreadsheet for an end user, who may have asked for a spreadsheet version of the results. To save a dataframe to an Excel spreadsheet, execute the following statement:

df1.to_excel("product_info.xlsx", index=False)

You can add style methods to make the Excel spreadsheet look nice for your stakeholder. You can also output complex spreadsheets with multiple worksheets.

Another case might be where you want to convert the Pandas dataframe into a CSV file for use by Power BI. This is not the best way to integrate Pandas and Power BI; it is better to use pandas in Power BI. But, sometimes you have to work the way others with whom you collaborate are working. To save a Pandas dataframe as a CSV file, execute the following statement:

df1.to_csv("product_info.csv", index=False)

You will not normally need to save your Pandas dataframe to disk for your own use. You should avoid saving data to disk when you work with sensitive data. However, if you want to save your Pandas dataframe to disk so you can use it later, the best option is to pickle the dataframe. This saves pandas dataframe objects to a file, maintaining column data types and other Pandas-specific information that would be lost if the dataframe was saved in other formats. Another, more modern, option is to use the parquet file format, but pickle is most commonly used for individual projects. To pickle a pandas dataframe, execute the following statement:

df1.to_pickle("product_info.pkl")

To read back the pickled dataframe, execute the following statement:

new_df = pd.read_pickle("product_info.pkl")

Conclusion

You have learned enough to read data from an SQL database into a Pandas dataframe. You can use Pandas to read entire database tables into dataframes, then operate on them locally on your PC. You can also use Pandas to send an SQL query to the database and read the results into a dataframe.

In the end, your decision about whether you read entire tables into separate dataframes and then join and manipulate them in Pandas, or whether you run SQL queries that load already selected and joined data into one or more dataframes, will depend on issues like the purpose of your application, the size of the database tables and pandas dataframes, database server performance, and the processing and memory resources available on your workstation.

In my opinion, if your data comes from a database, you should do most of your data joining and filtering using the database and then use Pandas for additional data cleaning and analysis. If your data comes from spreadsheets or CSV files, you have to use Pandas to combine, filter, clean, and analyze data.

From the Pandas package overview documentation, accessed on March 17, 2023 ↩
Many data sets that are available to the public but very few of them run on database servers. ↩