Interfacing Pandas to an Access Database¶
Introduction¶
This post is intended to be a guide for Python users, who wish to process a Microsoft (MS) Access database. The assumption is that pandas
will be the primary analysis tool.
One option is to read the Access datatables into a MS Excel spreadsheet. This allows for an initial exploration, and access to Excel spreadsheets is well supported by pandas
, but the two step translation loses the meta-data (column types, etc), so direct interfacing was chosen for implementation.
Background¶
Open Database Connectivity (ODBC) is a standard application programming interface (API) for accessing database management systems (DBMS). The extensive support for ODBC in the Python ecosystem makes it the obvious choice for interfacing to MS Access.
As shown below, the appropriate drivers from MS must be installed to support the interfacing operations.
pyodbc
Package¶
The pyodbc
package was the first choice for an implementation of code to read the databasde
import pandas as pd
import pyodbc
# used for environment documentation
import sys
import os
import platform
import datetime
%load_ext watermark
The watermark extension is already loaded. To reload it, use: %reload_ext watermark
%load_ext lab_black
The lab_black extension is already loaded. To reload it, use: %reload_ext lab_black
Driver environment¶
As discussed above, the appropriate software drivers from MS must be installed to support the intergace to Access databases. We can determine the drivers available by the following code snippet, which create a list of the drivers available (as seen by the pyodbc
package).
[x for x in pyodbc.drivers()]
['SQL Server', 'SQL Server Native Client 11.0', 'Microsoft Access Driver (*.mdb, *.accdb)', 'Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)', 'Microsoft Access dBASE Driver (*.dbf, *.ndx, *.mdx)', 'Microsoft Access Text Driver (*.txt, *.csv)']
In the initial run of this snippet, the only drivers available were the SQL Server
drivers. Going to the link
https://www.microsoft.com/en-us/download/details.aspx?id=54920
and clicking the "Download" button will install the Microsoft Access Driver
, as shown above.
Testing the Interface¶
To test the interface, we create a connection string that specifies the interfacing driver we need, and the database location (in this case, local to the machine running Python). Then we create a connection.
connStr = (
r"DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};"
r"DBQ=D:\WaterQualityDB2022\wq2022.accdb;"
)
cnxn = pyodbc.connect(connStr)
We know from the exploration via Excel, that there is a table Equipment
. We attempt to read the table into pandas
.
sql_qry = 'select * from Equipment'
df = pd.read_sql(sql_qry, cnxn)
df.head()
D:\Anaconda2022\lib\site-packages\pandas\io\sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy warnings.warn(
Equipment ID | Equipment Name | Serial | Probe Serial | Model | |
---|---|---|---|---|---|
0 | 1 | Maroochy 1 | Horiba U52 | ||
1 | 2 | Maroochy 2 | Horiba U52 | ||
2 | 3 | Maroochy 3 | Horiba U52 | ||
3 | 4 | Maroochy 4 | Horiba U52 | ||
4 | 5 | Maroochy 5 | Horiba U52 |
The problem here is that although this works, we have a warning message from pandas
that the ODC access we are using is not supported:
UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
Now in general, application or package warning messages have to considered seriously, especially in software intended to have an extended lifetime, because subsequent updates to pandas
could totally remove support for non-SQLAlchemy access methods.
SQLAlchemy Implementation¶
Investigation of integration of the SQLAlchemy package, pandas, and MS Access revealed one potential item of concern: MS Access is not a supported database that the SQLAlchemy project supports. I suspect that this is because Python (especially in the machine learning domain) is oriented towards large to very large databases, and not the relatively small databases that MS Access is typically used for.
However, there is an independent Python package sqlalchemy_access
, that provides the interface between SQLAlchemy and MS Access. Note that we still have to import the pyodbc
package. sqlalchemy_access
did not appear to be in the Anaconda ecosystem, so I just downloaded it from GitHub GitHub link
import pandas as pd
import sqlalchemy as sa
import sqlalchemy_access
import pyodbc
Next (as before), we define a connection specification, and establish a connection to the database.
import sqlalchemy_access
connection_string = (
r"DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};"
r"DBQ=D:\WaterQualityDB2022\wq2022.accdb;"
)
connection_url = sa.engine.URL.create(
"access+pyodbc",
query={"odbc_connect": connection_string},
)
engine = sa.create_engine(connection_url)
Now we can retrieve the Equipment
table, and display the contents, without generating warning messages.
qry = 'select * from Equipment;'
df = pd.read_sql_query(qry, engine)
df.head(1).T
0 | |
---|---|
Equipment ID | 1 |
Equipment Name | Maroochy 1 |
Serial | |
Probe Serial | |
Model | Horiba U52 |
Later SQLAlchemy¶
There is a problem: at sqlalchemy 2.0.11, the code above gives :
AttributeError: 'OptionEngine' object has no attribute 'execute'
The suggested fix is
import pandas as pd
from sqlalchemy import create_engine, text
engine = create_engine(connection_url)
query = 'SELECT * FROM my_table'
df = pd.read_sql_query(sql=text(query), con=engine.connect())
This fix also works fine at sqlalchemy 1.4.39
df = pd.read_sql_query(
sql=sa.text(qry),
con=engine.connect(),
)
df.head(1).T
0 | |
---|---|
Equipment ID | 1 |
Equipment Name | Maroochy 1 |
Serial | |
Probe Serial | |
Model | Horiba U52 |
Database Structure¶
We can confirm that SQLAlchemy 'sees' the same database tables, as can be seen via Excel.
from sqlalchemy import inspect
insp = inspect(engine)
insp.get_table_names()
['Equipment', 'Maintenance', 'Parameter Standards', 'People', 'Results_Kit_Calibration', 'Results_Sites_Water_Testing', 'Sites']
Reproducibility¶
This sections contains information to support the reproducibility of the software.
import sys
import os
import platform
import datetime
# | warn: false
%reload_ext watermark
# | warn: false
%reload_ext lab_black
Python execution environment¶
theNotebook = 'InterfacingToWQDatabase.ipynb'
print(f'Notebook: {theNotebook}')
print('python version : ' + sys.version)
print('pandas version : ' + pd.__version__)
print('current wkg dir: ' + os.getcwd())
print(
'Notebook run at: '
+ str(datetime.datetime.now())
+ ' local time'
)
print(
'Notebook run at: '
+ str(datetime.datetime.utcnow())
+ ' UTC'
)
print('Notebook run on: ' + platform.platform())
Notebook: InterfacingToWQDatabase.ipynb python version : 3.9.13 (main, Aug 25 2022, 23:51:50) [MSC v.1916 64 bit (AMD64)] pandas version : 1.4.4 current wkg dir: C:\Users\donrc\Documents\JupyterNotebooks\PythonNotebookProject\develop Notebook run at: 2023-05-24 20:30:00.363071 local time Notebook run at: 2023-05-24 10:30:00.364066 UTC Notebook run on: Windows-10-10.0.19045-SP0
watermark
data¶
Execution environment as reported by the watermark
utility
%watermark
Last updated: 2023-05-24T20:30:00.391079+10:00 Python implementation: CPython Python version : 3.9.13 IPython version : 7.31.1 Compiler : MSC v.1916 64 bit (AMD64) OS : Windows Release : 10 Machine : AMD64 Processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel CPU cores : 8 Architecture: 64bit
%watermark -h -iv -co
conda environment: D:\Anaconda2022 Hostname: DESKTOP-SODFUN6 sys : 3.9.13 (main, Aug 25 2022, 23:51:50) [MSC v.1916 64 bit (AMD64)] pandas : 1.4.4 pyodbc : 4.0.34 sqlalchemy_access: 1.1.4.dev0 sqlalchemy : 1.4.39 platform : 1.0.8