Spreadsheet Data in Python
Pandas is a Python library for data analysis.
SheetJS is a JavaScript library for reading and writing data from spreadsheets.
This demo uses SheetJS to process data from a spreadsheet and translate to the Pandas DataFrame format. We'll explore how to load SheetJS from Python scripts, generate DataFrames from workbooks, and write DataFrames back to workbooks.
The "Complete Example" includes a wrapper library that simplifies importing and exporting spreadsheets.
Pandas includes limited support for reading spreadsheets (pandas.from_excel
)
and writing XLSX spreadsheets (pandas.DataFrame.to_excel
).
SheetJS supports many common spreadsheet formats that Pandas cannot process.
SheetJS operations also offer more flexibility in processing complex worksheets.
This demo was tested in the following deployments:
Architecture | JS Engine | Pandas | Python | Date |
---|---|---|---|---|
darwin-x64 | Duktape 2.7.0 | 2.2.1 | 3.12.2 | 2024-03-15 |
darwin-arm | Duktape 2.7.0 | 2.2.2 | 3.12.3 | 2024-06-30 |
win10-x64 | Duktape 2.7.0 | 2.2.1 | 3.12.2 | 2024-03-25 |
win11-arm | Duktape 2.7.0 | 2.2.2 | 3.11.5 | 2024-06-20 |
linux-x64 | Duktape 2.7.0 | 1.5.3 | 3.11.3 | 2024-03-21 |
linux-arm | Duktape 2.7.0 | 1.5.3 | 3.11.2 | 2024-06-20 |
Integration Detailsβ
sheetjs.py
is a wrapper script that provides
helper methods for reading and writing spreadsheets. Installation notes are
included in the "Complete Example" section.
JS in Pythonβ
JS code cannot be directly evaluated in Python implementations.
To run JS code from Python, JavaScript engines1 can be embedded in Python
modules or dynamically loaded using the ctypes
foreign function library2.
This demo uses ctypes
with the Duktape engine.
Wrapperβ
The script exports a class named SheetJSWrapper
. It is a context manager that
initializes the Duktape engine and executes SheetJS scripts on entrance. All
work should be performed in the context:
#!/usr/bin/env python3
from sheetjs import SheetJSWrapper
with SheetJSWrapper() as sheetjs:
# Parse file
wb = sheetjs.read_file("pres.numbers")
print("Loaded file pres.numbers")
# Get first worksheet name
first_ws_name = wb.get_sheet_names()[0]
print(f"Reading from sheet {first_ws_name}")
# Generate DataFrame from first worksheet
df = wb.get_df(first_ws_name)
print(df.info())
# Export DataFrame to XLSB
sheetjs.write_df(df, "SheetJSPandas.xlsb", sheet_name="DataFrame")
Reading Filesβ
sheetjs.read_file
accepts a path to a spreadsheet file. It will parse the file
and return an object representing the workbook.
The get_sheet_names
method of the workbook returns a list of sheet names.
The get_df
method of the workbook generates a DataFrame from the workbook. The
specific sheet can be selected by passing the name.
For example, the following code reads pres.numbers
and generates a DataFrame
from the second worksheet:
with SheetJSWrapper() as sheetjs:
# Parse file
wb = sheetjs.read_file(path)
# Generate DataFrame from second worksheet
ws_name = wb.get_sheet_names()[1]
df = wb.get_df(ws_name)
# Print metadata
print(df.info())
Under the hood, sheetjs.py
performs the following steps:
flowchart LR
file[(workbook\nfile)]
subgraph SheetJS operations
bytes(Byte\nstring)
wb((SheetJS\nWorkbook))
csv(CSV\nstring)
end
subgraph Pandas operations
stream(CSV\nStream)
df[(Pandas\nDataFrame)]
end
file --> |`open`/`read`\nPython ops| bytes
bytes --> |`XLSX.read`\nParse Bytes| wb
wb --> |`sheet_to_csv`\nExtract Data| csv
csv --> |`StringIO`\nPython ops| stream
stream --> |`read_csv`\nParse CSV| df
-
Pure Python operations read the spreadsheet file and generate a byte string.
-
SheetJS libraries parse the string and generate a clean CSV.
- The
read
method3 parses file bytes into a SheetJS workbook object4 - After selecting a worksheet,
sheet_to_csv
5 generates a CSV string
-
Python operations convert the CSV string to a stream object.6
-
The Pandas
read_csv
method7 ingests the stream and generate a DataFrame.
Writing Filesβ
sheetjs.write_df
accepts a DataFrame and a path. It will attempt to export
the data to a spreadsheet file.
For example, the following code exports a DataFrame to SheetJSPandas.xlsb
:
with SheetJSWrapper() as sheetjs:
# Export DataFrame to XLSB
sheetjs.write_df(df, "SheetJSPandas.xlsb", sheet_name="DataFrame")
Under the hood, sheetjs.py
performs the following steps:
flowchart LR
subgraph Pandas operations
df[(Pandas\nDataFrame)]
json(JSON\nString)
end
subgraph SheetJS operations
aoo(array of\nobjects)
wb((SheetJS\nWorkbook))
u8a(File\nbytes)
end
file[(workbook\nfile)]
df --> |`to_json`\nPandas ops| json
json --> |`JSON.parse`\nJS Engine| aoo
aoo --> |`json_to_sheet`\nSheetJS Ops| wb
wb --> |`XLSX.write`\nUint8Array| u8a
u8a --> |`open`/`write`\nPython ops| file
-
The Pandas DataFrame
to_json
method8 generates a JSON string. -
JS engine operations translate the JSON string to an array of objects.
-
SheetJS libraries process the data array and generate file bytes.
- The
json_to_sheet
method9 creates a SheetJS sheet object from the data. - The
book_new
method10 creates a SheetJS workbook that includes the sheet. - The
write
method11 generates the spreadsheet file bytes.
- Pure Python operations write the bytes to file.
Complete Exampleβ
This example will extract data from an Apple Numbers spreadsheet and generate a DataFrame. The DataFrame will be exported to the binary XLSB spreadsheet format.
The Windows build requires Visual Studio with "Desktop development with C++". Commands must be run in a "Native Tools Command Prompt" session.
- Install Pandas:
python3 -m pip install pandas
On macOS and Linux, the install command may require root access:
sudo python3 -m pip install pandas
When pip
is not installed, the command will fail:
/usr/bin/python3: No module named pip
pip
must be installed. On Arch Linux-based platforms including the Steam Deck,
python-pip
can be installed through the package manager:
sudo pacman -Syu python-pip
In some local tests, the install failed with the following error:
error: externally-managed-environment
Pandas must be installed through the package manager:
- Debian and Ubuntu distributions:
sudo apt-get install python3-pandas
- Arch Linux-based platforms including the Steam Deck:
sudo pacman -Syu python-pandas
- macOS systems with a Python version from Homebrew:
sudo python3 -m pip install pandas --break-system-packages
- Build the Duktape shared library:
- MacOS
- Linux
- Windows
curl -LO https://duktape.org/duktape-2.7.0.tar.xz
tar -xJf duktape-2.7.0.tar.xz
cd duktape-2.7.0
make -f Makefile.sharedlibrary
cd ..
curl -LO https://duktape.org/duktape-2.7.0.tar.xz
tar -xJf duktape-2.7.0.tar.xz
cd duktape-2.7.0
make -f Makefile.sharedlibrary
cd ..
- Download and extract the source tarball. Commands must be run in WSL
bash
:
curl -LO https://duktape.org/duktape-2.7.0.tar.xz
tar -xJf duktape-2.7.0.tar.xz
(Run bash
, then run the aforementioned commands, then run exit
to exit WSL)
- Enter the source folder:
cd duktape-2.7.0
- Edit
src\duk_config.h
and add the highlighted lines to the end of the file:
#endif /* DUK_CONFIG_H_INCLUDED */
#define DUK_EXTERNAL_DECL extern __declspec(dllexport)
#define DUK_EXTERNAL __declspec(dllexport)
- Build the Duktape DLL:
cl /O2 /W3 /Isrc /LD /DDUK_SINGLE_FILE /DDUK_F_DLL_BUILD /DDUK_F_WINDOWS /DDUK_COMPILING_DUKTAPE src\\duktape.c
- Move up to the parent directory:
cd ..
- Copy the shared library to the current folder. When the demo was last tested, the shared library file name differed by platform:
OS | name |
---|---|
macOS | libduktape.207.20700.so |
Linux | libduktape.so.207.20700 |
Windows | duktape.dll |
- MacOS
- Linux
- Windows
cp duktape-*/libduktape.* .
cp duktape-*/libduktape.* .
copy duktape-2.7.0\duktape.dll .
- Download the SheetJS Standalone script and move to the project directory:
curl -LO https://cdn.sheetjs.com/xlsx-0.20.3/package/dist/shim.min.js
curl -LO https://cdn.sheetjs.com/xlsx-0.20.3/package/dist/xlsx.full.min.js
- Download the following test scripts and files:
curl -LO https://docs.sheetjs.com/pres.numbers
curl -LO https://docs.sheetjs.com/pandas/sheetjs.py
curl -LO https://docs.sheetjs.com/pandas/SheetJSPandas.py
- Edit the
sheetjs.py
script.
The lib
variable declares the path to the library:
lib = "libduktape.207.20700.so"
- MacOS
- Linux
- Windows
The name of the library is libduktape.207.20700.so
:
lib = "libduktape.207.20700.so"
The name of the library is libduktape.so.207.20700
:
lib = "./libduktape.so.207.20700"
The name of the library is duktape.dll
:
lib = ".\\duktape.dll"
In addition, the following changes must be made:
str_to_c
must be defined as follows:
def str_to_c(s):
b = s
if type(b) == str: b = s.encode("latin1")
return [c_char_p(b), len(b)]
eval_file
mustopen
with moderb
:
def eval_file(ctx, path):
with open(path, "rb") as f:
code = f.read()
- Run the script:
python3 SheetJSPandas.py pres.numbers
If successful, the script will display DataFrame metadata:
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 5 non-null object
1 Index 5 non-null int64
dtypes: int64(1), object(1)
It will also export the DataFrame to SheetJSPandas.xlsb
. The file can be
inspected with a spreadsheet editor that supports XLSB files.
Other Librariesβ
Other Python DataFrame libraries mirror the Pandas DataFrame API.
Polarsβ
Polars is a similar DataFrame library that offers many features from Pandas DataFrames.
Polars includes limited support for reading and writing spreadsheets by wrapping third-party libraries. In practice, Polars communicates with the third-party libraries using intermediate CSV files.12
SheetJS supports many common spreadsheet formats that Polars cannot process.
SheetJS operations also offer more flexibility in processing complex worksheets.
The Pandas example requires a few slight changes to work with Polars:
- Polars DataFrames expose
write_json
instead ofto_json
:
- json = df.to_json(orient="records")
+ json = df.write_json(row_oriented=True)
- Polars DataFrames do not expose
info
Polars Demoβ
This demo was tested in the following deployments:
Architecture | JS Engine | Polars | Python | Date |
---|---|---|---|---|
darwin-x64 | Duktape 2.7.0 | 0.20.15 | 3.12.2 | 2024-03-15 |
darwin-arm | Duktape 2.7.0 | 0.20.31 | 3.12.3 | 2024-06-30 |
win10-x64 | Duktape 2.7.0 | 0.20.16 | 3.12.2 | 2024-03-25 |
win10-arm | Duktape 2.7.0 | 0.20.31 | 3.11.5 | 2024-06-20 |
linux-x64 | Duktape 2.7.0 | 0.20.16 | 3.11.3 | 2024-03-21 |
linux-arm | Duktape 2.7.0 | 0.20.31 | 3.11.2 | 2024-06-20 |
-
Follow the Pandas "Complete Example" through the end.
-
Edit
sheetjs.py
.
- Near the top of the script, change the import from
pandas
topolars
:
from io import StringIO
from polars import read_csv
duk = CDLL(lib)
- Within the
export_df_to_wb
function, change thedf.to_json
line:
def export_df_to_wb(ctx, df, path, sheet_name="Sheet1", book_type=None):
json = df.write_json(row_oriented=True)
- Edit
SheetJSPandas.py
.
- In the
process
function, changedf.info()
todf
:
# Generate DataFrame from first worksheet
df = wb.get_df()
print(df)
Change the export filename from SheetJSPandas.xlsb
to SheetJSPolars.xlsb
:
# Export DataFrame to XLSB
sheetjs.write_df(df, "SheetJSPolars.xlsb", sheet_name="DataFrame")
- Install Polars:
python3 -m pip install polars
On macOS and Linux, the install command may require root access:
sudo python3 -m pip install pandas
On Windows, the C++ Clang Compiler for Windows
component must be installed
through the Visual Studio installer.
On Arch Linux-based platforms including the Steam Deck, the install may fail:
error: externally-managed-environment
It is recommended to use a virtual environment for Polars.
venv
must be installed through the system package manager:
- Debian and Ubuntu distributions:
sudo apt-get install python3.11-venv
-
venv
is included in thepython
package in Arch Linux-based platforms. -
macOS systems with a Python version from Homebrew:
brew install pyenv-virtualenv
After installing venv
, the following commands set up the virtual environment:
mkdir sheetjs-polars
cd sheetjs-polars
python3 -m venv .
./bin/pip install polars
cp ../libduktape.* ../SheetJSPandas.py ../sheetjs.py ../*.js ../*.numbers .
- Run the script:
python3 SheetJSPandas.py pres.numbers
If the virtual environment was configured in the previous step, run:
./bin/python3 SheetJSPandas.py pres.numbers
If successful, the script will display DataFrame data:
shape: (5, 2)
ββββββββββββββββ¬ββββββββ
β Name β Index β
β --- β --- β
β str β i64 β
ββββββββββββββββͺββββββββ‘
β Bill Clinton β 42 β
β GeorgeW Bush β 43 β
β Barack Obama β 44 β
β Donald Trump β 45 β
β Joseph Biden β 46 β
βββ βββββββββββββ΄ββββββββ
It will also export the DataFrame to SheetJSPolars.xlsb
. The file can be
inspected with a spreadsheet editor that supports XLSB files.
Footnotesβ
-
See "Other Languages" for more examples. β©
-
See "Workbook Object" β©
-
See the examples in "IO tools" in the Pandas documentation. β©
-
See
pandas.read_csv
in the Pandas documentation. β© -
See
pandas.DataFrame.to_json
in the Pandas documentation. β© -
As explained in the Polars documentation, "... the target Excel sheet is first converted to CSV ... and then parsed with Polarsβ
read_csv()
function." β©