Pandas-Trim

Pandas-Trim is a middleware between Pandas and SDQL (OOPSLA'22). It parses the pandas code and transforms it to SDQL IR, which is then computed by the C++ code generated by SDQL.py.

This project is a frontend & optimizer for SDQL query engine, which has been accepted by SIGMOD 2023 Efficient Query Processing in Python Using Compilation

Clone

git clone https://github.com/cxunka/pd2sd.git

Install

python3 setup.py install

Dev Branch (Experimental)

git clone --branch DuckPlan https://github.com/cxunka/pd2sd.git

Dependency

Linux / Unix / Windows

pip3 install -r requirements.txt

Essential

pip3 install toml
pip3 install varname

Optional

pandas and duckdb are optional packages. If these packages are installed, then pysdql will use duckdb and pandas to verify that the result is correct.

You may use pysdql.set_verify(False) to turn it off.

pip3 install duckdb
pip3 install pandas

Configuration

The configuration file is config.toml under the pysdql package.

You may use pysdql.get_pysdql_path() to get the absolute path of pysdql package.

Option	Type	Description
`enable_verification`	`bool`	Whether to use pandas and duckdb to verify the correctness of the query.
`display_query`	`bool`	Whether display the query that is generated and executed.
`display_result`	`bool`	Whether display the result of a query.

Write a Query

With using the tosdql decorator, a query in pandas can be transformed to SDQL. If pandas was installed, the result will be as the same type and value as that in pandas. Otherwise, it will be transformed to python built-in types.

Example 1.1: Load Data

import pysdql as pd

li = pd.read_csv(f"path/to/dataset/lineitem.tbl",
                 sep='|',
                 header=None,
                 names=['l_orderkey', 'l_partkey', 'l_suppkey', 'l_linenumber', 'l_quantity',
                        'l_extendedprice', 'l_discount', 'l_tax', 'l_returnflag', 'l_linestatus',
                        'l_shipdate', 'l_commitdate', 'l_receiptdate', 'l_shipinstruct', 'l_shipmode', 'l_comment'],
                 index_col=False,
                 dtype={"l_orderkey": int, "l_partkey": int, "l_suppkey": int, "l_linenumber": int,
                        "l_quantity": float, "l_extendedprice": float, "l_discount": float, "l_tax": float,
                        "l_returnflag": str, "l_linestatus": str, "l_shipinstruct": str, "l_shipmode": str,
                        "l_comment": str},
                 parse_dates=['l_shipdate', 'l_commitdate', 'l_receiptdate'])

Example 1.2: `tosdql` decorator

@tosdql
def query(lineitem):
    lineitem['revenue'] = lineitem.l_extendedprice * lineitem.l_discount

    result = lineitem.agg({'revenue': 'sum'})

    return result

print(query(li))

output

>> Optimized Query <<
revenue    1.080857e+09
dtype: float64

TPC-H Test

pysdql provides all 22 TPC-H benchmark queries to verify the correctness, to run these queries, pysdql.tpch_query() is particularly useful.

The templates for TPC-H queries are provided in pysdql/query/tpch/template.py

Example 2.1: TPC-H Benchmark

import pysdql

# test single query
pysdql.tpch_query(1)

# test a range of queries
pysdql.tpch_query(range(1, 11))

# test a list of queries
pysdql.tpch_query([1, 11, 21])

# set optimize=False to get unoptimized results
pysdql.tpch_query(1, optimize=False)

# set execution_mode and threads_count
pysdql.tpch_query(1, execution_mode=0, threads_count=1)

# set verbose=False summary results
pysdql.tpch_query(1, verbose=True)

Name		Name	Last commit message	Last commit date
Latest commit History 331 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.vscode		.vscode
archive		archive
compiler_tests		compiler_tests
docs		docs
output		output
pysdql		pysdql
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark.ipynb		benchmark.ipynb
demo.py		demo.py
demo2.py		demo2.py
demo3.py		demo3.py
demo4.py		demo4.py
demo5.py		demo5.py
demo6.py		demo6.py
requirements.txt		requirements.txt
setup.py		setup.py
test_duckdb.py		test_duckdb.py
test_hyper_opt_py_01.ipynb		test_hyper_opt_py_01.ipynb
test_hyper_unopt_py_01.ipynb		test_hyper_unopt_py_01.ipynb
test_pandas.py		test_pandas.py
test_pd2sd.py		test_pd2sd.py

License

Unka-Malloc/pandas-sdql.py

Folders and files

Latest commit

History

Repository files navigation

Pandas-Trim

Clone

Install

Dev Branch (Experimental)

Dependency

Essential

Optional

Configuration

Write a Query

Example 1.1: Load Data

Example 1.2: tosdql decorator

TPC-H Test

Example 2.1: TPC-H Benchmark

Query Plans

Q1

Q2

Q3

Q4

Q5

Q6

Q7

Q8

Q9

Q10

Q11

Q12

Q13

Q14

Q15

Q16

Q17

Q18

Q19

Q20

Q21

Q22

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Example 1.2: `tosdql` decorator

Packages