Skip to content

[SIGMOD'23] Pandas Front-end for SDQL.py. [Pandas] -> [FP] -> [SDQL IR]

License

Notifications You must be signed in to change notification settings

Unka-Malloc/pandas-sdql.py

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pandas-Trim

Pandas-Trim is a middleware between Pandas and SDQL (OOPSLA'22). It parses the pandas code and transforms it to SDQL IR, which is then computed by the C++ code generated by SDQL.py.

This project is a frontend & optimizer for SDQL query engine, which has been accepted by SIGMOD 2023 Efficient Query Processing in Python Using Compilation

Clone

git clone https://github.com/cxunka/pd2sd.git

Install

python3 setup.py install

Dev Branch (Experimental)

git clone --branch DuckPlan https://github.com/cxunka/pd2sd.git

Dependency

Linux / Unix / Windows

pip3 install -r requirements.txt

Essential

pip3 install toml
pip3 install varname

Optional

pandas and duckdb are optional packages. If these packages are installed, then pysdql will use duckdb and pandas to verify that the result is correct.

You may use pysdql.set_verify(False) to turn it off.

pip3 install duckdb
pip3 install pandas

Configuration

The configuration file is config.toml under the pysdql package.

You may use pysdql.get_pysdql_path() to get the absolute path of pysdql package.

Option Type Description
enable_verification bool Whether to use pandas and duckdb to verify the correctness of the query.
display_query bool Whether display the query that is generated and executed.
display_result bool Whether display the result of a query.

Write a Query

With using the tosdql decorator, a query in pandas can be transformed to SDQL. If pandas was installed, the result will be as the same type and value as that in pandas. Otherwise, it will be transformed to python built-in types.

Example 1.1: Load Data

import pysdql as pd

li = pd.read_csv(f"path/to/dataset/lineitem.tbl",
                 sep='|',
                 header=None,
                 names=['l_orderkey', 'l_partkey', 'l_suppkey', 'l_linenumber', 'l_quantity',
                        'l_extendedprice', 'l_discount', 'l_tax', 'l_returnflag', 'l_linestatus',
                        'l_shipdate', 'l_commitdate', 'l_receiptdate', 'l_shipinstruct', 'l_shipmode', 'l_comment'],
                 index_col=False,
                 dtype={"l_orderkey": int, "l_partkey": int, "l_suppkey": int, "l_linenumber": int,
                        "l_quantity": float, "l_extendedprice": float, "l_discount": float, "l_tax": float,
                        "l_returnflag": str, "l_linestatus": str, "l_shipinstruct": str, "l_shipmode": str,
                        "l_comment": str},
                 parse_dates=['l_shipdate', 'l_commitdate', 'l_receiptdate'])

Example 1.2: tosdql decorator

@tosdql
def query(lineitem):
    lineitem['revenue'] = lineitem.l_extendedprice * lineitem.l_discount

    result = lineitem.agg({'revenue': 'sum'})

    return result

print(query(li))

output

>> Optimized Query <<
revenue    1.080857e+09
dtype: float64

TPC-H Test

pysdql provides all 22 TPC-H benchmark queries to verify the correctness, to run these queries, pysdql.tpch_query() is particularly useful.

The templates for TPC-H queries are provided in pysdql/query/tpch/template.py

Example 2.1: TPC-H Benchmark

import pysdql

# test single query
pysdql.tpch_query(1)

# test a range of queries
pysdql.tpch_query(range(1, 11))

# test a list of queries
pysdql.tpch_query([1, 11, 21])

# set optimize=False to get unoptimized results
pysdql.tpch_query(1, optimize=False)

# set execution_mode and threads_count
pysdql.tpch_query(1, execution_mode=0, threads_count=1)

# set verbose=False summary results
pysdql.tpch_query(1, verbose=True)

Query Plans

Q1

QueryPlan

Q2

QueryPlan

Q3

QueryPlan

Q4

QueryPlan

Q5

QueryPlan

Q6

QueryPlan

Q7

QueryPlan

Q8

QueryPlan

Q9

QueryPlan

Q10

QueryPlan

Q11

QueryPlan

Q12

QueryPlan

Q13

QueryPlan

Q14

QueryPlan

Q15

QueryPlan

Q16

QueryPlan

Q17

QueryPlan

Q18

QueryPlan

Q19

QueryPlan

Q20

QueryPlan

Q21

QueryPlan

Q22

QueryPlan

About

[SIGMOD'23] Pandas Front-end for SDQL.py. [Pandas] -> [FP] -> [SDQL IR]

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published