PostgreSQL vs. pandas — how to balance tasks between server and client side

13 min readJul 19, 2017

and

The first step in building data-driven applications often includes writing a lot of queries in SQL. As you move your analysis to production, it is up to the developer to balance tasks optimally between data retrieval from the database server and data analysis in a client e.g. Python or R. During this automation step, you want to balance server and client tasks optimally as your choices can have a big impact on performance.

So how do you do that?

Every day — somewhere on earth: A typical “server-side vs. client-side” debate

There are predominant opinions in the analytics and developer community: some think the server should do as much as possible, others want to do everything on the client side. Both are too absolute to be helpful. The fact that you can do everything in SQL or python is neither a good reason nor an excuse for actually doing it all using the same tool.

SQL is the most common data manipulation language (DML) used to retrieve and manipulate data in a relational database. It was designed to carry out database routines (e.g. joins, select, filtering, grouping) and fine-grained CRUD (create, read, update and delete) operations. As such, it is the necessary first step to get data into a usable format — and you can do much more with it.

Inbuilt analytical functions (regex, ranking, quantile, lag or lead) allow you to go a long way and if you really want to, you can run regressions in SQL with UDFs. There is no limit to your (awkward) fantasies and it’s certainly possible to try to do too much in SQL.

Yet, for many data scientists and developers, SQL is a necessary evil and we have seen many cases of trying to do too much in “code” that should have been done in SQL. Indeed, pandas also supports all basic data manipulation tasks (joins, filtering, grouping) but this does not make the pandas vs. SQL choice a matter of taste. For many, pandas is just the path of least resistance, as it seems quicker to grab the raw data and slice and analyse it in python.

The key is thus to find a smart balance between the two. At least this is what we thought, and decided to write this blog post about SQL vs. pandas to explore where data preparation in SQL should end and where analysis in python should begin.

The answer, we thought, should be guided by what is most maintainable and efficient in getting the job done.

Experiment 1 — warming up

We decided to run some experiments and started out with a simple comparison of standard database operations i.e. join , filter, sort, and groupby between pandas and a local postgres database. To give an idea of the size of the simulated data, the largest table contained 10 million rows and 3 columns ~ 277MB.

A similar comparison was done by tianhuil who benchmarked pandas to SQLite and found that SQLite outperforms pandas for filterand sort but is slower for groupbyand join at a table size of 10 million rows.

Note that our measurement of execution time differs to tianhuil’s analysis. He timed the operation of e.g. filter after loading the data in Python. We timed the operation select * from table and filter to mirror more closely what would actually happen in practice.

Another nice database benchmarking reference that covers pandas and R vs. all databases is szilar. He finds that pandas is 30 seconds faster than postgres for join and almost 3min(!) faster than postgres forgroupby. It is worth noting that szilar did not configure postgres, which explains the poor postgres performance, as we mention below. Interestingly, he also found that the open source GPU database MapD can do joins in less than 1 second, which is good to know!

Now to our results: The slowest, fastest and median across 30 runs are displayed below:

table 1 — results of experiment 1 (run on 16GB laptop, Ubuntu)

We found that postgreSQL outperforms pandas in 4/4 test cases. It is much faster for join, filter, and groupby. sortwas marginally better than pandas, though this was highly dependent on working memory configurations. Without tweaking the config, postgres sorts much slower than pandas if the data set is >1MB.

So since we are keeping score, that’s 1-0 for postgreSQL and time for experiment number 2.

Experiment 2 — to the real-world

Next, we took a simple SQL query to get car model cross-configurations, i.e. how many users showed interest in a particular pair of cars by configuring both. This query can be used to recommend similar cars to users. (If we just lost you, have a look at our car chooser to configure a car.)

def do_it_in_sql():
     sql1 = """
     with recent_views as (
             select user_id, model_name
             from car_config_table
             where created_at > current_date - interval '2 months'
     ),
     popular_models as (
             select model_name as slug, 
                    count(distinct user_id) configs
             from recent_configurations
             group by 1
             having count(distinct user_id) > 3
     ),
     popular_configurations as (
             select C.model_name, C.user_id, M.configs
             from recent_configurations C
             join popular_models M on M.slug = C.model_name
     )     SELECT
     C1.model_name model_name,
     C1.configs model_configs,
     C2.model_name recommended_model_name,
     C2.configs recommended_model_configs,
     count(distinct C1.user_id) combo_configs
     from popular_configurations C1
          join popular_configurations C2 on C1.user_id = C2.user_id
     where
     C1.model_name <> C2.model_name
     group by 1,2,3,4
     order by model_name
     """
     df = execute_to_postgres(sql1)
     return dfdef do_it_in_pandas():
    sql= """
    select user_id, model_name
    from car_config_table
    where created_at > current_date - interval '2 months'
    """
    df = query_postgres(sql)
    df['configs'] = df.groupby(['model_name'])['user_id'].transform('nunique')
    df = df[df['configs'] > 3]
    crossdf = df.merge(df, on='user_id',how='outer')
    crossdf = crossdf[crossdf.model_name_y != crossdf.model_name_x]
    crossdf['combo_configs'] = crossdf.groupby(['model_name_x', 'model_name_y'])['user_id'].transform('nunique')
    crossdf = crossdf[['model_name_x', 'users_x', 'model_name_y', 'users_y','combo_configs']].drop_duplicates().sort_values('model_name_x')
    return crossdf

The first function (do_it_in_sql) executes a sql statement through a connector (e.g. psycopg2). It firstly pulls all configurations within the last 2 months into recent_views, then finds the total number of configurations by model into popular_models.Step 1 and 2 are then connected in the temp table popular_configurations used to generate the final cross configurations output.

The second function (do_it_in_pandas)reproduces the same in pandas by pulling all configurations from 2 months into memory and then constructing the cross_configurations table. So which one do you think was faster?

The result was a bit unexpected: SQL executed in 7.5 and pandas in 6 seconds.

So the scores are now leveled at 1-1.

To investigate this further and/or save the grace of SQL, we did some more query tuning and wrote 2 more optimised SQL queries to race against pandas:

sql2 uses a window function instead of subqueries used in the base query sql1 (see do_it_in_sqlfuntion);
sql3, in turn, uses a table/materialised view to add the count of distinct users for each model (called model_configs).

In addition, we set appropriate indices, vacuumed the respective database tables and further configured postgres to tune performance e.g. by bumping up shared_buffers for caching data and work_mem for larger in-memory sorts.

The results of 30 runs are shown below.

table 2 — results experiment 2 (run on 16GB laptop, Ubuntu)

sql2 = """
  with recent_configurations as (
    select user_id,
           model_name,
           COUNT(user_id) OVER (PARTITION BY model_name) as configs
    from car_config_table
    where created_at > current_date - interval '2 months'
    group by 1,2
  )  SELECT C1.model_name model_name,
         C1.configs model_configs,
         C2.model_name recommended_model_name, 
         C2.configs recommended_model_configs,
         count(distinct C1.user_id) combo_configs
  from recent_configurations C1
       join recent_configurations C2 on C1.user_id = C2.user_id
  where C1.model_name <> C2.model_name
  group by 1,2,3,4
  having count(distinct C1.user_id) > 3
 """sql3 = """
 drop table IF EXISTS analytics.mv;
 create table analytics.mv as (
    select user_id,
           model_name,
           COUNT(user_id) OVER (PARTITION BY model_name) as configs         
    from car_config_table
    where created_at > current_date - interval '2 months'
    group by 1,2);
 
 SELECT C1.model_name model_name,
        C1.configs model_configs,
        C2.model_name recommended_model_name,
        C2.configs recommended_model_configs,
        count(distinct C1.user_id) combo_configs
 from analytics.mv C1
 join analytics.mv C2 on C1.user_id = C2.user_id
 where C1.model_name <> C2.model_name
 group by 1,2,3,4
 having count(distinct C1.user_id) > 3
 """

Query optimisation — a drop in the ocean

As we can see from table 2, the query optimisation helped a bit — sql2 (which made use of window functions) is about 8 ms faster than the original query (sql1). Tweaking configurations also gained us 5ms on average for sql1 itself, which went from 7.5 to 7.0 seconds.

Yet, it is not enough to beat pandas.

What this experiment demonstrates is how many aspects there are to SQL query performance. Query optimisation is helpful, but the time spent on database configurations, proper indexing, distribution/sort keys, reducing contention, workload management and so on, make a bigger difference than query tuning itself. To get a full picture, bandwidth/network throughput and DB load also need to be taken into account.

Pandas is just a package and does not have the complexity of a database. It comes speed-optimised out of the box and is ready to race after a pip install. Postgres on the other hand is sleepwalking out of the box and the devil lies in the configuration. It does not even use quicksort by default. The default working memory allocated to postgres is 1MB! If you have more than that, it spills the data to disk to sort it, which is terrible for query performance and spectacularly useless for a database that can handle hundreds of terabytes!

Tuning postgres configurations for high performance is thus crucial as the impact on query performance is very big. Having said that, the right configuration is hard - too much or too little of it results in query performance as reliable and variable as the weather. Good luck!

bandwidth — “Never underestimate the bandwidth of a station wagon full of tapes hurling down the highway.” (A.S. Tanenbaum)

Another important aspect is the size of the data and bandwidth between SQL server and (python) client.

Whether a piece of code executes faster on the server or in pandas depends on whether the data which the database sends back to the client increases or decreases during the processing step.

For example, recent_configurations in sql1 has dimension (843k x 3), which is used to construct the final cross configurations table (699k x 5). Hence, in the do_it_in_sqlfunction, we increased the amount of data that the SQL server sends back to the client.

While sending it between SQL server and python client, the communication link is affected by bandwidth, which is typically small but can become a bottleneck. Loading a smaller dataset and computing locally in python can therefore be quicker than computing it on the server and sending extra data over a low bandwidth channel like WiFi.

Now in our case network bandwidth was not a concern, since we ran the postgres database and python on the same machine. Nevertheless, data still needs to be transferred between the SQL and python process, since they do not share memory.

Scale — Panda’s limit is your memory — SQL scales more efficiently

With increasing workload the case for the DB becomes more clear. While running a task on the client side can be faster as long as it fits into memory, you can scale much better in a database. Of course, you could scale locally with tools that allow you to do so like Spark or out-of-core computation tools like Dask. But even if you were to load the whole data into Spark, you still end up loading it all into memory. So what did you gain by shipping it out of the DB other than some bandwidth bottlenecks? Not much.

Spark is a distributed data processing engine and not a database, but if you use it for database operations like joins, set operations and queries, e.g. as prep for some ML task, there is a strong case for keeping the heavy lifting in e.g. AWS Redshift as much as possible. It will give you a faster answer regardless of whether the db client is connecting through local sockets or remotely over the network. Also, the additional complexity and maintenance introduced by Spark is probably more than that of a database.

Maintainability & readability

Finding the right balance between SQL and pandas is a trade-off between performance and maintainability. Experiment 1 showed that SQL outperforms pandas when looking at standalone operations like filter, groupby, sortand join, but experiment 2 showed it can be slower in real-world example queries. Why?

Well, to some extent this is because a database gives you stuff you don’t ask for like concurrency, locking, constraints, etc., which costs some execution time and lead you to do things you thought you don’t have to worry about like configuration mistakes. But on a more fundamental level, SQL is also not designed for transforming data in certain ways and there are many operations that go beyond SQL’s capabilities. SQL isn’t good for describing sets of rows, e.g. weighted moving average, string manipulation, regex and pivot tables. Procedural approaches like UDFs are also not as good as in real procedural languages and much of the formatting, math or statistics can be more efficiently performed in a general purpose language like python.

Yet, In practice, you will still use many of these features in SQL if you get more done in one go. So the benefits of keeping more tasks on the server side go beyond mere execution time. Practical performance improvements come from minimising data sent over the network (which is more secure and less I/O-heavy), standardisation, reliability, scalability and better collaboration. Thus, at a reasonable performance, SQL is still preferable since it is more maintainable in many ways.

In practice, SQL is a more structured and more readable approach to get the data in a usable format. A join in SQL looks much better than a join/merge in pandas and any non-technical team member can understand SQL very fast too. In addition, SQL is a standard, so any technical person will have some knowledge of it and will be able to grasp it without requiring specialist knowledge. So since SQL is almost self-documenting and more readable, it’s also more maintainable. That’s a big advantage since for long-lived applications, maintenance is a huge cost and simplicity will often dictate the design.

In the end, knowing SQL well will also speed up your workflow, which (for many data analysts) often includes writing a lot of queries in SQL in the beginning, and since data is often in a relational database anyway, it usually makes sense to stick with it. You’ll save yourself a ton of time and pain in the future by not pulling it locally every time and then load it in Python to continue.

SQL also integrates better with production code that may be written in some other programming language. So when a useful ad-hoc analysis becomes part of production, you already have most of it in SQL, which saves time on not having to convert back and forth between SQL and the app language.

For the data team at carwow, using SQL also has positive side effects for working collaboratively, since SQL queries written by others can be tagged and reused for related analyses.

Conclusions

SQL is the best tool to use for basic data retrieval and the better you know it, the more it will speed up your workflow. More advanced data manipulation tasks can be performed both on the server and client side and it is up to the analyst to balance the workload optimally.

Maybe the biggest pain with performing advanced data manipulation on the server side in SQL is that it introduces the complexity of a full-blown database, which is high in maintenance. It requires careful configuration of the database environment and maintaining optimal performance is a full-time job.

The biggest advantage of performing it in pandas is that it comes well configured out of the box to perform data manipulation tasks very fast for ‘smallish’ data. Yet, pandas should not be used to perform basic database operations (data aggregation or querying) as it adds another I/O operation, is limited by your memory and thus does not scale well. It integrates less smoothly with other frameworks in production and it is also less readable. This tends to make the code harder to understand and thus less maintainable.

‘The balance is the point where you get highest maintainability with acceptable performance’

The sweet spot for how to best balance server side and client-side tasks, is where you get highest maintainability with acceptable performance. How to best get there depends on a lot of factors, including DB load, bandwidth, distribution & sort keys set in the database and more general query tuning. We hope that our discussion of these aspects will help you determine where to best draw the line between server and client processes for your use case. If not, check out our rules of thumb at the end of the post.

SQL is a necessary evil and you should embrace it!

So what should be your priorities for balancing server and client side tasks?

Here are some rules of thumb:

If doing a task in SQL can cut the amount of data returned to the client (e.g. filtering to a smaller subset of data), then the task belongs on the server.
If the amount of data returned to the client remains unchanged or grows (e.g. adding complex calculated columns; cross-joins, etc.) by doing it in SQL, the task belongs into client side code.
Test different setups on the server and client side to see which is more efficient. In the worst case, you’ll learn something.
Never do in code what the SQL server can do well for you: Data extraction (CRUD, joins and set operations) & simple data analysis.
If it’s painful or ugly, do it in client-side code: Complex data analysis belongs into code. Leave formatting or math for the client side. The database exists mainly to facilitate fast extraction of data.
Minimise SQL complexity: Split overly complex, non-performant queries. Two simpler queries will save the headache of maintaining one mega-query. Alternatively, split it into a simple query and handle the complexity in client-side code.
Minimise database round trips: Try to do as much as you can in one operation. Every semicolon is one round trip and adds another I/O operation.
Configure your database carefully e.g. for postgres. Otherwise, you default to sub-optimal algorithms which is expensive.
It’s well worth investing time in database schema optimisation.
Same goes for setting optimal foreign/sort/distribution keys and properly normalised tables to maintain the integrity of data.

Interested in making an Impact? Join the carwow-team!
Feeling social? Connect with us on Twitter and LinkedIn :-)

Carwow Product, Design & Engineering

PostgreSQL vs. pandas — how to balance tasks between server and client side

Published in Carwow Product, Design & Engineering

Written by Tina Wenzel