SQL Tips¶

Info

Author: Jeremy, Published on 2021-08-27, Reading time: about 10 minutes, WeChat official account article link:

1 Introduction¶

Last time, my colleague Void mainly talked about his experience and evaluation of the data warehouse he has used. Unlike Void, I mainly work with Snowflake and connect Snowflake to write SQL in different platforms. I also have some tips for beginners in data analysis to avoid common mistakes. I hope this can help everyone.

2 SQL Structure¶

2.1 Choosing the Right Driver Table¶

When joining two or more tables, generally both the requirements and performance need to be considered:

On the business side: choose a table with the same granularity as the result table as the driver table. For example, if the result table is a behavior indicator of a group of target users on a certain day, a snapshot of a user table is generally selected as the driver table for filtering, and then relevant user behavior tables are left joined.
On performance: choose a small table as the driver table to join a large table and create an index on the large table. If there is a large difference in the number of records between the two tables that need to be joined, the small table is generally placed first. In table join, only the index in the right connection table will be effective, and the efficiency of the index will be more obvious in the large table.

2.2 Deduplication Before Joining¶

When joining tables, if the two tables are not deduplicated, a large number of duplicate records will appear due to the Cartesian product. Generally, there are several ways to deduplicate before joining:

deduplication with distinct
deduplication with group by
deduplication with window function row_number () (note that window function can only be used in MySQL 8.0 or above)

Suppose we currently have two tables, an employee table dept_emp, with the following data:

emp_no	dept_no	from_date	to_date
10001	d001	1986-06-26	9999-01-01
10002	d001	1996-08-03	2000-01-01
10002	d001	2001-08-03	9999-01-01
10003	d002	1995-12-03	9999-01-01

Another table manager has the following data:

dept_no	emp_no	from_date	to_date
d001	10001	1996-08-03	9999-01-01
d002	10003	1990-08-05	9999-01-01

We want to find all employees corresponding to managers. If the employee is already a manager, it will not be displayed. In the above example, it is as follows:

emp_no	manager
10002	10001

Note that emp_no in the first table is duplicated. If connected directly without deduplication, two records will appear:

emp_no	manager
10002	10001
10002	10001

All three deduplication methods can achieve deduplication:

Deduplication with distinct¶

Select dept.emp_no,mgn.emp_no
From
(
    Select distinct emp_no,dept_no
    From dept_emp
) dept
Left Join
(
    Select distinct emp_no,dept_no
    From dept_manager
) mgn
on dept.dept_no = mgn.dept_no
Where dept.emp_no != mgn.emp_no

Deduplication with group by¶

Select dept.emp_no,mgn.emp_no
From
(
    Select emp_no,dept_no
    From dept_emp
    group by 1
) dept
Left Join
(
    Select emp_no,dept_no
    From dept_manager
    group by 1
) mgn
on dept.dept_no = mgn.dept_no
Where dept.emp_no != mgn.emp_no

Deduplication with window function¶

Although the above two deduplication methods can achieve the effect, the selection of duplicate records is actually random. We are not sure which duplicate record will be selected in the final query. In actual business, it is often necessary to ensure the consistency of the running result. At this time, you can use window functions to sort with additional fields, then take the first one (such as taking the record with the newest time field from_date) to deduplicate. The corresponding SQL is:

Select dept.emp_no,mgn.emp_no
From
(
    Select emp_no,dept_no,
    row_number() over(partition by emp_no,dept_no order by from_date desc) as rec_rank
    From dept_emp
    group by 1
    where rec_rank = 1
) dept
Left Join
(
    Select emp_no,dept_no
    From dept_manager
    group by 1
) mgn
on dept.dept_no = mgn.dept_no
Where dept.emp_no != mgn.emp_no

2.3 Using Common Table Expressions (CTE) Wisely¶

Common Table Expressions (CTE) is similar to functions in programming languages and can be used to extract subquery statements that would otherwise appear multiple times, effectively reducing the number of lines of code.

In addition to making the query statement more concise and neat, CTE can also achieve "advanced" query functions such as recursion. According to the three elements of recursion:

Termination condition
Function running state, which gradually approaches the termination condition with each run
Function calls itself

In CTE, two queries need to be included. The first is a fixed-point query, which returns a regular query of a valid table, and then superimposes a recursive query that calls itself through union all. Unlike functions, recursive queries do not have an explicit termination condition and will stop recursion when the result of the second recursive query is empty.

The common form of recursion is as follows:

WITH CTE AS (
SELECT column1,column2... FROM tablename WHERE conditions
UNION ALL
SELECT column1,column2... FROM tablename
INNER JOIN CTE ON conditions
)

SELECT *
FROM CTE

3 Summary¶

The above is a super incomplete summary of SQL knowledge points, but in practical business, we often do not need very fancy code. Understanding and applying deduplication, connection, subqueries, window functions, and CTE can cover the vast majority of business scenarios, and more need to improve the readability and execution performance of the code. If you are interested in small tips on SQL in business or interviews, I recommend the following two readings:

SQL for Mere Mortals (basic)
SQL Antipatterns: Avoiding the Pitfalls of Database Programming (advanced)

If you want to see more SQL or experience sharing topics similar to this, please feel free to leave us a message and comment.

Viewed times