SQL Tips¶
Info
Author: Jeremy, Published on 2021-08-27, Reading time: about 10 minutes, WeChat official account article link:
1 Introduction¶
Last time, my colleague Void mainly talked about his experience and evaluation of the data warehouse he has used. Unlike Void, I mainly work with Snowflake and connect Snowflake to write SQL in different platforms. I also have some tips for beginners in data analysis to avoid common mistakes. I hope this can help everyone.
2 SQL Structure¶
2.1 Choosing the Right Driver Table¶
When joining two or more tables, generally both the requirements and performance need to be considered:
- On the business side: choose a table with the same granularity as the result table as the driver table. For example, if the result table is a behavior indicator of a group of target users on a certain day, a snapshot of a user table is generally selected as the driver table for filtering, and then relevant user behavior tables are left joined.
- On performance: choose a small table as the driver table to join a large table and create an index on the large table. If there is a large difference in the number of records between the two tables that need to be joined, the small table is generally placed first. In table join, only the index in the right connection table will be effective, and the efficiency of the index will be more obvious in the large table.
2.2 Deduplication Before Joining¶
When joining tables, if the two tables are not deduplicated, a large number of duplicate records will appear due to the Cartesian product. Generally, there are several ways to deduplicate before joining:
- deduplication with distinct
- deduplication with group by
- deduplication with window function row_number () (note that window function can only be used in MySQL 8.0 or above)
Suppose we currently have two tables, an employee table dept_emp
, with the following data:
emp_no | dept_no | from_date | to_date |
---|---|---|---|
10001 | d001 | 1986-06-26 | 9999-01-01 |
10002 | d001 | 1996-08-03 | 2000-01-01 |
10002 | d001 | 2001-08-03 | 9999-01-01 |
10003 | d002 | 1995-12-03 | 9999-01-01 |
Another table manager
has the following data:
dept_no | emp_no | from_date | to_date |
---|---|---|---|
d001 | 10001 | 1996-08-03 | 9999-01-01 |
d002 | 10003 | 1990-08-05 | 9999-01-01 |
We want to find all employees corresponding to managers. If the employee is already a manager, it will not be displayed. In the above example, it is as follows:
emp_no | manager |
---|---|
10002 | 10001 |
Note that emp_no in the first table is duplicated. If connected directly without deduplication, two records will appear:
emp_no | manager |
---|---|
10002 | 10001 |
10002 | 10001 |
All three deduplication methods can achieve deduplication:
Deduplication with distinct¶
Select dept.emp_no,mgn.emp_no
From
(
Select distinct emp_no,dept_no
From dept_emp
) dept
Left Join
(
Select distinct emp_no,dept_no
From dept_manager
) mgn
on dept.dept_no = mgn.dept_no
Where dept.emp_no != mgn.emp_no
Deduplication with group by¶
Select dept.emp_no,mgn.emp_no
From
(
Select emp_no,dept_no
From dept_emp
group by 1
) dept
Left Join
(
Select emp_no,dept_no
From dept_manager
group by 1
) mgn
on dept.dept_no = mgn.dept_no
Where dept.emp_no != mgn.emp_no
Deduplication with window function¶
Although the above two deduplication methods can achieve the effect, the selection of duplicate records is actually random. We are not sure which duplicate record will be selected in the final query. In actual business, it is often necessary to ensure the consistency of the running result. At this time, you can use window functions to sort with additional fields, then take the first one (such as taking the record with the newest time field from_date
) to deduplicate. The corresponding SQL is:
Select dept.emp_no,mgn.emp_no
From
(
Select emp_no,dept_no,
row_number() over(partition by emp_no,dept_no order by from_date desc) as rec_rank
From dept_emp
group by 1
where rec_rank = 1
) dept
Left Join
(
Select emp_no,dept_no
From dept_manager
group by 1
) mgn
on dept.dept_no = mgn.dept_no
Where dept.emp_no != mgn.emp_no
2.3 Using Common Table Expressions (CTE) Wisely¶
Common Table Expressions (CTE) is similar to functions in programming languages and can be used to extract subquery statements that would otherwise appear multiple times, effectively reducing the number of lines of code.
In addition to making the query statement more concise and neat, CTE can also achieve "advanced" query functions such as recursion. According to the three elements of recursion:
- Termination condition
- Function running state, which gradually approaches the termination condition with each run
- Function calls itself
In CTE, two queries need to be included. The first is a fixed-point query, which returns a regular query of a valid table, and then superimposes a recursive query that calls itself through union all. Unlike functions, recursive queries do not have an explicit termination condition and will stop recursion when the result of the second recursive query is empty.
The common form of recursion is as follows:
WITH CTE AS (
SELECT column1,column2... FROM tablename WHERE conditions
UNION ALL
SELECT column1,column2... FROM tablename
INNER JOIN CTE ON conditions
)
SELECT *
FROM CTE
3 Summary¶
The above is a super incomplete summary of SQL knowledge points, but in practical business, we often do not need very fancy code. Understanding and applying deduplication, connection, subqueries, window functions, and CTE can cover the vast majority of business scenarios, and more need to improve the readability and execution performance of the code. If you are interested in small tips on SQL in business or interviews, I recommend the following two readings:
- SQL for Mere Mortals (basic)
- SQL Antipatterns: Avoiding the Pitfalls of Database Programming (advanced)
If you want to see more SQL or experience sharing topics similar to this, please feel free to leave us a message and comment.