Skip to content

SQL tips

Info

Author: Void, published on 2021-07-20, reading time: about 10 minutes, WeChat official account article link:

1 Introduction

SQL is the language used to interact with databases. As data scientists, writing SQL may be one of the daily tasks we do just like having meals. SQL is not complicated, but there are still some details and experiences worth sharing.
This article records some useful experiences and knowledge in every day of dealing with databases.

2 MySQL, SQL Server, Oracle

These three databases are the most well-known and easy-to-learn databases, which are also suitable for individuals and small data storage. They are suitable for personal exploration, storing interesting data, and writing queries to get the feel of them.

3 Teradata, Snowflake

These two databases are commonly encountered in work. One of the major characteristics of Teradata is parallelism, which is different from the Master-Slave structure of Hadoop. It is based on a massively parallel processing (MPP) architecture and can allocate loads and segregate tasks.

Teradata SQL syntax is not too different, there is no limit, only top can be used. In addition, the execution order of each clause is:

where->join->group by->qualify->select

Unlike Hive, which does multi-table join before where and loads the left table into memory, so we need to filter each table before join to optimize the query.
To optimize Teradata, what we need to do is minimizing I/O and ensuring that each step filters as much data as possible.

Snowflake is a relatively new database. The aforementioned databases all have decades of history, but it's characteristics are its cloud database. In today's era of everything going to the cloud, moving to the cloud has become a must.
In addition to its novel technical architecture, Snowflake has the following advantages in terms of user experience:

  • A variety of functions (such as EDA charts), beautiful UI.
  • Powerful query language, such as supporting qualify, etc., or no defects and hard hurts.

4 Hadoop

In the Hadoop ecosystem, Hive and Spark are mostly used. Hive is a data warehousing tool based on Hadoop, which can execute MapReduce operations using Hive SQL. At present, large-scale data storage (domestic and foreign internet) still seems to be dominated by Hadoop.
I have also used Hive for some time at work, here is a summary of the pitfalls I have encountered.

  • Subqueries in Hive must have aliases.
select * from
(select * from table) a
  • Hive's limit will limit after all logical operations are completed. The following settings can be used for optimization (sampling will be carried out on the data source):
set hive.limit.optimize.enable=true;
  • Passing parameters in Hive:
set hivevar:var=1;
select ${var}
Where, if the parameter itself is a string, quotes need to be added, such as date_sub('${timestamp}',7)
  • Data skew (such as overly concentrated values in certain fields) will make the execution time of a small number of reduce subtasks too long, which will greatly affect query efficiency.
  • Filter data for each subtable as much as possible.
  • Merge too many small files:
set hive.execution.engine=tez;
set hive.merge.tezfiles=true; -- for namespace issue
set tez.output.compression.type=BLOCK;
set hive.merge.size.per.task=256000000;
set hive.merge.smallfiles.avgsize=256000000;
set hive.exec.compress.output=true;
  • Updating and deleting data in Hive is somewhat complicated, and requires changing settings.
  • Task progress can be viewed through YARN, such as:
yarn application -list

5 Conclusion

The author is not a professional data warehousing developer, and the above experience is based on personal actual usage. In addition to being busy with queries every day, it is also interesting to understand the differences and characteristics of different databases.


Viewed times

Comments