Categories
Data Science

SQL Notes for Data Science

SQL is one of the most widely used programming language but yet it remains too much undervalued. Here is some quick SQL query snippets for refreshing the memory.

Introductory Concepts

Optional: Learn about FOREIGN KEY and PRIMARY KEY and the different Database Normalization types from this Wikipedia page. Probably you will need only up to 1st to 4th Normal Form of a database.

These basic concepts are useful to understand why we need table joins. But not entirely essential.

Also this note does not talk about creating database schema and modifying it.

Basic SQL Syntax

In the example below, you are querying two columns, column_name_1, column_name_2 from the table but we only want the unique values from the second column, so we use DISTINCT. The conditions to select particular data are added in the WHERE clause. The two selected columns are then sorted first by ascending order and then by descending order using the ORDER BY clause. And finally we are only interested in maximum 10 rows.

SELECT column_name_1, DISTINCT(column_name_2)
FROM table_name
WHERE column_name_1 = "abc" and column_name_2 = "xyz" 
ORDER BY column_name_1 ASC, column_name_2 DESC
LIMIT 10

Case When

Another simple clause is CASE WHEN which is useful as a if-else block. Between a CASE ... END There can be as many WHEN block and then one ELSE clause

CASE
    WHEN condition_1 THEN result_1
    WHEN condition_2 THEN result_2
    ELSE final_value
END

Union vs Union All

Union enables stacking two table with same columns one on top of the other. UNION only keeps distinct rows from two tables, on the other hand UNION ALL keeps all rows.

Intermediate Difficulty

Aggregate Functions, HAVING clause

Aggregate functions are basically statistical summary functions like SUM(), MIN(), MAX(), AVG() etc. See the section titled “Rank, Window, Lag, Lead” below.

OVER(), PARTITION BY

From SQLShack.com “In the SQL GROUP BY clause, we can use a column in the select statement if it is used in Group by clause as well. It does not allow any column in the select clause that is not part of GROUP BY clause”. Definitely check out the link to get a clear idea about these useful clauses and their differences.

Table Joins

Joins are very intuitive but may become quite complex especially when used with nested sub-queries.

There are various types of SQL joins. Cross join simply makes all the combinations of the two tables as shown in the image from SQLShack.com. You should check out their post on Cross join, it is very detailed!

SQLShack.com has excellent explanation on Cross Join

Cross Join can be run by this example where table_name_1 and table_name_2 are cross joined. This is often time very inefficient. So be careful with cross joins and look out for ways to avoid it.

SELECT column_name_1, column_name_2
FROM table_name_1,table_name_2 

I find this image below to explain the SQL joins very effective

Image: SQL Joins explained by Lasse V. Karlssen on Stackoverflow

Carsson Forter writes, “By filtering and aggregating your data before joining, you write the most efficient SQL. Joins are expensive to process so you want the fewest possible rows before joining two tables together… if you have a JOIN and a WHERE clause in the same CTE, SQL processes the JOIN first”.

So trying to filter and then joining is typically more efficient. That brings us to more advanced topics, including subqueries and CTEs.

Advanced Syntax

Common Table Expressions (CTE)

CTEs gives a way to store some queried data with a name so that it can be used in more complex queries. That way, CTEs help write complex SQL queries in a cleaner way.

CTEs commonly have this syntax, you can have multiple CTE’s. Source here

WITH 
    cte1 as (SELECT * from table WHERE ...),
    cte2 as (SELECT * from table WHERE ...)
select * from cte1 union select * from cte2

This article by Carsson Forter provides a great example of using multiple CTEs to calculate the avg time between two transactions of a customer. This is his example which also used group by clause to aggregate data by user name.

-- First, find all of user_a's transactions today 
with user_a_trans as (
  SELECT username, time
  FROM transactions
  WHERE day = '2017-09-08'
  AND username = 'user_a'),

-- Join each transaction to all transactions occurring after it

joined_trans as (
  SELECT username, time, future_times
  FROM user_a_trans a
  INNER JOIN user_a_trans b
  ON b.time > a.time),

-- Find the immediate next transaction using MIN()
next_trans as (
  SELECT username, time, MIN(future_times) as next_time
  FROM joined_trans
  GROUP BY username, time)

-- Average difference of the time and the next transaction's time
SELECT AVG(next_time - time) as avg_time_to_next_transaction
from next_trans;

Rank, Window, Lag, Lead

The Window clause is useful to set a condition over which data is split. WINDOW can be at a single location and referred to while calling aggregation functions like SUM() and AVG(). This is the link for the code below.

SELECT sum(salary) OVER w, avg(salary) OVER w
  FROM empsalary
  WINDOW w AS (PARTITION BY depname ORDER BY salary DESC);

For LAG, see a thorough explanations here. The same logic applies on LEAD

For rank functions, ROW_NUMBER(), RANK(), and DENSE_RANK(), see this resource.

FIRST_VALUE and LAST_VALUE

A good example of these functions is found on this website. Copying the example query here

USE AdventureWorks2012;   
GO  
SELECT JobTitle, LastName, VacationHours,   
       FIRST_VALUE(LastName) OVER (
                   PARTITION BY JobTitle   
                   ORDER BY VacationHours ASC  
                   ROWS UNBOUNDED PRECEDING  ) 
        AS FewestVacationHours  
FROM HumanResources.Employee AS e  
INNER JOIN Person.Person AS p   
    ON e.BusinessEntityID = p.BusinessEntityID  
ORDER BY JobTitle;

Geospatial SQL Queries

This is another important group of SQL queries practically useful but not talked about in most cases.

Leave a Reply

Your email address will not be published. Required fields are marked *