Master Product Data Analytics
Your Guide To Data Analytics Mastery
2. SQL & Data Manipulation (Extracting and Transforming Data)
SQL (Structured Query Language) is the standard language for interacting with relational databases.
SQL is used extensively to extract, transform, and manipulate data. This section will cover the core SQL concepts you need to know for the Meta interview and beyond.
We'll start with the fundamentals and then move on to more advanced techniques. 💪
For Anyone getting started with the Video below, is a great stating point!
2.1 Core SQL Syntax
Let's start with the building blocks of SQL. These are the fundamental commands you'll use in almost every query you write.
-
2.1.1 SELECT, FROM, WHERE (Filtering Data)
These are the most basic and essential SQL keywords. They form the foundation of most queries.
- SELECT: Specifies the columns you want to retrieve from a table.
- FROM: Specifies the table you want to retrieve data from.
- WHERE: Filters the data based on a specified condition.
Example:
SELECT user_id, name, email FROM users WHERE country = 'USA';
This query retrieves the `user_id`, `name`, and `email` columns from the `users` table, but only for rows where the `country` column is equal to 'USA'.
References:
-
2.1.2 JOINs (INNER, LEFT, RIGHT, FULL OUTER - Combining Tables)
JOINs are used to combine data from two or more tables based on a related column between them.
- INNER JOIN: Returns rows when there is a match in both tables.
- LEFT (OUTER) JOIN: Returns all rows from the left table, and the matched rows from the right table. If there is no match, it returns NULL for the columns from the right table.
- RIGHT (OUTER) JOIN: Returns all rows from the right table, and the matched rows from the left table. If there is no match, it returns NULL for the columns from the left table.
- FULL (OUTER) JOIN: Returns all rows when there is a match in either the left or right table. If there is no match, it returns NULL for the columns from the non-matching table.
Example:
SELECT orders.order_id, users.name FROM orders INNER JOIN users ON orders.user_id = users.user_id;
This query returns the `order_id` from the `orders` table and the `name` from the `users` table, but only for rows where the `user_id` is the same in both tables (i.e., only for orders that are associated with a user).
References:
-
2.1.3 GROUP BY and Aggregate Functions (COUNT, SUM, AVG, MIN, MAX)
The `GROUP BY` statement groups rows that have the same values in one or more columns into a summary row. Aggregate functions are often used with `GROUP BY` to perform calculations on each group.
- COUNT: Counts the number of rows in each group.
- SUM: Calculates the sum of a numeric column in each group.
- AVG: Calculates the average of a numeric column in each group.
- MIN: Returns the minimum value of a column in each group.
- MAX: Returns the maximum value of a column in each group.
Example:
SELECT country, COUNT(*) AS num_users FROM users GROUP BY country;
This query groups the rows in the `users` table by the `country` column and then counts the number of users in each country.
References:
-
2.1.4 ORDER BY (Sorting Data)
The `ORDER BY` clause sorts the result set in ascending or descending order based on one or more columns.
Example:
SELECT user_id, name, registration_date FROM users ORDER BY registration_date DESC;
This query retrieves the `user_id`, `name`, and `registration_date` columns from the `users` table and sorts the results in descending order of `registration_date`.
References:
-
2.1.5 HAVING (Filtering Aggregated Data)
The `HAVING` clause is used to filter the results of a `GROUP BY` query based on aggregate functions. It's similar to the `WHERE` clause, but it operates on grouped data rather than individual rows.
Example:
SELECT country, COUNT(*) AS num_users FROM users GROUP BY country HAVING COUNT(*) > 100;
This query groups the rows in the `users` table by `country`, counts the number of users in each country, and then filters the results to include only countries with more than 100 users.
References:
2.2 Advanced SQL Techniques
Now that you've mastered the basics, let's move on to some more advanced SQL techniques that will enable you to perform complex data manipulations and analyses.
-
2.2.1 Window Functions (ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, NTILE)
Window functions perform calculations across a set of rows that are related to the current row. They are similar to aggregate functions, but instead of returning a single value for each group, they return a value for each row.
- ROW_NUMBER: Assigns a unique sequential integer to each row within the partition.
- RANK: Assigns a rank to each row within the partition, with the same rank for equal values. Gaps may appear in the sequence.
- DENSE_RANK: Similar to RANK, but without gaps in the ranking sequence.
- LAG: Accesses data from a previous row in the result set.
- LEAD: Accesses data from a following row in the result set.
- NTILE: Divides the rows within a partition into a specified number of groups (e.g., quartiles, deciles).
Example:
SELECT order_id, order_date, amount, RANK() OVER (PARTITION BY customer_id ORDER BY order_date) as order_rank FROM orders;
This query assigns a rank to each order within each customer's set of orders, based on the order date.
References:
- W3Schools: SQL Window Functions (Note: W3Schools' coverage of window functions is limited)
- Wikipedia: Window function (SQL)
-
2.2.2 Subqueries and CTEs (Common Table Expressions)
Subqueries are queries nested within another query. They can be used to perform operations in multiple steps or to filter data based on the results of another query.
CTEs (Common Table Expressions) are temporary result sets that you can reference within a `SELECT`, `INSERT`, `UPDATE`, or `DELETE` statement. They are defined using the `WITH` clause and are useful for breaking down complex queries into smaller, more manageable parts.
Example (Subquery):
SELECT order_id, amount FROM orders WHERE customer_id IN (SELECT customer_id FROM customers WHERE country = 'USA');
This query uses a subquery in the `WHERE` clause to select only orders from customers in the USA.
Example (CTE):
WITH USACustomers AS ( SELECT customer_id FROM customers WHERE country = 'USA' ) SELECT order_id, amount FROM orders WHERE customer_id IN (SELECT customer_id FROM USACustomers);
This query uses a CTE called `USACustomers` to define a temporary result set of customers from the USA, and then uses that CTE in the main query to select orders from those customers.
References:
- W3Schools: SQL Subqueries
- Wikipedia: Common Table Expression
- PostgreSQL Documentation: WITH Queries (Common Table Expressions) - This is a more in-depth resource.
-
2.2.3 String Manipulation Functions
SQL provides a variety of functions for manipulating strings, such as:
- CONCAT: Concatenates two or more strings.
- SUBSTRING: Extracts a substring from a string.
- LENGTH: Returns the length of a string.
- UPPER/LOWER: Converts a string to uppercase or lowercase.
- REPLACE: Replaces occurrences of a substring within a string.
- TRIM: Removes leading and/or trailing spaces from a string.
Example:
SELECT CONCAT(first_name, ' ', last_name) AS full_name FROM users;
References:
- W3Schools: SQL String Functions (Note: This is a general overview of SQL functions, including string functions)
- PostgreSQL Documentation: String Functions and Operators (Specific to PostgreSQL, but many functions are common across database systems)
-
2.2.4 Date and Time Functions
SQL provides functions for working with dates and times, such as:
- NOW/CURRENT_DATE/CURRENT_TIMESTAMP: Returns the current date and/or time.
- DATE_PART/EXTRACT: Extracts a specific part of a date or time (e.g., year, month, day).
- DATE_ADD/DATE_SUB: Adds or subtracts a time interval from a date.
- DATEDIFF: Calculates the difference between two dates.
Example:
SELECT order_id, order_date FROM orders WHERE DATE_PART('year', order_date) = 2023;
References:
- W3Schools: SQL Dates
- PostgreSQL Documentation: Date/Time Functions and Operators (Specific to PostgreSQL, but many functions are common across database systems)
2.3 Query Optimization
Writing efficient SQL queries is crucial for working with large datasets. Here are some techniques for optimizing your queries:
-
2.3.1 Understanding Execution Plans
Most database systems provide a way to view the execution plan of a query, which shows how the database will execute the query. This can help you identify bottlenecks and areas for improvement.
Example (PostgreSQL):
EXPLAIN SELECT * FROM users WHERE country = 'USA';
References:
-
2.3.2 Indexing Strategies
Indexes are data structures that improve the speed of data retrieval operations on a database table. Creating appropriate indexes on frequently queried columns can significantly speed up queries.
Example:
CREATE INDEX idx_country ON users (country);
References:
-
2.3.3 Efficient Joins and Filtering
Here are some tips for writing efficient joins and filters:
- Use `INNER JOIN` instead of `WHERE` clause to join tables when possible.
- Filter data as early as possible in the query using `WHERE` and `HAVING` clauses.
- Avoid using `OR` conditions in `JOIN` clauses, as they can lead to poor performance.
- Use `EXISTS` instead of `COUNT(*)` to check for the existence of rows.
2.4 Data Cleaning with SQL
Real-world data is often messy and inconsistent. SQL can be used to clean and transform data before analysis. Here are some common data cleaning tasks:
-
2.4.1 Handling Missing Values (NULLs)
SQL provides functions for dealing with NULL values:
- COALESCE: Returns the first non-NULL expression in a list.
- IS NULL / IS NOT NULL: Checks if a value is NULL or not.
- NULLIF: Returns NULL if two expressions are equal, otherwise returns the first expression.
Example:
SELECT COALESCE(email, 'N/A') as email FROM users;
-
2.4.2 Data Type Conversions
You may need to convert data from one type to another (e.g., string to integer, date to string). SQL provides functions like `CAST` and `CONVERT` for this purpose.
Example:
SELECT CAST(order_date AS DATE) FROM orders;
W3Schools: SQL CAST Function (SQL Server syntax, but similar functions exist in other database systems)
-
2.4.3 Identifying and Removing Duplicates
Duplicate rows can skew your analysis. You can use `DISTINCT` or `GROUP BY` to identify and remove duplicates.
Example:
SELECT DISTINCT user_id FROM users;