SQL for Data Analysis

Original article was published on Artificial Intelligence on Medium

SQL

stands for Structured Query Language, it is used to communicate with a database, it is a standard language for storing, manipulating and retrieving data in databases.

the main SQL statements of MySQL which are classified in the
following table:

LDD: The Data Definition Language is the part of SQL that describes tables
and other objects handled by DBMSs.

LMD: is a computer programming language used for adding (inserting), deleting, and modifying (updating) data in a database.

LID: allows establishing a combination of operations on tables (relation). The result of this combination of operations is itself a table whose existence lasts only a time.

LCD: A data control language is a programming language and a subset of SQL to control access to data in a database.

Key

Keys are fields in a table which participate in below activities in RDBMS(relational database management system) :

  • To create relationships between two tables.
  • To maintain uniqueness in a table.
  • To keep consistent and valid data in the database.
  • Might help in fast data retrieval by facilitating indexes on column(s).

Supports various types of keys, which are listed below:

  • Candidate Key: Candidate key is a key to a table that can be selected as a primary key of the table. A table can have multiple candidate keys, out of which one can be selected as a primary key.
  • Primary Key: The primary key is a candidate key of the table selected to identify each record uniquely in the table. The primary key does not allow null value in the column and keeps unique values throughout the column. In the above example, Employee_Id is a primary key of the Employee table. In SQL Server, by default primary key creates a clustered index on a heap table (a table which does not have a clustered index is known as a heap table). We can also define a nonclustered primary key on a table by defining the type of index explicitly.
  • Unique Key: The unique key is similar to the primary key and does not allow duplicate values in the column. It has below differences in the comparison of primary key:
  1. It allows one null value in the column.
  2. By default, it creates a nonclustered index on heap tables.
  • Alternate Key: Alternate key is a candidate key, currently not selected as the primary key of the table.
  • Composite Key: Composite key (also known as a compound key or concatenated key) is a group of two or more columns that identify each row of a table uniquely. An individual column of the composite key might not able to uniquely identify the record. It can be a primary key or candidate key also.
  • Super Key: Super key is a set of columns on which all columns of the table are functionally dependent. It is a set of columns that uniquely identifies each row in a table. The super key may hold some additional columns which are not strictly required to uniquely identify each row. Primary key and candidate keys are minimal super keys or you can say a subset of super keys.
  • Foreign Key: In a relationship between two tables, a primary key of one table is referred to as a foreign key in another table. A foreign key can have duplicate values in it and can also keep null values if the column is defined to accept nulls.

JOIN Queries

Joins help retrieving data from two or more database tables.
The tables are mutually related using primary and foreign keys.

Types of joins

Cross JOIN is the simplest form of JOINs that matches each row from one database table to all rows of another. In other words, it gives us combinations of each row of the first table with all records in the second table. Suppose we want to get all member records against all the movie records, we can use the script shown below to get our desired results.

INNER JOIN

The inner JOIN is used to return rows from both tables that satisfy the given condition. Suppose, you want to get a list of members who have rented movies together with titles of movies rented by them. You can simply use an INNER JOIN for that, which returns rows from both tables that satisfy with given conditions.

Outer JOINs

MySQL Outer JOINs return all records matching from both tables.

It can detect records having no match in the joined table. It returns NULL values for records of the joined tables if no match is found.

Sounds Confusing? Let’s look into an example –

LEFT JOIN

Assume now you want to get titles of all movies together with names of members who have rented them. It is clear that some movies have not being rented by anyone. We can simply use LEFT JOIN for the purpose.

The LEFT JOIN returns all the rows from the table on the left even if no matching rows have been found in the table on the right. Where no matches have been found in the table on the right, NULL is returned.

RIGHT JOIN

RIGHT JOIN is obviously the opposite of LEFT JOIN. The RIGHT JOIN returns all the columns from the table on the right even if no matching rows have been found in the table on the left. Where no matches have been found in the table on the left, NULL is returned.

“ON” and “USING” clauses

The ON condition for the natural join is basically an equijoin of identical column names.

The USING clause can also be used for the same purpose. The difference with USING is it needs to have identical names for matched columns in both tables.

Apart from using ON and USING with JOINs, you can use many other MySQL clauses like GROUP BY, WHERE and even functions like SUM, AVG, etc.

Aggregate Functions

An aggregate function performs a calculation on a set of values and returns a single value. Except for COUNT, aggregate functions ignore null values. Aggregate functions are often used with the GROUP BY clause of the SELECT statement.

All aggregate functions are deterministic. In other words, aggregate functions return the same value each time that they are called, when called with a specific set of input values. The OVER clause may follow all aggregate functions, except the STRING_AGG, GROUPING or GROUPING_ID functions.

Use aggregate functions as expressions only in the following situations:

  • The select list of a SELECT statement (either a subquery or an outer query).
  • A HAVING clause.

Transact-SQL provides the following aggregate functions:

Conclusion:

I am currently trying to learn skills related to machine learning and data science and trying to share them along the way so things will get more clear to me and I would be helpful too.

I hope I was able to clarify it a little to you SQL, I will be uploading a lot of more explanation because why not 🙂

Github

LinkedIn

Zahra Elhamraoui

References :

[1] link1

[2] link2

[3]Apprendre SQL Avec MySQL