1.1 Introduction
Normalization is the process by which you make the data efficient for storage in a database. It is an important concept. It is a way of analyzing data and arriving at different stages by following some standard rules and procedure - at each stage unwanted properties of data are eliminated.
Purpose of Normalization
1 .reduce redundancy of data (you don’t need to store the same attribute in two tables)
2. remove insert, update and delete anomalies of data
3. storing the related attributes in the same table
Functional Dependancy
Functional dependency is where an attribute in a table determines the occurrence of another attribute in the same table. If you want the formal definition which I guarantee you won’t understand here it is :
“in a table with attributes x,y, attribute y is functionally dependent on attribute x if and only if each value of x has only one value of y associated with it”
Let’s understand this through an example
Let’s consider Social Security number (SSN) and Name. As we looked at earlier the name is not unique. But SSN is unique. So for each occurrence of SSN determines the name. Each occurrence of the name doesn’t determine the SSN. In other words if I give you the name and ask you to give the SSN number you will be thoroughly confused. As for the same name there will be many SSN numbers. But If I give you the SSN and ask for the name then you can give me the name as for any given SSN number there will only be one name i.e. SSN number is unique and the attribute name is dependent aka functionally dependent on it.
So it is written like this:
X -> Y
SSN -> Name
The attributes on the left side are known as determinants. Read as X determines Y or SSN determines Name.
A key is one or a combination of attributes that is used to uniquely identify a row in a table.
So a key will determine all other attributes. i.e. all other attributes are functionally dependent on the key.
Composite Keys revisited
Remember the concept of Composite keys? Where two attributes team together to form a unique row? For example in the order line table – where you keep the details of an order i.e. the products that make up the order – the primary key would be order no and product no.
Full Functional Dependancy and Partial Dependancy
Although Product price is dependent on the composite key (order no and product no) actually speaking it is dependent only on the product no. So the attribute product price is only partially dependent on the key attributes (order no and product no).
Note: the above example assumes that the same price is quoted to all the orders. There could be some businesses which may give special prices for bulk orders. However even then it would go as a discount against the price. So the product price would remain the same.
Consider the airline industry. The price of a ticket doesn’t depend on the flight no (which determines your destination), but also on the date and time. Usually during peak hours the air fare is higher than non-peak hours. So the attribute “Ticket Price” is fully functionally dependent on the key attribute (Flight No, Date and Time).
Note that Flight no alone is not a unique attribute – as the same flight can go and come back several times for short trips.
Transitive Dependancy
Transitive Dependency is a classic example of A=B, B=C so A=C type of reasoning. If there is a dependency among non-key fields (fields which do not make up the primary or unique keys)
In the above example the pilot no and name are also dependent on the flight no and date and time. However the pilot name is dependent on the pilot no. So the attribute Pilot name doesn’t depend on the “Key” it depends on a non-key attribute – Pilot No. So this is transitive dependency.
Now we shall look at the different forms of normalization. As mentioned earlier each form is a way of purifying data. It is like water purification that happens in stages. First you remove the dirt from water – stones, dirt etc. Then you filter it for finer dirt. Then you add chemicals to purify it even further.
First Normal Form
The customary definition is that “a table is said to be in 1 NF only and only if all domains or attributes contain atomic or single valued data and all repeating groups are removed”
Here are the formal rules for 1st normal form:
1. Each column value must be atomic.
2. All values for a column must be of the same type- in the same column you cannot have a numeric value and a text value. For example you cannot have column nameandweight where you put the name of the person and the weight of the person. Because one is a numeric value and the other is a character or text data. When we say numeric what me mean is that it will make sense to perform arithmetic operations on that data. For example the weight of several rows – of several people – can be added, averaged on,
3. Each column name must be unique
4. The order or columns doesn’t matter – this and rule no 6 below are implemented by most DBMS. The row and the column order doesn’t matter.
5. The rows of a table must be unique
6. The order of the rows doesn’t matter
Let’s understand this through an example.
Second Normal Form
For a table to be in second normal form – 2NF – first it has to be in 1NF. As I said before it is a process for purifying data. You can’t start at middle point. That will be like sex without foreplay!
A table is in @NF if all of its non-key attributes are dependent on all of the key i.e. in 2NF we remove all partial dependencies.
If the tables doesn’t have a composite key this question wouldn’t rise in the first place – provided that the table is in 1NF.
That’s why in modern database designs we hardly use any composite keys. We use artificial identifiers – a running number generated by the system itself – by use of a sequence – as the primary key.
Let’s work this through an example:
Look at the below ‘order’ table:
As you will realize the key is Order No and Product Code, which can be shown as below
(Order No, Product Code, Product Description, Product Price)
The key attributes are underlined.
Here is how you normalize this table into 2NF:
1. Identify and isolate the problem – this is often the first step in most problems!
See where the partial dependency is occurring.
2. Represent it using either dependency diagrams as shown above or like below:
Order No, Product Code, -> Product Description, Product Price
Product Code -> Product Description, Product Price
The Product Description and Product Price are dependent only on the Product Code and not on Order no.
3. So now that the problem has been identified and represented, take action!
What can we do with the above table? Divide it into two:
Order Table (Order No, Product Code)
Order No | Product Code |
OR1 | P1 |
OR1 | P2 |
OR3 | P1 |
OR4 | P3 |
Product Table (Product Code, Product Description, Product Price)
Product Code | Product Description | Product Price |
P1 | Godiva Chocolates 100g | .5 |
P2 | Australian Apples – large | .1 |
P3 | Coke Classic Can 300ml | .2 |
As you will notice by removing partial dependency we have also minimized data redundancy.
Third Normal Form
A table is in 3rd normal form if it is in 2NF and has not transitive dependencies.
Let’s look at the following example, the flight tables
(Flight No, Date, Time, Ticket Price, Pilot No, Pilot Name)
Flight No | Date | Time | Ticket Price | Pilot No | Pilot Name |
AA876 | 2/12/2014 | 230 | 100 | 1 | Steve McDonald |
AA876 | 2/12/2014 | 1330 | 250 | 4 | Tommy Jones |
AA876 | 2/12/2014 | 2030 | 175 | 7 | Raymond Brown |
To normalize this to bring to the 3NF follow the same steps identified above:
1. Identify and isolate the problem – See where the transitive dependency is occurring.
2. Represent it using either dependency diagrams as shown above or like below:
Flight No, Date, Time, - > Ticket Price,
Pilot No -> Pilot Name
The Pilot Name is dependent on Pilot No and not on the Key attributes – that’s where the transitive dependency occurs.
3. So now that the problem has been identified and represented, take action!
Again divide this into two tables
Flight Table (Flight No, Date, Time, Ticket Price)
Flight No | Date | Time | Ticket Price | Pilot No |
AA876 | 2/12/2014 | 230 | 100 | 1 |
AA876 | 2/12/2014 | 1330 | 250 | 4 |
AA876 | 2/12/2014 | 2030 | 175 | 7 |
Pilot Table (Pilot No, Pilot Name)
Pilot No | Pilot Name |
1 | Steve McDonald |
4 | Tommy Jones |
7 | Raymond Brown |
There are some advanced aspects of normalization which we will look in the next chapter.