Search This Blog

Friday, June 15, 2018

Normalization


Normalization is a technique of organize data in the database to minimizing the data redundancy (repetition of similar data) and also avoid insertion, deletion and update anomalies.




Data Redundancy: We want to reduce the data redundancy means repetition of similar data multiple places multiple times because it not only leads some extra spaces on the database but leads multiple other issues also like insertion, deletion and update anomalies.
To understand these anomalies let us take an example of a Student table.

Insertion Anomaly
Suppose for a new admission, until and unless a student opts for a branch, data of the student cannot be inserted, or else we will have to set the branch information as NULL.
Also, if we have to insert data of 100 students of same branch, then the branch information will be repeated for all those 100 students.
These scenarios are nothing but Insertion anomalies.


Updation Anomaly
What if Mr. X leaves the college? or is no longer the HOD of computer science department? In that case all the student records will have to be updated, and if by mistake we miss any record, it will lead to data inconsistency. This is Updation anomaly.


Deletion Anomaly
In our Student table, two different informations are kept together, Student information and Branch information. Hence, at the end of the academic year, if student records are deleted, we will also lose the branch information. This is Deletion anomaly.




Normalization Rule

Normalization rules are divided into the following normal forms:
  1. First Normal Form 
  2. Second Normal Form 
  3. Third Normal Form 
  4. BCNF



First Normal Form (1NF)


For a table to be in the First Normal Form, it should follow the following 4 rules:
  1. It should only have single(atomic) valued attributes/columns.
  2. Each column should only contain same type of value.
  3. All the columns in a table should have unique names.
  4. Order in which data is stored, does not matter.

Second Normal Form (2NF)

For a table to be in the Second Normal Form,
  1. It should be in the First Normal form.
  2. It should not have Partial Dependency.


Let's take an example of a Student table with columns student_idnamereg_no(registration number), branch and address(student's home address).
student_id
name
reg_no
branch
address

In this table, student_id is the primary key and will be unique for every row, hence we can use student_id to fetch any row of data from this table
Even for a case, where student names are same, if we know the student_id we can easily fetch the correct record.


student_id
name
reg_no
branch
address
10
Akon
07-WY
CSE
Kerala
11
Akon
08-WY
IT
Gujarat

Hence we can say a Primary Key for a table is the column or a group of columns(composite key) which can uniquely identify each record in the table.
I can ask from branch name of student with student_id 10, and I can get it. Similarly, if I ask for name of student with student_id 10 or 11, I will get it. So all I need is student_id and every other column depends on it, or can be fetched using it.
This is Dependency and we also call it Functional Dependency.

Partial Dependency

Now that we know what dependency is, we are in a better state to understand what partial dependency is.
For a simple table like Student, a single column like student_id can uniquely identfy all the records in a table.
But this is not true all the time. So now let's extend our example to see if more than 1 column together can act as a primary key.
Let's create another table for Subject, which will have subject_id and subject_name fields and subject_id will be the primary key.


subject_id
subject_name
1
Java
2
C++
3
Php
Now we have a Student table with student information and another table Subject for storing subject information.
Let's create another table Score, to store the marks obtained by students in the respective subjects. We will also be saving name of the teacher who teaches that subject along with marks.
score_id
student_id
subject_id
marks
teacher
1
10
1
70
Java Teacher
2
10
2
75
C++ Teacher
3
11
1
80
Java Teacher
In the score table we are saving the student_id to know which student's marks are these and subject_id to know for which subject the marks are for.
Together, student_id + subject_id forms a Candidate Key for this table, which can be the Primary key.
Confused, How this combination can be a primary key?
See, if I ask you to get me marks of student with student_id 10, can you get it from this table? No, because you don't know for which subject. And if I give you subject_id, you would not know for which student. Hence we need student_id + subject_id to uniquely identify any row.

But where is Partial Dependency?

Now if you look at the Score table, we have a column names teacher which is only dependent on the subject, for Java it's Java Teacher and for C++ it's C++ Teacher & so on.
Now as we just discussed that the primary key for this table is a composition of two columns which is student_id & subject_id but the teacher's name only depends on subject, hence the subject_id, and has nothing to do with student_id.
This is Partial Dependency, where an attribute in a table depends on only a part of the primary key and not on the whole key.

How to remove Partial Dependency?

There can be many different solutions for this, but out objective is to remove teacher's name from Score table.
The simplest solution is to remove columns teacher from Score table and add it to the Subject table. Hence, the Subject table will become:


subject_id
subject_name
teacher
1
Java
Java Teacher
2
C++
C++ Teacher
3
Php
Php Teacher
And our Score table is now in the second normal form, with no partial dependency.


score_id
student_id
subject_id
marks
1
10
1
70
2
10
2
75
3
11
1
80

Quick Recap

1.    For a table to be in the Second Normal form, it should be in the First Normal form and it should not have Partial Dependency.
2.    Partial Dependency exists, when for a composite primary key, any attribute in the table depends only on a part of the primary key and not on the complete primary key.
3.    To remove Partial dependency, we can divide the table, remove the attribute which is causing partial dependency, and move it to some other table where it fits in well.




Third Normal Form (3NF)

A table is said to be in the Third Normal Form when,
1.    It is in the Second Normal form.
2.    And, it doesn't have Transitive Dependency.

In our last tutorial, we learned about the second normal form and even normalized our Score table into the 2nd Normal Form.
So let's use the same example, where we have 3 tables, StudentSubject and Score.

Student Table

student_id
name
reg_no
branch
address
10
Akon
07-WY
CSE
Kerala
11
Akon
08-WY
IT
Gujarat
12
Bkon
09-WY
IT
Rajasthan

Subject Table

subject_id
subject_name
teacher
1
Java
Java Teacher
2
C++
C++ Teacher
3
Php
Php Teacher

Score Table

score_id
student_id
subject_id
marks
1
10
1
70
2
10
2
75
3
11
1
80
In the Score table, we need to store some more information, which is the exam name and total marks, so let's add 2 more columns to the Score table.
score_id
student_id
subject_id
marks
exam_name
total_marks

What is Transitive Dependency?

With exam_name and total_marks added to our Score table, it saves more data now. Primary key for our Score table is a composite key, which means it's made up of two attributes or columns → student_id + subject_id.
Our new column exam_name depends on both student and subject. For example, a mechanical engineering student will have Workshop exam but a computer science student won't. And for some subjects you have Prctical exams and for some you don't. So we can say that exam_name is dependent on both student_id and subject_id.
And what about our second new column total_marks? Does it depend on our Score table's primary key?
Well, the column total_marks depends on exam_name as with exam type the total score changes. For example, practicals are of less marks while theory exams are of more marks.
But, exam_name is just another column in the score table. It is not a primary key or even a part of the primary key, and total_marks depends on it.
This is Transitive Dependency. When a non-prime attribute depends on other non-prime attributes rather than depending upon the prime attributes or primary key.

How to remove Transitive Dependency?

Again the solution is very simple. Take out the columns exam_name and total_marks from Score table and put them in an Exam table and use the exam_id wherever required.

Score Table: In 3rd Normal Form

score_id
student_id
subject_id
marks
exam_id

The new Exam table

exam_id
exam_name
total_marks
1
Workshop
200
2
Mains
70
3
Practicals
30

Advantage of removing Transitive Dependency

The advantage of removing transitive dependency is,
·         Amount of data duplication is reduced.
·         Data integrity achieved.

Boyce and Codd Normal Form (BCNF)

Boyce and Codd Normal Form is a higher version of the Third Normal form. This form deals with certain type of anomaly that is not handled by 3NF. A 3NF table which does not have multiple overlapping candidate keys is said to be in BCNF. For a table to be in BCNF, following conditions must be satisfied:
·         R must be in 3rd Normal Form
·         and, for each functional dependency ( X → Y ), X should be a super Key.

No comments:

Post a Comment


This is a User Friendly Blog.
Simple Interface and Simple Controls are used.
Post your comments so i can modify blog regarding your wish.