Normalization is a technique of organize data in the
database to minimizing the data redundancy (repetition of similar data) and
also avoid insertion, deletion and update anomalies.
Data Redundancy: We want to reduce the data redundancy means
repetition of similar data multiple places multiple times because it not only leads
some extra spaces on the database but leads multiple other issues also like
insertion, deletion and update anomalies.
To understand these anomalies let us take an example of a
Student table.
Insertion Anomaly
Suppose for a new
admission, until and unless a student opts for a branch, data of the student
cannot be inserted, or else we will have to set the branch information as NULL.
Also, if we have to
insert data of 100 students of same branch, then the branch information will be
repeated for all those 100 students.
These scenarios are
nothing but Insertion anomalies.
Updation Anomaly
What if Mr. X leaves
the college? or is no longer the HOD of computer science department? In that
case all the student records will have to be updated, and if by mistake we miss
any record, it will lead to data inconsistency. This is Updation anomaly.
Deletion Anomaly
In our Student table,
two different informations are kept together, Student information and Branch
information. Hence, at the end of the academic year, if student records are
deleted, we will also lose the branch information. This is Deletion anomaly.
Normalization Rule
Normalization rules are divided into the following normal forms:
- First Normal Form
- Second Normal Form
- Third Normal Form
- BCNF
First Normal Form (1NF)
For a table to be in
the First Normal Form, it should follow the following 4 rules:
- It should only have single(atomic) valued attributes/columns.
- Each column should only contain same type of value.
- All the columns in a table should have unique names.
- Order in which data is stored, does not matter.
Second Normal Form (2NF)
For a table to be in the Second Normal Form,
- It should be in the First Normal form.
- It should not have Partial Dependency.
Let's take an example of a Student table with
columns
student_id
, name
, reg_no
(registration
number), branch
and address
(student's
home address).
student_id
|
name
|
reg_no
|
branch
|
address
|
In this table,
student_id
is
the primary key and will be unique for every row, hence we can use student_id
to
fetch any row of data from this table
Even for a case, where student names are same, if we know
the
student_id
we
can easily fetch the correct record.
student_id
|
name
|
reg_no
|
branch
|
address
|
10
|
Akon
|
07-WY
|
CSE
|
Kerala
|
11
|
Akon
|
08-WY
|
IT
|
Gujarat
|
Hence we can say a Primary Key for a table is the column or a group of columns(composite key) which can uniquely identify each record in the table.
I can ask from branch name of student with
student_id
10,
and I can get it. Similarly, if I ask for name of student with student_id
10 or 11,
I will get it. So all I need is student_id
and
every other column depends on it, or can be fetched using it.
This is Dependency and we also call it Functional
Dependency.
Partial Dependency
Now that we know what dependency is, we are in a better state to
understand what partial dependency is.
For a simple table like Student, a single column like
student_id
can
uniquely identfy all the records in a table.
But this is not true all the time. So now let's extend our example
to see if more than 1 column together can act as a primary key.
Let's create another table for Subject, which will
have
subject_id
and subject_name
fields
and subject_id
will
be the primary key.
subject_id
|
subject_name
|
1
|
Java
|
2
|
C++
|
3
|
Php
|
Now we have a Student table with student
information and another table Subject for storing subject
information.
Let's create another table Score, to store the marks obtained
by students in the respective subjects. We will also be saving name of
the teacher who teaches that subject along with marks.
score_id
|
student_id
|
subject_id
|
marks
|
teacher
|
1
|
10
|
1
|
70
|
Java
Teacher
|
2
|
10
|
2
|
75
|
C++
Teacher
|
3
|
11
|
1
|
80
|
Java
Teacher
|
In the score table we are saving the student_id to
know which student's marks are these and subject_id to know
for which subject the marks are for.
Together,
student_id + subject_id
forms
a Candidate Key for this table, which can be the Primary key.
Confused, How this combination can be a primary key?
See, if I ask you to get me marks of student with
student_id
10,
can you get it from this table? No, because you don't know for which subject.
And if I give you subject_id
, you would
not know for which student. Hence we need student_id
+ subject_id
to uniquely identify any row.
But where is Partial Dependency?
Now if you look at the Score table, we have a
column names
teacher
which is only
dependent on the subject, for Java it's Java Teacher and for C++ it's C++
Teacher & so on.
Now as we just discussed that the primary key for this table is a
composition of two columns which is
student_id
& subject_id
but
the teacher's name only depends on subject, hence the subject_id
, and has
nothing to do with student_id
.
This is Partial Dependency, where an attribute in a
table depends on only a part of the primary key and not on the whole key.
How to remove Partial Dependency?
There can be many different solutions for this, but out objective
is to remove teacher's name from Score table.
The simplest solution is to remove columns
teacher
from
Score table and add it to the Subject table. Hence, the Subject table will
become:
subject_id
|
subject_name
|
teacher
|
1
|
Java
|
Java
Teacher
|
2
|
C++
|
C++
Teacher
|
3
|
Php
|
Php
Teacher
|
And our Score table is now in the second normal form, with no
partial dependency.
score_id
|
student_id
|
subject_id
|
marks
|
1
|
10
|
1
|
70
|
2
|
10
|
2
|
75
|
3
|
11
|
1
|
80
|
Quick Recap
1.
For a table to be in the Second Normal form, it should be in the
First Normal form and it should not have Partial Dependency.
2.
Partial Dependency exists, when for a composite primary key, any
attribute in the table depends only on a part of the primary key and not on the
complete primary key.
3.
To remove Partial dependency, we can divide the table, remove the
attribute which is causing partial dependency, and move it to some other table
where it fits in well.
Third Normal Form (3NF)
A table is said to be in the Third Normal Form when,
1.
It is in the Second Normal form.
2.
And, it doesn't have Transitive Dependency.
In our last tutorial, we learned about the second
normal form and even normalized our Score table into
the 2nd Normal Form.
So let's use the same example, where we have 3 tables, Student, Subject and Score.
Student Table
student_id
|
name
|
reg_no
|
branch
|
address
|
10
|
Akon
|
07-WY
|
CSE
|
Kerala
|
11
|
Akon
|
08-WY
|
IT
|
Gujarat
|
12
|
Bkon
|
09-WY
|
IT
|
Rajasthan
|
Subject Table
subject_id
|
subject_name
|
teacher
|
1
|
Java
|
Java
Teacher
|
2
|
C++
|
C++
Teacher
|
3
|
Php
|
Php
Teacher
|
Score Table
score_id
|
student_id
|
subject_id
|
marks
|
1
|
10
|
1
|
70
|
2
|
10
|
2
|
75
|
3
|
11
|
1
|
80
|
In the Score table, we need to store some more information, which
is the exam name and total marks, so let's add 2 more columns to the Score
table.
score_id
|
student_id
|
subject_id
|
marks
|
exam_name
|
total_marks
|
What is Transitive Dependency?
With
exam_name
and total_marks
added
to our Score table, it saves more data now. Primary key for our Score table is
a composite key, which means it's made up of two attributes or columns → student_id
+ subject_id.
Our new column
exam_name
depends
on both student and subject. For example, a mechanical engineering student will
have Workshop exam but a computer science student won't. And for some subjects
you have Prctical exams and for some you don't. So we can say that exam_name
is
dependent on both student_id
and subject_id
.
And what about our second new column
total_marks
? Does it
depend on our Score table's primary key?
Well, the column
total_marks
depends
on exam_name
as
with exam type the total score changes. For example, practicals are of less
marks while theory exams are of more marks.
But,
exam_name
is
just another column in the score table. It is not a primary key or even a part
of the primary key, and total_marks
depends
on it.
This is Transitive Dependency. When a non-prime
attribute depends on other non-prime attributes rather than depending upon the
prime attributes or primary key.
How to remove Transitive
Dependency?
Again the solution is very simple. Take out the columns
exam_name
and total_marks
from
Score table and put them in an Exam table and use the exam_id
wherever
required.
Score Table: In 3rd Normal Form
score_id
student_id
subject_id
marks
exam_id
score_id
student_id
subject_id
marks
exam_id
The new Exam table
exam_id
|
exam_name
|
total_marks
|
1
|
Workshop
|
200
|
2
|
Mains
|
70
|
3
|
Practicals
|
30
|
Advantage of removing Transitive
Dependency
The advantage of removing transitive dependency is,
·
Amount of data duplication is reduced.
·
Data integrity achieved.
Boyce and Codd Normal Form (BCNF)
Boyce and Codd Normal Form is a higher version of the
Third Normal form. This form deals with certain type of anomaly that is not
handled by 3NF. A 3NF table which does not have multiple overlapping candidate
keys is said to be in BCNF. For a table to be in BCNF, following conditions must
be satisfied:
·
R must be in 3rd Normal Form
·
and, for each functional dependency ( X → Y ), X should be a super
Key.
No comments:
Post a Comment
This is a User Friendly Blog.
Simple Interface and Simple Controls are used.
Post your comments so i can modify blog regarding your wish.