Home Company Services Experience Process Articles FAQs Contact Us

Data Modeling


Data modeling refers to the process of collecting data items and defining entities. First, find all the data items in a particular problem domain, for example, customer, account, program, interest, clerk for banking industry. Then describe clearly and accurately what each data item means and what its properties are. After that, consolidate data items into entities based on similarities or relationships. For each entity, define the attributes based on the properties of the corresponding data items, and identify the relationships among entities.

 

Note that during the phase of data modeling, we are not doing any particular database design work. We focus on understanding the data and analyzing their relationships. Therefore for data modeling, we use the terms entity, attribute and instance, instead of table, field and record, to distinguish data models from database representations. Theoretically, data modeling and database representations can be independently chosen. For example, we can model the data using the object-oriented approach that is familiar to application developers, and represent the data with the relational database. With the increasing supply of object-to-relation mapping software, this approach has become more popular, especially for certain enterprise and e-commerce applications. Since the object-oriented approach is better covered in the object-oriented analysis and design, this article only discuss certain topics of data modeling using the relational approach. However, we still want to make the distinction between data modeling and database representation, because they usually focus on different objects, target at different audiences, use different tools, require different skills and have different concerns, especially for applications that have very large and complex data. To summarize, we will use the relational approach, or more specifically, the entity-relationship approach to build data models, for the purpose of understanding the data, without the context or constraints from any particular database.

 

Soft Skills

 

Data modeling is usually the first opportunity to get familiar with the problem domain and customer’s requirements. It demands not only deep domain knowledge, but also strong soft skills and good understanding of business practice. There are enormous and recursive interactions between data modelers and business units, IT departments, end users, database developers, DBAs, application developers, SQA, and support engineers. Data modelers must understand the company’s business goals, daily operations, stakeholder interests, business rules and information flow in order to model the data effectively and correctly. They think from both business and engineering perspectives, and interact with different types and levels of people (the way to communicate with corporate executives are certainly different from the way with software engineers). They must be open-minded and good listeners because no matter how many projects they have done before, there are no two projects that are the same. From data modeling point of view, there are no two companies conducting business in the same way and having the same business rules. Skills such as how to lead, motivate, comply with business manners, conduct interviews, analyze business logics, and understand people factors, are crucially important.

 

Identify Entities

 

Entity is a set of data items that are highly cohesive to one another but loosely coupled to data items of other entities. In other words, entities are self contained and self sustained, and the interactions among entities are minimized. Most likely entities directly map to concrete concepts in the problem domains, but there are entities that are derived from abstract concepts or merely for data modeling purposes.

 

One of the major decisions to make in identifying entities is how to group data items that share some common properties. Sometimes two data items have a big portion of distinctive properties, but still share some common properties. For example, data items X and Y share common properties A; in addition to that, X has properties B and Y has properties C. One straightforward way is to model the data is to define an entity X for item X, and an entity Y for item Y. However, this would spread instances of entities X and Y that share the common attributes derived from properties A into two entities, which would create problems in data retrieval. For example, if we want to find all the instances that have the same value of attributes A, we need to search all the entities that contains attributes A. Not only this is search inefficient, but keeping track of all the entities that share the same attributes is a big headache.

 

Another way to model items X and Y that have common properties is to define an entity XY whose attributes are the consolidation of properties A, B and C. This way solves the problems mentioned above, but is an obvious overshoot. For any instance that is actually X, all the C attributes will be irrelevant in entity XY, which lead to certain waste of storage and cost of performance.

 

The third way to deal with such case is to use the super-entity/sub-entity method. define a super-entity SPA whose attributes are derived from the common attribute A, a sub-entity SBB that contains attribute B, and a sub-entity SBC that contains attribute C. The primary keys of SBB and SBC also appear in SPA, which is how the super-entity/sub-entity association is maintained, but never appear in each other. For example, an online store can sell things from books to cars, from cell phones to drugs. While they all share some common attributes such as name, quantity, price, and sales promotion, they obvious have attributes that are unique to themselves. Books have writers and publishers, cars have warranties, cell phones have a number associated with it, and drugs have insurance and prescription information. We can define entities for book, car, cell phone and drug. However, if we put all the attributes in each entity, it will be very hard to find a particular instance. For example, without knowing that “Top Mobile” is actually a cell phone name, we need to search for all the four entities to match that instance. An alternative is to define a super-entity such as “Merchandise”, and put all the common attributes in that entity.

 

Define Primary Key

 

Every entity shall have one and only one primary key, which can be a simple one or a composite one. The primary purpose is to eliminate duplicated instances. Each instance of an entity must be easily, and sometime uniquely, identified. The relationships among entities are through the association of primary key and foreign key. Primary key serves as the ID of each instance of the entity. It must be unique, cannot be empty, shall contain one single attribute – or as few as possible. It shall never change either. Since primary key is used for identification and navigation, it cannot contain sensitive attributes such as password, social society number, date of birth, etc. Things like driver’s license number, credit card number, and account number shall not be used in their complete format. In the case that one data item is distributed among several entities, such as the super-entity/sub-entity case, the same primary key must designate the same instance throughout the whole database or even the whole organization. For example, no two employees shall use the same employee ID if used as the primary key. No matter how many entities are used to maintain a complete set of data of an employee, all the tables must use the same employee ID to refer to the same employee.

 

Identify Attributes

 

Decompose entity to the minimum. It is easier to get multiple attributes, combine them and present to the user, comparing with searching or processing the subset of an attribute. The query is hard to write and the subset syntax is usually not well defined. For example, if use name is an attribute, people can enter James Bond or Bond James or Bond, James. How do you find the super agent based on his famous last name? The better way is to have two attributes: first name and last name, which makes query easier to write and no room for confusion. You don’t want to mess up with Mr. Bond, right?

 

Make sure an attribute belongs to the right table. The example is that students check out books from a library. If the rule is that all the books must be checked out at one time, then the checkout date attribute shall belong to a “Checkout Records” table. If the rule is that each book can be checked out individually, then the checkout date attribute shall belong to a “Checkout Books” table.

 

Be careful about dependency among attributes. Common cases are one attributes contains a calculation or concatenation results that take data from other attributes as the inputs. Examples are total amount, price after discount. The problem is that they have to be kept in sync. If such dependency is absolutely necessary, user application must be carefully designed to keep the data integrity.

 

Define Entity Relationships

 

In relational data model, relationships among entities are everything. On one end, you want to balance the size of entity (don’t put all the attributes in one entity) and the complexity of relationships. On the other end, you don’t want too many standalone entities, which has no relationship to other entities.

 

Even though the entity relationships can be very complex, the types of relationships are quite simple: no relation, one-to-one, one-to-many and many-to-many, and the desired types of relationships are even less: one-to-many. Entities that are not standalone are always related to others. One-to-one relationships can be avoided by grouping the attributes into one entity. In other words, unless for certain unusual reasons, e.g., avoid big size entity, one-to-one relationships shall be avoided.

 

Eliminate many-to-many relationship between entities. The solution is to create a third entity, the linking entity, which contains the primary keys from the two entities, to break the many-to-many relationship to two one-to-many relationships. In the linking entity, the two attributes are each foreign key and combined as a composite primary key. For example, since there is a many-to-many relationship between Customers and Accounts, a linking entity “Customer Accounts” is created to break it into two one-to-many relationships. In the linking entity, two foreign keys each from the two entities serve as the composite primary key.


Jerry Zhong, April 2000.