Tuesday, 11 October 2011

DITA - week three: the joys of structuring and querying information stored in databases

"Woah, hold up there a minute!" I hear you say. "Week three? What happened to week two?" Erm, yes, I regret to say there's been a change to our regularly scheduled DITA services due to me still grappling with some of the finer points of the week two tasks - I'll be finishing those shortly (after which you will all be able to marvel at my mad HTML skillz), but for now let's talk about databases and (more importantly for us) how to query them using SQL.

Databases - what they're good for

First of all, why are databases good? Databases allow us to centralize information, thereby reducing the inevitable inconsistencies that creep in when that same information is stored in a number of different places. The example given in the lecture was that of a company, with different departments of that company storing information on its employees. If each department (for example, human resources, accounts/payroll, etc.) keeps their own files on employees (recording information on, for example, home address, phone number, bank account, etc.) you've got a lot of information being duplicated. And when any of that information changes (let's say somebody's phone number), each department will need to record that change. That's a lot of work being repeated, and once you start dealing with a large number of employees it's inevitable that someone somewhere won't update that information. It's much simpler to store all that information in one place that can be accessed and searched by all of the different departments - changes only need to be made once, saving time, eliminating redundant information, and minimizing inconsistencies. In other words, bring on the database!

Databases are also good because an organization (or individual) can structure it to suit their own needs. They can be made homogeneous in structure, and can be designed to create efficient searches. These three points, however, apply when you create or own the data involved - if it's not your data a database isn't so appropriate (information retrieval becomes more appropriate in that case - but more on that in week four).

What exactly is a database?

Okay, so we know that databases are useful - but what exactly is a database? Basically, a database is a collection of data tables, where each table is a two-dimensional table of data (in other words, rows and columns).  Each table in the database describes the attributes of a single thing - or, to use the technical term, an entity. For example, one table in a database could be about publishers, with that table listing different attributes of publishers, such as the name, address, city, and phone number. Another table in the same database could be about book titles, listing attributes such as title, year published, ISBN.

Each table in a database will also share a relationship with one other table in that database. In other words, each table will share a piece of information (an attribute) with one other table - for example, your publishers table and book titles table may both contain the attribute publisher ID. These relationships are extremely important, as they allow the tables in the database to all connect to each other - which in turn allows us to search across the tables of the database, pulling out the information we are looking for.

Searching databases with SQL


We search databases using SQL - Structured Query Language - which is a language that allows communication with a DBMS (Database Management System). Using SQL we can create databases, and insert, modify, and delete data from the data tables, but here we'll be focusing on using SQL to query what's in the data tables.

The basic syntax of an SQL search is pretty simple. You:

Search fields
From tables in the database
Where something is true;

That semicolon at the end is important, by the way - it signifies the end of the command.

So, if you wanted to see all of the names of the publishers in the publishers table, you would type:
search name from publishers;

If you wanted to put in some kind of condition, for example, all of the publishers from New York, you would type:
search name from publishers where city = "New York";

Of course, what starts off as being "pretty simple" quickly turns into "headache inducingly complex". To create more precise searches there are a number of different operators you can use. For example:

select * allows you to select everything
= is equal to
< is less than
> is greater than
<= is less than or equal to
>= is greater than or equal to
<> is not equal to
order by (column name) allows you to order the results of your search in a particular way
order by (column name) desc allows you to order the results in descending order
% is your wildcard character. Importantly, this isn't use with =, but is instead using with like. For example, select title from titles where title like "%philosoph%"; will give you all of the titles that have variations of the word philosophy anywhere in the title.
and
or
not


Using these operators allows us to create very precise searches - however, it's important in SQL to enter these in precisely, and in the correct order. Making a mistake on this, or missing anything out, the database will refuse to play ball with you.

Searching across the tables in a database is particularly tricky. In order to do this you'll need to know the relationship between the different data tables. You then need to put this relationship in your SQL command, joining the two tables. Earlier, I gave the example of a publishers table and book titles table both containing the attribute publisher ID. To get information from both of these tables you need to mention that particular relationship in your SQL command. So, if you wanted to find out the titles and publishers of books that had variations of the word philosophy in their title, you would type:
select title, company_name from publishers, titles where publishers.pubid = titles.pubid and title like "%philosoph%";

The part in the command where publishers.pubid = titles.pubid is what joins the two tables together, thereby allowing you to search across both tables.

Okay, this post is already far too long - time to wrap things up with some thoughts and reflections on the exercises for this week.

Reflections on this week's exercises


I found this week's exercises to be a little hard-going at first, although by the end of the session I think I had started to wrap my head around using SQL. Understanding the relationships between the tables in a database seems pretty crucial - you can do this with the commands show tables; and desc (table name); - the first command shows you what tables are available, and the second shows you the details of one of the data tables. By going through each data table you'll see what the relationships are between them (which you really need to know for searching across the database).

Mistakes that I made early on were forgetting to end my commands with ; and forgetting to put in underscores in attribute names (for example, typing in select year published instead of select year_published). I realized pretty quickly that I wasn't going to make much progress if I kept forgetting these!

Getting the syntax of commands exactly right is something that's extremely important. One of the questions was to find the name of the publisher who published a book with the ISBN 0-0280074-8-4. I typed select name from publishers, titles where publishers.pubid = titles.pubid and isbn = 0028007484;


This didn't work. I needed to type in the ISBN with the dashes. However, that didn't work either. Why? It turns out I was forgetting to put quotation marks around the ISBN - it should've been isbn = "0-0280074-8-4";


I need the quotation marks because the dashes in the ISBN are characters - and if you have a string of characters in your command, you need to use quotation marks. (This is not the case with pure numbers, however.)

So for example:
search name from publishers where city = "New York";
select title, company_name from publishers, titles where publishers.pubid = titles.pubid and title like "%philosoph%"; 
select title from titles where year_published = 1973;


I also discovered that if you are searching across two (or more) tables, and are selecting information that is contained in both tables (their relationship), then you need to specify exactly which table you want that information to come from. For example, let's say you're looking for company name and publishers ID for publishers that have published books with the word philosophy in the title. At first, the temptation is to type:
select company_name, pubid from publishers, titles where publishers.pubid = titles.pubid and title like "%philosophy%";


But this won't work - instead, you'll be told that the pubid in field list is ambiguous. This is because you haven't specified which of the two tables you want the publishers ID info to come from. It doesn't matter which one you specify (the information is the same in both tables). So, you should type:
select company_name, publishers.pubid from publishers, titles where publishers.pubid = titles.pubid and title like "%philosophy%";


Finally, I learnt that you can search across three (or more) tables in a single search - you just need to put all of their relationships in the command. For example:
select author, title, titles.isbn from authors, title_author, titles where titles.isbn = title_author.isbn and authors.au_id = title_author.au_id and title = "a beginner's guide to basic";


In this command I have searched across the authors, title author, and titles tables, using the relationships to join the titles and title author tables and the title author and authors tables.

And that's all for this week. Stayed tuned for when we go back in time to DITA week two - Cameron's adventures with HTML!

1 comment:

  1. I like that you ask what a database is good for. I leap right over that and assure databases of my eternal devotion.

    ReplyDelete