Working with multiple tables

Where modern databases start to get both more confusing and more useful is when you start making multiple tables and link them together. For example, we now have a list of all of our albums, why not a list of all of the songs as well?

Import songs

Create a new table called “songs”. This one will have a field for song, length, artist, album, year, rating, track, and genre. Set up the ID as an auto increment field as we did for albums.

New for this table is the “time” field type. Our song length is going to be a time.

Import “songs.txt” as you earlier imported “albums.txt”. Note that there is an extra field in songs.txt that we are not importing: a combined date and time field.

Once you’ve imported the songs, scroll through them to see if there are any obvious inconsistencies. You might notice that, in the import screen, the track says “x of y”, i.e., “4 of 9” in the above example. But “4 of 9” is not a number. When MySQL imports text into a numeric field, it will grab the first characters if the first characters are numbers, and then toss the rest.

You should end up with something like this:

Grouping songs

Now that we have actual songs, we can look up, for example, how many songs we have in each genre.

select genre, count(*) as songcount from songs group by genre

We can do everything in this table that we could in the other one, just on different data.

Relating tables

So we have a song table with song information, and an album table with album information. Suppose we want to combine the two on our display? We need to have a field that relates the two tables. In this case, the item that is the same is likely to be the album name.

select song, songs.artist, songs.album, purchasedate, songs.rating/10 as "Song Rating", albums.rating as "Album Rating"
from songs join albums on songs.album=albums.album
order by artist
limit 10

I’ve chosen to “limit 10” so that I only see ten rows. Otherwise, the screen would fill up with several thousand rows. When you are testing your queries, it is often useful to limit the number of records you ask for.

Indexing columns

If you have large databases, or if you have complex select statements, searches can be slow. This is because MySQL has to re-index the columns every time you execute the statement. If you are willing to slow down adding data, however, you can pre-index columns. Generally, you don’t want to pre-index any column that is not searched on, related to, or sorted by.

That last query took about three seconds on the computer I’m using to test this. That doesn’t seem like a long time, but imagine that this is on a web site and we’re getting hundreds of requests every minute. Three seconds per request means that our server will get overloaded: we’re getting more requests in than we can put out.

How can we speed it up? We are sorting it by artist and checking the songs album against the albums album. Let’s add an index for those three columns and see if it reduces response time.

There are three basic kinds of indexes you’ll use most often: a primary index, a unique index, and a normal, non-unique index.

A primary index is a unique index, and is usually your id field. You can only have one primary index per table.

Since artists and album names can occur more than once, make all of these be an “index”, a non-unique index.

Here is how the index area of your Structure tab should look for the songs table. Your albums table should look the same, but without the artist index.

After making these three indexes, the same query now took under a second on this computer. That’s over thirty times as fast. If this were a web page, that would be critical.

Conflicting column names

Let’s take another look at that query.

select song, songs.artist, songs.album, purchasedate, songs.rating/10 as "Song Rating", albums.rating as "Album Rating"
from songs join albums on songs.album=albums.album
order by artist
limit 10

For the from portion of the statement, we have “from songs join albums”. This is the part that combines the two tables. However, we also need to restrict the combined table, because by default MySQL combines the tables in every possible way. If we had a table with the records “2” and “3”, and another table with the records “3” and “6”, a simple combine would give us four rows: “2, 3”, “2, 6”, “3, 3”, and “3, 6”.

So, to the on portion of the statement, we add a restriction that matches a column in one table to the same column in the other: “on songs.album=albums.album”.

Once we have more than one table, it is possible--often even likely--that one or more columns will have the same name. We differentiate between the two columns by prefacing the column name with the table name, and separating the two with a period. So, “songs.album” is the album column from the songs table, and “albums.album” is the album column from the albums table.

We do the same thing in the select portion of the statement. We have two choices for artist, and we choose the artist column from the songs table. Similarly for the album column. for the rating column, we choose to display each one (although we modify the rating column in songs to make it comparable to the rating column in albums) and give them each a special name.

Missing records

We’re asking MySQL for all songs whose album name matches an album name in the albums table. What if a song doesn’t have an album name?

select * from songs where album=""

There are five songs with no album. Let’s try and find one in the above query:

select song, songs.artist, songs.album, purchasedate, songs.rating/10 as "Song Rating", albums.rating as "Album Rating"
from songs join albums on songs.album=albums.album
where song="Eddie Fisher"
order by artist
limit 10

Nothing. That’s because there is no album with no name in the albums table. The same thing will happen if we have some songs with an album name that does not match any album name in the albums table. If we want to see all songs, even ones with no matching album, we need to join the tables together with a left join. A left join is like a normal join, but it does not throw any records out on the left. In this case, the left table is “songs”, so a left join will show all songs, even ones with no album.

select song, songs.artist, songs.album, purchasedate, songs.rating/10 as "Song Rating", albums.rating as "Album Rating"
from songs left join albums on songs.album=albums.album
where song="Eddie Fisher"
order by artist
limit 10

When we do a left join, the relevant portion of the where part of the statement moves into an on area. We are grabbing “from songs left join albums on songs.album=albums.album”.

Now, we find our Eddie Fisher record.

What if we wanted a list of all songs with no matching album? Notice that in the above example, purchasedate and Album Rating are both NULL for Eddie Fisher. That’s because those columns are in the albums table, and there is no matching record for Eddie Fisher in the albums table. We can use that to our advantage.

select song, songs.artist, songs.album from songs
left join albums on songs.album=albums.album
where albums.album is NULL
order by artist

There are 1,198 songs that don’t have a matching album. If this were important data, we might want to figure out why. Most likely, some album names have been spelled differently in one table than they are in the other. Depending on the source of the data and the nature of the differences, you might make your fixes in MySQL using the update statement, or in the original source and re-import.

Quotes in statements

Some of the orphaned tracks make sense, but some of them don’t. Why is Aerosmith’s Rocks album orphaned?

select album from albums where artist="Aerosmith"

One of the “bad” albums is Aerosmith’s “Rocks”. The title of the album in albums has quotes in it, whereas the title of the album in songs does not. Let’s assume that the title without quotes is the correct one. We need to update the albums table and give it the new title.

select * from albums where album='"Rocks"'

I always try to do a select before I do an update, using the same where that I’ll be using in the update. This reduces the risk of completely destroying my data. Look at that where statement. Because the text that we’re looking for contains double quotes, we cannot use double quotes to surround the text. MySQL supports single quotes as well as double quotes to get around this.

If we surround the text with double quotes, we should avoid double quotes in the text. If we surround the text with single quotes, we should avoid single quotes in the text.

If our text has both double quotes and single quotes in it, we can “backquote” the offending quotes. For example, we could have used:

select * from albums where album="\"Rocks\""

Finally, go ahead and update that album. Note that in CocoaMySQL, you may find it easier--and safer--to make the change in the spreadsheet view.

update albums set album="Rocks" where album='"Rocks"'

If you redo the NULL check, there will now be only 1,189 songs without a matching album.

Using IDs

We’ve set up each table with a unique ID. We should be using those rather than album names to relate our tables together. Otherwise, when we change an album name in one table, we also need to change it in the other. By adding a reference to the unique ID, which never changes and never has to change, we only have to update the album name in one place.

The way to do this is to duplicate our table, with an added field.

create table newSongs select songs.*, albums.id as albumID
from songs left join albums on songs.album=albums.album

Once we’ve verified that the new table has what we want, we can rename the old one to “oldSongs” and the new one to “songs”.

Remember that our new table will contain a bunch of songs with no albumID:

select * from newSongs where albumID is NULL

For real data, we would need to go through these records and find out why, fixing those that can be fixed. Once we do fix it, we could get rid of the “album” column in the newSongs table.

One other thing to fix in newSongs is the ID column: it was not transferred over as auto increment. So go ahead and switch the ID column to auto increment.

First, it tells you “invalid default value for ID”. Switch “Null” to YES and the default value to NULL. Then, you’ll get a different error.

Try the following:

select * from newSongs where id=7185

This is the Time Warp from the Rocky Horror Picture Show. It’s in there twice. How many items do we have like this?

select id, artist, song, album, count(*) as idCount from newSongs group by id having idCount > 1

Having is a form of where that often works better after group by. If you can use where, however, it is almost always a better choice.

Looks like we have 61 songs that ended up getting duplicated. Why?

It turns out that different artists occasionally release albums of the same name. Tracy Chapman released an album called “Crossroads”; and there was a movie called “Crossroads”. Our select statement couldn’t tell the difference.

One solution would be to check the artist as well as the album when doing the transfer. But while it would solve this problem, it will exacerbate the problem of different spellings: instead of only different spellings for album mattering, different spellings for either album or artist will matter.

These are the sorts of problems you run into when you try to reconcile data from multiple sources. Sometimes you can fix the problem at the source; other times, you’ll need to do the work to reconcile them in MySQL. If it is important data, however, it is almost always worth the time now to fix the data, rather than spending more time later after problems have had a chance to build up.

Separating data

Our songs table includes a genre column. Suppose you wanted to add a description and a history for each genre. You could go ahead and add two new columns to the songs table, but this would mean duplicating the same description and history for each genre across every record that contains that genre.

select count(*) from songs where genre="Rock"

The “rock” genre alone has 2,289 songs in it. That’s a lot of work. In general, you don’t want to have to duplicate more than one column across all of your records. When the only genre column was the genre itself, separating that into a different table wouldn’t have saved much. But now that genre is also going to be tied to a description and a history, it will be a lot less work and a lot more reliable to separate the genres into their own table.

What we really need to do is create a new table, much like we did above. Since we’ll be creating it from almost scratch this time, we won’t have the same problems we had above.

The first thing to do is make a new table with all the genres in it. We know how to make a table from data in an existing table. First, we find the select statement we’re going to need.

select distinct genre from songs

That shows us the 29 genres we’re going to have in our new genre table.

create table genres select distinct genre from songs

That creates a very simple one-column table with our genre names in it. The next step is to add the rest of the columns we need: an auto increment ID column, and the two text columns for description and history, respectively.

You already know how to create the ID column. But our current text column type, the variable length character column, is really designed for small pieces of text. If your text is going to contain more than 255 characters, varchar will not suffice. For longer text, we have the text column type. Text columns can contain up to 65,535 characters. (If that isn’t enough for you, there is also mediumtext which can contain up to 16,777,215 characters!)

Make the description and history columns be text types.

Now, go ahead and set up some content for the description and history of a few of the genres. For example:

update genres set description="Rock and roll is a form of popular music, usually featuring vocals (often with vocal harmony), electric guitars and a strong back beat; other instruments, such as the saxophone, are common in some styles. Rock's social impact on the world is unparalleled by any other kind of music. It has been credited with ending wars and spreading peace and tolerance, as well as corrupting the innocent and spreading moral rot." where genre="Rock"

update genres set history="Rock and roll emerged as a defined musical style in America in the 1950s, though elements of rock and roll can be heard in rhythm and blues records as far back as the 1920s. Early rock and roll combined elements of blues, boogie woogie, jazz and rhythm and blues, and is also influenced by traditional Appalachian folk music, gospel and country and western. Going back even further, rock and roll can trace a foundational lineage to the old Five Points district of mid-19th century New York City, the scene of the first fusion of heavily rhythmic African shuffles and sand dances with melody driven European genres, particularly the Irish jig.\nRocking was a term first used by gospel singers in the American South to mean something akin to spiritual rapture. By the 1940s, however, the term was used as a double entendre, ostensibly referring to dancing, but with the hidden subtextual meaning of sex; an example of this is Roy Brown's \"Good Rocking Tonight\". This type of song was usually relegated to \"race music\" (the music industry code name for rhythm and blues) outlets and was rarely heard by mainstream white audiences. In 1951, Cleveland, Ohio disc jockey Alan Freed would begin playing this type of music for his white audience, and it is Freed who is credited with coining the phrase \"rock and roll\" to describe the rollicking R&B music that he brought to the airwaves." where genre="Rock"

I pulled this information off of Wikipedia (http://en.wikipedia.org/). You may find it easier to go there and copy some of this text out. Remember to watch out for quotes! You cannot surround your text with the same kind of quotes that the text contains unless you backquote the quotes in the text, as I’ve done with the history of rock above.

The final step is to replace the genres in the songs table with the genre ID we’ve just created. First, create an unsigned integer column in songs called “genreID”.

Since we’ll be searching on this genreID, you’ll want to create a standard index for the column as well.

Now, let’s set the genre ID to the correct ID from the genres table. First, we text the where we’re going to use.

select song, artist, songs.genre, genres.genre, genres.id
from songs join genres on songs.genre=genres.genre

If that shows you what you’d expect to see, backup your data (always backup your data before making a potentially data-threatening change such as this!) and convert that select into an update:

update songs, genres set genreID=genres.id where songs.genre=genres.genre

This may take a while! It has 7,000 records to update.

You should now be able to get genre information for any songs whose genre you’ve entered information for. You’ll only have to enter that information once for all songs of that genre; and if you alter the name of the genre that alteration will follow through as long as the genre ID stays the same.

select song, artist, description
from songs left join genres on songs.genreID=genres.id
where song like "%chain%"
order by artist

This will show all songs whose name contains “chain”, along with the artist and the description of the genre.

Once you’ve verified that this works, you’ll want to remove the extraneous genre column in the songs table. In CocoaMySQL you can just choose the column in the structure tab and delete it. In the mysql command line utility use:

alter table songs drop column genre;

You no longer need that column.