Monday, February 20, 2017

Data and Dimension - Pt. 1

Going into the field of DH, one must be just as aware of the "Digital" side, as the "Humanities." The digital end of digital humanities is no small matter and there's quite a bit of work that goes into tools for textual analysis. This week, I'll be walking through two articles from Blackwell's A Companion to Digital Humanities, both of which talk about the technological side.

A few semesters back, I decided to learn about coding. I took lessons at Code Academy and learned a little about HTML and CSS, and I loved it. It was amazing to learn a little about the "language" of computers, and I honestly wish I had had this opportunity when I was younger. Part of what draws me to DH is the opportunity to learn to use technology in a way that marries it to my first love, literature. 

The reason I bring up coding for the reason that it, much like databases, is so incredibly complicated. In taking a few lessons on basic coding, I know a fraction of a fraction of what goes into making one single webpage. I get the same feeling from this reading on databases, they can be quite simple, but so much goes into making a truly responsive database.

To delve more into that point we have our first article, "Databases," by Stephen Ramsey. Databases have existed, in one form or another, for a long time and serve as a way to categorize and store data for easy retrieval. Computerized databases add another element to the mix, a need for systems that "facilitate interaction with multiple end users, provide platform-independent representations of data, and allow for dynamic insertion and deletion of information." Databases play a large role in the DH, as the compilation of data can aid in charting relationships and themes throughout a number of books, or fields of data. Although this, as previously discussed, may seem daunting to the humanist, it is actually quite an exciting addition to the field. As Ramsey notes: 
The most exciting database work in humanities computing necessarily launches upon less certain territory. Where the business professional might seek to capture airline ticket sales or employee data, the humanist scholar seeks to capture historical events, meetings between characters, examples of dialectical formations, or editions of novels; where the accountant might express relations in terms like "has insurance" or "is the supervisor of", the humanist interposes the suggestive uncertainties of "was influenced by", "is simultaneous with", "resembles", "is derived from."
The first model of database discussed in this article is the relational model, which studies relationships between databanks-- or sets of data. The man who first proposed this model reasoned that "could be thought of as a set of propositions…and hence that all of the apparatus of formal logic could be directly applied to the problem of database access and related problems." Sounds logical to me!

Databases are quite complicated entities, and I'm going to try to keep my secondhand explanation as simple as possible. Ramsey delves into the finer points of what goes into a database and, at its most simple, a database includes a system which can store and query data, as well as one that can answer simple questions by linking together stored data. However simple as these systems are, databases can get quite large, and that's when the algorithms start getting more complicated.

Ramsey goes into the different categorizations of data in his example of a database that stores information about current editions of American novels. By showing the problems that can arise from the most simplistic categorizations, he explains that there are other ways in which data can be categorized that are more complicated, but yield better results. Further, there are different ways in which data can be compared, which complicates things even more. In his American novels example, he talks about comparing one author to many works (1:M), or comparing many publishers to many works (M:M), and how the system would logically go about making the calculations needed to give a result.

Are you still with me? It's going to get much more technical. The next subject to be discussed is schema design. This is where we get into programming, database schema, which is created using Structured Query Language, or SQL. The best, least daunting way I can think to describe this is by saying that in using SQL the user is telling the machine what to do. The humanist tells the computer what they want it to do with the data they will be using. Even though it includes a lot of code, it's somehow less daunting if you think of it as giving direction. Below is an example of SQL, a basic structure that will later be filled in with data.

Ramsey explains this next part well, so I'm going to direct you to him for this bit:
Like most programming languages, SQL includes the notion of a datatype. Datatype declarations help the machine to use space more efficiently and also provide a layer of verification for when the actual data is entered (so that, for example, a user cannot enter character data into a date field). In this example, we have specified that the last_name, first_name, title, city, and name fields will contain character data of varying length (not to exceed 80 characters), and that the year_of_birth, year_of_cleath, and pub_year fields will contain integer data. Other possible datatypes include DATE (for day, month, and year data), TEXT (for large text blocks of undetermined length), and BOOLEAN (for true/ false values). Most of these can be further specified to account for varying date formats, number bases, and so forth. PostgreSQL, in particular, supports a wide range of datatypes, including types for geometric shapes, Internet addresses, and binary strings.
There's a lot more technical discussion in this article that I struggle to explain in my own words so I direct you to the text if you're interested in learning more about the programming side of SQL. What I've come to see is that it's incredible interesting and incredible precise. Much the same as coding, there is a fine art to speaking the language of the computer and communicating effectively. I'd be excited to take a lesson in this and have hands-on instruction. I'm hoping for a THATcamp in my area, as I think this would be the best opportunity to learn from others.

Ramsey broaches the discussion of data management in the last few paragraphs of his article. With great power comes great responsibility, so to speak, and anyone who has ever played around with HTML can tell you that the smallest error can throw off a large amount of work. The same is true with database management, and Ramsey suggests giving full access to very few people, for the sake of data and code security. After all, not many people need full access to an entire system. The less room for error, the better.

As we can see from the brief introduction to databases, here's a lot that goes into database programming, and there's certainly a learning curve that is not easily overcome. Luckily, there are a ton of resources to help the aspiring learner. Ramsey cites three of the most commonly used SQL tools, MySQL, mSQL, and Post-greSQL, as helpful options for those interested in using this methodology.

Because the readings this week are so dense, I'm going to split up the Great Wall of Text and direct you here for part two of this week's blog post!

No comments:

Post a Comment