The issue of student debt is one of the most pressing for the millennial generation. Millions of young people every year leave college with an ever increasing burden of debt, threatening their purchasing power and financial security. The data bears these facts out, as the tableau story below will demonstrate. The purpose of this story is to demonstrate that student debt continues to be an unaddressed issue and is affecting the viability and ultimate value of the current higher education model.
The story type I used was change over time. Using the college scorecard data set, I was able to pull a decade of tuition, debt and income data for Non-Profit and Public institutions. The story looks at change in tuition, total debt and debt-to-income from 2003 to 2013. I provided three story points in order to tell the story. Each story point should logically build on the previous one – setting the stage with the increase in tuition leads to the consequential increase in debt burden. Because tuition exceeds the increase in income over the period – student debt-to-income naturally increases as well.
Visualizations are a tool for understanding data. They help to express relationships between data, the significance of data points, and the importance of data without having to scroll through tables or calculate percentages. Visualizations provide immediate understanding based on vision and interpretation of the image. All of the data used for this exercise is from the 2013 College ScoreCard dataset, which provides the most complete data I found as of this date.
One of the first images I created is a map of the United States showing the tuition of bachelor’s degree granting universities across geographies. The size of circles indicate the amount of in-state tuition charged by the institution, while the color of the circle indicates whether the institution is public (light blue) or non-profit private (dark blue). As seen, dark blue circles tend to be significantly larger on average as expected. The visualization also shows that in terms of distribution, more expensive private universities tend to be located in the Northeast while the South and Midwest feature less expensive public and private options.
The second graph I produced shows the geographic distribution of student populations using a bubble graph. The graph demonstrates that most undergraduate students are located in the Southeast region (1,994,386) while the fewest are located in the Rocky Mountains region (410,525). I created a legend separately to spell out each state.
My final visualization shows the correlation between Family Income and SAT at universities in Illinois. As seen in the graph, there appears to be a positive correlation between an increase in family income and the related increase in test scores. Northwestern University and the University of Chicago, two of the most prestigious universities in the state, both have the highest test scores and the highest average family income of any undergraduate degree granting institution.
While it’s important to note that these visualizations are not developed using entirely scientific methods, they still serve as a useful way to digest what is otherwise a noisy dataset or table. This exercise has taught me the importance of visualizations and the utility that tools such as Tableau can provide when dealing with data.
To dive deeper into my dataset, I used Google BigQuery to run SQL queries on my data table. Because of the size of my dataset (over 10MB) I had to first upload the file to Google Cloud.
In order to sample the use of BigQuery on my dataset, I decided to query the number of accredited undergraduate degree granting publc and non-profit institutions in the state of Illinois. In order to run this query I constructed the following in SQL:
WHERE STABBR = ‘IL’
AND PREDDEG = 3
AND CONTROL != 3
The structure of the query is to select all the results (not excluding any columns) from my table where the state is Illinois, the predominate degrees granted are for undergraduate, and control of the institution is not for-profit.
A count expression provides: 71 results
I’m actually somewhat surprised by the number of results, however they’re confirmed by viewing the institution names, id’s and cities from the table. The query took about 5 seconds to run, despite needing to run through tens of thousands of rows.
Lake Forest College
Trinity Christian College
University of St Francis
North Central College
Illinois Institute of Technology
Trinity International University-Illinois
North Park University
National Louis University
Olivet Nazarene University
University of Illinois at Springfield
Governors State University
Saint Xavier University
Loyola University Chicago
Eastern Illinois University
Chicago State University
Southern Illinois University-Edwardsville
Western Illinois University
Northeastern Illinois University
Illinois State University
Robert Morris University Illinois
University of Illinois at Urbana-Champaign
University of Illinois at Chicago
Northern Illinois University
Southern Illinois University-Carbondale
Moody Bible Institute
VanderCook College of Music
Blessing Rieman College of Nursing
National University of Health Sciences
Saint Anthony College of Nursing
Saint Francis Medical Center College of Nursing
Lakeview College of Nursing
Illinois Wesleyan University
Lincoln Christian University
School of the Art Institute of Chicago
University of Chicago
St. John’s College-Department of Nursing
Rosalind Franklin University of Medicine and Science
For assignment 2, I was tasked with designing a web structure and data schema for the data set I am using. To do this, I reviewed the resources from webstyleguide and lucidchart to gain a better understanding of both information architecture principles.
As seen in figure 1 – my site structure takes on a mostly hierarchical structure. Sub-pages are accessed by first visiting or hovering over the relevant content or page. This then provides access to the sub-pages which sit underneath the parent category. I attempted to allow for internal routing as well – for example by accessing Resources you can choose between accessing an external link or being routed within the site to the College Scorecard sub-page of the Data category. This was an organization that seemed to make the most sense to me, however once actually implementing this structure I’m sure it would undergo revision. This process had me thinking about how a site should be structured and whether the most logical way is necessarily the most user friendly way.
For part 2 of my assignment, I was asked to design a database schema for my data set. This was somewhat difficult as my dataset is a “flat file” with most columns representing attributes of the University ID. I gave it my best shot, though, and focused on dividing into tables and eventually joining 3 key attributes. 1) The university’s key attributes such as size, type, tuition. 2) Its accreditation agency and the compliance needed to be accredited. 3) Its student post-graduate profile on earnings and debt. This wasn’t entirely a straight forward exercise because of the nature of my data, but I wanted to give it a thoughtful effort and tried to split the data as best as I could.
As seen in Figure 2 – the entity records involved ultimately direct to the Accredited College entity record. The idea is to bring all the information from the College record (basic facts), Accreditation profile/agency, and post-graduate survey and federal student loan reporting into the same record of Accredited College. The idea here is that the information is gathered in separate places, not all at once. In order for it to be joined together without significant mistakes, data integrity must be maintained along the way. For example, while each college is located in one state in the data, states can have many different colleges so we need the State to be separated and not stored with the college table. Next, since a University can be accredited by multiple agencies (regional accreditation, national accreditation, program accreditation), the accreditation profile for compliance needs to be separated as well. Overall, I felt this was a good exercise to think through and separate the dataset out logically.
When it comes to transforming data into information, the tool used is often as important as the data itself. The College Scorecard data-set will require data cleansing and any tool used to create visualizations will need to be able to handle large data volumes. The first step to analyzing data is to understand the structure of your data-set.
To begin, I will use Microsoft Excel to open the .CSV files in the raw data dump and better understand how the data is structured. This will give me an idea of what is relevant and usable in the data-set. It will also give me the beginning of an understanding of what data needs to be cleansed and what columns or rows will be irrelevant to the analysis. Excel can provide quick and easy formulas, sums and averages for simple data calculations.
Second, I will look to use a tool such as Microsoft PowerBI to attempt to analyze and visualize the data. PowerBI allows powerful visualization and the adding of fields and filters to present information in the most relevant manner. Finally, it allows the establishment of relationships between fields, tables and even data-sets. Understanding a data-set, its components and the relationships between fields and tables is important to utilizing it as effective information.
As discussed in class, data can be noisy, messy and at times difficult to understand. However, using data and data analysis tools we can transform data into information. The key distinction lies in usability. Information contains relevant and easily digested facts and figures. Part of the process of converting data to information involves reducing the volume being presented. In this post, I’ll explore ways to decrease the size and scope of my data-set to make my analysis more manageable.
Because my data-set centers on U.S. higher education, five key categories stand out as relevant for narrowing down the data: State, Private vs Public, Size, Institution Type, and Financial Aid. For example, I might want to focus on private universities in Illinois with 5,000 to 15,000 students. To further narrow down my analysis, I would focus on pulling the key figures relevant to my analysis, such as cost, employment rate, and debt and earnings levels after graduation. Not only will this narrow down the scope of my data-set, it will allow for more relevant comparisons between similar institutions.
In my next post, I will investigate data analysis tools I can use to visualize and present the information extracted from the data.
For my project, I plan to look at the college scorecard dataset
In September of 2015, the U.S. Government launched the College Scorecard website. The site is designed to allow parents, students and other interested consumers to easily compare statistics on higher education institutions. Key stats such as cost, post-graduation earnings, debt levels, employment levels, and other facts are found within the government’s data sets. Full data sets are available on the college scorecard website.
For my topic, I will be analyzing the government’s data set in order to better understand college as both an investment and financial decision. The raw data – containing information ranging from 1996-2015 – are large, with each year representing 100MB of data. The uncompressed files total 2GB in size. In terms of the data table, the files each have roughly 7,700+ rows and 1,700+ columns.
In order to tackle this dataset I plan to break the data into more manageable pieces. Some of this has already been done by the government and can be downloaded from the College Scorecard website. As stated in the reading, this is the ultimate goal of information architecture, which is to make information findable and understandable. In my next post, I plan to consider the ways in which I can break down and compare the data in more detail.