Process:
Goal:
The goal of the project was to experience and go through a thought process of a researcher in Digital Humanities field.
Choosing the Project:
At the time we had to choose our project the popular series “Game of thrones” came to an end. It was the most popular topic in any conversation and we where curious about the dominance of every character, house and gender in the series throughout the seasons and in the whole series in general via the aspect of spoken words in addition to the well known plot.
​Process:
-
Data collection -
-
First step was to find the transcripts of Game of Throne series – We used https://genius.com/albums/Game-of-thrones website. We collected the transcripts (text) from it with Selenium (Python library) as is to text files (file per episode).
-
Finding data sources about the characters including names, gender and house – we used Wikipedia pages by Wikipedia API and DBpedia API, extracting: characters list (names of main and supporting characters), gender, house and aliases for each character (name in the list) and creating list of Character data structure as JSON file.
-
-
Parsing – we created a parser which iterates the transcripts files line by line (excluding description lines) extracting the name of the character and the line and:
-
Creates a Line data structure including the season, the episode, the text (for data reservation reasons) and number of words.
-
Finds the character in the list of Characters data structure by name or alias comparison.
-
Adds the new line to the character structure (to list of Lines field).
-
-
CSV creation –
-
According to the collected information in the unify data structure (JSON file) of the Characters list, we calculated and saved fields for number of word per season and in total for each character.
-
We created main CSV with all needed information (name, gender, house, # of words per season each season, # of words in total)
-
We created more CSVs for creating desirable graphs more comfortably for example (name, gender, house, season, # of words) or (season, # of episodes) for presentation reasons.
-
-
Graphs – we used the CSV files to create graphs using Tableau Public tool. We built graphs presenting the collected information results according to our three main subjects of interest: gender, house and personal. In intention to answer our research question regard the dominance in each one throughout the seasons.
-
Interesting extra findings:
-
Word cloud - we created word cloud from all the processed transcripts of all season using word cloud python library which prepare the text and present image of words, while every word size in the image correspond to its number of occurrences in the text.
-
Topic modeling – we used gensim and nltk python libraries to execute LDA (topic modeling algorithm) on the transcripts. The results weren’t clear and didn’t make sense in the context of the series so we excluded it from the presentation.
-
-
Conclusions – using the results presented in the graphs and in the word cloud to answer the question and understanding the findings in the context of the series (the plot that we know from watching it).
Bibliography :
Tools :
-
Tableau
-
Wix
-
Wikipedia API by the library wikipediapi
-
DBpedia API
-
Selenium
-
Word cloud python library