Siobhan Grayson (UCD) – Data Science Student of The Year Finalist

Siobhan Grayson has made it to finals for the Data Science Student of the Year Award powered by Core Media.

On August 17th, in Croke Park, Siobhan was of a number of Student Finalists competing to be this Year’s  Data Science Student of the Year, it was a bit of a nerve wracking experience for everyone!

We caught up with Siobhan who truly is a ambassador for Data Science and heavily involved in PyData Dublin, PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other.

We were delighted that Siobhan was keen to write a guest blog for us, however she had not written one before. This did not stand in her way and through her research on how to approach this – she delved into the insights and facts of what our blogs and performance had achieved to date!

This will be an interesting read, that highlight the growth and engagement of the Awards  and also will give you some tips if you want analyse your own online performance!

Thanks Siobhan, we look forward to welcoming you and the other finalist on September 21st!

Check out Siobhan had to say:

So here’s my confession: This is my first blog post. I had a lot of ideas but I wasn’t sure what would be most suitable for the DatSci Awards. To get inspiration, I decided it would be a good idea to review what other DatSci bloggers had submitted before me (see here for previous posts). One way to do this is to close read each post in turn. Another way is to take a distant reading perspective, an approach which facilitates higher level comparisons to be made across all posts. In this case, I went with the latter as it allows me to demonstrate some simple text analysis techniques that can be used to gain insights about the type of blog posts being submitted.

Collect the Data

The first step in this approach is to collect the data for our corpus. In this case, I used a Python package called Scrapy:

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Scrapy has great documentation and the code below is just an adaptation of the sample spider project they walk through on their tutorial page. The details that needed to be tailored specifically for the DatSci Awards site were the CSS Selectors. In order to know which parts of the page I needed to collect, I had to inspect the tags used within the site’s HTML. For example, to collect the Title of each blog post I used the following CSS Selector:

 

This means, return only the text contained between the <h1 class=entry-title> tags, which are

themselves contained between the <article> tags of the page. The code in it’s entirety is as follows:


Once you have followed the guidelines for setting up a new project and have saved the code above as your spider, from your terminal, navigate to the same directory as your spider project and run:

 

 

This will save the title, date, and text content of each post in a file called ‘blogs.json’.

Analyse the Data

Now that we have our data, let’s import it into a Pandas DataFrame and begin to analyse it. For this part, I have also uploaded a Juypter Notebookcontaining the code which can be found here.

 

 

 

 

To start, I’m going to preprocess the textual content of each blog post using a Python package called TextBlob:

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

The TextBlob processed text is stored in new column called blob_content in our df_blogs DataFrame.

 

 

 

 

To get an overall sense of the blog posts submitted to the DatSci Awards, I will visualize the word count and sentiment score of each post sorted by date. TextBlob makes this easy as it has inbuilt functions purposely built for these tasks.

  • To get the word count of a blog I just need to apply the TextBlob function words and find the len().
  • To get the sentiment score for a post, I can just use the TextBlob function sentiment.polarity.

Plot the Results

To plot the results sorted by date, I’ll sort df_blogs by the date column and then set it to be the index of the DataFrame. I’ll then change the format of the date using t.strftime('%d-%m-%Y') (day-month-year) to eliminate seconds, minutes, and hours from appearing in the final plots.

 


 

 

 

 

 

 

 

For the purposes of plotting, I’ll use a Python library called Matplotlib:

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

I’m also going to import the library Seaborn as this will automatically apply the default Seaborn plot style which I prefer to Matplotlib’s.


The following function can then be used to produce the plots displayed in Fig. A and Fig. B below.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

For Fig. A, I use the following parameters when calling the function:

 

 

 

 

For Fig. B, I modify the code to normalise the colour range between the minimum sentiment value and maximum sentiment value ([min sentiment = 0.102, max sentiment = 0.386]) recorded for our dataset.

 

 

 

 

By executing the code above we get the following figures:

Fig. A: Color scaled between [-1 (negative sentiment), 1 (positive sentment)].       Fig. B: Color scaled between [min sentiment = 0.102, max sentiment = 0.386].

Fig. A provides a means of comparing the sentiment of the posts against the entire range of values that are possible. In the case of TextBlob:

  • A value of 0 indicates a post is neutral in sentiment.
  • Values that extend from 0 to -1 represent increasingly negative degrees of sentiment.
  • While values that extend from 0 to +1 representing increasingly positive degrees of sentiment.

In Fig. A, all blogs appear green to light green in colour, reflecting that all posts are positive in sentiment and that this positivity has carried through from 2016 into 2017. In other words, it’s a nice visual representation of the enthusiasm shared by bloggers for Data Science and the DatSci Awards.

To get a better idea of how sentiment varies just within the corpus itself, I then replotted the same figure but with colour values normalised between the minimum sentiment (0.102) and maximum sentiment (0.386) recorded for our dataset. In this case, two posts stand out, both appearing bright yellow in colour. We’ll come back to these later to find out what the authors were talking about.

Another result that stands out from the figures is that there is one post appears to be 4 times larger than the rest. To find the average word count of each post and other summary statistics, I used pandas.DataFrame.describe:

“Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.”

by executing df_blogs.describe() to produce the table in Fig. C.

Fig. C: The table that results when the Pandas “describe()“ function is applied to our “df_blogs“ DataFrame. Fig. D: A bar chart depicting the number of posts that are published each month.
  • Count gives the number of blog posts that are currently available on the DatSci Awards website. Hence, it is the same, 29, for both the word_count and sentiment columns.
  • The mean word_count is 838 words, while the max is 4304, which is in fact not 4, but over 5 times larger than the average post.
  • While the min sentiment of 0.102 is indeed within the positive range of values.

Fig. D depicts the number of blog posts that are published per month. We can see that the DatSci Blog was active between the months of May and September last year. This coincides with the DatSci Awards season, from the announcement of submission deadlines, right up until the awards ceremony itself held in September. In fact, the last two blog posts of 2016 occur the day after the awards ceremony held on the 22nd September 2016.

The next post appears at the start of May 2017, denoting the start of the new DatSci Awards season. So, even if you have no idea what the data is representing, we can see that whatever it is is seasonal. So far, 2017 has had 13 blog submissions, 7 more than for the same time period last year. Given the trend, I think it’s likely that 2017 will have a higher number of posts than 2016, but will it also beat the highest number of submissions in a month currently held by September 2016 with 10 posts? I guess we’ll have to check back in 30 days time to find out.

For completeness, I have included the code for producing Fig. D below. Note that from datetime import datetime is required, while the function itself is called by month_hist(df_blogs, 'month_count.pdf').

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

What are DatSci Bloggers Writing about?

So far, we know:

  • When DatSci bloggers like to post,
  • How long their posts tend to be, and
  • The sentiment of their posts.

But we still have no idea as to what they’re actually writing about. As of right now, if you didn’t know the source of the data, it could be anything from summer travel diaries to movie reviews (where all the movies so far have been good). Or even if you do know the source, posts could still be potentially irrelevant to the DatSci Awards themselves. To address this, I’m going to apply a method known as Term Frequency-Inverse Document Frequency or TF-IDF for short:

A numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

This will allow us to examine what words are most important for individual blog posts in comparison to the overall corpus, and hence, will give a sense of the different topics DatSci bloggers like to cover.

Note 1: A portion of the code I will use next is an adaption of Steven Loria’s implementation which conveniently uses TextBlob to, and who has written an informative blog post on the topic which can be found here. Comments at the start of code cells will denote when content has been adapted from Loria’s implementation.

Note 2: For those familiar with scikit-learn you might prefer sklearn.feature_extraction.text.TfidfVectorizer.

Before applying TF-IDF I’m going to further preprocessor the corpus by removing common words such as theandImyis etc. These words are known as Stop Words and are usually removed as they generally provide little value when attempting to classify the content of a document. To do this, I will use the Python library Natural Language Toolkit (NLTK) stopword list.

 

 

 

 

 

 

 

 

 

 

 

The following functions are taken from Steven Loria’s blog post will be used to compute the TF-IDF for each document.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The next code sample demonstrates how I apply the above functions to the corpus and store the top 3 most important terms in a dictionary, where the keys of the dictionary are the blog titles.

 

 

 

 

 

 

 

 

 

 

 

The TF-IDF dictionary is then converted into a DataFrame and merged, using the title column, to the original DataFrame.

 

 

 

 

 

Finally, I adapted the previous sentiment plotting function, senti_plots, such that the top three most important words for each post would appear over their respective bar. In particular,

  • I added alpha = 0.6 so that colours would be lighter,

 

 

 

  • for loop for annotating each post with their respective TF-IDF terms,

 

 

 

 

 

  • I changed Seaborn’s style to remove the grey grid background so that text is more legible, and

 

 

 

  • The adapted function was then called using:

 

 

 

Which resulted in the following plot, Fig. E:

Fig. E: Color represents sentiment values scaled between [min sentiment = 0.102, max sentiment = 0.386]. Bar length = word count of blog post. Three words overlaying each bar are the top 3 TF-IDF terms for that post.

Immediately, our original sentiment figure becomes even more informative:

  • The last post of 2016 is about who won: “winner – datsciawards – 22”.
  • The longest post has: “pb (personal best) – runner – pacing” which aligns with Barry Smyth’s guest blog titled Using AI to Run your Best Marathon.
  • We can now also see what the post with the highest positive sentiment is about: “oisin – ceadar – boydell”. This is in fact a post by a previous winner, Oisin Boydell, describing what it is like to have won as a part of the CeADAR research team.
  • Finally, another nice finding, especially as someone that will be attending the awards this year, is that the sentiment of the posts by Paul Hayes, DatSci Awards compère, increases from 2016 (15-09-2017) to 2017 (31-08-2017). Clearly, the DatSci Awards were a lot of fun last year, as even the compère can’t hide their excitement for this years event!

Topic Modeling

The very last type of text analysis I’m going to conduct is known as Topic Modeling:

Topic modeling can be described as a method for finding a group of words (i.e topic) from a collection of documents that best represents the information in the collection.”

Hence, although TF-IDF has provided insight into what individual posts are about, topic modeling builds on TF-IDF to learn what topics are common across multiple posts. In addition, since there are two years of data available, we can also investigate whether these topics changed between 2016 and 2017. Therefore, for this task I’m going to use a method known as Dynamic Topic Modeling via Non-negative Matrix Factorization, an approach developed by Dr. Derek Greene from the Insight Centre for Data Analytics, who in the spirit of open science has made the code freely available online.

First, I need to prepare the data so that it’s in a suitable format for the library to work. To keep things simple I’m going to break the data into two different time windows, one for each year.

 

 

 

 

 

 

 

 

 

 

 

 

 

Now I can start to follow the steps outlined in the ReadMe for the library.

Step 1 Navigate to the directory dynamic-nmf-master and execute the following:

 

 

Step 2 I’m going to set k=2 so that the algorithm looks for 2 topics:

 

 

Step 3 Finally, to view the results (see Fig. F and Fig. G) run:

 

 

         
            Fig. F: The two 2016 topics found.                          Fig. G: How the two topics from 2016 look in 2017.

At first glance, the two topics in 2016 look very similar. So to make it a little more obvious and see if there are two distinct topics present, lets just look at the non-overlapping terms:

From the above table, the two topics that underlie the posts of 2016 become a bit clearer. Topic 1 appears to be concerned with analytics in Ireland in general, and how it is applied by both companies and research. Topic 2 is much more centered on the event of the DatSci Awards itself, taking place in September.

Both topics then shift in 2017 to become identical in their focus. How they change, in my opinion, is a reflection of how the relevance of the DatSci Awards has already been established in the previous year. There is no need to explain that data analytics is present in Ireland as they have already made that fact clear. Therefore, more concentration can now be placed on the event itself and the value that it brings to the Irish data science community, rather than having to prove that a data science community in Ireland exists.

Conclusion

To conclude, although this post has turned out slightly longer then I originally intended, I hope it has been informative both from an implementation point-of-view, and from the insights that were made along the way. It’s been great to get a bird’s-eye view of the diverse range of topics being covered by DatSci bloggers, and also visualise how the DatSci awards have inspired such positivity. For the complete code implementation and data behind this post, the repository can be found here. If you’re interested in learning more about the different open source libraries and tools available to data scientists, and how they’re implemented, then you should pop along to one of the PyData Dublin Meetups. These are held monthly and details of the next event can be found here.

You can be part of the DatSci Awards as well on the 21st of September in Croke Park, Dublin, Ireland. Be sure to get your ticket for a great opportunity to talk and learn from over 400 leading Data Science professionals in the Data Science community!

By | 2017-09-07T12:28:56+00:00 September 4th, 2017|

Leave A Comment