This blog post was originally written for a series of articles by ZEEF curators. In this case, it was a contribution on behalf of my ZEEF Data Science page. I’m providing 7 essential resources & tips, trying to predict which information related to data science the readers might find interesting.
#1: Data science
Data science is an umbrella term for a collection of techniques from many distinct areas such as computer science, statistics, machine learning to name just a few. The main objective is to extract information from data and turn it into knowledge which you can base your further decisions on. It sounds easy, but it’s not necessarily always straightforward. Usually, the process comprises many steps starting with a research question. Once you know what you want to study, you need to obtain the right data, clean it, explore it, create and evaluate a model, repeat this cycle a couple of times, and finally you are ready to start looking for a way how to properly communicate your results. The Python for Data Analysis book is a great starting point, it guides you through all these stages and helps you to get this workflow under your skin.
I believe that you’re eager to do some real ‘data science’ now. First of all, you need an interesting dataset to play with. Either you already have your own data (congratulations!) or you need to acquire some. As you’ve probably heard before, we happen to be living in the age of information overload which probably means that data is everywhere and it’s easy to get it, right? Yes and no. Data is wherever you look, however, it’s not always trivial to get what you want. The path of least resistance when searching for data is to explore publicly available datasets. People tend to organize them in curated lists such as ‘Awesome Public Datasets’ by ￼Xiaming Chen, alternatively you can use one of data repositories like datahub.io. If you don’t succeed, you can try to find a public API and collect the precious data yourself. Chances are high that such an API is not available or is very limited, then you have to find a way to extract the data by other means, for example, by scraping webpages. This approach typically requires some data-cleaning steps, which might be costly in terms of time and effort.
Having a good understanding of statistics is extremely helpful when performing data analysis. Actually, without some basics in stats, it’s hard to come to reliable conclusions in many data science scenarios. A rule of thumb says that the first step after getting a dataset is to have a quick look at it, and some basic descriptive statistics is a good friend of yours here. If your dataset contains numerical variables, you might be interested in their distributions – their center (i.e., mean) and how spread they are (i.e., variance).
Sometimes, you run into troubles with datasets too large to process, then it’s time to apply one of sampling methods. And statistics answers the questions which data points to choose so that they represent the whole dataset in best possible way. In short, statistics offers you a toolbox for understanding your data, distinguishing between causation and correlation, analyzing patterns, modeling, predicting, etc. Last but not least, statistics quantifies certainty of your outcomes and therefore gives you confidence in your results. In our ZEEF list you can find, among others, this awesome hands-on tutorial called ‘An Introduction to Statistics’ prepared by Thomas Haslwanter.
#4: Machine learning
In layman’s terms, the goal of machine learning algorithms is to learn to make decisions based on data. This approach, contrary to designing hard-coded algorithms, has huge benefits in a sense that one method can serve many purposes. Moreover, machine learning systems are designed to improve as new data come in. And that’s exactly why your Amazon account looks different when you’re logged in than when you’re not – as you’re browsing their catalog, it learns your preferences. Google search, to mention another example, is constantly learning the importance of webpages. You don’t have time to manually inspect those X thousands of results it returns, all you want is the ten blue links to be the best hits. I can imagine you want to start with the machine learning right away, then you should visit the Joseph Misiti’s GitHub repository with a great hack-first-get-serious-later tutorial called Dive into Machine Learning. It uses Python and one of its most popular ML libraries, scikit-learn.
I’ve already mentioned the descriptive power of statistics. Let me illustrate the importance of visualization on one example, where simple statistics is not enough: Anscombe’s quartet. It is a collection of four different datasets with two variables x and y. Interestingly, these datasets, despite looking very different visually, appear nearly the same through the lens of statistics. They share almost identical values of the following properties: mean of x, sample variance of x, mean of y, correlation between x and y, and linear regression line, yet, in fact, they’re very dissimilar.
Data visualization is important both when analyzing data and when conveying your findings. Human eyes and brain are great co-workers when it comes to recognition of patterns. They make it easy for us to immediately spot relationships, trends, outliers or anomalies in visualizations, especially for low-dimensional data. Whenever possible, you should try to leverage the enormous bandwidth of human’s visual system and explain your data in graphical form. You can find many dataviz resources on the ZEEF Data Science page, but I’d recommend you to first get some inspiration in this amazing overview of visualizations based on D3.js library.
Data science in various forms is being introduced as a new program on many universities around the world. Massive online courses go hand-in-hand with this trend and already you can find a plethora of free or very affordable courses that will guide you from Introduction Data Science, through Data Analysis and Statistical Inference, Data Mining or Data Visualization to Machine Learning lectured by Andrew Ng.
An animated visualization of cultural mobility in the world between 600 BC and present, revealing migration patterns of people. Animation is based on publication of M. Schich et al., with data extracted from publicly available Freebase knowledge base.
Now, when you have all the pieces together, it’s time to apply your knowledge in practice. And what can be more fun than participating in a competition? Data science challenges, such as Kaggle, are a great opportunity to test your own abilities and to learn from others (you’ll also get nice data for free). On top of that, if you manage to win you can be offered a dream job or at least a lot of money. If that doesn’t tickle your fancy, there is also another, more noble, reward in some competitions (e.g., DrivenData.org): saving the world!
Use your data science skills for doing good: DrivenData.org.