Pytrends
is an unofficial API for tracking Google Search trends, and we can use matplotlib
to visualize these trends over time. You can read more about the package in the project’s GitHub repository here.
Well, we all know what Google Search is. We all use it several times a day, sometimes without a second thought, to access millions of search results for a wide range of topics. Fun fact, the Google Search Toolbar was first released in 2000 for Internet Explorer 5!
By looking at a time series visualization of Google Search keywords over a decade, we can draw valuable…
Before I dive into the definition of the No Free Lunch theorem, let’s quickly discuss the context. The beauty of data science and machine learning is that no two datasets will ever be the same. The size, noise and content will always be different. Therefore, our approach to every problem must be different.
The No Free Lunch theorem states that there is no one model that works best for every problem. The assumptions of a great model for one problem may not hold for another problem. …
With data science, the key to learning any new technology will always be practicing first-hand with a project. To learn Tableau, I performed an analysis of the survival rates of the Titanic. The full project can be found here, hosted on Tableau Public.
Anyone familiar with Kaggle, the data science and machine learning dataset resource, may already recognize the Titanic dataset. This dataset provides observations for each passenger on the Titanic and their survival outcome. For the purposes of this project, only 871 observations from the training set were used. Ultimately, out of the 2,435 total passengers on board, only…
Last week, I shared a tutorial about creating a spam filter to classify an email. You can find it linked here. In that blog, I walked through the theory behind a Naive Bayes algorithm. And as promised, this blog will be about implementing all of that code.
We can test out the model by feeding in real-life data. A popular dataset that is commonly used for spam filter testing is the SpamAssassin public corpus. We’ll be looking at the files prefixed with 20021010
. …
Creating a spam filter isn’t a new concept, but it’s important to understand the underlying theory that drives these predictions. Furthermore, understanding the theory behind machine learning algorithms in general is crucial for a Data Scientist to effectively implement them on real-life data.
Naive Bayes classifiers are a popular statistical technique used for email filtering. These algorithms typically use bag of words features to identify spam emails. This baseline technique can tailor itself to the email needs of individual users and give a low false positive rate, which is generally acceptable to users.
The key to the Naive Bayes algorithm…
A key skill for any Data Scientist is the ability to write production-quality code to create models and deploy them into cloud environments. Typically, working with cloud computing and data architectures falls in the Data Engineer job title. However, every data professional is expected to be a generalist who can adapt and scale their projects.
Here is an introduction to popular platforms that I have seen across dozens of job descriptions. This doesn’t mean that we have to become experts overnight, but it helps to understand the services that are out there. …
In general, when people think of Natural Language Processing (NLP), they tend to restrict it to English. This is due to the idea that English is the only language that can be applied. Because of this linguistic bias, I decided to investigate how to preprocess Chinese text data for NLP.
First, I would like to thank my cohort mate David Bruce for pointing out this disparity. In his blog post on Learning a New Language in a Word Cloud, he shared that Professor Emily M. …
Now that I’m ten weeks deep in a fifteen-week Data Science program, there’s still a subject that weighs heavily on my mind. From the very start of the program, all language and educational examples have almost always been binary. This makes sense, given that computers themselves are binary machines. But Data Science specifically deals with real-life human data and problems. So, it should be able to adapt to the evolving identities of the people it’s about. Additionally, this isn’t a critique of the program that I’m in; this is a widespread problem across industries.
As a disclaimer, I’m writing this…
Most data scientists refer to either Python or R as their “go-to” programming language. Both have vast software ecosystems and communities, so either language is suitable for almost any data science task.
So the question is, which language should an aspiring data scientist learn first? Long story short, the answer is usually Python. However, each language has its own strengths and weaknesses to consider before diving head first.
Additionally, it’s important to note that Python and R are not the only programming languages or tools that can be used for data science. …
You have the right to know what companies do with your personal data.
Currently, I’m enrolled in a 15-week data science bootcamp with Flatiron School. Although it’s only the fourth week, I have already dived head-first into the world of data science. And in such a short amount of time, we’ve covered a range of data analysis tools and methods, such as Python, SQL and JSON. However, I’ve noticed that there has been a lack of discourse surrounding the ethics of data science, which is not too surprising. …
Data Scientist | Machine Learning | Digital Media Studies