Data science - introduction
In its simplest terms, data science is understanding and analyzing data using scientific methods. In more complex terms, it is an interdisciplinary field of study that combines the skills of a statistician, computer scientist, and subject matter expert to understand and draw conclusions from data. Data science can help you understand the value in your data, evaluate the true impact of changes to your business processes or policies, and predict future trends and outcomes so you can make faster and better strategic decisions .
Data science is a field that has been growing at an incredible rate over the past few years. Just how fast has it grown? According to a recent report, job postings for data scientists have increased by 650% since 2012, and job openings will increase another 15% over the next 10 years.
But while data science is one of the hottest fields around right now, it’s still important to make sure that you’re doing your research before joining the industry. Some people might tell you that data science is a “get rich quick” scheme, but that’s not always the case — especially when you’re just starting out.
So what does it take to get started in data science? What does a typical day look like for a data scientist? And where are all these jobs popping up?
This is some of the things we will discuss in our blog.
If you’re a data scientist, you probably don’t need to be told that your job is in demand. According to a recent Glassdoor study, the 50 best jobs in America right now are dominated by careers in data science and analytics. Among the top 25 are four different data-centric roles: data scientist (No. 1), data engineer (No. 9), analytics manager (No. 12), and business intelligence developer (No. 21).
"What does a Data Scientist do?"
If you were to ask 10 data scientists this question, you'd get 10 different answers. It wouldn't be surprising if some of the definitions were quite vague. That's because the field of data science is still new. It's very likely that there isn't a single person in the world that can answer this question with absolute authority. Even the definition of data science will change as we gain more experience and refine our understanding of how we work with data.
Data science tasks include data manipulation, calculation, and visualization.
Data manipulation is the daily task of data scientists. Data come from different sources with different formats. To acquire the data we need, we may need to clean some irrelevant columns and rows from a dataset. We may also need to merge two datasets and generate new features.
Calculation plays an important role in data science. Data scientists use calculation methods to analyze data, such as using means to describe a dataset or using standard deviation to detect outliers. They also use calculation methods to support machine learning algorithms, such as normalization and feature extraction.
Visualization is the main way we build models to present our results and analysis. We use visualization techniques to explore data, visualize distributions and correlations of features, and present the performance of machine learning models.
Some other tasks that data scientist do:
Data Cleaning
Cleaning data is the process of getting your data from its initial collection point to a form in which you can start using it for the tasks you have in mind. You'll need to clean your data if you want to perform any machine learning algorithm
Summarizing and describing your dataset
Understanding the structure of a dataset is an important step in becoming familiar with it. When we first look at a dataset, we are often interested in quantitative results such as:
The mean, median and mode of quantitative columns (e.g., how much money was spent on average?).
How many observations are there? Are there missing values?
How many unique values are there in each column? Which variables are numerical, and which ones are categorical?
Exploratory Data Analysis(EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to maximize insight into a data set; to uncover underlying structure; to detect outliers & anomalies; to test underlying assumptions; etc. EDA differs from initial data analysis (IDA),1 which focuses more narrowly on checking assumptions required for model fitting and hypothesis. A nice introduction to data sets: What are the practical applications of Data science?
One important feature in eCommerce is categorization of products.
Better categorization of products means easier searching for customers on online stores and can lead to better conversions.
A somehow related topic within text classification field is URL classification. E.g. in marketing it is relevant to know in which vertical are websites where you want to publish things. Thus it helps to do URL classification of websites. Introduction to URL database for web content filtering: https://www.alpha-quantum.com/blog/url-database/url-database/. A standard for URL classification of websites is that from IAB: Text classification is usually done in automated way by using machine learning models. Typical machine learning models used for website categorization and product categorization involve:
- naive bayes
- support vector machines
- logistic regression
- long-short term memory
- recurrent neural nets
- transformer neural nets