The new professional profiles of Big Data and Analytics
Andrew Pole was at his desk, when two marketing people came to ask him: We want to find out if a client is pregnant, even if she doesn’t want us to know. Can you do that
Andrew was always obsessed with the intersection of data and human behavior, it did not take long to discover several patterns in the data to achieve its mission. In his interview with the NY Times  he explains:
… Test after test, analyzing the data, and in a short time some useful patterns emerged. Lotions, for example. … They were buying large quantities of odorless lotion at the beginning of their second trimester. … sometime in the first 20 weeks, pregnant women bought supplements such as calcium, magnesium and zinc. Many customers buy soap and cotton balls, but when someone suddenly begins to buy a lot of unscented soap and extra large bags of cotton balls, in addition to hand sanitizers and wipes, it indicates that the arrival of the stork might be arriving.
As Andrew’s computers tracked the patterns in the data, about 25 products could be identified, which when analyzed together, allowed each buyer to assign a “pregnancy prediction” score. More importantly, you could also estimate the date the stork would arrive, so the store could send timed coupons at very specific stages of your pregnancy.
The study of Andrew and his colleagues had a record of a false positive. They sent baby clothing promotions to a 17-year-old teenager, so her father questioned Target the reason for the promotion. Then Target decided to reveal his secret.
The Generate Insights Process
In the low world of data science, we call a recommendation that generates a competitive advantage as an insight. If you want to work professionally in the glorious world of Big Data and Analytics, you must be clear about your goal.
The generation of Insights can be generalized in 5 steps , to me personally, I like to add two extra steps:
- Collection: Activities related to the extraction, collection and storage of data.
- Obtaining: Activities related to easy access to data, as well as signal manipulations.
- Cleaning: Activities involved with the validation of the data, as well as discarding irrelevant signals.
- Exploration: Activities focused on the visualization / presentation of data.
- Modeling: Activities that generate a mathematical representation of a real-world process.
- Interpretation: Activities whose objectives are to facilitate the human interpretation of a mathematical model.
- Deployment: Activities that make it easier for all users to access the interpretation of the model.
Although Andrew’s story seems to be a task of his own, there is actually a team of many people, and each of them played a very important role at the time of generating that recommendation, which resulted in an advantage for Target.
Normally the size of the team depends on how big the business is. If we talk about a Startup, probably two or three people do the whole process, and in the worst case scenarios it is only one person who does everything (the Showman). But if we talk about a titan, like Amazon, rest assured that behind an insight there are at least 10 people, who in turn coordinate more people.
So do not worry about not being the Showman, and accompany me to discover where you can export your skills in this exciting process.
When talking about data collection, it is impossible to ignore Big Data, to address this issue I like to cite the paper Sergey Brin and Larry Page published in 1998, where they detail the operation of the Google search engine . In it you can read that in one weekend they downloaded the entire Internet, approximately 200 GBs, and at the time of trying to manipulate the collected files, they realized that their computers (at that time) could not process all that amount of information . And so they coined the term BigFiles.
Although today 200 GBs does not seem much information, in 1998 that was Big Data. Now, to put it in simple terms: Big Data is the use of tools and / or techniques that allow the manipulation of large volumes of information. Take into account that the Large Hadron Collider in Geneva generates 300 GBs per second, and they are relevant events filtered . It is estimated that it generates 27 Terabytes of filtered information per day. At this point the use of Hadoop and Spark makes sense, but that will be the subject for another occasion.
Data collection tasks, even if they seem simple, require a lot of effort. For example, if we want to market a smart refrigerator, which connects to the store to send you your consumables as they run out, we require at least three professionals:
- Software Architects: It is the person in charge of proposing the connection scheme between the refrigerator in the consumer’s home and the company’s data center, the whole point-to-point process must be evaluated by this professional. If you are good at connecting systems that use different technologies this is your place.
- Embedded Systems Engineers: It is the person who is dedicated to connect sensors, save signal reading and replicate them in data centers. If you like the Internet of Things and you are good at playing with the GPIO of the Raspberry Pi this is where you have to be.
- Data Engineers: This person is in charge of the data storage logic, this role keeps a super secret power, since the data engineer can accelerate the information retrieval process using partitions in the databases and optimizing the queries Imagine you can save a second per record in the database, if your query is one million records, you will have saved a million seconds. If you are obsessed with orders of computational complexity, you can make a difference here.
According to the operation of the business, this process can be simple or complicated. For example, if the business has a culture of opening data, they can give you direct access to the database, so that a simple SELECT will be enough to get you the data. But if the business is very picky, it is most likely that I will send you a batch of files that you will have to Extract, Transform and Load (ETL). In the real world, all companies are jealous of their data and you will need help running the ETL in 30 GBs of CVS files. The professionals involved are:
- ETL Analysts: This person is dedicated to following the ETL process to propose the platform and / or database engine that is needed. It requires knowledge of operating systems, scripting, interconnection between servers and automation of ETL jobs. If your forte is the GNU / Linux terminal and you’re good at scripting to automate tasks, your place is here.
- Data Engineers
In my opinion this is the most critical task of the process, because many of the times the insights that are given by the stakeholders are not adequate due to the nature of the process. For example, in Supply Chain you cannot tell executives that the main problem with non-fulfillment of orders is customer demand, it is as if you were telling them that they have to reduce their number of customers so as not to have production problems, You see how silly it sounds.
Depending on the nature of the process, it is the type of expert that should advise you, and may be experts in: marketing, human resources, physicists, actuaries, mechatronics, geologists, psychologists , etc.
The cleaning of data has to do with signals / events that are going to be integrated into the study, this task is usually planned by: Data Engineers, Data Scientists and Topic Experts. And depending on the volume of the data, it can be executed by Data Engineers (If it is Big Data) or by Data Scientists (If it is a small collection).
In recent years, society has become aware that the information that is collected by companies can invade their privacy. For example, in the Facebook scandal with Cambridge Analytica, the US government had to intervene to ensure that no laws were broken, this is called Data Governance. Companies had to generate a new profile:
- Custodian of Data: This person has the responsibility of raising his voice if he finds a privacy violation in the data that is being used to conduct a study. If you are interested in complying with the laws of your country, and have programming knowledge to audit data, do not hesitate and be part of the Edward Snowden club.
Exploration, Modeling and Interpretation of Data
Yes, these three tasks are performed by the Data Scientist, not for nothing is the sexiest work of the 21st century . If you ever asked yourself: Does taking a Machine Learning course make me a Data Scientist? Let me tell you that it is not so.
The search for patterns in the data is not a new thing, it dates from 1986. In the 90s it was called Business Inteligencie. One of the companies that took advantage of this term was Oracle with its OLAP platform, which allowed generating reports that helped in decision-making . In summary, Oracle added a module that implemented some Data Mining algorithms : Anomaly Detection, Association, Decision Tree, Expectation Maximization, Generalized Linear Models, k-Means, Naive Bayes, Nonnegative Matrix Factorization, Orthogonal Partitioning Clustering, Singular Value Decomposition, Principal Components Analysis, and Support Vector Machine. The people in charge of executing these algorithms called Data Analysts.
In May 2008, DJ Patil, a mathematician who worked on LinkedIn, nerd at heart, had more advanced ideas on how to analyze the graph of relationships between users of the social network. The unconventional way of analyzing the problem generated a lot of attention and the industry realized that these people with scientific training had skills that could mean a strong differentiator for the market. So Patil called his friends on Facebook and together they invented the title Scientific Data.
Data Scientists are people who handle more complex concepts than you can take in a Machine Learning course. Given its formation, it has a very high capacity for abstraction, that is, they can easily model a real-life process in mathematical terms, so that the Machine Learning algorithm becomes a tool. Reaching that level of innovation is what all companies are looking for today.
On the other hand, in the last two years companies have begun to recruit people who do Machine Learning to integrate them into collaborative teams whose objective is to generate insights, no matter how innovative the methodology behind it is. So we see ourselves in a scenario in which the Data Scientist is replaced by Machine Learning Engineers and Experts. Being the Experts who guide the Machine Learning Engineers. And the cherry of the cake are UX Designers, which provide the visual impact.
So do not worry, take that Machine Learning course to make your way as a Data Scientist and make an effort to learn to make a good modeling.
Under these trends, we can highlight the following professional profiles:
- Data Scientist: Person in charge of modeling the business process for the discovery of patterns that result in an insight. If you are curious by nature, you like to explain the reason for things and you know why Random Forest generates a weighted list of Features, while Naive Bayes does not, then don’t wait any longer, this is your place.
- Machine Learning Engineer: Person who, under the expert’s advice, will apply supervised and / or unsupervised learning algorithms to model the business process. If you are a Machine Learning enthusiast and you just start on this path, do not doubt that this is a good starting point.
- Data Artist: Person in charge of generating reports / infographics that will summarize the modeling process and show insight in a simple way to users. If you share the same passion for design and data, don’t doubt that you can make a difference if they count on you.
- Analytical Project Manager: Person in charge of coordinating analytical developments, is responsible for translating the needs of stakeholders into actions for the analytical development team. If you like to identify areas of opportunity when talking with people and you are good at translating ideas into technical concepts of Machine Learning, this is definitely your place.
Once the insight is approved by the stakeholders, it must be made available to all users who are going to use it, for this they can think of several ways of delivery, these can be: SMS messages, Email messages, Notifications in the cell phone, a PDF with the report that arrives by mail, a dashboard hosted on a website, etc.
In the last year, a machine learning model exchange standard has opened up the possibility of interporting a trained model to almost any platform, the exchange format is called ONNX . Powered by industry greats, it promises to train your model and serve it anywhere. This opens a new professional profile:
- Analytical DevOps: Person in charge of deploying, maintaining and monitoring the useful life of the model. Because in the industry the big projects work under iterations and it is necessary to improve the versions. If you are not afraid to change platforms (Cloud, Smartphone, Embedded, etc.) and you are good at programming in several languages, this is your place.
Although it seems unbelievable, there are people who cover all these skills, that is, they could develop the entire process without anyone else’s help. This type of people has been in the industry for many years and they know what trend is coming and which one is coming out. So be prepared well, that the path of Big Data and Analytics is just booming, and I hope that in the future you can work in one of these professional positions:
- Head of Analytical Consulting: Analytical Expert who you go to when you have any questions about your model. For example, this professional can tell you what to change in your model to reduce overfiting or how to transform features to improve the accuracy of your model. In the same way it can give you advice on how to change the modeling to correctly capture the business process.
- Analytics Director: Expert who knows the business very well, as well as the data that is available. It is able to propose new analytics that will become services and / or products. Among its responsibilities is to monitor the market to know what is valuable for the business and what is not.
About The Author
Saul León Silverio is a graduate of the Benemérita Universidad Autónoma de Puebla, during the bachelor’s and master’s degree he participated in a research group focused on Natural Language Processing, compiled with research groups from IBM, Yahoo and universities around the world. He currently works at General Electric Infrastructure Querétaro, modifying predictive models to optimize the Supply Chain. In his spare time, he likes to read about new technologies, program IoT and develop Digital Assistants.