Data Science | Machine Learning | Lessons Learned

My top 3 learnings after 5 years in Data Science

What have I learned working in the data science field?

Swapnil Kangralkar

--

I‘ve had almost 8 years of professional experience now and 5 years specifically in the field of data science and machine learning. In my current role, my team and I design and build predictive machine learning models and promote emerging technologies within the department. This article describes my top three learnings as a data scientist that I hope will help the inspiring data scientists to get a gist of what’ inside the real world of data science and what they should expect.

1. You will spend 80% of your time in data preparation

The Pareto principle/80–20 rule states that for many outcomes, roughly 80% of consequences are a result of 20% of the causes (the law of “vital few”). However, this is just contrary to the 80–20 rule.

Most of us while practicing the concepts of machine learning, use academic datasets and spend little to no time on data cleaning or Exploratory Data Analysis (EDA). This is because the datasets are usually small and are less complex. They leave some room for data cleaning as a practice but the data is nowhere close to the real-world data that you will face in organizations. Companies usually have hundreds of data marts that hold data from various departments and lines of business. The challenge sometimes is merging records from various datasets and oftentimes there is no common key to merge on.

The first law of Machine Learning is “Garbage in — Garbage out”. Knowing the problem you are solving, the type of solution you are designing (e.g. classification or regression), and the type of algorithm you plan to use (Decision trees vs Neural nets), you will be required to prepare your data accordingly. And this takes massive time and effort. One has to spend time discovering the data and looking into the metadata dictionaries to understand the data, and if you are building your model on the source data, your efforts and challenges manifold.

2. You will almost always seek help and communicate with the domain experts

A domain expert can be an expert by education and/or by their experience in that field. You may have aced machine learning concepts at your university. However, when you start your career in a company, you rarely have any knowledge about the particular domain your company operates in. Domain experts help define a framework for your project. They have spent enough time in that domain to know the challenges you may have in the present and in the future. Not all knowledge is explicit. Domain experts have this tacit knowledge that is hard to document and that they gathered over the years in the form of experience, wisdom, insight, and intuition.

You will need domain experts in the first place to understand the business, ask the right questions, and develop a statement of scope for your project. Once that is done, domain experts can tell you what data is available within the company, what is not available, where it is available, what can you use, and what is beyond limits. For e.g. Consider you are working on a text analytics project on call center data. As a data scientist, you can get your hands on the data (e.g. transcript of a chat that takes place between a call center agent and the customer), but only a person who has actually worked in the call center as a call center agent would be able to guide you through the actual process of how the chats are recorded, whether all the chat or only a part of it is recorded, what happens before and after the chat, etc. They can act as quality assurance for your models and can help you diagnose, optimize, and recognize patterns in the data. Sometimes if documentation is not well maintained in the company, domain experts are your only option to understanding the seasonal patterns in your data. You can get nowhere without a domain expert beside you.

3. Not all solutions you build will make it to production

There is no doubt that companies today pour money into the development of machine learning models to improve their products or services. However, support for data science means more than just money. For data science, machine learning, and artificial intelligence to be successful, it is important not only for data scientists but also for business people and the management to understand what machine learning models are and how they work. A client who doesn’t understand how your model makes predictions will always be skeptical won’t buy-in. And this in turn will affect your model learning. At last, data science and business intelligence is all about implementing and learning, learning and implementing.

Another major hurdle is the disconnect between data scientists and the IT department. The job of the IT department is to ensure that things work and that nothing breaks in the system. IT may not always envision the needs of a data scientist and will be resistant to implementing new processes and technologies. Also, sometimes it can be very expensive for IT to build and maintain the infrastructure required for some of the machine learning technologies. Not to mention the efforts required to keep up with the security concerns of the new versions that keep rolling in so often. Therefore, it is necessary that departments have a shared goal and that they do not operate in silos.

The above learnings are of my own. I would love to hear from others how relevant are they to you.

Thank you for reading. Get in touch if you have further questions via LinkedIn.

--

--