I've been given a dataset and I need to assess its quality.
Use Pandas Profiling to quickly generate a document that will provide you with a first overview of the data.
Your first step should be to look for warnings and messages at the top of the document. Look for entries about missing values, those will point you to variables that may need attention during the data cleaning and data imputation phases of your machine learning problem. As you are doing an assessment, simply indicate that data is missing in these variables and then see if you can determine why by looking at a few examples by loading the data in a pandas dataframe.
Are there a lot of duplicated rows? Depending on the data you've been provided, this may help you identify whether or not something is wrong with the data you were provided. If all entries are supposed to be unique because they represent a single (entity, timestamp, target) tuple, then you should ask yourself why it isn't the case. Is it possible that the dataset was created by appending a collection of other documents, leading to duplicate lines? If so, you may have to do some dataset preprocessing in order to get rid of duplicate rows.
Look for variables that are indicated as highly correlated with other variables. High correlation means that it may be possible that one variable has exactly (or almost) the same values as the other variable, which would provide little information to a machine learning model. It would also mean that picking one variable out of two correlated variables would avoid the cost of storing both.
Look at each variable in turn and view its details.
Look at the distribution of values. Are they uniformly distributed, normally distributed, binomially distributed, etc.?
If there are only two possible values for a variable, are those values approximately the same or one value is dominant compared to the other? Were you to try and predict this variable, you would have to deal with class imbalance.
Are the values of the variables sensible to you? Are variables composed of multiple information, such as the value and the unit used for the measurement? You would generally prefer composite values to be separated into different variables as it will be easier to process using machine learning models.
When looking at numbers distribution, are there outliers (values that are either a lot smaller or larger than the rest)? It is sometimes important to ask those who provided you with the data if they can explain those outliers. In general you will want to ignore outliers during training as they may skew your model toward them, resulting in less than ideal results for all the other data points.
The quality of a dataset is inversely proportional to the number of operations you need to apply to it to make it a clean dataset. That is to say that if you don't need to do anything on the data provided to you, then it is a good dataset.
As an AI developer, what skills do I use the most at work?
The following list is ordered according to the importance I give to those skills in my success at work.
- Analysis/understanding the problems: There's no point in doing work if you don't know why you're doing it. Understanding why you are solving a specific problem will give you insights on the best way to solve the problem itself using the right approach.
- Programming: I'm a software developer, so I spend most of my time (or hope I am) on writing code. That also includes reading code written by others.
- Prioritization: I need to decide what tasks are more important than other tasks and organize the list of tasks from most to least important. At my current job, it's everyone's responsibility to decide what is important to build.
- Time management: I have a variety of responsibilities so I need to juggle between them and the time I allocate to them. I also need to accomplish certain tasks according to deadlines, so it is important to properly manage my time, what gets done and when it gets done.
- Design/code organization: Programming is more than science. You also need to think about how the different parts of the code interact with each other and how to organize the code so that it is easy for others to understand the code and to participate in its development.
- Negotiation: Working in an organization means negotiating with others to make your ideas heard, accepted and developed. Not everyone will agree with your ideas, so you need to spend time and effort to convince others that your idea is the best.
- Scheduling: While time management is about making sure that you're spending your time on a task at the right moment, scheduling is about organizing your use of time with others in the company. It is about scheduling meetings and or events.
- Delegation: Being able to offload work to others is important. It is however a difficult skill because you need to be able to properly communicate with others what needs to be done and the expectations that you have (or that others have) regarding the completion of the task.
- Debugging: I love playing detective when there are issues in the code I maintain. I've acquired over the years a good ability to understand a system and to quickly pinpoint the cause of a bug or low performing system.
- Communication: Work is largely exchanging information and aligning with your coworkers. The most important skill of communication is listening because you want to understand first before saying anything.
How can you make people more productive?
Start by asking your teammates to assess whether they are at their peak productivity or not. If they already are, ask them what you could do to help them be more productive. If they are not at their peak productivity, ask them what is making them unproductive. Make it your goal to help your teammates be as productive and efficient as possible.
Being unproductive is like slow software. If you want to improve it, you have first to start with profiling the program to have an idea of what is making it slow. Once you have a general idea of the cause of the slowness, you will want to focus on your biggest sources of slowness because they are likely to be the ones you can help improve.
In software development, the biggest source of productivity loss is meetings. Meetings break the flow of developers, they consume time and are generally easy to replace with a properly written document. If you can act as the proxy for multiple members of your team so that they do not have to attend all the same meetings you are attending, you can help them reduce this meeting fatigue.
Another source of ineffectiveness is back and forth code reviews. If you need to do multiple iterations of review to close a PR, this PR was too big. If you also end up having to do more than two back and forth, you should simply reach out to the author of the PR and do a pair programming session with them to resolve most of the issues you have with the PR in a single session.
Code depends on other code. It should be your first priority to make sure that code that is developed by others is not stuck in the PR queue for a long period of time because the author might be building on that code. The code in that PR may actually be used by code the author is already writing while waiting for the PR to be reviewed. If he needs to go back and fix things because of delays in the PR, this will make his whole development process inefficient. Having to wait or being blocked is an important source of productivity loss.
Help others avoid working on too many tasks at once. The brain does not handle multitasking very well and it generally leads to lower performance and higher error rate.
There are no tests on the project I joined. How do I get started?
When joining a new project without tests, here is the value you need to provide through the addition of tests:
- the application works and doesn't crash
- the application works and supports a few input cases
- the application works and supports a variety of input cases
- the application works and is robust to most input cases
Start by finding the main entrypoint of the program and call it in a test. Your test doesn't have to do much, other than ensuring you can start the program and possibly exit. Your goal should not be to assert anything yet, but to exert the code. Create a few tests that do very few things other than starting and terminating the program. Once you've covered a few use cases, you can use those tests to ensure that the application can start, do a few things, then terminate without crashing.
Start unit testing the various parts of the code that are critical. To determine what is critical, you'll have to dig into the code. With the tests you initially wrote you will get a sense of the "critical" pieces, simply due to the fact that they are executed whenever the program starts and stops, which are two things you always want to work.
When writing unit tests, always aim to write a test case that covers the happy path first. You want to demonstrate that the functionality a class or function is supposed to provide is there first and foremost. Then you want to test its robustness and its ability to handle a variety of input cases. Given a large codebase, start by covering most of the code with the happy paths before you start to dig into the special cases.
What is the most important part of a meeting?
If you get invited to a meeting that only has a title, ask for an agenda. If the agenda is only the title of the meeting, ask for a detailed agenda.
If people cannot prepare an agenda for a meeting, then there is no point in meeting. A meeting with no preparation will generally be ineffective simply because it will be spent either sharing information, something that could have been done without meeting, or actually doing preparatory work for something that would deserve to be done in a group, but not as a meeting in itself.
If you need to make some decisions with a few people, then prepare the list of decisions that will be made during the meeting.
Make sure that your meetings have an agenda. You will get a lot more out of your meeting because you will know the outcome you expect out of them.