What data do I need to do time series forecasting?
There are three values that you must know for each data point of your time series:
- its entity, which represents a unique value identifying the time series (e.g., a product SKU). Without this information, it is not possible to construct a sequence of points since there's no logical grouping between the points.
- its timestamp, which represents the moment in time the data point was recorded. Without this information, it is not possible to construct a sequence of points since there's no sequential ordering between the points.
- its target, which represents the measurement of the data point itself that we want to predict. Without this information, we have effectively nothing to base ourselves on.
Such information would look as follow when organized in a table:
Additionally, you may also have recorded additional values at the same time, which can be a useful source of information when trying to predict a time series.
Let see what happened if we removed each of these columns to illustrate their necessity.
Removing the entity effectively leaves us with two values for the same timestamp. If the data was in this format and we were told that each time the timestamp goes below its previous value a new entity was defined, we would be able to reconstruct the initial table with its entity column.
Removing the timestamp gives us the values the entity may take, but we don't know when. Again, if we're told that the rows have been kept in some order, we could reconstruct the timestamp column.
Removing the target column makes this problem impossible to solve. We're left with only the entities that were measured and the time of measurement, but no measurement, which makes the two other values useless.
Is it possible to extract someone's beliefs by reading their writings?
If someone has written a blog for example, it is possible to figure out a variety of information about them simply based on what is written in those articles.
If someone writes about a specific software, we may not be able to infer right away what they think about said software, but we know that they've spent enough time to learn a bit about this specific software. If we know all the other alternatives in this category of software, we could infer that the user believes that this software is possibly better than the alternatives, otherwise why would they have picked it?
If the writer writes a lot on a topic, that is also a clue about their beliefs. They probably think that this topic is important, hence why they write about it. Maybe they write about this topic because it is lucrative to them.
What the writer doesn't write about is also informative. If they write mostly about technology, maybe they don't care about politics or sports?
It is possible to extract if-then rules from their writings, which generally expresses some form of believe that if something then something else. There are other variants where only if is provided (if something, something else, or something else, if something).
A writer may use certain adjectives to describe things as "easy", "simple", "straightforward", "difficult", "impossible", "hard", etc. Those are also useful of indicators of the writer's beliefs.
I've been given a dataset and I need to assess its quality.
Use Pandas Profiling to quickly generate a document that will provide you with a first overview of the data.
Your first step should be to look for warnings and messages at the top of the document. Look for entries about missing values, those will point you to variables that may need attention during the data cleaning and data imputation phases of your machine learning problem. As you are doing an assessment, simply indicate that data is missing in these variables and then see if you can determine why by looking at a few examples by loading the data in a pandas dataframe.
Are there a lot of duplicated rows? Depending on the data you've been provided, this may help you identify whether or not something is wrong with the data you were provided. If all entries are supposed to be unique because they represent a single (entity, timestamp, target) tuple, then you should ask yourself why it isn't the case. Is it possible that the dataset was created by appending a collection of other documents, leading to duplicate lines? If so, you may have to do some dataset preprocessing in order to get rid of duplicate rows.
Look for variables that are indicated as highly correlated with other variables. High correlation means that it may be possible that one variable has exactly (or almost) the same values as the other variable, which would provide little information to a machine learning model. It would also mean that picking one variable out of two correlated variables would avoid the cost of storing both.
Look at each variable in turn and view its details.
Look at the distribution of values. Are they uniformly distributed, normally distributed, binomially distributed, etc.?
If there are only two possible values for a variable, are those values approximately the same or one value is dominant compared to the other? Were you to try and predict this variable, you would have to deal with class imbalance.
Are the values of the variables sensible to you? Are variables composed of multiple information, such as the value and the unit used for the measurement? You would generally prefer composite values to be separated into different variables as it will be easier to process using machine learning models.
When looking at numbers distribution, are there outliers (values that are either a lot smaller or larger than the rest)? It is sometimes important to ask those who provided you with the data if they can explain those outliers. In general you will want to ignore outliers during training as they may skew your model toward them, resulting in less than ideal results for all the other data points.
The quality of a dataset is inversely proportional to the number of operations you need to apply to it to make it a clean dataset. That is to say that if you don't need to do anything on the data provided to you, then it is a good dataset.
What is learning according to machine learning?
It is (for supervised learning) looking at numerous samples, decomposing them into input variables and their associated target variable, and deriving according to an algorithm how to predict the target variable given input variables.
It is the (potentially lossy) compression of the observed samples, where the learning algorithm describes the compression/decompression algorithm. The compressed data is the information necessary for the algorithm to make predictions (decompression).
It is the creation of some "memory" of the observed samples. Whereas an untrained model has no memory of the dataset since it hasn't seen the data, a trained model has some form of memory. A simple model such as sklearn's DummyRegressor will learn and memorize the mean of the target variable. It may not have learned and memorized much, but it has built its internal model of the data.
It is to imitate as closely as possible the source of data it is trained on. This means that given input variables, it should produce target values that are as close as possible to those observed during training (learning).
How do you determine whether you have a useful model?
In machine learning, one way you can determine that you have a useful model is to compare it against baseline models. In a field such as time series, one can create models that are based on previous values, such as lag 0, which predicts that the next value will be equal to the current value, or a moving average, which takes the X last values and averages them and returns this average as the next predicted value. In this field, we expect that a model that can predict more accurately than those baseline models may prove to be useful.
In other cases, we may already have an existing model from which we can generate predictions. This model may also serve as a baseline which other models will have to beat in order to replace it.
There is however a case where the answer isn't clear: what happens when out of various models, one of the baseline models is the best? Then it becomes a question of whether the prediction interval produced by the model satisfies your needs for your problem.
Note that even when you beat the baseline models, if the best model still does not satisfy the error requirements that have been defined on the metrics you care about, the model may still not be useful. For example, an OCR model that has 95% accuracy at the character level will still produce 5 errors every 100 characters, which may be too high for the requirements of the system to be produced.
Another factor to keep in mind when comparing models is whether the improvements on the metric you are to optimize for are significant. The mean absolute error (MAE) may be lower for a model compared to another, but if its confidence interval is larger compared to the other model and their intervals overlap, then you cannot claim that one model is really better than the other. There may even be cases where you will prefer a model with higher MAE simply because its confidence interval is smaller than the other model with lower MAE but larger confidence interval.
There might be other attributes of the model that you may need to take into account. A model that takes days to train may be useless when you need it to be up to date every hour. A faster to train but less accurate model may be more useful in this case.
- Can the model under evaluation beat baseline models?
- Does the model under evaluation satisfy error requirements?
- Is the model under evaluation significantly better than existing models?
- Does the model under evaluation satisfy requirements such as time to train (e.g., less than 6 hours) or necessary resources (cpu, gpu, ram)?