12 Apr 2020

Tesseract TSV format

History / Edit / PDF / EPUB / BIB / 5 min read (~871 words)

Tesseract (an open source OCR engine) supports a TSV format as output. I looked online for some documentation about the columns but couldn't find anything, so I looked at the source code.

Here is a summary description of each column, what they represent, and the range of valid values they can have.

  • level: hierarchical layout (a word is in a line, which is in a paragraph, which is in a block, which is in a page), a value from 1 to 5
    • 1: page
    • 2: block
    • 3: paragraph
    • 4: line
    • 5: word
  • page_num: when provided with a list of images, indicates the number of the file, when provided with a multi-pages document, indicates the page number, starting from 1
  • block_num: block number within the page, starting from 0
  • par_num: paragraph number within the block, starting from 0
  • line_num: line number within the paragraph, starting from 0
  • word_num: word number within the line, starting from 0
  • left: x coordinate in pixels of the text bounding box top left corner, starting from the left of the image
  • top: y coordinate in pixels of the text bounding box top left corner, starting from the top of the image
  • width: width of the text bounding box in pixels
  • height: height of the text bounding box in pixels
  • conf: confidence value, from 0 (no confidence) to 100 (maximum confidence), -1 for all level except 5
  • text: detected text, empty for all levels except 5

Here is an example of the TSV format output, for reference.

level page_num block_num par_num line_num word_num left top width height conf text
1 1 0 0 0 0 0 0 1024 800 -1
2 1 1 0 0 0 98 66 821 596 -1
3 1 1 1 0 0 98 66 821 596 -1
4 1 1 1 1 0 105 66 719 48 -1
5 1 1 1 1 1 105 66 74 32 90 The
5 1 1 1 1 2 205 67 143 40 87 (quick)
5 1 1 1 1 3 376 69 153 41 89 [brown]
5 1 1 1 1 4 559 71 105 40 89 {fox}
5 1 1 1 1 5 687 73 137 41 89 jumps!
4 1 1 1 2 0 104 115 784 51 -
5 1 1 1 2 1 104 115 96 33 91 Over
5 1 1 1 2 2 224 117 60 32 89 the
5 1 1 1 2 3 310 117 224 39 88 $43,456.78
5 1 1 1 2 4 561 121 136 42 92 <lazy>
5 1 1 1 2 5 722 123 70 32 92 #90
5 1 1 1 2 6 818 125 70 41 89 dog

31 Mar 2020

Resume filtering

History / Edit / PDF / EPUB / BIB / 5 min read (~992 words)

What do I look for in a resume?

Let's start by the things I don't look at in a resume for a position in which experience is expected:

  • Your university: I couldn't care less where you've studied. While having a university may sometimes tell me you've been serious enough to go through the pain of university, I also know it's possible to go through university without acquiring any knowledge.
  • Your grades: It's great that you have A+ in so many classes. However grades do not always generalize to an effective worker. Furthermore, most of the applicants will also have high grades, which makes it a noisy/useless signal. Do understand that I also do not have the time to fact-check your grades, so you might as well have written you had a perfect score in every class. Be careful with this, as some people will see it as something to probe you on during the initial interviews, and this could backfire on you.
  • Your extra-curricular activities: Unless you are doing extra-curricular activities that are relevant to the position you are applying for, I am not filtering for people with whom I could do things with outside of work.
  • The list of all your publications: I work in a scientific field, and while for some publications are badges of success, I see listing articles as filler into a resume. If I want to know all the articles you've published, I can look it up on Google Scholar. Instead, focus on listing the areas of research you're interested in and indicating how many papers in those areas you've published.

Here are the things I look for:

  • 2-3 pages: If you cannot summarize your accomplishments in less than 3 pages, then you don't know how to summarize. I don't want to know everything you've done in your professional life. I don't want to know every single paper you've published, every conference/workshop you've attended, every grant you've received, every honors you have.
  • Relevant work experience: If you've been working in the same position for a different company, that will generally be a good thing. It means you already have prior experience in the field, you have seen how another company has accomplished what you might still do at your new job. It means you'll be easier to ramp up and may require less supervision/support during that period.
  • List what you did: If you only list the title/position you had and the company, I have no clue what you did there. You might as well not have worked there. Clearly list the big tasks/milestones you've worked on and what was your contribution.
  • List clear and quantitative accomplishments: "Increased sales by 200%", "Largely reduced operation costs" may sound great, but without the ability to compare against something, those accomplishments do not mean much.
  • List technologies you've used: When hiring it is often common that you want your new recruits to already have some prior experience in the tools that are used at the company, especially if you need them to ramp up quickly. This is even more important when the set of tools used in your industry is common enough as it will communicate how in touch with the field you are.
  • How long you've worked at your previous jobs: Are you the kind of person that will keep a job for a year then move to the next one? That's great! I'm not looking to work with people that have a short time horizon for their positions. I want to work with people that will go through the marathon with me.
  • Proper ordering of the sections of your resume: It's a little thing, but the ordering of the sections in your resume will communicate a lot to me. It will let me know whether you know how to prioritize, which is a critical skill. This point goes in hand with the "2-3 pages" item, as they both show that you are able to critically assess the content you produce.
  • What you studied in university: I expect people that apply to the positions I filter for to belong into a certain set of domains. This is generally not a very important criterion, but it gives me a better idea of your professional career.
  • When you finished your degree: This is used to determine how recent your education is. I consider professional experience once the degree is completed, not while it is being completed.
  • Free of grammatical and syntactical errors: Make sure your resume doesn't have major grammatical or syntactical mistakes. Those communicate a lack of seriousness and professionalism that I would expect in your future communication with others in the company. If your resume was initially written in a different language, make sure it is thoroughly translated.
  • Github account: If you list one, expect me to look at it. If you don't contribute much (less than 20 contributions per year), then it's simply better not to list it.
  • Personal website: If you list one, I will look at it as well. I have a background in web development, so I will use it as an additional way to evaluate it. Make sure it is online. A personal website that is down or for which the domain expired will lose you points.
30 Mar 2020

Data profiling

History / Edit / PDF / EPUB / BIB / 3 min read (~414 words)

What is data profiling?

Data profiling is the process of extracting information about data. Given tabular data (think of an Excel spreadsheet), we commonly want to extract the following properties about each column:

  • Number of rows
  • Number of cells without data
  • Number of cells with a value of zero
  • Number of distinct/unique values
  • Number of duplicate rows
  • Minimum, mean, median, maximum, quantiles, range, standard deviation, variance, sum
  • Values distribution
  • Most common values
  • Examples of values

The process of data profiling allows a data scientist or engineer to identify quickly potential sources of problems in the data such as:

  • Negative numbers when numbers should all be positive
  • Missing values which may need to be imputed or for which the row may have to be removed
  • Issues with the distribution of values such as class imbalance if we plan to solve a classification problem

In an ideal situation, data profiling reports:

  • No missing cells, this way you do not have to ask if data can be filled in or you don't need to impute the data using assumptions
  • Proper normalization of the data (e.g., value separate from their unit), this way the data can be used as-is, otherwise you need to transform the column to extract the numeric value from the unit
  • All the data in a column using the same unit, unless otherwise specified (e.g., you do not want data in meters, centimeters, feet or inches in the same column), this way your data is consistent, otherwise you need to identify the scales/units used and transform the data to use a common unit
  • Little to no row duplication, this way you know that your data was collected without creating duplicate entries, which sometimes happen when databases are merged manually to create a data file, otherwise you may have to drop the duplicate rows or identify how many of the duplicates should be kept
29 Mar 2020

Time series forecasting projects

History / Edit / PDF / EPUB / BIB / 6 min read (~1069 words)

What are the general steps of a time series forecasting project?

Using a tool such as pandas-profiling, the dataset provided by the client is profiled and a variety of summary statistics produced, such as the min, mean, median, max, quartiles, number of samples, number of zeros, missing values, etc. are computed for numerical values. Other types of data also have their own set of properties computed.

These summary statistics allow you to quickly have a glance at the data. You will want to look for missing values to assess whether there's a problem with the provided data. Sometimes missing data can imply that you should use the prior value that was set. Sometimes it means that the data isn't available, which can be an issue and may require you to do some form of data imputation down the road.

Common things to look for in time series data are gaps in data (periods where no data has been recorded), the trend/seasonality/residual decomposition per time series, the autocorrelation and partial autocorrelation plots, distribution of values grouped by a certain period (by month, by week, by day, by day of the week, by hour), line/scatter plots of values grouped by the same periods.

Data is rarely clean and ready to be consumed. This means many things: removing invalid values, converting invalid values or values out of range into a valid range, splitting cells that have multiple values in them into separate cells (e.g., "10 cm" split into "10" and "cm").

A variety of transformations can be applied to the cleaned data, ranging from data imputation (setting values where values are missing using available data), applying a function on the data, such as power, log or square root transform, differencing (computing the difference with the prior value), going from time zoned date time to timestamps, etc.

Common feature generation transformations are applied, such as computing lagged values on variables, moving averages/median, exponential moving averages, extracting the latest min/max, counting the number of peaks encountered so far, etc. Feature generation is where you create additional information for your model to consume with the hope that it will provide it some signal it can make use of.

Before attempting to find a good model for the problem at hand you want to start with simple/naive models. The time series naive model simply predicts the future by using the latest value as its prediction.

With a baseline established, you can now run a variety of experiments, which generally means trying different models on the same dataset while evaluating them the same way (same training/validation splits). In time series, we do cross-validation by creating a train/validation split where the validation split (i.e., the samples in the validation set) occurs temporally after the training split. The cross-validation split represents different points in time at which the models are trained and evaluated for their performance.

After you've completed a few experiments you'll have a variety of results to analyze. You will want to look at your primary performance metric, which is generally defined as an error metric you are trying to minimize. Examples of error metrics are MAE, MSE, RMSE, MAPE, SMAPE, WAPE, MASE. Performance is evaluated on your validation data (out-of-sample) and lets you have an idea of how the model will perform on data it hasn't seen during training, which closely replicates the situation you will encounter in production.

With many models and their respective primary metric computed, you can pick the one which has produced the lowest error on many cross-validation train/test splits.

Once the model has been selected, it is packaged to be deployed. This generally implies something as simple as pickling the model object and loading it in the remote environment so it can be used to do predictions.

There are two modes of forecasting:

  • Offline: Data used for forecasting is collected during a period of time and then a scheduled task uses this newly available data to create new forecasts. This is generally used for systems with large amounts of data where the forecasts are not needed in real-time, such as forecasting tomorrow's stock price, the minimum and maximum temperature, the volume of stocks that will be sold during the week, etc.
  • Online: Data used for forecasting is given to the model and predictions are expected to be returned within a short time frame, on the order of less than a second to a minute.

Raw data is transformed and feature engineered, then given to the model to use to forecast.

28 Mar 2020

Writing with simple vocabulary

History / Edit / PDF / EPUB / BIB / 2 min read (~204 words)

Why should I write using simple, frequently used words?

Using simple language will allow more people to understand your message.

Using simple words to explain ideas that are more involved makes it easier to understand those ideas.

It's also easier to identify errors in reasoning when you're expressing yourself with simple language.

Using rare words does not mean that you are more intelligent or have smart thoughts. It simply means you're trying to conceal yourself by using words others may not understand.

Like writing software, you should aim to keep your writing simple. It makes it easier on the readers that don't have to spend their time to understand what you're trying to say.

Writing such articles is very difficult. For example, this article was written using only the 5000 most common words according to Wiktionary.