data

## Select, Fetch, Extract, Watch: Web Scraping Architecture

The internet is full of useful information just waiting to be collected, processes and analysed. While getting started scraping data from the web is straightforward, it's easy to tangle the whole process together in a way that makes it fragile to failure, or hard to change with requirements. And the internet is inconsistent and changing; in anything but the smallest scraping projects you're going to run into failures. I find it useful to conceptually break web scraping down into four steps; selecting the data to retrieve, fetching the data from the internet, extracting structured data from the raw responses, and watching the process to make sure it's functioning correctly.

• Edward Ross
data

• Edward Ross
data

## Price Hysteresis

The demand curve in economics represents the relationship between price and quantity sold. It's generally not possible to know the demand curve without varying prices to measure it. But if you lower a price and then raise it again you don't always get back to your original volume - you get price hysteresis. Hysteresis is where the value of a quantity depends on how you got there. Pricing hysteresis is about that the quantity of goods sold isn't dependent just on the price today, but on previous prices too.

• Edward Ross
data

## More Profitable A/B with Test and Roll

When running an A/B test the sample sizes can seem insane. For example to observe a 2 percentage point uplift on a 60% conversion rate requires over 9,000 people in each group to get the standard 95% confidence level with 80% power. If you've only got less than 18,000 customers you can reach, which is very common in businesss to business settings, it's impossible to conduct this test. But if you look in terms of the outcomes, doing an A/B test on a few hundred users may actually greatly increase your outcomes.

• Edward Ross
data

## Reference Sets as Pervasive Models

Suppose you have a long standing heart condition, and are considering undergoing a surgical procedure that could alleviate the procedure, but has its own set of risks. You happen to have a good friend who's a statististican at the hospital you're being seen at and can get you historical frequencies of complications. However she asks you what reference set do you want to use? Do you want complications related to just the specific procedure treating your condition, or for all procedures on that area of the heart for a variety of conditions?

• Edward Ross
data

## The Way of the Physicist

A large number of the physicists I trained with are now data scientists, and it's not uncommon to meet a data scientist who trained in Physics. Part of this is because there's not a lot of physics jobs, especially in Australia. But another reason is that the training we get as physicists is very similar to what you need for data science. David Bailey, a physicist from the University of Toronto, has objectives for their undergraduate Physics program which describes "the Way of the Physicist":

• Edward Ross
maths

## Probability Distributions Between the Mean and the Median

The normal distribution is used throughout statistics, because of the Central Limit Theorem it occurs in many applications, but also because it's computationally convenient. The expectation value of the normal distribution is the mean, which has many nice arithmetic properties, but the drawback of being sensitive to outliers. When discussing constant models I noted that the minimiser of the Lᵖ error is a generalisation of the mean; for $$p = 2$$ it's the mean, for $$p = 1$$ it's the median, and for $$p = \infty$$ it's the midrange (half way betwen the maximum and minimum points).

• Edward Ross
data

## Metrics for Binary Classification

When evaluating binary classifier (e.g. will this user convert?) the most obvious metric is accuracy; what's the probability a random prediction is correct. One issue with this metric is if 90% of the cases are one class a high accuracy isn't really impressive; you need to contrast it with a constant model predicting the most frequent class. More subtly it's not a very sensitive measure, by measuring cross-entropy of predicted probabilities you get a much better idea of how well your model is working.

• Edward Ross
data

## Building a Reputation in Data Science

As a professional your reputation is very important to your career success. To get people to offer you work, to pay for your advice or to buy products from you they need to trust that you will deliver them value. The most common heuristic for this is your reputation; what other people say about you, what you have done and what certifications you have. It's important that your reputation is very specific to the kind of work you want others to buy.

• Edward Ross
maths

## Cosine Similarity is Euclidean Distance

In mathematics it's surprising how often something that's obvious (or trivial) to someone else can be revolutionary (or weeks of work) to someone else. I was looking at the annoy (Approximate Nearest Neighbours, Oh Yeah) library and saw this comment: Cosine distance is equivalent to Euclidean distance of normalized vectors I hadn't realised it at all, but once the claim was made I could immediately verify it. Given two vectors u and v their distance is given by the length of the vector between them: $$d = \| u - v \| = \sqrt{(u - v) \cdot (u - v)}$$.

• Edward Ross
data

## Glassbox Machine Learning

Can we have an interpretable model that has as good performance as blackbox models like gradient boosted trees and neural networks? In a 2020 Empirical Methods for Natural Language Processing Keynote, Rich Caruana says yes. He calls interpretable models glassbox machine learning, in contrast to blackbox machine learning. It is models in which a person can explicitly see how they work, and follow the steps from inputs to outputs. This interpretability is subtly different from explainable (explainable to who?

• Edward Ross
data

## Centroid for Cosine Similarity

Cosine similarity is often used as a similarity measure in machine learning. Suppose you have a group of points (like a cluster); you want to represent the group by a single point - the centroid. Then you can talk about how well formed the group is by the average distance of points from the centroid, or compare it to other centroids. Surprisingly it's not much more complex than finding the geometric centre in euclidean space, if you pick the right coordinate system.

• Edward Ross
data

## Using Behaviour to Understand Items

When people access products online their behaviour gives lots of information about both the people and the products. This information deeply enriches understanding of how to better serve your customers, how your products are related to each other and can help answer deeper questions about them. However you need to find a way to unlock the information. Using behavioural information can greatly improve modelling on the tabular data in your database.

• Edward Ross
data

## Structuring a Project Like a Kaggle Competition

Analytics projects are messy. It's rarely clear at the start how to frame the business problem, whether a given approach will actually work, and if you can get it adopted by your partners. However once you have a framing the modelling part can be iterated on quickly by structuring the project like a Kaggle Competition. The modelling part of analytics projects will go smoothly only if you have clear evaluation criteria.

• Edward Ross
python

## Decorating Pandas Tables

When looking at Pandas dataframes in a Jupyter notebook it can be hard to find what you're looking for in a big mess of numbers. Something that can help is formatting the numbers, making them shorter and using graphics to highlight points of interest. Using Pandas style you can make the story of your dataframe standout in a Jupyter notebook, and even export the styling to Excel. The Pandas style documentation gives pretty clear examples of how to use it.

• Edward Ross
analytics

## Insights From Google Analytics for a Small Blog

I started regularly writing this website to get better at writing, to build a portfolio and share my learnings. Because of this I haven't been focussed on building an audience or looking at analytics. However now I've been writing continuously for 6 months I'd see if I learned anything interesting from looking at Google Analytics. I installed Google Analytics a couple of weeks ago on the website to see how people are actually viewing my site.

• Edward Ross
data

## Importance of Collecting You Own Training Data

A couple years ago I built whatcar.xyz which predicts the make and model of Australian cars. It was built mainly with externally sourced data and so only works sometimes, under good conditions. To make it better I've started collecting my own training data. External data sources are extremely convenient for training a model as they can often be obtained much more cheaply than curating your own data. But the data will almost always be different to what you are actually performaning inference on, and so you're relying on a certain amount of generalisation.

• Edward Ross
nlp

## Building NLP Datasets from Scratch

There's a common misconception that the best way to build up an NLP dataset is to first define a rigorous annotation schema and then crowdsource the annotations. The problem is that it's actually really hard to guess the right annotation schema up front, and this is often the hardest part on the modelling side (as opposed to the business side). This is explained wonderfully by spaCy's Matthew Honnibal at PyData 2018.

• Edward Ross
statistics

## Experimental Generalisability

Experiments reveal the relationship between inputs and outcomes. With statistical methods you can often, with enough observations, tell whether there's a strong relationship or if it's just noise. However it's much harder to know how generally the relationship holds, but it's essential for making decisions. Suppose you're testing two alternate designs for a website. One has a red and green button with a santa hat and bauble, and the other has a blue button.

• Edward Ross
data

## Gridded Population of the World

I've spent the last few hours looking at the Gridded Population of the World which consistenly estimates the population density consistent with national censuses and population registers. This would have been a massive job to compile and is really interesting to look at. You can immediately see a strip through the north of India, Pakistan and Bangladesh that is incredibly dense. The north-east of China and the island of Java in Indonesia are also very dense.

• Edward Ross
data

## Dataflow Chasing

When making changes to a new model training pipeline I find it really useful to understand the dataflow. Analytics workflows are done as a series of transformations, taking some inputs and producing some outputs (or in the case of mutation; an input is also an output). Seeing this dataflow helps give a big picture overview of what is happening and makes it easier to understand the impact of changes. Generally you can view the process as a directed and (hopefully) acyclic graph.

• Edward Ross
data

## Contact Tracing in Fighting Epidemics

The state government of Victoria, Australia has recently announced a plan on how to respond to the current Covid-19 pandemic. Based on epidemiological modelling they have set to reduce restrictions based on 14 day averages of new case numbers. If the 14 day average daily new cases are 30-50 in 3 weeks they will reduce restrictions; if they are below 5 a month after that they will reduce restrictions again.

• Edward Ross
maths

## Modelling the Spread of Infectious Disease

Understanding the spread of infectious disease is very important for policies around public health. Whether it's the seasonal flu, HIV or a novel pandemic the health implications of infectious diseases can be huge. A change in decision can mean saving thousands of lives and relieving massive suffering and related economic productivity losses. The SIR model is a model that is simple, but captures the underlying dynamics of how quickly infectious diseases spread.

• Edward Ross
data

## Embeddings for categories

Categorical objects with a large number of categories are quite problematic for modelling. While many models can work with them it's really hard to learn parameters across many categories without doing a lot of work to get extra features. If you've got a related dataset containing these categories you may be able to meaningfuly embed them in a low dimensional vector space which many models can handle. Categorical objects occur all the time in business settings; products, customers, groupings and of course words.

• Edward Ross
data

## From Descriptive to Predictive Analytics

The starting point for an analysis is often summary statistics, such as the mean or the median. For some of these you're going to want it more precisely, more timely or cut by thinner segments. When the data gets too volatile to report on it's a good time to reframe the descriptive statistics as a predictive problem. Businesses often have a lot of reporting around important metrics cut by key segments.

• Edward Ross
data

## Interpretable models with Cynthia Rudin

A while ago I came across Cynthia Rudin through their work on the FICO Explainable Machine Learning Challenge. Her team got an honourable mention and she wrote an opinion with Joanna Radin on explainable models. I think the article was hyperbolic on claiming interpretable models always work as well as black box models. On the other hand I only came across her because of this article, so taking an extreme viewpoint in the media is a good way to get attention.

• Edward Ross
data

## Topic Modelling to Bootstrap a Classifier

Sometimes you want to classify documents, but you don't have an existing classification. Building a classification that is mutually exclusive and completely exhaustive is actually very hard. Topic modelling is a great way to quickly get started with a basic classification. Creating a classification may sound easy until you try to do it. Think about novels; is a Sherlock Holmes novel a mystery novel or a crime novel (or both)? Or do we go more granular and call it a detective novel, or even more specifically a whodunit?

• Edward Ross
data

## Rough Coarse Geocoding

A coarse geocoder takes a human description of a large area like a city, area or country and returns the details of that location. I've been looking into the source of the excellent Placeholder (a component of the Pelias geocoder) to understand how this works. The overall approach is straightforward, but it takes a lot of work to get it to be reliable. A key component geocoder is a gazetteer that contains the names of locations.

• Edward Ross
data

## Refining Location with Placeholder

Placeholder is a great library for Coarse Geocoding, and I'm using it for finding locations in Australia. In my application I want to get the location to a similar level of granularity; however the input may be for a higher level of granularity. Placeholder doesn't directly provide a method to do this, but you can use their SQLite database to do it. For example to find the largest locality for East Gippsland, with Who's On First id 102049039, you can use the SQL.

• Edward Ross
maths

## Dip Statistic for Multimodality

If you've got a distribution you may want a way to tell if it has multiple components. For example a sample of heights may have a couple of peaks for different gender, or other attributes. While you could determine this through explicitly modelling them as a mixture the results are sensitive to your choice of model. Another approach is statistical tests for multimodality. One common test is Silverman's Test which checks for the number of modes in a kernel density estimate; the trick is choosing the right width.

• Edward Ross
sql

## Create User Sessions with SQL

Sometimes you may want to experiment with sessions and need to hand-roll your own in SQL. There's a good mode blog on how to do this. If you're using Postgres or Greenplum you may be able to use Apache Madlib's Sessionize for the basic case. This blog post will give a very brief summary of how to do this with some examples in Presto/Athena. The idea of a session is to capture a continuous unit of user activity.

• Edward Ross
maths

## Differentiation is Linear Approximations

Differentiation is the process of creating a local linear approximation of a function. This is useful because linear functions are very well understood and efficient to work with. One application of them is gradient descent, often used for fitting models in machine learning. In this context a function is something that maps between coordinate spaces. For example consider an image classifier that takes a 128x128 pixel image with three channels for colours (Red, Green, Blue) and returns a probability that the image contains a cat and the probability that the image contains a dog.

• Edward Ross
data

## Data Tests with SQL

A challenge of data analytics is that the data can change as well as the code. The systems producing and collecting data are often changed and can lead to missing or corrupt data. These can easily corrupt reports and machine learning systems. Worst of all the data may be lost permenantly. So if you're going to use some data it's important to check the data regularly to catch the worst kind of mistakes as early as possible.

• Edward Ross
data

## Sessionisation Experiments

You don't need a lot of data to prove a point. People often think statistics requries big expensive datasets that cost a lot to acquire. However in relatively unexplored spaces a small amount of data can have high yield in changing a decision. I've been working on some problems around web sessionisation. The underlying model is that when someone visits your website they may come at different times for different reasons.

• Edward Ross
data

## Metrics you can Drive

Tracking a metric can help to drive dramatic improvements. When your team is focused on a metric you can test what has impact and quickly optimise it. However for this to work it's important to be something you can actually impact. When people start looking for a metric to track they want to look for things that have a direct impact on the business, such as revenue, share price or customer satisfaction.

• Edward Ross
data

## Data Models

Information is useful in that it helps make better decisions. This is much easier if the data is represented in a way that closely match the conceptual model of the business. Building a useful view of the data can dramatically decrease the time and cost of answering questions and even elevate the conversation to answering deeper questions about the business. A typical example of where analysis can help is trying to increase revenue of a digitally sold product.

• Edward Ross
data

## A Checklist for NLP models

When training machine learning models typically you get a training dataset for fitting the model and a test dataset for evaluating the model (on small datasets techniques like cross-validation are common). You typically assume the performance on your chosen metric on the test dataset is the best way of judging the model. However it's really easy for systematic biases or leakage to creep into the datasets, meaning that your evaluation will differ significantly to real world usage.

• Edward Ross
data

## Deep Neural Networks as a Building Block

Deep Neural Networks have transformed dealing with unstructured data like images and text, making totally new things possible. However they are difficult to train, require a large amount of relevant training data, are hard to interpret, hard to debug and hard to refine. I think for these reasons there's a lot of space to use neural networks as a building block for extracting structured data for less parameterised models. Josh Tenenbaum gave an excellent keynote at ACL 2020 titled Cognitive and computational building blocks for more human-like language in machines.

• Edward Ross
data

## Sequential Weak Labelling for NER

The traditional way to train an NER model on a new domain is to annotate a whole bunch of data. Techniques like active learning can speed this up, but especially neural models with random weights require a ton of data. A more modern approach is to take a large pretrained NER model and fine tune it on your dataset. This is the approach of AdaptaBERT (paper), using BERT. However this takes a large amount of GPU compute and finnicky regularisation techniques to get right.

• Edward Ross
data

## Using HTML in NLP

Many documents available on the web have meaningful markup. Headers, paragraph breaks, links, emphasis and lists all change the meaning of the text. A common way to deal with HTML documents in NLP is to strip away all the markup, e.g. with Beautiful Soup's .get_text. This is fine for a bag of words approach, but for more structured text extraction or language model this seems like throwing away a lot of information.

• Edward Ross
python

## Demjson for parsing tricky Javascript Objects

Modern Javascript web frameworks often embed the data used to render each webpage in the HTML. This means an easy way of extracting data is capturing the string representation of the object with a pushdown automoton and then parsing it. Python's inbuilt json.loads is effective, but won't handle very dynamic Javascript, but demjson will. The problem shows up when using json.loads as the following obscure error: json.decoder.JSONDecodeError: Expecing value: line N column M (char X) Looking at the character in my case looking near the character I see that it is a JavaScript undefined, which is not valid in JSON.

• Edward Ross
python

## Tips for Extracting Data with Beautiful Soup

Beautiful soup can be a useful library for extracting infomation from HTML. Unfortunately there's a lot of little issues I hit working with it to extract data from a careers webpage using Common Crawl. The library is still useful enough to work with; but the issues make me want to look at alternatives like lxml (via html5-parser). The source data can be obtained at the end of the article. Use a good HTML parser Python has an inbuild html.

• Edward Ross
data

Downloading files can often be a bottleneck in a data pipeline because network I/O is slow. A really simple way to handle this is to run multiple downloads in parallel accross threads. While it's possible to deal with the unused CPU cycles using asynchronous processing, in Python it's generally easier to throw more threads at it. Using multiprocessing can be very simple if you can turn make the processing occur in a pure function or object method, and both the variables are results are picklable.

• Edward Ross
data

## Processing RDF nquads with grep

I am trying to extract Australian Job Postings from Web Data Commons which extracts structured data from Common Crawl. I previously came up with a SPARQL query to extract the Australian jobs from the domain, country and currency. Unfortunately it's quite slow, but we can speed it up dramatically by replacing it with a similar script in grep. With a short grep script we can get twenty thousand Australian Job Postings with metadata from 16 million lines of compressed nquad in 30 seconds on my laptop.

• Edward Ross
data

## Coarse Geocoding

Sometimes you have some description of a location and want to work out where it is. This is called geocoding; if you just want to know what state or country it's in it's called coarse geocoding. I found that while many structured JobPostings contain a country some have it as a description rather than a country code, and some put the location in other fields. We can often find the country using geocoding.

• Edward Ross
python

## Analytics Web Data Commons with SPARQL

I am trying to understand how the JobPosting schema is used in Web Data Commons structured data extracts from Common Crawl. I wrote a lot of ad hoc Python to get usage statistics on JobPosting. However SPARQL is a tool that makes it much easier to answer these kinds of questions. After reading in the graphs individually they can be combined into a rdflib.Dataset so we can query them all together.

• Edward Ross
python

## Converting RDF to Dictionary

The Web Data Commons has a vast repository of structured RDF Data about local businesses, hostels, job postings, products and many other things from the internet. Unfortunately it's not in a format that's easy to do analysis on. We can stream the nquad format to get RDFlib Graphs, but we still need to convert the data into a form we can do analysis on. We'll do this by turning the relations into dictionaries of properties to the list of objects they contain.

• Edward Ross
data

The Web Data Commons extracts structured RDF Data from about one monthly Common Crawl per year. These contain a vast amount of structured infomation about local businesses, hostels, job postings, products and many other things from the internet. Python's RDFLib can read the n-quad format the data is stored in, but by default requires reading all of the millions to billions of relations into memory. However it's possible to process this data in a streaming fashion allowing it to be processed much faster.

• Edward Ross
commoncrawl

## Common Crawl Index Athena

Common Crawl builds an open dataset containing over 100 billion unique items downloaded from the internet. There are petabytes of data archived so directly searching through them is very expensive and slow. To search for pages that have been archived within a domain (for example all pages from wikipedia.com) you can search the Capture Index. But this doesn't help if you want to search for paths archived across domains. For example you might want to find how many domains been archived, or the distribution of languages of archived pages, or find pages offered in multiple languages to build a corpus of parallel texts for a machine translation model.

• Edward Ross
commoncrawl

## Extracing Text, Metadata and Data from Common Crawl

Common Crawl builds an open dataset containing over 100 billion unique items downloaded from the internet. You can search the index to find where pages from a particular website are archived, but you still need a way to access the data. Common Crawl provides the data in 3 formats: If you just need the text of the internet use the WET files If you just need the response metadata, HTML head information or links in the webpage use the WAT files If you need the whole HTML (with all the metadata) then use the full WARC files The index only contains locations for the WARC files, the WET and WAT files are just summarisations of it.

• Edward Ross
commoncrawl

## Searching 100 Billion Webpages Pages With Capture Index

Common Crawl builds an open dataset containing over 100 billion unique items downloaded from the internet. Every month they use Apache Nutch to follow links accross the web and download over a billion unique items to Amazon S3, and have data back to 2008. This is like what Google and Bing do to build their search engines, the difference being that Common Crawl provides their data to the world for free.

• Edward Ross
jobs

## Understaning Job Ad Titles with Salary

Different industries have different ways of distinguishing seniority in a job title. Is a HR Officer more senior than a HR Administrator? Is a PHP web developer more skilled than a PHP developer? How different is a medical sales executive to general sales roles? Using the jobs from Adzuna Job Salary Predictions Kaggle Competition I've found common job titles and can use the advertised salary to help understand them. Note that since the data is from the UK from several years ago a lot of the details aren't really applicable, but the techniques are.

• Edward Ross
data

## Simple Metrics

I have a tendency to create really complex metrics. Sometimes when I'm analysing data I'll need to transform the data to understand it. I often calculate the ratio of common metrics to get a more stable rate. Or when building a machine learning model I'll find that log-loss or root mean square log error is the right metric. This can be appropriate for gaining insight or training a model, but it's not good for communication.

• Edward Ross
nlp

## Summary of Finding Near Duplicates in Job Ads

I've been trying to find near duplicate job ads in the Adzuna Job Salary Predictions Kaggle Competition. Job ads can be duplicated because a hirer posts the same ad multiple times to a job board, or to multiple job boards. Finding exact duplicates is easy by sorting the job ads or a hash of them. But the job board may mangle the text in some way, or add its own footer, or the hirer might change a word or two in different posts.

• Edward Ross
jobs

## Finding Duplicate Companies with Cliques

We've found pairs of near duplicate texts in 400,000 job ads from the Adzuna Job Salary Predictions Kaggle Competition. When we tried to extracted groups of similar ads by finding connected components in the graph of similar ads. Unfortunately with a low threshold of similarity we ended up with a chain of ads that were each similar, but the first and last ad were totally unrelated. One way to work around this is to find cliques, or a group of job ad were every job ad is similar to all of the others.

• Edward Ross
excel

## Spreadsheets as a Rough Annotation Tool

I needed to design some heuristic thresholds for grouping together items. In my first step attempt I iteratively tried to guess the thresholds by trying them on different examples. This was directionally useful but as I refined the thresholds I had to keep going back to check whether I had broken earlier examples. To improve this I used a spreadsheet as a rough annotation tool. There are various tools for data entry like org mode tables in Emacs, or you can use a spreadsheet interface in R with data.

• Edward Ross
data

## Bridging Bipartite Graph

When you have behavioural data between actors and events you naturally get a bipartite graph. For example you can have the actors as customers and events as products that are purchased, or the actors as users of a website and the events as videos that are viewed, or the actors as members of a forum and the events as posts they comment on. One of the ways to represent this is to relate actors by the number of events they both participate in.

• Edward Ross
data

## Clustering for Exploration

Suppose you're running a website with tens of thousands of different products, and no satisfactory way to group them up. Even a mediocre clustering can really help bootstrap your understanding. You can use the clusters to see new patterns in the data, and you can manually refine the clusters much more easily than you can make them. There are many techniques to cluster structured data or even detect them as communities in the graph of interactions with your users.

• Edward Ross
data

## Community detection in Graphs

People using a website or app will have different patterns of behaviours. It can be useful to cluster the customers or products to help understand the business and make better strategic decisions. One way to view this data is as an interaction graph between people and the product they interact with. Clustering a graph of interactions is called "community detection" Santo Fortunato's review article and user guide provides a really good introduction to community detection.

• Edward Ross
data

## Finding Common Substrings

I've found pairs of near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition using Minhash. One thing that would be useful to know is what the common sections of the ads are. Typically if they have a high 3-Jaccard similarity it's because they have some text in common. The most asymptotically efficient to find the longest common substring would be to build a suffix tree, but for experimentation the heuristics in Python's DiffLib work well enough.

• Edward Ross
data

## Simple Models

My first instinct when dealing with a new problem is to try to find a complex technique to solve it. However I've almost always found it more useful to start with a simple model before trying something more complex. You gain a lot from trying simple models and the cost is low. Even if they're not enough to solve the problem (which they can be) they will often give a lot of information about the problem which will set you up for later techniques.

• Edward Ross
data

## From Bernoulli to Binomial Distributions

Suppose that you flip a fair coin 10 times, how many heads will you get? You'd think it was close to 5, but it might be a bit higher or lower. If you only got 7 heads would you reconsider you assumption the coin is fair? What if you got 70 heads out of 100 flips? This might seem a bit abstract, but the inverse problem is often very important. Given that 7 out of 10 people convert on a new call to action, can we say it's more successful than the existing one that converts at 50%?

• Edward Ross
data

## Clustering for Segmentation

Dealing with thousands of different items is difficult. When you've got a couple of dozen you can view them together, but as you get into the hundreds, thousands and beyond it becomes necessary to group items to make sense of them. For example if you've got a list of customers you might group them by state, or by annual spend. But sometimes it would be useful to split them into a few groups using some heuristic criteria; clustering is a powerful technique to do this.

• Edward Ross
data

## Representing Decision Trees on a grid

A decision tree is a series of conditional rules leading to an outcome. When stated as a chain of if-then-else rules it can be really hard to understand what is going on. If the number of dimensions and cutpoints is relatively small it can be useful to visualise on a grid to understand the tree. Decision trees are often represented as a heirarchy of splits. Here's an example of a classification tree on Titanic survivors.

• Edward Ross
data

## Four Competencies of an Effective Analyst

Analysts tend to be natural problem solvers, good at reasoning and adept with numbers. But to know how to frame the problem and what to look for they need to understand the context. To solve the problems they have to collect the right data and perform any necessary calculations. To have impact they need to be able to understand what's valuable, communicate their insights and influence decisions. These make up the four competencies of an effective analyst.

• Edward Ross
data

## 4am Rule for timeseries

When you've got a timeseries that doesn't have a timezone attched to it the natural question is "what timezone is this data from?" Sometimes it's UTC, sometimes it's the timezone of the server, otherwise it could be the timezone of one of the locations it's about (and it may or may not change with daylight savings). When it's people's web activity there's a simple heuristic to check this: the activity will be minimum between 3am and 5am.

• Edward Ross
data

A very useful open dataset the Australian Government provides is the Geocoded National Address File (G-NAF). This is a database mapping addresses to locations. This is really useful for applications that want to provide information or services based on someone's location. For instance you could build a custom store finder, get aggregate details of your customers, or locate business entities with an address, for example ATMs. There's another open and editable dataset of geographic entities, Open Street Map (and it has a pretty good open source Android app OsmAnd).

• Edward Ross
emacs

## Pipetable to CSV

Sometimes I get out pipe tables in Emacs that I want to convert into a CSVto put somewhere else. This is really easy with regular expressions. I often get data output from an SQL query like this text | num | value --------------+------+------------- Some text | 0.3 | 0.2 Rah rah | 7 | 0.00123(2 rows) Running sed 's/$$^ *\| *|\|(.*$$ */,/g' gives: ,text,num,value --------------+------+------------- ,Some text,0.3,0.2 ,Rah rah,7,0.00123, I can delete the divider and then use as a CSV.

• Edward Ross
sql

## Binning data in SQL

Generally when combining datasets you want to join them on some key. But sometimes you really want a range lookup like Excel's VLOOKUP. A common example is binning values; you want to group values into custom ranges. While you could do this with a giant CASE statement, it's much more flexible to specify in a separate table (for regular intervals you can do it with some integer division gymnastics). It is possible to implement VLOOKUP in SQL by using window functions to select the right rows.

• Edward Ross
data

## Representing Interaction Networks

Behavioural data can illuminate the structure of the underlying actors. For example looking at which products customers buy can help understand how both the products and customers interact. The same idea can apply to people who attend events, watch the same movie, or have authored a scientific paper together. There are a few ways to represent these kinds of interactions which gives a large toolbox of ways to approach the problem.

• Edward Ross
data

## Analysis Needs to Change A Decision

Any analysis where the results won't change a decision is worthless. Before even thinking of getting any data it's worth being clear on how it impacts the decision. There's lots of reasons people want an analysis. Sometimes it's to confirm what they already believe (and they'll discount anything that tells them otherwise). Sometimes it's to prove to others something they believe; possibly to inform a decision someone else is making. But it's most valuable when it effects a decision they can make with an outcome they care about.

• Edward Ross
sql

## SQL Views for hiding business logic

The longer I work with a database the more I learn the dark corners of the dataset. Make sure you exclude the rows created by the test accounts listed in another table. Don't use the create_date field, use the real_create_date_v2 instead, unless it's not there, then just use create_date. Make sure you only get data from the latest snapshot for the key. Very quickly I end up with complex spaghetti SQL, which either contains monstrous subqueries or a chain of CREATE TEMPORARY TABLE.

• Edward Ross
data

## The Problem with Jaccard for Clustering

The Jaccard Index is a useful measure of similarity between two sets. It makes sense for any two sets, is efficient to compute at scale and it's arithmetic complement is a metric. However for clustering it has one major disadvantage; small sets are never close to large sets. Suppose you have sets that you want to cluster together for analysis. For example each set could be a website and the elements are people who visit that website.

• Edward Ross
maths

## Jaccard Shingle Inequality

Two similar documents are likely to have many similar phrases relative to the number of words in the document. In particular if you're concerned with plagarism and copyright, getting the same data through multiple sources, or finding versions of the same document this approach could be useful. In particular MinHash can quickly find pairs of items with a high Jaccard index, which we can run on sequences of w tokens. A hard question is what's the right number for w?

• Edward Ross
python

## Showing Side-by-Side Diffs in Jupyter

When comparing two texts it's useful to have a side-by-side comparison highlighting the differences. This is straightforward using HTML in Jupyter Notebooks with Python, and the inbuilt DiffLib. I used this to display job ads duplicated between different sites. For a long document it's important to align the sentences (otherwise it's hard to compare the differences), and highlight the individual differences at a word level. Overall the problems are breaking up a text into sentences and words, aligning the sentences, finding word level differences and displaying them side-by-side.

• Edward Ross
nlp

## Creating a Diff Recipe in Prodigy

I created a simple custom recipe to show diffs between two texts in Prodigy. I intend to use this to annotate near-duplicates. The process was pretty easy, but I got tripped up a little. I've been extracting job titles and skills from the job ads in the Adzuna Job Salary Predictions Kaggle Competition. One thing I noticed is there are a lot of job ads that are almost exactly the se; sometimes between the train and test set which is a data leak.

• Edward Ross
data

## All of Statistics

For anyone who wants to learn Statistics and has a maths or physics I highly recommend Larry Wasserman's All of Statistics . It covers a wide range of statistics with enough mathematical detail to really understand what's going on, but not so much that the machinery is overwhelming. What I learned reading it really helped me understand statistics well enough to design bespoke statistical experiments and effectively use and implement machine learning models.

• Edward Ross
python

## Counting n-grams with Python and with Pandas

Sequences of words are useful for characterising text and for understanding text. If two texts have many similar sequences of 6 or 7 words it's very likely they have a similar origin. When splitting apart text it can be useful to keep common phrases like "New York" together rather than treating them as the separate words "New" and "York". To do this we need a way of extracting and counting sequences of words.

• Edward Ross
nlp

## Not using NER for extracting Job Titles

I've been trying to use Named Entity Recogniser (NER) to extract job titles from the titles of job ads to better understand a collection of job ads. While NER is great, it's not the right tool for this job, and I'm going to switch to a counting based approach. NER models try to extract things like the names of people, places or products. SpaCy's NER model which I used is optimised to these cases (looking at things like capitalisation of words).

• Edward Ross
nlp

## Rules, Pipelines and Models

Over the past decade deep neural networks have revolutionised dealing with unstructured data. Problems like identifying what objects are in a video through generating realistic text to translating speech between languages that were intractible are now used in real-time production systems. You might think that today all problems on text, audio an images should be solved by training end-to-end neural networks. However rules and pipelines are still extremely valuable in building systems, and can leverage the information extracted from the black-box neural networks.

• Edward Ross
nlp

## Training a job title NER with Prodigy

In a couple of hourse I trained a reasonable job title Named Entity Recogniser for job ad titles using Prodigy, with over 70% accuracy. While 70% doesn't sound great it's a bit ambiguous what a job title is, and getting exactly the bounds of the job title can be a hard problem. It's definitely good enough to be useful, and could be improved. After thinking through an annotation scheme for job titles I wanted to try annotating and training a model.

• Edward Ross
nlp

## Annotating Job Titles

When doing Named Entity Recognition it's important to think about how to set up the problem. There's a balance between what you're trying to achieve and what the algorithm can do easily. Coming up with an annotation scheme is hard, because as soon as you start annotating you notice lots of edge cases. This post will go through an example with extracting job titles from job ads. In our previous post we looked at what was in a job ad title and a way of extracting some common job titles from the ads.

• Edward Ross
nlp

## What's in a Job Ad Title?

The job title should succinctly summarise what the role is about, so it should tell you a lot about the role. However in practice job titles can range from very broad to very narrow, be obscure or acronym-laden and even hard to nail down. They're even hard to extract from a job ad's title - which is what I'll focus on in this series. In a previous series of posts I developed a method that could extract skills written a very particular way.

• Edward Ross
data

## Data Transformations in the Shell

There are many great tools for filtering, transforming and aggregating data like SQL, R dplyr and Python Pandas (not to mention Excel). But sometimes when I'm working on a remote server I want to quickly extract some information from a file without switching to one of these environments. The standard unix tools like uniq, sort, sed and awk can do blazing fast transformations on text files that don't fit in memory and are easy to chain together.

• Edward Ross