## Imports
# For downloading webpages
from urllib.request import urlretrieve
from time import sleep
# For large download jobs
from tenacity import retry, wait_exponential, stop_after_attempt
# For saving paths
from pathlib import Path
# For extracting data
import re
from lxml import html
from PyPDF2 import PdfReader
from dataclasses import dataclass
from typing import Optional
# For analysis
from collections import Counter
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
= 30
pd.options.display.max_columns
# A directory to store the data
= Path('data')
data_dir =True) data_dir.mkdir(exist_ok
Dataset of 2024 Victorian Council Election Candidates
In local elections what the candidates campaign on tells us a lot about what matters to them, and by extension to their communities. To understand what is important to the people of the Australian state of Victoria I’ve compiled a dataset of the statements of all the candidates to make it easy to explore the differences across regions.
The dataset is available as a zipped CSV or on Kaggle datasets. The rest of this article goes through the details of how the data was soruced from the Victorian Electoral Commision website. Note that I made sure to review the site’s robots.txt file and legal terms before collecting the data, and I took care to minimize server load during the process.
What’s in the data?
Victoria is broken up into 79 Local Government Areas, which are further subdivided into wards. Due to the Local Government Act 2020 most metropolitan wards only have one member that can be elected, but regional wards can still have multiple members. Altogether there are 646 vacancies, contested by 2231 candidates in 466 elections.
Voting is compulsory for all enrolled voters, who receive in the mail a voting card, a reply paid envelope and a candidate leaflet. The candidate leaflet has a photo and statement from each of the candidates in their ward, which is also available on the Victorian Electoral Commission Website. For example for Darebin City Council for any ward you can click “expand all” to see the candidates statements. There is also a link to a Candidate Questionnaire PDF that has additional details about the candidate’s vision for the council, their relevant expertise, and political endorsements.
We are going to get the information from these statements and questionnaires for all of the candidates.
Finding the council pages
Each council has its own page that we need to extract data from. There are a few ways to find them, one would be to extract the names from all URLs on the 2024 Local Council Election landing page, but an alternative is to use their sitemap.xml
file which contains a listing of most of the URLs on the site.
= 'https://www.vec.vic.gov.au/sitemap.xml'
sitemap_url
= data_dir / 'sitemap.xml'
sitemap_path
if not sitemap_path.exists():
urlretrieve(sitemap_url, sitemap_path)
with open(sitemap_path, 'rt') as f:
= f.read() sitemap
The nomination pages all end in “Nominations” and so we can extract them from the HTML with a crude regex
= re.findall('https://[^<>]+Nominations', sitemap)
nomination_urls 5] nomination_urls[:
['https://www.vec.vic.gov.au:443/Voting/2024 local council elections/Macedon Ranges Shire Council/Nominations',
'https://www.vec.vic.gov.au:443/Voting/2024 local council elections/Mount Alexander Shire Council/Nominations',
'https://www.vec.vic.gov.au:443/Voting/2024 local council elections/Wellington Shire Council/Nominations',
'https://www.vec.vic.gov.au:443/Voting/2024 local council elections/Ballarat City Council/Nominations',
'https://www.vec.vic.gov.au:443/Voting/2024 local council elections/Benalla Rural City Council/Nominations']
We get 78 distinct URLs, which covers all the 79 councils except Moira Shire Council which is under administration due to misconduct and won’t have an election this year.
len(nomination_urls)
78
Downloading the nomination pages
A pattern I find useful when extracting data from the web is to download the pages first, and extract the data as a second step. Extracting data from webpages is error prone because you can’t always anticipate the differences between different pages, and it’s easy to get failures and missing or invalid data. If you don’t have the original webpages saved you at best need to retrieve them again, and at worst may not be able to recover the data again. For a more complex job I would save the pages as WARC which also captures their origin, but in this case I’ll just download the pages.
Let’s create a directory just for the nomination pages.
= data_dir / 'nomination'
nomination_dir =True) nomination_dir.mkdir(exist_ok
For a given URL
= nomination_urls[0]
url url
'https://www.vec.vic.gov.au:443/Voting/2024 local council elections/Macedon Ranges Shire Council/Nominations'
We can extract council the name as the second to last part of the path:
= url.split('/')[-2]
name name
'Macedon Ranges Shire Council'
This is a useful filename because it’s unique and informative:
= nomination_dir / name path
Then, if we haven’t already, we can download the URL to the path (URL encoding the spaces in the URL string with their hex representation ‘%20’)
if not path.exists():
' ', '%20'), path) urlretrieve(url.replace(
We can do this for all 78 councils, waiting 2 seconds between each request so that we don’t put too much load on the server:
for url in nomination_urls:
= url.split('/')[-2]
name = nomination_dir / name
path if not path.exists():
print(path.name)
' ', '%20'), path)
urlretrieve(url.replace(2) sleep(
Extracting candidate statements
Let’s look at how we can extract the statements from the HTML of a Council’s page. We will start by picking out one Council:
= sorted(d for d in nomination_dir.iterdir() if not d.is_dir())
paths len(paths)
78
= paths[19]
path with open(path, 'rb') as f:
= f.read()
content path.name
'East Gippsland Shire Council'
There are a number of options to parse the HTML, from the inbuilt html.parser to something faster like html5-parser or selectolax, but since this is a fairly simple job I’m going to stick to LXML’s HTML parser. There are also a few different selectors I could use like BeautifulSoup’s API or CSS Selectors, but I will use XPaths which are built into LXML.
= html.fromstring(content) tree
Getting the wards
My best tools for extracting information is to open the webpage in a browser, open the Developer Tools and inspect the element I want to get information from. For example I want to first get the Ward and then the candidates in the ward.
It looks like it’s a <h2>
element, so my first guess would be to get all of these headers, which gets me a bit too much.
for x in tree.xpath("//h2"):
print(x.text)
Navigation
East Gippsland Shire Council (9 vacancies)
Connect with us
I saw in the developer tools that the field I wanted to extract had a class
and id
; looking at the attributes we can see that only the Ward headers have an id
attribute.
for x in tree.xpath("//h2"):
print(x.text, x.attrib)
Navigation {'class': 'vh'}
East Gippsland Shire Council (9 vacancies) {'class': 'h3', 'id': 'East Gippsland Shire Council'}
Connect with us {}
So we can extract all the ward text using an XPath query:
= tree.xpath("//h2[@id != '']")
wards for w in wards:
print(w.text)
East Gippsland Shire Council (9 vacancies)
We can extract the name and number of vacancies with a little regex:
= re.compile(r'(?P<name>[^(]+) \((?P<vacancies>\d+) vacan.*')
ward_re 0].text).groupdict() ward_re.match(wards[
{'name': 'East Gippsland Shire Council', 'vacancies': '9'}
Getting the candidates
Unfortunately the candidates don’t actually sit underneath the header in the DOM, but in a sibling table.
A little XPath can get us the closest table to the Ward name:
= wards[0].xpath('following-sibling::div[1]//table')[0] table
The table consists of pairs of rows, the first has the name of the candidate, a link to their questionnaire response, and contact details, the second (which is hidden in the browser until “expand” is pressed) contains their photo and candidate statement (split across two cells).
Pandas can often read HTML tables but is a bit simple for this job:
0] pd.read_html(html.tostring(table))[
Candidates | Unnamed: 1_level_0 | Unnamed: 2_level_0 | |
---|---|---|---|
Name | Questionnaire | Contact Details | |
0 | WHITE, John | See questionnaire response | John White Mob: 0409 576 355 jingella2@wideb... |
1 | NaN | Candidate Statement John White Experienced Co... | Candidate Statement John White Experienced Co... |
2 | ASHWORTH, Jodie | See questionnaire response | Jodie Ashworth Mob: 0494 164 427 jodie4eastg... |
3 | NaN | Candidate Statement I’m passionate and commit... | Candidate Statement I’m passionate and commit... |
4 | JOHN, Jes | See questionnaire response | Jes John jesjohnegscc@gmail.com |
5 | NaN | Candidate Statement I am a lifelong local who... | Candidate Statement I am a lifelong local who... |
6 | NICHOL, Aly | See questionnaire response | No contact person provided |
7 | NaN | Candidate Statement I am running for Council ... | Candidate Statement I am running for Council ... |
8 | CURTIS, Valerie | See questionnaire response | Valerie Curtis Mob: 0430 571 093 valerie@qui... |
9 | NaN | Candidate Statement My name is Valerie Curtis... | Candidate Statement My name is Valerie Curtis... |
10 | BURY, Clive | See questionnaire response | Clive Bury Mob: 0458 563 327 clive@bury.com.au |
11 | NaN | Candidate Statement I am running for Councill... | Candidate Statement I am running for Councill... |
12 | JAKOBI, Cheryl | See questionnaire response | Cheryl Jakobi Mob: 0437 353 808 cheryljakobi... |
13 | NaN | Candidate Statement Let’s work together to ma... | Candidate Statement Let’s work together to ma... |
14 | BUCKLEY, Sonia | See questionnaire response | Sonia Buckley Mob: 0429 636 342 soniabuckley... |
15 | NaN | Candidate Statement Vote One Sonia Buckley. A... | Candidate Statement Vote One Sonia Buckley. A... |
16 | DAVIES, Suzanne | See questionnaire response | Tristan Dooley Mob: 0430 378 357 gumtree14@h... |
17 | NaN | Candidate Statement I am a passionate East Gi... | Candidate Statement I am a passionate East Gi... |
18 | REEVES, Mark | See questionnaire response | Mark Reeves Mob: 0417 319 342 Alt: (03) 5154... |
19 | NaN | Candidate Statement Mark is an experienced Ma... | Candidate Statement Mark is an experienced Ma... |
20 | TREVASKIS, Ian | See questionnaire response | Ian trevaskis Mob: 0477 838 821 |
21 | NaN | Candidate Statement Ian Trevaskis stands for ... | Candidate Statement Ian Trevaskis stands for ... |
22 | IRELAND, Judy | See questionnaire response | Judy Ireland Mob: 0428 119 252 judy.ireland.... |
23 | NaN | Candidate Statement A resident of East Gippsl... | Candidate Statement A resident of East Gippsl... |
24 | COATES, Janice | No questionnaire response provided | No contact person provided |
25 | NaN | Candidate Statement Being a proud East Gippsl... | Candidate Statement Being a proud East Gippsl... |
26 | CROOK, Tom | See questionnaire response | Tom Crook Mob: 0417 557 114 vote1tomcrook@gm... |
27 | NaN | Candidate Statement Vote 1 Tom Crook. For a b... | Candidate Statement Vote 1 Tom Crook. For a b... |
28 | FARQUHAR, Bernie | See questionnaire response | Sue FARQUHAR Mob: 0439 366 962 Alt: (03) 515... |
29 | NaN | Candidate Statement Born and raised in Bairns... | Candidate Statement Born and raised in Bairns... |
30 | DAVIS, Barry | No questionnaire response provided | Barry Davis baz_87_@hotmail.com |
31 | NaN | Candidate Statement Proud to be a 5th generat... | Candidate Statement Proud to be a 5th generat... |
32 | KRUSE, Sasha | See questionnaire response | Sasha Kruse Mob: 0493 833 415 sashakruse@pro... |
33 | NaN | Candidate Statement I grew up on Gunaikurnai ... | Candidate Statement I grew up on Gunaikurnai ... |
34 | COLUMBUS, Steven | See questionnaire response | Steven Columbus steven.columbus.eg@gmail.com |
35 | NaN | Candidate Statement Youth and young families ... | Candidate Statement Youth and young families ... |
36 | ALLEN, Arthur | See questionnaire response | Arthur Allen Mob: 0447 028 159 arthur.r.alle... |
37 | NaN | Candidate Statement Four years have gone quic... | Candidate Statement Four years have gone quic... |
38 | NICHOLAS, James | See questionnaire response | James Nicholas Mob: 0455 988 600 JamesNichol... |
39 | NaN | Candidate Statement I believe that community ... | Candidate Statement I believe that community ... |
40 | BADY, Susie | See questionnaire response | Susie Bady susannabady@gmail.com |
41 | NaN | Candidate Statement I live in Cabbage Tree Cr... | Candidate Statement I live in Cabbage Tree Cr... |
42 | EASTMAN, Joanne | See questionnaire response | Joanne Eastman Mob: 0494 130 879 voteforjoan... |
43 | NaN | Candidate Statement I am hard-working, energe... | Candidate Statement I am hard-working, energe... |
44 | STEPHENSON, Matt (Retired) | NaN | NaN |
Instead we can get all the table’s rows:
= table.xpath('.//tr') rows
The first row is just a header saying “Candidates”
0]) html.tostring(rows[
b'<tr class="candidate-row mobile">\r\n <th scope="col">Candidates</th>\r\n </tr>\r\n '
The second has the headers “Name”, “Questionnaire”, and “Contact Details”
1]) html.tostring(rows[
b'<tr class="candidate-row desktop">\r\n <th scope="col">Name</th>\r\n <th scope="col">Questionnaire</th>\r\n <th scope="col">Contact Details</th>\r\n </tr>\r\n '
This is our first example of a first row of candidate details
= rows[2]
first_row html.tostring(first_row)
b'<tr class="candidate-row">\r\n <td>\r\n WHITE, John\r\n </td>\r\n\r\n <td>\r\n <a href="https://itsitecoreblobvecprd.blob.core.windows.net/public-files/Council/Questionnaires/21222.pdf" class="has-link"><!-- config key:\'QuestionnaireResponsePresentText\' -->\r\nSee questionnaire response</a>\r\n </td>\r\n <td>\r\n <span>John White</span>\r\n <br>\r\n <span>Mob: 0409 576 355</span>\r\n <br>\r\n <span>jingella2@wideband.net.au</span>\r\n </td>\r\n </tr>\r\n '
We can read the name from the text of the first table data:
= first_row.xpath('.//td[1]/text()')[0].strip()
name name
'WHITE, John'
And the questionnaire link and contact details from the next two table data:
= first_row.xpath('.//td[2]/a/@href')[0]
link = first_row.xpath('.//td[3]/span/text()')
contact
link
'https://itsitecoreblobvecprd.blob.core.windows.net/public-files/Council/Questionnaires/21222.pdf'
The following row has their photo, and the candidate statement split across two cells.
= rows[3]
second_row html.tostring(second_row)
b'<tr id="east-gippsland-shire-council_1" class="candidate-row hidden">\r\n <td>\r\n <img src="https://itsitecoreblobvecprd.blob.core.windows.net/public-files/Council/Photos/21222.png" alt="Photograph of candidate WHITE, John" width="128">\r\n </td>\r\n <td colspan="2">\r\n <strong>Candidate Statement</strong><br>\r\n <p>John White Experienced Councillor for 8 years. Deputy Mayor 2018/2019, Mayor 2019/2020, supporting and representing community during bushfires. Established productive relationships with government agencies and emergency services. Initiated recovery funding, Active Landcare member for 40-years, CFA member for 50 years. My objectives are to make Council decisions open and transparent, put the community first. Always ignore vested interests. Insist on responsible Council decisions that prioritise resources to meet the community’s primary needs. Ease cost of living pressures for ratepayers/residents by managing expenditure to get best value from Council rates and external funding. Defer non-urgent expenditure until economic recovery. Stop cost over-runs, financial waste and funding loss with accurate, professional and cost-effective project planning and implementation. Ensure Councillors adherence to the Councillor Code of Conduct. Push for Council to be accessible and responsive, and for Councillor decision-making in the best interests of the community. Personally, I am not aligned with any faction/group or political party. My job is to represent everybody and fight for everyone’s right to be heard. I believe that the team is greater than the individual. I will not trade my vote for personal advantage. I will maintain my independence and support my community. Vote one John White</p>\r\n </td>\r\n </tr>\r\n '
We can read the image URL from the first table data:
= second_row.xpath('.//td[1]/img/@src')
photo_url photo_url
['https://itsitecoreblobvecprd.blob.core.windows.net/public-files/Council/Photos/21222.png']
And the candidate data from the second:
= second_row.xpath('.//td[2]/p/text()')[0]
statement statement
'John White Experienced Councillor for 8 years. Deputy Mayor 2018/2019, Mayor 2019/2020, supporting and representing community during bushfires. Established productive relationships with government agencies and emergency services. Initiated recovery funding, Active Landcare member for 40-years, CFA member for 50 years. My objectives are to make Council decisions open and transparent, put the community first. Always ignore vested interests. Insist on responsible Council decisions that prioritise resources to meet the community’s primary needs. Ease cost of living pressures for ratepayers/residents by managing expenditure to get best value from Council rates and external funding. Defer non-urgent expenditure until economic recovery. Stop cost over-runs, financial waste and funding loss with accurate, professional and cost-effective project planning and implementation. Ensure Councillors adherence to the Councillor Code of Conduct. Push for Council to be accessible and responsive, and for Councillor decision-making in the best interests of the community. Personally, I am not aligned with any faction/group or political party. My job is to represent everybody and fight for everyone’s right to be heard. I believe that the team is greater than the individual. I will not trade my vote for personal advantage. I will maintain my independence and support my community. Vote one John White'
Let’s make some small data objects to hold all this information:
@dataclass(frozen=True)
class Ward:
str
council: str
ward: int
vacancies:
@dataclass
class Candidate:
str
name: str]
questions: Optional[str]
photo: Optional[list[str]
contact: str]
statement: Optional[= None ward: Optional[Ward]
And put together our HTML scraping into a function; I’ve added a few changes to make it more robust:
- questionnaire link is optional since some candidates did not provide one
- photo is optional since some candidates did not provide one
- sometimes the statement is not provided
- sometimes there isn’t a second row in uncontested elections
def parse_candidate_table(table):
= []
candidates
= table.xpath('.//tr')
rows
# Skip header rows
= rows[2:]
rows
= 0
i while (i < len(rows)):
= rows[i]
first_row = i + 1
i = first_row.xpath('.//td[1]/text()')[0].strip()
name = first_row.xpath('.//td[2]/a/@href')
link # Some candidates didn't submit a questionnaire response
= link[0] if link else None
link = first_row.xpath('.//td[3]/span/text()')
contact
# If we have an expansion row with more details
if i < len(rows) and 'id' in rows[i].attrib:
= rows[i]
second_row = i + 1
i
= second_row.xpath('.//td[1]/img/@src')
photo_url # Some candidates didn't submit a photo
= photo_url[0] if photo_url else None
photo_url = second_row.xpath('.//td[2]/p/text()')
statement = statement[0] if statement else None
statement # We don't always have this, e.g. in unopposed contests
# or for retired but listed candidates
else:
= None
photo_url = None
statement
candidates.append(Candidate(
name,
link,
photo_url,
contact,
statement))
return candidates
Then to parse an entire Council page we just need to iterate over the Wards:
def parse_page(content, council):
= []
candidates = []
wards
= html.fromstring(content)
tree
for ward_node in tree.xpath("//h2[@id != '']"):
= ward_re.match(ward_node.text).groups()
ward_name, ward_vacancies = int(ward_vacancies)
ward_vacancies
= Ward(council, ward_name, ward_vacancies)
ward
wards.append(ward)
= ward_node.xpath('following-sibling::div[1]//table')[0]
table
for candidate in parse_candidate_table(table):
= ward
candidate.ward
candidates.append(candidate)
return wards, candidates
We can now apply this to get all the candidates
= []
candidates = []
wards
for path in paths:
with open(path, 'rb') as f:
= f.read()
content = parse_page(content, path.name)
w, c
wards.extend(w) candidates.extend(c)
Checking for missing data
As stated before the VEC stated there are 466 elections, but unfortunately we are 2 short:
466 - len(wards)
2
We are also short of the expected 646 vacancies:
646 - sum(w.vacancies for w in wards)
10
If we look back on our councils we can see that we’re missing candidates from Melbourne City Council.
= {p.name for p in paths}
councils = {w.council for w in wards}
ward_councils
councils.difference(ward_councils)
{'Melbourne City Council'}
The City of Melbourne is different to other Local Government Areas in that it hold two elections, one to elect the Lord Mayor and Deputy Mayor, and another to elect 9 councillors. This explains our missing 2 elections and 10 vacancies (if we count the “Leadership Team” vacancy for the Lord and Deputy Mayor together as 1 team vacancy, rather than 2 positions). These changes mean the Melbourne City Council has a very different dynamic to every other council, with the Leadership team vote being separate but connected to the Council votes, and each team or unaffiliated candidate stating their preferences for all other candidates.
This means that the data on the Melbourne City Council nomination page is represented quite differently.
= nomination_dir / 'Melbourne City Council'
path
with open(path, 'rb') as f:
= f.read()
content
= html.fromstring(content) tree
We can extract the names of candidates for the Mayor and Deputy Mayor election, giving 22 candidates across 11 parties.
= [x.strip() for x in tree.xpath("//tr[contains(@class, 'candidate-row')]/td[1]/p/text()") if x.strip() and not x.strip().startswith('No')]
leadership_team_candidates
len(leadership_team_candidates)
22
We do something slightly different to get the names of the candidates for the 9 Council positions.
= [x.strip() for x in tree.xpath("//tr[contains(@class, 'candidate-row')]/td[1]/text()") if x.strip()]
melb_candidates
len(melb_candidates)
58
I can’t really meaningfully represent Melbourne City Council with the same data structure as I have for other councils. The idea of teams and preferencing doesn’t exist in other councils, which are important information for understanding the structure. The candidates don’t have individual candidate statements (with the exception of “ungrouped” candidates), but group statements. All this means that Melbourne City Council should really be a separate dataset, and I won’t build it here.
All together we’ve accounted for all but 1 of the 2231 candidates, which is pretty comprehensive.
2231 - (len(candidates) + len(leadership_team_candidates) + len(melb_candidates))
1
Analysing candidate data
Before going any further with publishing the data let’s sense check the data we’ve got and put it into a tidy form.
Contact information
The candidates are public figures and have provided their public contact details; however I don’t want to republish these details in a form that’s easy for bulk spam. For the same reason I won’t show contact details in the output of this article, but I have often checked they actually contain the kind of information I attribute to them.
The majority of candidates provide 3 pieces of contact information, often a phone number, an email and a name. Some also offer an alternate phone number, and some don’t have a phone number.
len(c.contact) for c in candidates) Counter(
Counter({3: 1348, 2: 554, 0: 152, 4: 96})
We can use a simple heuristic that works very well for detecting emails, and the majority of candidates supply an email address, and no one has more than one:
def is_email(s):
return '@' in s
sum([1 for cf in c.contact if is_email(cf)]) for c in candidates]) Counter([
Counter({1: 1875, 0: 275})
The email domain can sometimes hold interesting information, like whether they are a member of a political group such as the Greens, the Victorian Socialists or the Australian Labor Party. Some of the tail domains are also sites the candidate themselves have set up which are relevant for finding more information (although some are just businesses they work for).
def get_email_domain(s):
return s.split('@')[-1]
for c in candidates
Counter(get_email_domain(cf).lower() for cf in c.contact if is_email(cf)).most_common(20)
[('gmail.com', 1036),
('hotmail.com', 96),
('outlook.com', 94),
('vic.greens.org.au', 71),
('bigpond.com', 60),
('victoriansocialists.org.au', 59),
('yahoo.com', 26),
('yahoo.com.au', 24),
('icloud.com', 22),
('me.com', 11),
('live.com.au', 10),
('optusnet.com.au', 10),
('darebinlabor.org', 9),
('proton.me', 6),
('mail.com', 6),
('bigpond.net.au', 6),
('libertarians.org.au', 6),
('outlook.com.au', 5),
('tpg.com.au', 5),
('iprimus.com.au', 4)]
Many candidates provide a phone number, and some provide two:
def is_mobile(s):
return bool(re.match(r"^(Mob|Alt): \d{4} \d{3} \d{3}$", s))
def is_landline(s):
return bool(re.match(r"^Alt: \(\d{2}\) \d{4} \d{4}$", s))
def is_phone_number(s):
return is_mobile(s) or is_landline(s)
sum([1 for cf in c.contact if is_phone_number(cf)])
Counter([for c in candidates])
Counter({1: 1469, 0: 584, 2: 97})
Not all the phone numbers are unique, for example many Greens candidates have the same mobile number suggesting it might be a number run by their political office. However this doesn’t occur often enough to make for informative data, so I won’t include it in the dataset.
The remaining field is a contact person’s name; this is often the candidate themselves, but not always especially for political parties.
sum([1 for cf in c.contact if
Counter([not (is_phone_number(cf) or is_email(cf))])
for c in candidates])
Counter({1: 1998, 0: 152})
We can then extract the relevant data fields out of the contact details:
@dataclass
class ContactDetails:
str]
contact_name: Optional[bool
has_phone: bool
has_email: str]
email_domain: Optional[
= []
contact_details
for c in candidates:
= c.contact
contact
= [cf for cf in contact if
contact_name not (is_phone_number(cf) or is_email(cf))]
= contact_name[0] if contact_name else None
contact_name
= [cf for cf in contact if is_email(cf)]
email = bool(email)
has_email = get_email_domain(email[0]) if email else None
email_domain
= any([cf for cf in contact if is_phone_number(cf)])
has_phone
contact_details.append(
ContactDetails(
contact_name,
has_phone,
has_email,
email_domain ))
Which makes for a nice dataframe:
= pd.DataFrame(contact_details)
df_contact_details df_contact_details
contact_name | has_phone | has_email | email_domain | |
---|---|---|---|---|
0 | John Andersen | True | True | yahoo.com.au |
1 | Peter Smith | True | True | expressfinanceoptions.com.au |
2 | Jean-Pierre Louis Ronco | True | True | hotmail.com |
3 | David Byrne | True | True | arderns.com.au |
4 | Bruce McDonald | True | True | gmail.com |
... | ... | ... | ... | ... |
2145 | Corinne Heintze | True | True | gmail.com |
2146 | Andrew McLean | True | True | gmail.com |
2147 | Chris Lehmann | False | True | bigpond.com.au |
2148 | Karly Kirk | True | True | YAHOO.COM |
2149 | Kylie ZANKER | True | True | gmail.com |
2150 rows × 4 columns
Checking other fields
= pd.DataFrame(candidates).drop('contact', axis=1)
df_candidate = pd.concat([df_candidate.drop('ward', axis=1),
df_candidate
df_contact_details,'ward'])],
pd.json_normalize(df_candidate[=1)
axis df_candidate
name | questions | photo | statement | contact_name | has_phone | has_email | email_domain | council | ward | vacancies | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | ANDERSEN, John | https://itsitecoreblobvecprd.blob.core.windows... | https://itsitecoreblobvecprd.blob.core.windows... | My name is John Andersen, also known as Snowy.... | John Andersen | True | True | yahoo.com.au | Alpine Shire Council | Alpine Shire Council | 7 |
1 | SMITH, Peter | None | https://itsitecoreblobvecprd.blob.core.windows... | My name is Peter Smith. I am a 54 year old res... | Peter Smith | True | True | expressfinanceoptions.com.au | Alpine Shire Council | Alpine Shire Council | 7 |
2 | RONCO, Jean-Pierre | https://itsitecoreblobvecprd.blob.core.windows... | https://itsitecoreblobvecprd.blob.core.windows... | As a candidate at the Alpine Shire Council ele... | Jean-Pierre Louis Ronco | True | True | hotmail.com | Alpine Shire Council | Alpine Shire Council | 7 |
3 | BYRNE, David | https://itsitecoreblobvecprd.blob.core.windows... | https://itsitecoreblobvecprd.blob.core.windows... | My name is David Byrne. Ten years ago, my fami... | David Byrne | True | True | arderns.com.au | Alpine Shire Council | Alpine Shire Council | 7 |
4 | McDONALD, Bruce | https://itsitecoreblobvecprd.blob.core.windows... | https://itsitecoreblobvecprd.blob.core.windows... | A resident of Bright with a no-nonsense, down ... | Bruce McDonald | True | True | gmail.com | Alpine Shire Council | Alpine Shire Council | 7 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2145 | HEINTZE, Corinne | https://itsitecoreblobvecprd.blob.core.windows... | None | None | Corinne Heintze | True | True | gmail.com | Yarriambiack Shire Council | Dunmunkle Ward | 2 |
2146 | McLEAN, Andrew | https://itsitecoreblobvecprd.blob.core.windows... | None | None | Andrew McLean | True | True | gmail.com | Yarriambiack Shire Council | Hopetoun Ward | 2 |
2147 | LEHMANN, Chris K. | https://itsitecoreblobvecprd.blob.core.windows... | None | None | Chris Lehmann | False | True | bigpond.com.au | Yarriambiack Shire Council | Hopetoun Ward | 2 |
2148 | KIRK, Karly | https://itsitecoreblobvecprd.blob.core.windows... | None | None | Karly Kirk | True | True | YAHOO.COM | Yarriambiack Shire Council | Warracknabeal Ward | 2 |
2149 | ZANKER, Kylie | https://itsitecoreblobvecprd.blob.core.windows... | None | None | Kylie ZANKER | True | True | gmail.com | Yarriambiack Shire Council | Warracknabeal Ward | 2 |
2150 rows × 11 columns
All of the candidates within a council are unique:
'council', 'name']].duplicated().any() df_candidate[[
False
Although some common names occur in different councils:
'name'].duplicated(keep=False)][['name', 'council', 'ward']]
(df_candidate[df_candidate['name', 'council', 'ward'])) .sort_values([
name | council | ward | |
---|---|---|---|
1553 | CLARK, David | Moyne Shire Council | Moyne Shire Council |
1680 | CLARK, David | Pyrenees Shire Council | Ercildoune Ward |
851 | CLARK, Tony | Hepburn Shire Council | Hepburn Shire Council |
867 | CLARK, Tony | Hindmarsh Shire Council | North Ward |
49 | HARRIS, Mark | Ballarat City Council | Golden Point Ward |
1439 | HARRIS, Mark | Moonee Valley City Council | Airport Ward |
33 | JONES, Mark | Ballarat City Council | Buninyong Ward |
218 | JONES, Mark | Benalla Rural City Council | Benalla Rural City Council |
361 | OWEN, Brett | Cardinia Shire Council | Beacon Hills Ward |
1373 | OWEN, Brett | Mitchell Shire Council | Central Ward |
194 | SMITH, Greg | Bayside City Council | Dendy Ward |
1870 | SMITH, Greg | Whitehorse City Council | Kingsley Ward |
159 | WILLIAMS, David | Bayside City Council | Boyd Ward |
958 | WILLIAMS, David | Hume City Council | Roxburgh Park Ward |
1441 | WRIGHT, David | Moonee Valley City Council | Airport Ward |
1658 | WRIGHT, David | Port Phillip City Council | Port Melbourne Ward |
Most candidates filled out the questionnaire, and they are all different:
= 1-df_candidate['questions'].isna().mean()
p_has_questionnaire_url
assert not df_candidate['questions'].dropna().duplicated().any()
f"{p_has_questionnaire_url:0.2%} of candidates have a questionnaire"
'93.95% of candidates have a questionnaire'
And most of them provided a photo:
= 1-df_candidate['photo'].isna().mean()
p_has_photo_url
assert not df_candidate['photo'].dropna().duplicated().any()
f"{p_has_photo_url:0.2%} of candidates have a photo"
'96.28% of candidates have a photo'
Most candidates have a statement; I noticed when collecting the data that in unopposed wards the candidates do not have statements.
= 1 - df_candidate.statement.isna().mean()
p_has_statement
f"{p_has_statement:0.2%} of candidates have a statement"
'96.84% of candidates have a statement'
We can sanity check the number of characters in the candidate statement is in a reasonable range:
= df_candidate.statement.str.len()
statement_num_char
= statement_num_char.plot.hist(bins=30)
ax 'Distribution of number of characters in candidate statements'); ax.set_title(
The shortest statements look legitimate (but not particularly persuasive)
< 75] df_candidate[statement_num_char
name | questions | photo | statement | contact_name | has_phone | has_email | email_domain | council | ward | vacancies | |
---|---|---|---|---|---|---|---|---|---|---|---|
1721 | MENADUE, Alan | https://itsitecoreblobvecprd.blob.core.windows... | https://itsitecoreblobvecprd.blob.core.windows... | www.alanmenadue.com | Alan Maxwell Menadue | True | True | gmail.com | Stonnington City Council | Greville Ward | 1 |
2135 | TALBETT, James | https://itsitecoreblobvecprd.blob.core.windows... | https://itsitecoreblobvecprd.blob.core.windows... | My policies include having a beer and watching... | James Talbett | True | False | None | Yarra Ranges Shire Council | O'Shannassy Ward | 1 |
If we consider a word to be a space separated string, most of the candidate statements average 6 characters per word, except for one outlier:
= df_candidate.statement.fillna('').str.split().map(len)
statement_num_word
plt.scatter(statement_num_word, statement_num_char)
/ statement_num_word).dropna().mean() (statement_num_char
6.26055256228232
That outlier appears to have a lot of links (I hope they didn’t get printed in the candidate leaflet):
str.len() > 2000].iloc[0].statement df_candidate[df_candidate.statement.
'We need a competent, trustworthy Council that is not a variant of the old Council. November 2017: Independent Broad-based Anti-corruption Commission (IBAC), “Operation Sandon Public examinations into allegations of corruption in local government planning and property development decisions”. Tuesday 11th February 2020: City of Casey Municipal Monitor Report, Page 9 Findings, included: “In response to the intense scrutiny by IBAC and the appointment of a municipal monitor, the Councillors have prioritised protecting their reputations rather than that of the City of Casey.”, Page 10 Recommendations, included: “I therefore recommend that: 1. The Casey City Council is dismissed and administrators are appointed.”. Friday 14th February 2020: The Age newspaper, “Casey councillors desert ship as investigator recommends their sacking”. Tuesday 18th February 2020: Parliamentary Debates (Hansard), Page 295, “Her report is damning. It is damning as to the conduct of all councillors ....”; Berwick Star News newspaper, “State moves to sack Casey Council”; Local Government (Casey City Council) Act 2020, “[Assented to 19 February 2020]”; Casey Council meeting. Thursday 20th February 2020: Council dismissed. 27th July 2023: IBAC’s “Special Report”. Please refer to all references regarding the dismissal of Casey Council, (https://www.ibac.vic.gov.au/operation-sandon), (https://www.localgovernment.vic.gov.au/__data/assets/pdf_file/0031/207796/City-of-Casey-Municipal-Monitor-Report-February-2020.pdf), (https://www.theage.com.au/politics/victoria/casey-councillors-desert-ship-as-investigator-recommends-their-sacking-20200214-p540sy.html), (https://www.parliament.vic.gov.au/4a4c7d/globalassets/hansard-daily-pdfs/hansard-2145855009-5096/hansard-2145855009-5096.pdf), (https://berwicknews.starcommunity.com.au/news/2020-02-18/state-moves-to-sack-casey-council/), (https://content.legislation.vic.gov.au/sites/default/files/2020-02/20-002aa%20authorised.pdf), (https://content.legislation.vic.gov.au/sites/default/files/2020-04/20-2aa002%20authorised.pdf), (https://www.youtube.com/watch?v=ETshjEJJjOY), (https://www.casey.vic.gov.au/sites/default/files/2020-02/Agenda%20-%20Council%20Meeting%20-%20Website%20-%2018-2-20.pdf%20%28reduced%29_0.pdf), (https://www.casey.vic.gov.au/sites/default/files/2020-02/Council%20Meeting%20-%20Minutes%20-%20Website%20-%2018-2-20.pdf), (https://www.ibac.vic.gov.au/operation-sandon-special-report).'
Ward level statistics
We can also sanity check the ward level statistics
= df_candidate.groupby(['council', 'ward']).agg(vacancies = ('vacancies', 'max'), candidates=('name', 'count'))
ward_stats ward_stats
vacancies | candidates | ||
---|---|---|---|
council | ward | ||
Alpine Shire Council | Alpine Shire Council | 7 | 12 |
Ararat Rural City Council | Ararat Rural City Council | 7 | 12 |
Ballarat City Council | Alfredton Ward | 1 | 6 |
Brown Hill Ward | 1 | 3 | |
Buninyong Ward | 1 | 3 | |
... | ... | ... | ... |
Yarra Ranges Shire Council | Streeton Ward | 1 | 2 |
Walling Ward | 1 | 2 | |
Yarriambiack Shire Council | Dunmunkle Ward | 2 | 2 |
Hopetoun Ward | 2 | 2 | |
Warracknabeal Ward | 2 | 2 |
464 rows × 2 columns
Because of the Local Government Act 2020 the metropolitan wards, which are the majority, have only 1 vacancy.
'vacancies'].value_counts() ward_stats[
1 411
3 21
7 13
2 12
5 4
9 3
Name: vacancies, dtype: int64
Most wards have 2 or 3 candidates, with a long tail.
= plt.bar(*zip(*Counter(ward_stats['candidates']).items()))
ax range(1, 25, 1)); plt.xticks(
Some waards are uncontested, with as many vacancies as candidates, and others are very hotly contested, for examply Dendy Ward in Bayside Council has 18 candidates for its 1 seat.
'candidates_per_vacancy'] = ward_stats['candidates'] / ward_stats['vacancies']
ward_stats['candidates_per_vacancy') ward_stats.sort_values(
vacancies | candidates | candidates_per_vacancy | ||
---|---|---|---|---|
council | ward | |||
Loddon Shire Council | Wedderburn Ward | 1 | 1 | 1.0 |
Latrobe City Council | Newborough Ward | 1 | 1 | 1.0 |
Horsham Rural City Council | Wyn Wyn Ward | 1 | 1 | 1.0 |
Kalimna Ward | 1 | 1 | 1.0 | |
Hindmarsh Shire Council | West Ward | 2 | 2 | 1.0 |
... | ... | ... | ... | ... |
Wyndham City Council | Williams Landing Ward | 1 | 12 | 12.0 |
Wimba Ward | 1 | 12 | 12.0 | |
Bemin Ward | 1 | 15 | 15.0 | |
Bayside City Council | Castlefield Ward | 1 | 15 | 15.0 |
Dendy Ward | 1 | 18 | 18.0 |
464 rows × 3 columns
In fact there are a large number of uncontested wards:
= ward_stats[ward_stats['candidates'] <= ward_stats['vacancies']]
uncontested_wards uncontested_wards
vacancies | candidates | candidates_per_vacancy | ||
---|---|---|---|---|
council | ward | |||
Banyule City Council | Beale Ward | 1 | 1 | 1.0 |
Buloke Shire Council | Mallee Ward | 2 | 2 | 1.0 |
Cardinia Shire Council | Pakenham Hills Ward | 1 | 1 | 1.0 |
Central Goldfields Shire Council | Maryborough West Ward | 1 | 1 | 1.0 |
Tullaroop Ward | 1 | 1 | 1.0 | |
Corangamite Shire Council | Cooriemungle Ward | 1 | 1 | 1.0 |
Gnotuk Ward | 1 | 1 | 1.0 | |
Lake Elingamite Ward | 1 | 1 | 1.0 | |
Lake Keilambete Ward | 1 | 1 | 1.0 | |
Leura Ward | 1 | 1 | 1.0 | |
Mt Elephant Ward | 1 | 1 | 1.0 | |
Greater Bendigo City Council | Epsom Ward | 1 | 1 | 1.0 |
Greater Dandenong City Council | Noble Park Ward | 1 | 1 | 1.0 |
Greater Shepparton City Council | Midland Ward | 1 | 1 | 1.0 |
Poplar Ward | 1 | 1 | 1.0 | |
Hindmarsh Shire Council | West Ward | 2 | 2 | 1.0 |
Horsham Rural City Council | Kalimna Ward | 1 | 1 | 1.0 |
Wyn Wyn Ward | 1 | 1 | 1.0 | |
Latrobe City Council | Newborough Ward | 1 | 1 | 1.0 |
Loddon Shire Council | Boort Ward | 1 | 1 | 1.0 |
Terrick Ward | 1 | 1 | 1.0 | |
Wedderburn Ward | 1 | 1 | 1.0 | |
Manningham City Council | Tullamore Ward | 1 | 1 | 1.0 |
Yarra Ward | 1 | 1 | 1.0 | |
Mildura Rural City Council | Karadoc Ward | 1 | 1 | 1.0 |
Sunset Country Ward | 1 | 1 | 1.0 | |
Monash City Council | Gallaghers Ward | 1 | 1 | 1.0 |
Mount Alexander Shire Council | Barkers Creek Ward | 1 | 1 | 1.0 |
Forest Creek Ward | 1 | 1 | 1.0 | |
Loddon River Ward | 1 | 1 | 1.0 | |
Moonlight Creek Ward | 1 | 1 | 1.0 | |
Tarrengower Ward | 1 | 1 | 1.0 | |
Wangaratta Rural City Council | King River Ward | 1 | 1 | 1.0 |
Warby Ward | 1 | 1 | 1.0 | |
Warrnambool City Council | Botanic Ward | 1 | 1 | 1.0 |
Russells Creek Ward | 1 | 1 | 1.0 | |
Wollaston Ward | 1 | 1 | 1.0 | |
Yarra Ranges Shire Council | Chirnside Ward | 1 | 1 | 1.0 |
Ryrie Ward | 1 | 1 | 1.0 | |
Yarriambiack Shire Council | Dunmunkle Ward | 2 | 2 | 1.0 |
Hopetoun Ward | 2 | 2 | 1.0 | |
Warracknabeal Ward | 2 | 2 | 1.0 |
There are a total of 42 wards, with 47 positions, that are uncontested.
len(uncontested_wards), uncontested_wards['vacancies'].sum()
(42, 47)
Extracting Candidate Questionnaires
There’s additional information about candidates in the questionnaires they submitted, and for uncontested elections this is the only information we have. We have to download them all from external PDFs.
We will save them all to a separate folder:
= data_dir / 'questionnaires'
question_dir =True) question_dir.mkdir(exist_ok
Because there’s a lot of files to download as well as being slow to prevent server overload (downloading at most 1 document every 2 seconds), we’ll retry with exponential backoff using tenacity if the process fails.
@retry(wait=wait_exponential(multiplier=1, min=4, max=60), stop=stop_after_attempt(7))
def download(url, path):
if path.exists():
return
urlretrieve(url, path)2) sleep(
Then we can download all the candidate’s questionnaires (this took me a couple of hours)
for c in candidates:
= c.questions
url if not url:
continue
= question_dir / url.split('/')[-1]
path
download(url, path)
While we’re downloading assets we can also grab all of their photographs as well:
= data_dir / 'photos'
photo_dir = True)
photo_dir.mkdir(exist_ok
for c in candidates:
= c.photo
url if not url:
continue
= photo_dir / url.split('/')[-1]
path
download(url, path)
Parse Questionnaire PDFs
For some reason VEC decided the best way to publish the candidate questionnaire responses was in separate PDFs. Luckily the formatting is consistent enough and informative enough that we can actually extract it. We have questions for most of the candidates:
= sorted(question_dir.glob('*.pdf'))
questions len(questions)
2098
While PDF is an open standard it’s pretty gnarly to extract things out of, and I’m not sure what the best general libraries for parsing them are. In this case PyPDF2 is good enough to read the data because it’s very consistent, but our approach would break with any small variation.
All of the questionnaires consist of a single page and we can read through that single page. PyPDF2 uses a visitor pattern and we can use that to run conditional extraction depending on the text, the user matrix cm, the translation matrix tm, the font dictionary and the font size. The matrices are a rotation matrix followed by a translation matrix, so positions 4 and 5 should correspond to the x and y offset. I originally printed all the attributes to find out which ones might be meaningful, which is too messy to show here, and discovered that all the text of the candidate responses occurs twice, once with a missing fontDict. Moreover the cm[5]
variable is different for most of the text fields (and consistent between PDFs) so we can use that to extract the relevant field.
= questions[0]
path = PdfReader(path)
reader = reader.pages[0]
page
def visitor_text_attributes(text, cm, tm, fontDict, fontSize):
if text.strip() and not fontDict:
print(dict(text=text, x=cm[4], y=cm[5]))
=visitor_text_attributes); page.extract_text(visitor_text
{'text': 'Frankston City Council 2024', 'x': 242.47, 'y': 708.41}
{'text': 'Derinya Ward', 'x': 242.47, 'y': 688.58}
{'text': 'GRACE, Cassandra', 'x': 242.47, 'y': 669.57}
{'text': 'I believe in a council that listens, values diversity, and acts with transparency. I would like to\nwork collaboratively as a team to make decisions that impact the residents of the City of\nFrankston. I would like to provide respectful, honest and professional leadership of our\ncommunity with community engagement.', 'x': 46.62, 'y': 541.58}
{'text': 'Councillor?', 'x': 445.75, 'y': 515.25}
{'text': 'I have been a dedicated community volunteer for over 30 years. I live in the Derinya Ward\nand have done so my entire life. I have a deep and strong understanding of our community. I\nunderstand what it means to be a business owner. I have strong leadership skills.', 'x': 46.62, 'y': 405.45}
{'text': 'Yes', 'x': 423.55, 'y': 370.11}
{'text': 'Yes', 'x': 423.31, 'y': 349.27}
{'text': 'No', 'x': 423.57, 'y': 327.73}
{'text': 'No', 'x': 423.24, 'y': 261.59}
{'text': 'www.cassandragrace.net', 'x': 46.62, 'y': 99.74}
We use these observations to map the y coordinates back to fields.
= {
coords 708.51: "council",
688.58: "ward",
669.57: "name",
541.58: "vision",
405.45: "expertise",
370.11: "read_plan",
349.27: "read_code",
327.73: "has_party",
285.94: "party_name",
261.59: "councillor",
240.23: "council_attendance",
99.74: "contact"
}
We can then parse a PDF by extracting the data from fields with the corresponding cm[5]
coordinate with an empty fontDict
. We allow passing in extra kwargs of context that are passed into the output dictionary, so that we can add the file name.
def parse_pdf(path, **kwargs):
= PdfReader(path)
reader = reader.pages[0]
page
= kwargs
data def visitor(text, cm, tm, fontDict, fontSize):
if text.strip() and not fontDict:
= cm[5]
coord if coord in coords:
= coords[coord]
value if value in data:
raise ValueError(f"Duplicate {value}")
= text
data[value]
=visitor)
page.extract_text(visitor_text
return data
And apply it to all of the questionnaires:
= []
questionnaires
for q in questions:
=q.name)) questionnaires.append(parse_pdf(q, question_file
Analysing questionaires
Let’s do some sense checking of the data in the questionnaires, we’ll ignore the contacts because there’s not much additional information we don’t already have and it may not be appropriate to publish in a dataset.
= pd.DataFrame(questionnaires).drop('contact',axis=1)
df_q
df_q
question_file | ward | name | vision | expertise | read_plan | read_code | has_party | councillor | council_attendance | party_name | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 19672.pdf | Derinya Ward | GRACE, Cassandra | I believe in a council that listens, values di... | I have been a dedicated community volunteer fo... | Yes | Yes | No | No | NaN | NaN |
1 | 19673.pdf | Tooronga Ward | BELL, Jamie | A better Stonnington Council prioritises trans... | As a management consultant, my ability to anal... | Yes | Yes | No | No | NaN | NaN |
2 | 19674.pdf | Dendy Ward | SCANLON, Terence | Bayside vision for improvement & growth while ... | Multinational business management skills, impl... | Yes | Yes | No | No | NaN | NaN |
3 | 19675.pdf | Coliban Ward | LESSER, Max | A connected community | Previous local government employee and innovat... | Yes | Yes | No | No | NaN | NaN |
4 | 19676.pdf | Jeeralang Ward | DUNCAN, Peter | My vision for Latrobe City is to have a counci... | My expertise and attributes that will assist m... | Yes | Yes | No | No | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2093 | 21939.pdf | Streeton Ward | PETERSEN, Sigrid | To work with local council, residents and comm... | I am local mother and business owner, with str... | Yes | Yes | No | No | NaN | NaN |
2094 | 21941.pdf | Walling Ward | FERRIER, David | I want to lead the way in developing support p... | Over the past 3 years I have led a local commu... | Yes | Yes | No | No | NaN | NaN |
2095 | 21943.pdf | Horsham North Ward | ROSS, Kathy | I would like to see a community that is engage... | I have understanding of government and legisla... | Yes | Yes | No | No | NaN | NaN |
2096 | 21944.pdf | Banksia Ward | MATTHEWS, Corey | Answer not submitted | Answer not submitted | Answer not submitted | Answer not submitted | Answer not submitted | Answer not submitted | NaN | NaN |
2097 | 21945.pdf | Tooradin Ward | DIZON, Jennifer | My vision for the City of Casey is a vibrant, ... | As a lawyer with extensive experience in legal... | Yes | Yes | No | No | NaN | NaN |
2098 rows × 11 columns
The only columns with missing values are council_attendance
which only makkes sense if councillor
, and party_name
which only makes sense if has_party
.
sum() df_q.isna().
question_file 0
ward 0
name 0
vision 0
expertise 0
read_plan 0
read_code 0
has_party 0
councillor 0
council_attendance 1687
party_name 1818
dtype: int64
All the files are distinct
'question_file'].duplicated().any() df_q[
False
Almost all candidates read the Council Plan for the council they want to be elected to, and read the Councillor Code of conduct, but surprisingly a few candidates didn’t.
'read_plan', 'read_code']].value_counts(dropna=False) df_q[[
read_plan read_code
Yes Yes 2053
No Yes 26
No 9
Yes No 5
Answer not submitted Answer not submitted 4
Yes 1
dtype: int64
Almost one-fifth of the candidates are currently a Councillor, and most of them have attended more than 90% of council meetings.
'councillor', 'council_attendance']].value_counts(dropna=False) df_q[[
councillor council_attendance
No NaN 1683
Yes >90% 374
75 – 90% 37
Answer not submitted NaN 4
dtype: int64
Most candidates have not been endorsed by a party, and of those who are most are with the Australian Greens or Victorian Socialists. Note that a lot more candidates are affiliated with a party, but not officially endorsed, and do not need to disclose that affiliation. The political affiliation of candidates is murky, there are candidates running as independent Liberals using the Liberal party colours but not endorsed by the Liberal party, and similarly candidates running as community Labor who are not officially endorsed by the Labor party. There are also other quasi-parties such as Yarra For All in Yarra and the Socialist Alliance.
'has_party', 'party_name']].value_counts(dropna=False) df_q[[
has_party party_name
No NaN 1814
Yes Australian Greens 123
Victorian Socialists 77
Australian Labor Party 53
Libertarian 14
AJP 9
Answer not submitted NaN 4
Yes Liberal 4
dtype: int64
All PDFs have text from the vision and expertise:
The vision for most Victorian Socialsts (top) and Australian Greens (second) candidates is the same, and some candidates did not submit an answer.
'vision']).most_common(5) Counter(df_q[
[('Victorian Socialists don’t hide behind the lie that local government is only about roads, rates\nand rubbish. I’ll fight for a local government that will stand up to the state and federal\ngovernment and demand policies that put the public before corporate profits.',
38),
('Council has an important role in ensuring that all of us thrive and that our local environment is\nprotected. Together, we can continue building a supportive, vibrant and diverse community by\ndelivering the infrastructure and services we all need.',
22),
('Answer not submitted', 11),
('A vibrant, inclusive community where residents and businesses thrive. A council led by\nindependent residents, not tied to political parties or campaign donors, but representing the\nvoices of Melbourne. Revitalised neighbourhoods, increased green spaces, and improved\nservices. We will ensure council delivers on its core priorities, maximising value for\nratepayers.',
4),
('Focus needs to be on local council mandate and not wasting money on issues outside its\njurisdiction. Do not remove parking spaces nor increase speed limits. Work together with the\ncommunity and small businesses.',
2)]
Even more concerning is a lot of Socialist candidates have the same “expertise”, and some people did not submit an answer.
'expertise']).most_common(3) Counter(df_q[
[('I’m committed to standing up for the community. First of all that means I want to listen to you.\nI’ll always answer my phone to residents and community groups and campaign to make sure\nyour voice is heard.',
25),
('Answer not submitted', 11),
('Local Small Business Experience and having grown up and have always worked within the\ncommunity.',
2)]
The number of characters looks pretty reasonable for the vision and relevant expertise:
= df_q['vision'].str.len().plot.hist(bins=30)
ax 'Number of Characters in Vision for Council'); ax.set_title(
= df_q['expertise'].str.len().plot.hist(bins=30)
ax 'Number of Characters in relevant Expertise'); ax.set_title(
Putting it all together
Let’s merge together the questionnaire with the other candidate data.
'question_file'] = df_candidate['questions'].str.replace(".*/", "", regex=True)
df_candidate['photo_file'] = df_candidate['photo'].str.replace(".*/", "", regex=True)
df_candidate[
= df_candidate.merge(df_q, how='left', on='question_file', suffixes=('', '_q'))
df_full assert len(df_full) == len(df_candidate)
df_full.head()
name | questions | photo | statement | contact_name | has_phone | has_email | email_domain | council | ward | vacancies | question_file | photo_file | ward_q | name_q | vision | expertise | read_plan | read_code | has_party | councillor | council_attendance | party_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ANDERSEN, John | https://itsitecoreblobvecprd.blob.core.windows... | https://itsitecoreblobvecprd.blob.core.windows... | My name is John Andersen, also known as Snowy.... | John Andersen | True | True | yahoo.com.au | Alpine Shire Council | Alpine Shire Council | 7 | 20384.pdf | 20384.png | Alpine Shire Council | ANDERSEN, John | Greater community engagement in council decisi... | Hard working ethic and community-based experie... | Yes | Yes | No | No | NaN | NaN |
1 | SMITH, Peter | None | https://itsitecoreblobvecprd.blob.core.windows... | My name is Peter Smith. I am a 54 year old res... | Peter Smith | True | True | expressfinanceoptions.com.au | Alpine Shire Council | Alpine Shire Council | 7 | None | 19773.png | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | RONCO, Jean-Pierre | https://itsitecoreblobvecprd.blob.core.windows... | https://itsitecoreblobvecprd.blob.core.windows... | As a candidate at the Alpine Shire Council ele... | Jean-Pierre Louis Ronco | True | True | hotmail.com | Alpine Shire Council | Alpine Shire Council | 7 | 19701.pdf | 19701.png | Alpine Shire Council | RONCO, Jean-Pierre | I would like to see a united community which b... | I enjoy listening to diverse opinions, I work ... | Yes | Yes | No | No | NaN | NaN |
3 | BYRNE, David | https://itsitecoreblobvecprd.blob.core.windows... | https://itsitecoreblobvecprd.blob.core.windows... | My name is David Byrne. Ten years ago, my fami... | David Byrne | True | True | arderns.com.au | Alpine Shire Council | Alpine Shire Council | 7 | 20266.pdf | 20266.png | Alpine Shire Council | BYRNE, David | The Alpine Shire’s unique character is shaped ... | As the owner of Arderns Caravan Park in Myrtle... | Yes | Yes | No | No | NaN | NaN |
4 | McDONALD, Bruce | https://itsitecoreblobvecprd.blob.core.windows... | https://itsitecoreblobvecprd.blob.core.windows... | A resident of Bright with a no-nonsense, down ... | Bruce McDonald | True | True | gmail.com | Alpine Shire Council | Alpine Shire Council | 7 | 20189.pdf | 20189.png | Alpine Shire Council | McDONALD, Bruce | To make and keep Alpine Shire the best place t... | Common sense, no nonsense ability to solve pro... | Yes | Yes | No | No | NaN | NaN |
We’ll check that the data from the questionnaire is consistent:
assert (df_full['name_q'].isna() | (df_full['name'] == df_full['name_q'])).all()
assert (df_full['ward_q'].isna() | (df_full['ward'] == df_full['ward_q'])).all()
Let’s tidy up the columns and column names for export
= (
df
df_full.copy()'council', 'ward', 'vacancies', 'name',
[['statement', 'vision', 'expertise',
'read_plan', 'read_code', 'has_party',
'party_name', 'councillor', 'council_attendance',
'has_phone', 'has_email', 'email_domain', 'contact_name',
'photo_file']]
)
'councillor'] == 'No', 'council_attendance'] = 'N/A'
df.loc[df['has_party'] == 'No', 'party_name'] = 'N/A'
df.loc[df[~df['has_email'], 'email_domain'] = 'N/A'
df.loc[= df.fillna('Answer not submitted')
df
'has_phone'] = np.where(df['has_phone'], 'Yes', 'No')
df['has_email'] = np.where(df['has_email'], 'Yes', 'No')
df[
= df.rename(columns = dict(
df = "ward_vacancies",
vacancies
))
'2024_victorian_council_candidates.csv', index=False) df.to_csv(
What next
There are various ways to further enrich the data for more specific analyses. For example with the geometric boundaries the councils and wards could be mapped to ABS Statistical Areas, and linked to Census and other demographic data. They could also be linked to state or national electoral boundaries (which would be much larger) and linked to votes for major political parties. More could be done to find council candidates specific political affiliations, for example Councilwatch ran a survey that includes political party affiliation. We could also run a separate extract for Melbourne City Council, for which there’s a richer structure and more information available.
However I think the dataset as is should be rich enoguh to do interesting analyses. I have a hypothesis that the topics candidates focus on within a council are much narrower than the topics between councils; this could give an indication of the topics that are of interest within a specific area. It could be interesting to get a list of topics and see which candidates mention them, and their positions, and how this varies across councils. Hopefully you find this a useful resource for understanding Victoria and its people.