The Canadian Centre for Policy Alternatives series of reports of the top 100 CEOs pay in Canada contain that data in tables in pdf files. But, I do databases, so I needed the data in SQL table forms.
Thus, I needed the data from the PDF in a form that could be converted to SQL statements.
use -raw or -layout (layout preserves spaces where there are no dollar amounts )
> pdftotext -layout -f 17 -l 19 Canadas_CEO_Elite_100Final.pdf CanElite.pdf.txt
Now I have lines of text, and I edited each of the 1500 lines (100 each year, 15 years of reports), to transform this:
Rank Name Company Base Salary Bonus Shares Options Pension Other Total 1 Frank Stronach Magna International Inc. 205,988 41,908,220 17,006,353 2,690,726 61,811,287 2 Donald Walker Magna International Inc. 329,358 10,121,908 668,613 4,073,219 1,485,739 16,678,837 ... 100 Pierre Beaudoin Bombardier Inc. 1,162,900 870,200 1,027,800 528,600 184,600 131,500 3,892,000
to this:
call insert_ceo_data3('2010','Frank Stronach', 'Magna International Inc.', 205988 , 41908220 , NULL , 17006353 , NULL , 2690726 , 61811287); call insert_ceo_data3('2010','Donald Walker', 'Magna International Inc.', 329358 , 10121908 , 668613 , 4073219 , NULL , 1485739 , 16678837); ... call insert_ceo_data3('2010','Pierre Beaudoin', 'Bombardier Inc.', 1162900 , 870200 , 1027800 , 528600 , 184600 , 131500 , 3892000);
I wrote various SQL routines to ingest the data, as not every year has the same columns or column order.
Sometimes numbers in the columns were left empty (I took this as NULL) or with a - (NULL).
Some numbers were in parentheses: (23456) which I interpreted as -23456.
This was a lot of editing, but sed scripting helped to remove commas in numbers:
remove embedded commas from number strings:
>sed 's/\([0-9]\),\([0-9]\)/\1\2/g' CanElite.pdf.txt > CanElite.pdf.commad.txt
with the commas removed, if there was consistency in the blanks between numbers this sometimes worked:
for sql, replace numbers with blanks between with , to separate the integers.... replace xxx yyy with xxx,yyy (\1 and \2 remember the pattern >sed 's/\([0-9]\) \([0-9]\)/\1,\2/g' 2020paradise.txt.sql > foo.txt
Lots of tweaking of lines of text was used to get the raw data to SQL format. VS Code has a nice text editor and those proficient in sed or awk or tr could do this all faster than I.
With enough data entered I tried to normalize the names of CEOs and the names of firms.
For instance, I took M.H. McCain = Michael McCain = Michael H. McCain, thus removing multiple identifiers for one person to a 1:1 basis. Errors in doing this are all mine.
Corporate names were tweaked to try to normalize them, but I was not sure if Foobar Trust was the same firm as Foobar Income Fund, etc, so I erred on the side of more firms not fewer. The raw data used Inc. or Inc, Corp. or Corp, so I normalized all those abbreviations as best I could, and then added them back in a normalized column of corrected names. This was a lot of busywork.
With a normalized set of CEO names and a normalized set of corporation names, various PKs and FKs were able to be applied for data integrity. Also added StatsCan data on average wages, et voilà.