The Top 100 CEO data transform

The Canadian Centre for Policy Alternatives series of reports of the top 100 CEOs pay in Canada contain that data in tables in pdf files. But, I do databases, so I needed the data in SQL table forms.

Thus, I needed the data from the PDF in a form that could be converted to SQL statements.

use -raw or -layout (layout preserves spaces where there are no dollar amounts )
> pdftotext -layout -f 17 -l 19 Canadas_CEO_Elite_100Final.pdf CanElite.pdf.txt

Now I have lines of text, and I edited each of the 1500 lines (100 each year, 15 years of reports), to transform this:

Rank Name                       Company                            Base Salary   Bonus        Shares      Options      Pension     Other         Total
                1    Frank Stronach             Magna International Inc.          205,988        41,908,220               17,006,353               2,690,726     61,811,287
                2    Donald Walker              Magna International Inc.           329,358       10,121,908   668,613     4,073,219                1,485,739     16,678,837
                ...
                100   Pierre Beaudoin           Bombardier Inc.                        1,162,900      870,200        1,027,800       528,600        184,600        131,500        3,892,000
                

to this:

            call insert_ceo_data3('2010','Frank Stronach', 'Magna International Inc.', 205988 , 41908220 , NULL , 17006353 , NULL , 2690726 , 61811287);
            call insert_ceo_data3('2010','Donald Walker', 'Magna International Inc.', 329358 , 10121908 , 668613 , 4073219 , NULL , 1485739 , 16678837);   
            ...    
            call insert_ceo_data3('2010','Pierre Beaudoin', 'Bombardier Inc.', 1162900 , 870200 , 1027800 , 528600 , 184600 , 131500 , 3892000);
        

I wrote various SQL routines to ingest the data, as not every year has the same columns or column order. Sometimes numbers in the columns were left empty (I took this as NULL) or with a - (NULL).
Some numbers were in parentheses: (23456) which I interpreted as -23456.

This was a lot of editing, but sed scripting helped to remove commas in numbers:

remove embedded commas from number strings:
>sed 's/\([0-9]\),\([0-9]\)/\1\2/g' CanElite.pdf.txt > CanElite.pdf.commad.txt

with the commas removed, if there was consistency in the blanks between numbers this sometimes worked:

for sql, replace numbers with blanks between with , to separate the integers....
replace xxx yyy with xxx,yyy (\1 and \2 remember the pattern 
>sed 's/\([0-9]\) \([0-9]\)/\1,\2/g'  2020paradise.txt.sql > foo.txt

Lots of tweaking of lines of text was used to get the raw data to SQL format. VS Code has a nice text editor and those proficient in sed or awk or tr could do this all faster than I.

With enough data entered I tried to normalize the names of CEOs and the names of firms.

For instance, I took M.H. McCain = Michael McCain = Michael H. McCain, thus removing multiple identifiers for one person to a 1:1 basis. Errors in doing this are all mine.

Corporate names were tweaked to try to normalize them, but I was not sure if Foobar Trust was the same firm as Foobar Income Fund, etc, so I erred on the side of more firms not fewer. The raw data used Inc. or Inc, Corp. or Corp, so I normalized all those abbreviations as best I could, and then added them back in a normalized column of corrected names. This was a lot of busywork.

With a normalized set of CEO names and a normalized set of corporation names, various PKs and FKs were able to be applied for data integrity. Also added StatsCan data on average wages, et voilà.

Well-Paid CEO Home