Exploration and Cleaning for Real Estate Data Set with SQL
In this study, we will work on a real estate data .
Click here for data sets.
We have a data set with 56000 rows of information about the properties of a big real estate company.
Our aim in this data set;
To separate address information by street and district
Populate property address data
To fix all the values in the SoldAsVacant column as ‘YES’ or ‘NO’
Delete duplicate values
Delete unused columns
- Standardize date format
- As seen following section data format is ‘2013–02–08 00:00:00.000’
- To convert this data format to ‘yyyy-mm-dd’ format;
select CONVERT(date,SaleDate) from homesales
- After getting correct format to update our data set;
ALTER TABLE homesales Add SaleDateConverted Date;
Update homesales SET SaleDateConverted = CONVERT(Date,SaleDate)
2. Populate property address data
- First lets see how many empty values there are in PropertyAddress.
select count(*) PropertyAddress from homesales
where PropertyAddress is null
- To fill these with who has same parcelID rows;
select a.ParcelID, a.PropertyAddress, b.ParcelID, b.PropertyAddress, ISNULL(a.PropertyAddress, b.PropertyAddress)
from homesales a
join homesales b on a.ParcelID = b.ParcelID and a.[UniqueID ] != b.[UniqueID ]
where a.PropertyAddress is null
— Mean of ‘’ ISNULL (a.PropertyAddress, b.PropertyAddress) ‘’ is if a.PropertyAddress has empty value fill that with b.PropertyAddress
- To update the data set;
update a set a.PropertyAddress = ISNULL(a.PropertyAddress, b.PropertyAddress) from homesales a
join homesales b on a.ParcelID = b.ParcelID and a.[UniqueID ] != b.[UniqueID ] where a.PropertyAddress is null
- To check again;
select count(*) PropertyAddress from homesales
where PropertyAddress is null
- As seen there is no null value anymore.
3. Breaking out address into individual columns (address, city, state)
select PropertyAddress from homesales
- As seen above number , street name and suburb names are together. We will separate them as new columns.
select SUBSTRING(PropertyAddress, 1, CHARINDEX(‘,’, PropertyAddress) -1) Address, SUBSTRING(PropertyAddress, CHARINDEX(‘,’, PropertyAddress) +1, LEN(PropertyAddress)) AS Suburb from homesales
- After completing method to update data;
alter table homesales add adress varchar(255)
update homesales set adress = SUBSTRING(PropertyAddress, 1, CHARINDEX(‘,’, PropertyAddress) -1)
alter table homesales add suburb varchar(255)
update homesales set suburb = SUBSTRING(PropertyAddress, CHARINDEX(‘,’, PropertyAddress) +1, LEN(PropertyAddress))
- ‘’ CHARINDEX(,’, PropertyAddress) ‘’ The first element in it indicates which character we are looking for, and the second element in which variable we are looking for. The result is the number of the searched character from the left in the cell.
- To check
select adress, suburb from homesales
- As seen above PropertyAddress has been separated as address and suburb.
4. Change y and n to yes and no in “sold as vacant” field.
- When we examine the data in detail, we see that some values in SoldAsVacant column are entered as YES, NO or Y, N. We will convert all these values to YES and NO.
SELECT SoldAsVacant FROM homesales
SELECT
CASE
WHEN SoldAsVacant = ‘Y’ then ‘YES’
WHEN SoldAsVacant = ’N’ THEN ‘NO’
ELSE SoldAsVacant
END
FROM homesales
UPDATE homesales SET SoldAsVacant = CASE
WHEN SoldAsVacant = ‘Y’ then ‘YES’
WHEN SoldAsVacant = ’N’ THEN ‘NO’
ELSE SoldAsVacant
END
- To check the accuracy of our transaction;
SELECT COUNT(*) FROM homesales WHERE SoldAsVacant = ‘Y’
SELECT COUNT(*) FROM homesales WHERE SoldAsVacant = ‘N’
5. Removing duplicates
- When we examine data detailed we will see some rows have same value. If ParcelID, PropertyAddress, SalePrice, SaleDate and LegalReference values are same we will accept those values as duplicate.
- With the code block below, a column that says 1 has been added next to the repeated lines, next to the non-repeating ones.
Select *,ROW_NUMBER() OVER (
PARTITION BY ParcelID,PropertyAddress,SalePrice,SaleDate,LegalReference
ORDER BY UniqueID ) row_num From homesales order by ParcelID
- Now we will use the CTE method to sort by the ROW number column
select * FROM (Select *,ROW_NUMBER() OVER (
PARTITION BY ParcelID,PropertyAddress,SalePrice,SaleDate,LegalReference
ORDER BY UniqueID ) row_num From homesales)T
WHERE row_num > 1
- As seen above we had just row_number = 2 which is duplicates.
- To delete these ones;
WITH RowNumCTE AS(Select *, ROW_NUMBER() OVER ( PARTITION BY ParcelID, PropertyAddress, SalePrice, SaleDate, LegalReference
ORDER BY
UniqueID ) row_num From homesales )
DELETE
From RowNumCTE
Where row_num > 1
6. Deleting unused columns
ALTER TABLE homesales DROP COLUMN PropertyAddress, LegalReference