{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Scrape wikipedia for microprocessors\n", "- https://en.wikipedia.org/wiki/Transistor_count\n", "- 'Moore's law is the observation that the number of transistors in a dense integrated circuit doubles about every two years' - wiki" ] }, { "cell_type": "code", "execution_count": 186, "metadata": {}, "outputs": [], "source": [ "# // 1. Import packages that we need:\n", "import numpy as np\n", "import pandas as pd\n", "# // Web scraping: \n", "import requests\n", "import string\n", "from bs4 import BeautifulSoup\n", "# // OS. Sometimes need this for finding working directory:\n", "import os\n", "# // datetime\n", "from datetime import datetime\n", "# // regex library used to detect the presence of particular characters (eg extarcting numbers from string)\n", "import re\n", "from pprint import pprint \n", "\n", "# // altair + practice datasets\n" ] }, { "cell_type": "code", "execution_count": 187, "metadata": {}, "outputs": [], "source": [ "# Scrape with Beautiful Soup\n", "\n", "URL = \"https://en.wikipedia.org/wiki/Transistor_count\"\n", "\n", "# // Request the html from the URL:\n", "html = requests.get(URL)\n", "\n", "# // Get the soup of this page\n", "soup = BeautifulSoup(html.content, 'html.parser')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notes:\n", "- tried many iterations of cleaning unnecessary text from data, multiple instances where one row is formatted slightly differently leading to a incorrect cleaning,\n", " - eg, to clean 'mm2' from each number in the area column, this worked for all but 1 observation which had its units formatted differently\n", " - area data as ' mm2' attached in multiple formats, so extract numbers (including decimals).\n", " row[5] = re.sub(\"[^\\d\\.]\", \"\", row[5])\n", " - then select all but the last character, which always be '2' from the mm2. \n", " row[5] = row[5][:-1]\n", " - worked by instead splicing on 'm' then removing any empty space" ] }, { "cell_type": "code", "execution_count": 224, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Processor | \n", "Transistors | \n", "Year | \n", "Designer | \n", "Process | \n", "Area mm2 | \n", "Transistors/mm2 | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "Intel 4004 | \n", "2250 | \n", "1971-01-01 | \n", "Intel | \n", "10,000 nm | \n", "12 | \n", "187 | \n", "
1 | \n", "TMX 1795 | \n", "3078 | \n", "1971-01-01 | \n", "Texas Instruments | \n", "? | \n", "30 | \n", "102 | \n", "
2 | \n", "Intel 8008 | \n", "3500 | \n", "1972-01-01 | \n", "Intel | \n", "10,000 nm | \n", "14 | \n", "250 | \n", "
3 | \n", "Toshiba TLCS-12 | \n", "11000 | \n", "1973-01-01 | \n", "Toshiba | \n", "6,000 nm | \n", "32 | \n", "343 | \n", "
4 | \n", "Intel 4040 | \n", "3000 | \n", "1974-01-01 | \n", "Intel | \n", "10,000 nm | \n", "12 | \n", "250 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
159 | \n", "HiSilicon Kirin 9000 | \n", "15300000000 | \n", "2020-01-01 | \n", "Huawei | \n", "5 nm | \n", "114 | \n", "134210526 | \n", "
160 | \n", "Apple A15 | \n", "15000000000 | \n", "2021-01-01 | \n", "Apple | \n", "5 nm | \n", "107 | \n", "139301634 | \n", "
161 | \n", "AMD Ryzen 7 5800H | \n", "10700000000 | \n", "2021-01-01 | \n", "AMD | \n", "7 nm | \n", "180 | \n", "59444444 | \n", "
162 | \n", "Apple M1 Pro | \n", "33700000000 | \n", "2021-01-01 | \n", "Apple | \n", "5 nm | \n", "245 | \n", "137551020 | \n", "
163 | \n", "Apple M1 Max | \n", "57000000000 | \n", "2021-01-01 | \n", "Apple | \n", "5 nm | \n", "432 | \n", "131944444 | \n", "
164 rows × 7 columns
\n", "