Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Northern Border Pipeline (NBP) Data – 12873

⚠️ If the resource you are scraping requires you to agree to any Terms & Conditions,
please do not proceed and notify your contract manager immediately. Under no
circumstances should you create a false account or fake identity.

Description:

Please write a scraper tool to scrape at least the first page of the NBP data. If possible,also scrape
the first 3 pages of the data in the frame.

• Root Domain: https://ebb.tceconnects.com/infopost/


• Data are under the:
o Northern Border Pipeline Company (NBPL) >
o Transaction Reporting >
o Interruptible
Then the right-hand side frame contains the data.
It appears that the right-hand side frame via following link directly. Please verify though:
https://ebb.tceconnects.com/infopost/ReportViewer.aspx?%2FInfoPost%2FTransInterrupt&
AssetNbr=3029
Scraping Description:

• Data can be accessed as various format in the save button above


• We will scrape the main table in the blue, schema below
• Add two columns, data_url and scrape_datetime

The desired schema is listed below

Root URL:

• https://ebb.tceconnects.com/infopost/
• https://ebb.tceconnects.com/infopost/ReportViewer.aspx?%2FInfoPost%2FTransInterrupt&
AssetNbr=3029 - Please verify before use

Job Frequency:

Realtime (every minute)

Output Columns:

File One: summary.csv

Column name Datatype Example value Comment


Ensure this is a datetime in ISO-8601
datetime.datetime.now().isoforma format, and have one value per run,
scrape_datetime datetime
t() evaluated at start of the script, rather
than at the time of each request
data_url string URL where the row is scraped from
posting_date_time datetime ISO Datetime
contract_holder string 196748938
Constellation Energy Generation,
contract_holder_name string
LLC
k_holder_prop string 1444
svc_req_k string 102273
rate_schedule string PAL
contract_status string A
contract_begin_date date 4/4/2013
contract_end_date date 3/31/2024
k_ent_begin_date date 10/2/2021
K_ent_end_date date 2/27/2023
deal_type string 65
it_qty_k int 100,000
location_indicator string
market_based_rate_ind string
disc_provisions string
term_notes string

Timeline:

You may complete this job any time and submit any required files to the linked GitHub repository
within one week of accepting the job.

Please submit your code here: https://github.com/international-data-repository-cpd/scrape-12873

Submission Files:

Sample.csv for sample data

A requirement.txt

scrape/ - containing all of the source code

Main file: scrape.py that will be run with a output $filename.

Job Schema/Output Format:

You should save the output csv using these settings from a pandas DataFrame:

encoding="utf-8",
line_terminator="\n",
quotechar='"',
quoting=csv.QUOTE_ALL,
index=False

Runtime Environment:

Your code will be copied form the root to/usr/src/scrape

You should feel free to modify the requirements as you need. However, you must keep the
awscli dependency

You may also upload additional binaries into the repository root and reference them
there.

Please do not change the Dockerfile or shell scripts in the repository as this will cause
automated test failure.

python scrape.py $filename

Page access limitations (max requests / day):

If you encounter a captcha during your scrape job, please contact the job poster before continuing.
10% of website traffic max

You might also like